How does Similar Web measure how much traffic a site has?

As with Google Trends, I’m always surprised at how quickly conclusions can be drawn from data without having to think about where the data actually comes from and how plausible it is. Especially with Similar Web, this is amazing, because Google has the search data and can read trends from it, but how can Similar Web have data about how many visitors a website or app has? How reliable is this data? Is the reliability sufficient to make important business decisions?

The ancestor of SimilarWeb

In 2006, my former colleague Matt Cutts had once investigated how reliable Alexa’s data is (Alexa used to be an Amazon service that had nothing to do with speech recognition). This service collected data with a browser toolbar (there’s no such thing anymore), i.e. every page a user looked at was logged. Since Alexa was especially interesting for webmasters, pages that are interesting for webmasters were logged. So they were distorted. So if you are already recording the traffic of users, then you have to somehow make sure that the user base somehow corresponds to the network population you want to find out about. That doesn’t mean that the data is completely worthless. If you compare two fashion sites with each other, then they are probably “uninteresting” for the webmaster population (a prejudice, I know), and then you could at least compare them with each other. But you couldn’t compare a fashion page with a webmaster tool page.

But where does Similar Web get the data from? On their website they give 4 sources:

  • An international panel
  • crawling
  • ISP data
  • direct measurements

Data collection via a panel

Similar Web Chrome Extension The panel is not explained in detail, but if you do only minimal research, you will quickly find browser extensions. These are probably the successors of the earlier browser toolbars. What is the advantage of the Similar Web Extension? It offers exactly what Similar Web offers: You can see with one click how many users the currently viewed page has, where they come from, and so on. The Similar Web-Extension does not only work at home if you are currently viewing the data for a page, but for every page you are viewing.

If you consider for whom such data is interesting and who then installs such an extension, then we have arrived at the data quality of the Alexa Top Sites. Webmasters, marketing people, search engine optimizers, all these people have a higher probability to install this extension than for example a teenager or my mother.

Crawling

What exactly Similar Web crawls is still a mystery to me, especially why a crawling can give information about how much traffic a page has. Strictly speaking, you only cause traffic with a crawler 🙂 Similar Web says, “[we] scan every public website to create a highly accurate map of the digital world”. Probably links will be read here, maybe topics will be recognized automatically.

ISP traffic

Unfortunately, Similar Web does not say which ISPs they get traffic data from. It’s probably forbidden in Germany, but in some countries it will certainly be allowed for an Internet service provider to have Similar Web’s colleagues record all the traffic they receive through their cables. That would of course be a very good database. But not every ISP is the same. Would we trust the data if, for example, AOL users were in it (do they still exist at all)?

Direct measurements

This is where it gets exciting, because companies can link their web analytics data, in this case Google Analytics, directly to Similar Web, so that the data measured by Google Analytics is available to all Similar Web users. Then the site says “verified”. Why should you do that? You don’t get anything for it, instead you can expect more advertising revenue or strengthen your brand. Quite weak arguments, I think, but there are still some sites that do.

How reliable is Similar Web data really?

Of course, the direct measurements are reliable. It becomes difficult with all other data sources. These make up the majority of the measurements. Only a fraction of the Similar Web data is based on my sample of direct measurement data. But here you could certainly create models based on the accurately measured data and the inaccurately measured data. If I know how the data from spiegel.de is accurate and what the inaccurately measured data looks like, then I could, for example, calculate the panel bias and compensate for other pages. And I could do the same with all other data.

But does it really work? Let’s take a look at a measurement of Similar Web for one of my pages:

Measurement from Similar Web

Apparently the number of visitors fluctuates between as good as nothing and 6,000 users. There are no clear patterns. And now we look at the real numbers from Google Analytics:

Numbers from Google Analytics

It’s the same time period. And yet the unique traffic patterns from the Google Analytics data are not recognizable in the Similar Web data. The data is simply wrong.

Result

Can you use Similar Web at all? I would advise you to be very careful if the data does not come from a direct measurement. Of course, the question can now arise as to what else to use. The counter-question is what to do with data that you can’t be sure is correct at all. If I had to make a business decision that might cost a lot of money, I wouldn’t rely on that data. For a first glance…? We also know that a “first glance” can quickly become a “fact” because it fits so well into one’s own argumentation.

How reliable is Keyword Data from keywordtool.io?

Since the Google AdWords Keyword Planner only spits out granular data for accounts with sufficient budget, the search for alternatives is big. Rand Fishkin believes that Google Trends is the salvation, but apparently did not understand that Google Trends provides normalized, indexed and keyword-extended data and not absolute numbers. On one point, however, he’s right, even the Keyword Planner doesn’t really provide accurate data, as I stated in another article.

Of course this is an unsatisfactory situation. No wonder that alternatives like keywordtool.io are popular. But how accurate is their data? Because I couldn’t find a clue where they got the data from, and that makes me very suspicious at first. Where would they get the data from, if not via the API? And here the access is limited. On the initiative of my esteemed colleague Christian, I then took a closer look at it. On the one hand he asked where keywordtool.io got the data from (there was no satisfactory answer). On the other hand he got a test account 🙂 A first pre-test brought disappointing results, the numbers were completely different than those of AdWords. However, the colleague was no longer sure if he had chosen the same settings as I did in AdWords, so I did the test again on my own.

With the first keyword set from the pedagogy area, the surprise after the pre-test was big: Apart from a few exceptions, all keywords had exactly the same search volume. The few exceptions were that keywordtool.io did not spit out any numbers, but the keyword planner did. Here, however, there were only 2 out of almost 600 keywords. The second keyword set on the subject of acne also showed the same picture. The search volumes fit exactly except for a few exceptions. Interestingly, both keyword sets were more concerned with topics that did not necessarily stand out due to their high search volume, in some cases we are talking about 20 search queries per month. So it is very likely that the Google AdWords API will be tapped directly here, otherwise these exact numbers cannot be explained. This would also explain why you can only query 10 sets of a maximum of 700 keywords (more than 700 keywords are not possible with the Keyword Planner, but more than 10 per day). Thus keywordtool.io would be a good alternative… if not…

The third keyword set then showed a different picture, the deviations are dramatic. Unlike the previous keywords we are talking about high volume keywords like <used cars>. Unfortunately there is no pattern to be seen on the plot, except for the pattern that keywordtool.io is always higher and never lower. Keywords with a high search volume can be as well off as keywords with a low search volume. It’s also not that it’s always the same deviation, we’re talking about keywords where the numbers fit exactly, and keywords that have a deviation of 16 times the volume reported by Google. There is also no order in any way. The deviations are completely random. And they’re far too big to ignore.

Of course, that won’t stop many people from using keywordtool.io, after all, people like to say “Joa, usually fits” or “It’s better than nothing”. Whether it’s really better, I question that. I wouldn’t want to make any decisions on the basis of such deviations. The keyword planner is the better option, even if it only delivers staggered values, if you don’t have enough budget.

By the way, the data of the third set are available here in an R notebook.

5 Reasons why you misunderstand Google Trends

[Please note that this article was originally written in German in 2016; all screenshots are in German but you will understand them nevertheless. The text has been translated automatically so please excuse the English]

In September 2015, I stood on a big Google stage in Berlin and showed the advantages of the new features of Google Trends in addition to a demo of voice search. For a data lover like me, Google Trends is a fascinating tool if you understand all the pitfalls and know how to avoid them. At the same time the tool offers a lot of potential for misunderstandings 🙂 Search queries are placed in <> brackets.

  1. Misunderstanding: Not all search queries are considered.
  2. Misunderstanding: The lines say nothing about the search volume / No absolute numbers.
  3. Misunderstanding: Google trends search queries are different
  4. Misunderstanding: Rising lines no longer mean search queries.
  5. Misunderstanding: Without a benchmark, Google Trends is worthless.

Continue reading “5 Reasons why you misunderstand Google Trends”

Why you should be careful when you see new and returning visitors in Google Analytics

Google Analytics can sometimes be mean, because some dimensions paired with segments do not behave as you might think at first. Thanks to Michael Janssens and Maik Bruns’ comments on my question in the analysis group founded by Maik, I can go to sleep today with peace of mind and have become a little smarter again.
The question came up today in the Analytics course: How can it be that I have more new users than transactions when I am in the “Has made a purchase” segment? The link to the report is here, the assumption I had was that: If I have a segment of users who have made a purchase and use that segment in the “New vs. Recurring Users” report, I assume that in the New Visitors + Have Made a Purchase section I see only those users who made a purchase in their first visit. However, here in this report we see 691 users, but only 376 transactions. If my expectations were right, then the number here would be the same. But it is not. Continue reading “Why you should be careful when you see new and returning visitors in Google Analytics”

R: dplyr/sparklyr vs data.table Performance

In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.

Continue reading “R: dplyr/sparklyr vs data.table Performance”