Where does the SimilarWeb data come from?


[This is the new edition of an older article]

As with Google Trends, I’m always surprised at how quickly conclusions are drawn from data without even thinking about where the data actually comes from and how plausible it is. This is especially surprising for Similar Web, because Google has the search data and can read trends from it, but where can Similar Web actually have data about how many visitors a website or an app has? How reliable is this data? Is the reliability sufficient to make important business decisions?

The ancestor of SimilarWeb

In 2006, my former colleague Matt Cutts once investigated how reliable Alexa’s data was (Alexa used to be an Amazon service that had nothing to do with speech recognition). This service collected data with a browser toolbar (something like that doesn’t exist anymore), i.e. every page a user looked at was logged. Since the Alexa data was mainly interesting for webmasters, they had installed the toolbar in particular, and so pages that were of interest to webmasters were logged. They were distorted.

If you record the traffic of users, then you have to somehow make sure that the user base somehow corresponds to the net population you want to find out something about. That’s not to say that the Similar Web data is completely worthless. If you compare two fashion sites with each other, then they might be “uninteresting” for the webmaster population (a prejudice, I know), and then you could at least compare them with each other. But you couldn’t compare a fashion site to a webmaster tool page. But even these are only conjectures, we don’t know exactly. For such an expensive tool, it’s actually unbelievable.

But where does Similar Web get the data from? On their website, they cite 4 sources:

  • An international panel
  • Crawling
  • ISP Data
  • Direct measurements

Data collection via a panel

The panel is not explained in more detail, but if you do minimal research, you can quickly find browser extensions. These are probably the successors of the earlier browser toolbars. What are the advantages of the Similar Web Extension? It offers exactly what Similar Web offers: you can see with one click how many users the page you are currently looking at has, where they come from, and so on. The Similar Web Extension not only sends you home when you are viewing the data for a page, but for every page you look at.

If you then think about who is interested in such data and who then installs such an extension, then we have arrived at the data quality of the Alexa Top Sites. Webmasters, marketing people, search engine optimizers, all these people have a higher probability of installing this extension than, for example, a teenager or my mother.

Crawling

What exactly Similar Web crawls is still a mystery to me, especially why a crawl can provide information about how much traffic a page has. Strictly speaking, you only cause traffic with a crawler Similar Web says, “[we] scan every public website to create a highly accurate map of the digital world”. Presumably, links are read here, perhaps topics are also automatically recognized.

ISP Traffic

Unfortunately, Similar Web doesn’t say which ISPs they get traffic data from. In Germany it is probably forbidden, but in some countries it will certainly be allowed for an Internet service provider to let the colleagues of Similar Web record everything that passes through their cables in terms of traffic. Of course, that would be a very good database. But not every ISP is the same. Would we trust the data if, for example, AOL still existed and only its users were measured? Worse still, at no point does SimilarWeb make transparent where ISP data flows.

Direct measurements

This is where it gets exciting, because companies can connect their web analytics data, in this case Google Analytics, directly to Similar Web, so that the data measured by Google Analytics is available to all Similar Web users. Then the site says “verified”. Why should you do that? You don’t get anything for free, instead you can expect more advertising revenue or strengthen your brand. Pretty weak arguments, I think, but there are some sites that do so anyway.

How reliable is the Similar Web data really?

Of course, the direct measurements are reliable. It gets difficult with all other data sources. These account for the majority of the measurements. Only a fraction of the Similar Web data is based on direct measurement data according to my sample. But here you could certainly create models based on the precisely measured data and the inaccurately measured data. If I know exactly how the data from spiegel.de is and what the inaccurately measured data looks like, then I could, for example, calculate the panel bias and balance it for other pages. And I could do the same with all the other data. But does it really work? Let’s take a look at a measurement of Similar Web for one of my pages:

Apparently, the number of visitors fluctuates between next to nothing and 6,000 users. There are no clear patterns. And now let’s look at the real numbers from Google Analytics:

It’s the same time period. And yet, the unique traffic patterns from the Google Analytics data are not visible in the Similar web data. The data is simply wrong.

Result

Is it possible to use Similar Web at all? I advise against it completely, unless the data collected comes from a direct measurement. Of course, the question may now arise as to what else to use. The counter question is what you can do with data that you can’t be sure is true at all. In statistics, you rarely commit yourself anyway, but if not even the collection of data is transparent, then you have to keep your hands off it. If I have to make a business decision that may cost a lot of money, then I wouldn’t rely on this data. And “for a first look…?” We also know that a “first glance” can quickly become a “fact” because it fits so well into one’s own argumentation. Which brings us back to the confirmation bias.

Comments (since February 2020 the comment function has been removed from my blog):

Konrad says

  1. August 2017 at 14:56 On the subject of crawling: Maybe the unprotected AW Stats crawl installations and other counter services and then use the data to train the algos for extrapolation? Of course, it’s all very imprecise. As far as I know, they not only use the SimilarWeb Toolbar but also buy data from third-party browser addons. Adblockers could already provide a much better database…

Tom Alby says

  1. August 2017 at 11:52 This is not documented anywhere, and if they did, they would certainly write about it, because that would create more trust. However, the AdBlocker users are not a smaller section of the total population, i.e. I could not conclude from this the total population.

And which sites still have AW-Stats and other counter services?

Daniel Brückner says

  1. August 2017 at 09:14 Hi Tom,

Regarding crawling: “data” does not necessarily mean that it is traffic data.

Regarding the panel: It could also be that they buy traffic data from other browser extensions. Based on this, you can already make an estimate.

But overall, you’re right that you should take the numbers with a grain of salt.

Lg, Daniel

Tom Alby says 11. August 2017 at 17:22 Hallo Daniel,

Regarding the panel: this is not correct, because with which other browser extensions can you be sure that it is installed by a section of the population that corresponds to the overall surf population? With extensions, I would always assume that the willingness to install extensions alone will make you stand out from the crowd. Regarding crawling: It’s also there.

BG

Tom

Jan says

  1. October 2017 at 05:28 Busted I can see the Extention is worthless. Is there something that works?

Tom Alby says

  1. October 2017 at 19:05 No. Sorry.

Tom says

  1. July 2019 at 13:34 Great article that speaks from my soul! I am also extremely skeptical about the data (except for the verified direct measurements). SimilarWeb’s official wishy-washy formulation on the subject only adds to my skepticism.

To be honest, I wonder how they managed to win over investors in the first place…

Leave a Reply

Your email address will not be published. Required fields are marked *