New versus returning users: A useless KPI


Art – Blog – New vs recurring users a useless KPI

I never understood the purpose of a certain diagram in Google Analytics, namely the pie chart that shows the ratio of new users to returning users. It used to be in the standard dashboard that a user saw after logging in, and I always apologized for this diagram when I gave a Google Analytics demo during my time at Google.

Torten diagram: Only for static compositions

What’s so bad about this diagram? First of all, a pie chart is being used for static compilations. If I want to know the gender distribution in my course, then a pie chart makes sense. The genders will largely not change during the course.

Most websites want to increase the number of their visitors, whether through new users, repeat users or both. A development is therefore the goal, and thus a pie chart is not meaningful, as it shows static constellations. A line graph showing development over time is certainly a better choice in most cases.

The two metrics are independent of each other

I’m going now but one step further and claim that these two metrics have nothing to do with each other and therefore should never be represented in the same diagram. New users can become returning users, but they don’t have to. And returning users can also have been new users in the same period, then they are counted twice. If a user can appear in both parts of the pie chart, what does the ratio of the two parts say about each other?

New users are created through marketing. Ideally, recurring users come about because the content is so great that they can’t live without it anymore. If I don’t get new users, then I have to optimize my marketing. If my users don’t return, then I have to optimize my content. Since we’re always on the hunt for so-called “actionable insights”, why should we then display two metrics in one diagram if they require different corrective measures?

Additionally, I can spend a lot of money on marketing for two weeks, so that the proportion of new users increases massively and the proportion of returning users is greatly reduced in the ratio. Even if the absolute number of returning users remains the same, the ratio would suggest that we have fewer returning users. For this reason, these two metrics should never be displayed together as a ratio, but always separately. Suggested display: A graph showing the development of new users with acquisition channels, a graph showing returning users and the content that may be responsible for their return.

What’s actually with the non-recurring users?

This question was asked by a course participant today, and I think this question is good for several reasons. We don’t know if new users will become repeat users (apart from those new users who are both new and repeat in our period because they came twice, but they could of course decide against another visit in the future). In this sense, any user who has been there once could come back at some point in the future. Technically, no user who has deleted their cookies can reappear as a repeat user with us, except for user ID usage. But I still find the question exciting because I have dealt with it in a different context: From when do I have to consider a customer lost for a product that is regularly purchased?

The graphic is intended to illustrate my thoughts on this matter. We have a point “Today” and three users: blue, red, and green. Blue user comes by at more or less regular intervals. At the point “Today”, I would assume that they will also come back in the future, and the probability seems high. Green user was here recently. They may not have had the chance to come back yet. Red user was here a long time ago, and compared to the time intervals between blue user’s purchases, the probability of their return seems low. They can come back, but I would rather lure them with an incentive than green user, who might come back anyway (pull-forward cannibalization).

We can’t say anything definite about non-recurring users, because we don’t know the future. But we can work with probabilities. For pure users, this may not be so exciting. But for shop customers, it’s more exciting.

Why New and Returning Visitors in Google Analytics are sometimes to be consumed with caution


Google Analytics can sometimes be nasty, because some dimensions paired with segments don’t behave the way you might initially think. Thanks to Michael Janssens‘ and Maik Bruns’ comments on my question in the analysis group founded by Maik, I can sleep peacefully tonight and have become a bit wiser again.

The question came up today in the analytics course: How can it be that I have more new users than transactions when I’m in the segment “Has made a purchase”? The link to the report is here, my assumption was: If I have a segment of users who have made a purchase and use this segment in the “New vs. returning users” report, then I assume that in the area New visitors + Have made a purchase, I only see users who made a purchase on their first visit. However, we see here in this report 691 users, but only 376 transactions. If my expectation were correct, then these numbers should be equal. But they are not.

New users + Returning users > All users

We see other contradictions here as well, and for the sake of completeness, we will start with them. The number of new users is 53,263, that of recurring users is 14,578. However, we have a total of only 58,856 users, which is less than the sum of new and recurring users.

This discrepancy is easily explained: If a user comes for the first time within the reporting period, then they are a new user. If they come a second time within the reporting period, then they are also a returning user. They are therefore counted twice, once among the new users and once among the returning users. However, under “All Users” they are only counted once.

Let’s take a look at the column Transactions, we see that they are not counted multiple times. This makes sense, because they can only be defined as a transaction once.

Global pages distort the data

Maik also brought up the point that global pages display somewhat distorted data because Google Analytics restarts all sessions at midnight, so a new visitor who arrives at 23:55 and calls up a second page at 0:01 is also counted as a new visitor on this second day, just like on the first day. It’s the same user, but they are counted twice as a new user (see source here). But can that lead to us having so many more new users than transactions? Certainly, the Google Merchandising Store is globally and galactically active, but are things being bought around the clock to such an extent?

The solution: No Boolean AND

The solution (thanks, Michael!) lies in the fact that the segment “Made a Purchase” paired with “New Visitor” is not linked with an AND, i.e. we can have users here who have made a purchase at some point, but not necessarily on their first visit. This becomes clear when we compare our two segments with a segment that Michael has built:

Michael’s segment is designed to use the AND link:

We have sessions here where a user must be new and at the same time have at least one transaction. And we suddenly see that for 376 transactions, we have 373 users, i.e., there must have been users who had multiple transactions during their visit. In other words, the new visitors in our “Made a Purchase” segment did indeed make a purchase, but 691 minus 376 transactions were not made by these new visitors during their first visit, but rather at a later time. The connection between the report and the segment could be formulated as follows: Show me all users who had a transaction in any session and also made their first visit within the reporting period. It does not mean show me all users who have a transaction in their first session.

In the future, I will take a closer look at how to interpret the connection of a segment with a report. Because that was, as I said, something nasty.

Data-driven personas with association rules


I have already talked about personas elsewhere, this article is about the data-driven generation of personas. I stick to the definition of the persona inventor Cooper and see a persona as a prototype for a group of users. This can also be interesting for marketing, because after all, you can use it to create a needs- and experience-oriented communication, for example on a website. Personas are not target groups, but more on that elsewhere.

How do you create a data-driven persona?

I haven’t found the perfect universal way for data-driven personas either. External data is not available for all topics, the original approach of 10-12 interviews is difficult, and internal data has the disadvantage that it only contains the data of those you already know, not those you might still want to reach. The truth lies in merging different data sources.

Data-driven persona meets web analytics

Web analytics data offers a lot of usage behavior, and depending on how a page is structured (for example, whether it is already geared to the different needs of different personas), it is possible to understand the extent to which the different user groups actually behave as expected. Or you try to generate data-driven personas from the usage behavior on the website. All under the restriction that the users have to find the page first, so it is not certain that really all groups of people actually access this page and therefore important personas are overlooked. This article is about a special case of this automated persona generation from web analytics data, which is exciting from an algorithmic point of view and the associated visualization. As is well known, everyone likes to report on successes, here is a case where the failure shows in which direction further work could go.

The experiences from web mining are rarely associated with personas, although some research was done on it more than 10 years ago; for an overview, see, for example, Facca and Lanzi, Minining interesting knowledge from weblogs: a survey, from 2004 (published in 2005). Whereas in the past it was mainly weblogs (not web blogs!) that were used, i.e. log files written by the server, today we have the opportunity to use much “better” data through Google Analytics & Co.

Reintroducing: Association Rules

But which exactly is better? In GA & Co we can better distinguish people from bots (of which there are more than you think), returners are recognized more reliably, devices etc. The question is whether you absolutely have to use the additional data for basic data-driven personas. Because association rules, which I have already written about in a post about clustering with Google Analytics and R and which are also mentioned by Facca and Lanzi, can already identify basic groups of users (I had already mentioned in the other article that I had once worked for one of the creators of the algo, Tomasz Imilinski, but I still have to tell an anecdote with him: In a meeting, he once said to me that you often think something is a low hanging fruit, a quick success, but, “Tom, often enough, the low hanging fruits are rotten”. He has been right so many times.). The groups identify themselves through a common behavior, the co-occurrence of page views, for example. In R, this works wonderfully with the arules package and the algo apriori it contains.

Data-driven personas with Google Analytics & Co.

As already mentioned in the earlier article: A standard installation of Google Analytics is not sufficient (it never is anyway). Either you have the 360 variant or “hack” the free version (“hack” in terms of “tinkering”, not “being a criminal”) and pull the data via API. With Adobe Analytics, the data can be pulled from the data warehouse or via an API. Simply using Google Analytics and drawing personas from it is therefore not possible with this approach. You also have to think about which date from GA is best used next to the Client ID to represent transactions. This can vary greatly from website to website. And if you want to be very clever, then a PageView alone may not be signal enough.

However, this is first of all about visualization and what limitations the a priori approach has for the automated generation of data-driven personas. For the visualization, I work with the package arulesViz. The resulting graphics are not easy to interpret, as I have experienced at the HAW, but also with colleagues. Below we see the visualization of association rules obtained from the data of this page, with the GA date pagePathLevel1 (which is unfortunately also an article title for me). One thing stands out here: I can actually only identify two groups here, and that’s pretty poor.

What exactly do we see here? We see that users who are on the homepage also go to the Courses section and vice versa. The lift is high here, the support not so much. And then we see users moving between my four articles about Scalable Capital, with roughly the same low lift but different levels of support. Lift is the factor by which the co-occurrence of two items is higher than their probable occurrence if they were independent of each other. Support is the frequency. Support was defined at 0.01 when creating the association rules, and confidence was also defined at 0.01. For details, see my first article.

But why don’t I see any other pages here? My article about Google Trends is a very frequently read article, as is the one about the Thermomix or AirBnB. So it’s not because there aren’t more user groups. The disadvantage of this approach is simply that users have to have visited more than one page for a rule to arise here at all. And since some users come via a Google search and apparently have no interest in a second article, because their need for information may already be satisfied or because I don’t advertise it well enough, apparently only students and those interested in Scalable Capital can be identified here in these rules.

Ways out of the a priori dilemma?

So far, I’ve identified three ways to solve this dilemma, and all of them require extra work:

  • I test whether I can get users to view more than one page through a better relevant offer, for example with Google Optimize, and if successful, I get better data.
  • I use the a priori data only as a base and merge it with other data (also very nice, but I won’t cover it here)
  • I lower the support and confidence.

The most beautiful is the first approach, in my opinion, but it requires time and brains. And it is not said that something will come out. The last approach is unpleasant, because we are dealing with cases that occur less frequently and therefore not necessarily reliable. With a support of 0.005, the visualization looks different:

But again I have the problem that the individual pages do not appear. So it seems to be extremely rare that someone moves from the Google Trends article to another article, so lowering the support value didn’t help. From experience, I can say that this problem appears more or less strongly on most pages that I otherwise see, but it always appears somehow. The stupid thing is, if you can already read good personas, then you are more inclined not to look at the rest, even if it could be very large in scope.

We also see another problem in the graphic, because the users in the right strand do not have to be the same from arrow to arrow. In other words, it is not said that visitors who look at photography pages and courses will also look at the publications, even if it looks like that in the visualization. If A and B as well as B and C, then A and C do not apply here! To solve this, the association rules in the visualization would still have to have an exclusionary marking. It does not exist and would be a task for the future.

Result

The path via association rules is exciting for the creation of data-driven personas with Google Analytics or other web analysis tools. However, it will usually not be sufficient at the moment, because a) the problem of one-page visitors is not solved here, b) the rules do not provide sufficient information about different groups that only have overlaps and c) it can only say something about those groups that are already on the site anyway. I’m currently working on a) and b) on the side, I’m always happy about thoughts from outside

Hacking Google Optimize: From Bayes, p-values, A/A tests and forgotten metrics


Google Optimize is one of my favorite tools because it allows anyone to quickly build A/B tests; in my courses, participants are often amazed at how quickly such a test can be online. Of course, the preparatory work, the clean creation of a hypothesis, is not done so quickly, but it is also no fun to wait months for a test to go live. I don’t want to go into more detail about the advantages of Google Optimize, but instead point out three subtleties that are not so obvious.

Use Google Optimize data in Google Analytics raw data

The Google Analytics API also allows access to Google Optimize data that runs into Analytics, which allows the raw Analytics data to be analyzed into a Google Optimize test. This is especially interesting if something can’t be used as a KPI in Optimize, you forgot to set a KPI in Google Optimize, or you want to analyze side effects. Some of it also goes afterwards with segments, but hey, this is about hacking (in the sense of tinkerer, not criminal), you also do things because you can do them, not because they are always necessary

The two important Optimize dimensions are called ga:experimentId and ga:experimentVariant, and there is now also a combination called ga:experimentCombination. However, if you only run one test, then it is also sufficient to query only the dimension ga:experimentVariant. 0 is the original variant (control group), after which it is counted up per variant. If you have several tests running, simply look up the ID in the Google Optimize interface; it can be found in the right-hand column under Google Analytics. It is usually very cryptic, as you can see in the picture.

In my example, I have two experiments running, so I can output the combination next to three custom dimensions (Client ID, Hit Type and UNIX Timestamp) and page title (I cut off the Client ID a bit on the image, since it is only a pseudonymized date). In the second picture, we see the two experiments and the respective variants in a field. In the test, which starts with c-M, a student hypothesized that visitors to my site would see more pages and spend more time if the search window was higher up. I didn’t believe in it, but believing is not knowing, so we ran the test with the KPI Session Duration. I had forgotten to set the number of searches as the second KPI. Well, it’s good that I have the raw data, even if I could of course build a segment for it.

As we can also see in the screenshot, users are in two tests at the same time, as the other test should not affect the first test. Now, during the test period of 4 weeks, there were only 3 users on my site who searched for something, one of the users searched for a query several times, one user searched for two different terms. With such a small number of cases, we don’t even need to think about significance. In the meantime, it looked as if the search window up variant would actually win, but more on that in the last section. The question now is, why can the variant be better at all, if hardly any search was made? Or did the presence of the search box alone lead to a longer session duration? Very unlikely!

Let’s take a closer look…

It should be noted in the raw data that there are now two entries for each hit of a user, one per test. Also, not every user will be in a test, even if 100% of the traffic is targeted, which can already be seen in Google Analytics. We can also check whether the random selection of test and control group participants has resulted in a reasonably even distribution of users (e.g. mobile versus desktop, etc.). Of course, this is also possible with the interface.

The first thing I notice when I pull the data from the API is that the values don’t match those from the GUI. First of all, this is quite worrying. If I only look at users and sessions, the values match exactly. If I add the experimentCombination dimension, the numbers don’t fit anymore, and it’s not because of the differences between API v3 and v4. It’s not uncommon for the data to mismatch, most often through sampling, but that can’t be the case here. Interestingly, the numbers within the GUI also don’t match when I look at the data under Experiments and compare it to the audience dashboard. However, the figures from the API agree with the data from the experiments report. So be careful who forms segments!

If I drag the data including my ClientID dimension, I have a little less users, which is explained by the fact that not every user writes such an ID into the Custom Dimension, i.e. he probably has this Client ID (or certainly, because otherwise GA would not be able to identify him as an individual user), but I somehow don’t manage to write the ID into the dimension, so that there is e.g. “False”.

Now let’s take a look at some data. For example, I’m interested in whether Optimize manages to get the same distribution across devices as I have on the site:

The majority of my traffic still takes place on the desktop. What does it look like in Optimize?

The distribution is definitely different. This is not surprising, because no Optimize experiment should be played out on AMP pages; so it is rather surprising why experiments on mobile devices have taken place here at all. And these cases have different values in relation to the target KPI, as you can also see in Analytics:

So we can’t draw conclusions about the whole page from the test results, but we also don’t know how big the effect of the unexpected mobile users is on the test result. To do this, we would have to redetermine the winner. But how is the winner determined in the first place? For example, we could use a chi-square test with the observation of the average SessionDuration:

chisq.test(x) Pearson’s Chi-squared test with Yates’ continuity correction data: x X-squared = 1.5037, df = 1, p-value = 0.2201′

In this case, p is above 0.05, more to p in the next section. If the Chi-Square test is the correct test at all, it would show that the difference is not statistically significant. However, this is not the test that Google Optimize uses.

Bayesian Inference versus NHST

What exactly is happening under the hood? Let’s take a look at how Google Optimize calculates whether a variant won or not. Unlike Adobe Test & Target, for example, or most significance calculators like Conversion’s (although Convertibility doesn’t even say what kind of test they’re using), Google Optimize isn’t based on a t-test, Mann-Whitney-U, or Chi Square test, but on a Bayesian inference method. What does that mean?

Two different ideas collide here, that of the so-called frequentists (NHST stands for Null Hypothesis Significance Testing) and that of the Bayesian inference supporters. Some of these have been and still are intensively discussed in statistics, and I am not the right person to make a judgement here. But I try to shed light on these two approaches for non-statisticians.

Most A/B testing tools perform hypothesis testing. You have two groups of roughly the same size, one group is subjected to a “treatment”, and then it is observed whether the defined KPI in the test group changes “significantly”. For significance, the p-value is usually looked at; if this is below 0.05 or however the significance level has been defined, the null hypothesis is rejected. Although you don’t see anything about null hypotheses etc. on the tool interfaces, probably so as not to confuse the users, the thinking construct behind them assumes that. For example, if it is tested whether a red button is clicked more often than a blue one, the null hypothesis would be that both are clicked on equally often. The background to this is that a hypothesis cannot always be proven. But if the opposite of the hypothesis is rather unlikely, then it can be assumed that a hypothesis is rather likely. The p-value is about nothing else.

Now the p-value is not a simple story, not even scientists manage to explain the p-value in such a way that it is understandable, and there is a discussion as to whether it makes sense at all. The p-value says nothing about how “true” a test result is. It simply says something about how likely it is that this result will occur if the null hypothesis is true. With a p-value of 0.03, this means that the probability that a result will occur with a true null hypothesis is 3%. Conversely, this does not mean how “true” the alternative hypothesis is. The inverse p-value (97%) does not mean a probability that one variant will beat another variant.

Another common problem with A/B testing is that the sample size is not defined beforehand. The p-value can change over the course of an experiment, and so statistically significant results can no longer be significant after a few days because the number of cases has changed. In addition, it is not only the significance that is of interest, but also the strength/selectivity/power of a test, which is only displayed in very few test tools.

But these are mainly problems with the tools, not the frequentist approach used by most tools. The “problem” with the Frequentists approach is that a model doesn’t change when new data comes in. For example, with returning visitors, a change on the page can be learned at some point, so that an initial A/B test predicts a big impact, but the actual effect is much smaller because the Frequentists approach simply counts the total number of conversions, not the development. In Bayesian inference, however, newly incoming data is taken into account in order to refine the model; decreasing conversion rates would influence the model. Data that exists “before”, so to speak, and influences the assumptions about the influence in an experiment are called initial probability or “priors” (I write priors because it’s faster). The example in the Google Help Center (which is also often used elsewhere) is that if you misplace your cell phone in the house, according to Bayesian inference, you can use the knowledge that you like to forget your cell phone in the bedroom and also “run after” a ring. You are not allowed to do that with the Frequentists.

And this is exactly where the problem arises: How do we know that the “priors” are relevant to our current question? Or, as it is said in the Optimizely blog:

The prior information you have today may not be equally applicable in the future.

The exciting question now is how Google gets the Priors in Optimize? The following statement is made about this:

Despite the nomenclature, however, priors don’t necessarily come from previous data; they’re simply used as logical inputs into our modeling.

Many of the priors we use are uninformative – in other words, they don’t affect the results much. We use uninformative priors for conversion rates, for example, because we don’t assume that we know how a new variant is going to perform before we’ve seen any data for it.

These two blog excerpts already make it clear how different the understanding of the usefulness of Bayesian inference is. At the same time, it is obvious that, as in any other tool, we lack transparency about how exactly the calculations were achieved. Another reason that if you want to be on the safe side, you need the raw data to carry out your own tests.

The Bayesian approach requires more computational time, which is probably why most tools don’t use this approach. There is also criticism of Bayesian inference. The main problem, however, is that most users know far too little about what exactly the a/b testing tools do and how reliable the results are.

Why an A/A test can also be healing

Now the question arises as to why there was any difference in session duration at all, when hardly anyone was looking. This is where an A/A test can help. A/A test? That’s right. There is also something like that. And such a test helps to identify the variance of one’s own side. So I had a wonderful test where I tested the AdSense click-through rate after a design change. The change was very successful. To be on the safe side, I tested again; this time the change had worse values. Now, of course, it may be that worse ads were simply placed and therefore the click-through rate had deteriorated. But it could also simply be that the page itself has a variance. And these can be found out by running an A/A test (or using the past raw data for such a test). In such a test, nothing is changed in the test variant and then it is seen whether one of the main KPIs changes or not. Theoretically, nothing should change. But if it does? Then we identified a variance that lies in the page and the traffic itself. And that we should take into account in future tests.

Result

  • Differences in test results can be caused by a pre-existing variance of a page. This is where an A/A test helps to get to know the variance.
  • The results may differ if different tools are used, as the tools have different approaches to how they determine the “winners”.
  • Raw data can help to use your own test statistics or verify test results, as the tools offer little transparency about how they got to the respective tests. For example, as in my example, it can be that the test was not played evenly at all and therefore the results are not so clearly usable.
  • The raw data is sometimes very different from the values in the GUI, which cannot be explained.
  • The p-value is only part of the truth and is often misunderstood.
  • For A/B tests, you should think about how large the sample size should be in advance with the Frequentists approaches.
  • A fool with a tool is still a fool.

How b4p and Statista create alternative “facts”


Firstly: I am not a fully trained statistician or market researcher. Although I passionately deal with numbers and also immerse myself in statistics books in my free time, the more you know, the more you know what you don’t know. I didn’t acquire this wisdom by spoon-feeding, and I’m always grateful when experts provide feedback. However, I believe that any reasonably clear-thinking person can understand without a basic statistics course when data has not been collected sensibly or false conclusions are being drawn.

Are print ads more persuasive than those on social media?

I became aware of this thesis through the postings of Andre Alpar and Karl Kratz. It comes from Best4Planning, here is the whole study “Quality is more important than likes”. Unlike the very esteemed colleagues Andre and Karl, I don’t think best4planning is a satire site, but the b4p derivation from the data is at least courageous. If you disregard the hair growth products in the classifieds of a newspaper, the typical “lose 10 kilos in 2 days” or “millionaires don’t want to see this video” advertisement on Facebook & Co does not speak for social. So at least subjectively, I would confirm the thesis. The entry barrier for advertising on the Internet is wonderfully low, in print it costs a lot of money upfront. Whether the automotive advertising for clean diesel in print is still perceived as credible today remains to be seen, but any idiot can offer a product in online media for little money. Therefore, I would not dispute the data from b4p that advertising in print is more credible.

The situation is different with the interpretation. Is credible the same as a stimulating purchase? First of all, I think this is a misinterpretation of the data. But in my opinion, the problem lies in the graphic, because the credibility feature is mentioned as the source: There is also the characteristic “stimulating to buy”:

So it makes sense. I don’t know how they got to 28.1% now, since I understand print to mean both daily newspapers and magazines etc, but maybe they simply calculated (34+22)/2 and thus came up with the number (internally they will probably have decimal places). However, the explanation in the graph is unfortunate. Because that’s how you think (at least I think) that trustworthy is the characteristic that stimulates buying.

Let’s go one step further. The data is spread across all age groups, what does it look like if we divide by age group?

Wow. The demographic factor seems to have struck well. If I haven’t done everything wrong now, then the statement is true that older people find advertising in print more trustworthy and more incisive, younger people more likely to find social media. As already commented by Karl on FB, 65% of the respondents are at least 40 years old, 49% at least 50, because under 14 is not surveyed. So we have a surplus of older people who are also more likely to use print. Now, of course, the under-14s are not so strongly represented in social and not in print, but the age development in Germany does the rest to ensure that the number may be correct. We hardcore onliners won’t like that, dear Andre, dear Karl, but outside our bubble there are actually people who don’t have a telepathic connection to the net.

If you take a closer look: 28.1% of those surveyed find advertising in print credible, according to the graphic. 28.1%. That is less than a third. In the end, you could also say, yes, advertising in print is not credible for the majority of respondents, but even less so in social advertising!

Conclusion: Numbers ok, interpretation difficult if you don’t see it differentiated. The excitement around the statement is understandable, because it is true, but an average of the total population is still an average, and it is usually suboptimal because it says nothing about the distribution.

Is best4planning legit?

Now you can also do a lot of crap with b4p. Especially if you don’t understand how the data is collected. But we also have that at Google Trends. Or at Similar Web. A fool with a tool is a fool. Point. Of course, we can make it easy for ourselves and say that all market research is a lie anyway, but that wouldn’t be fair, because there are enough market researchers who put a hell of a lot of effort into it.

But how is b4p’s data created? In fact, 30,121 people were surveyed for this study, which is many more people than those involved in the ratings, to whom I had once belonged, and it is a good basis. There were two waves of surveys, all of which can be read here. I see no reason to doubt that just because publishers are the clients. Because you can also rummage around in the data in such a way that some things don’t really look chic for print.

The technical measurement took place with 10,231 participants in the GfK Crossmedia Link Panel, who were also sent an apparently somewhat shortened form of the questionnaire. Then these data from the panel and the interviews were “superimposed”, so to speak. This is a common procedure. However, this also has consequences. Example: I build a target group in b4p of people over 50 who are self-employed in a construction company, etc. n=30,121 then quickly becomes n=12, who were surveyed. That’s difficult, but well. But if I then look at media use, it may be that it comes from the third of the respondents of the GfK panel, so I would have to calculate correctly 12:3 and would be at 4. Unfortunately, the b4p page says nothing about such cases, does not give any help. But I wouldn’t invest my advertising millions on the basis of such data if n is so small. Does this mean that b4p is dubious? No. The problem is in front of the screen, because the small number in the upper left corner, which stands for the number of cases, is very small and is therefore often overlooked. What would also be the alternative? Nothing at all?

Where I actually have a stomachache, these are the questions, at least the ones that are published, and I have only found one: “Now think about the days from Monday to Saturday. In general, how many of those 6 business days do you watch shows on TV between 6:00 a.m. and 9:00 a.m.? Please also remember that Saturday is often different from the other working days”. Because of social desirability: Who with enough brain cells voluntarily admits that they are already bombarding themselves with breakfast TV in the morning? But I hope that such questions will be backed up by further control questions.

Conclusion: b4p is a good place to go for many concerns. It is important to back up the data with other data sources and common sense.

What is the age distribution of Germans on Pinterest?

Are there alternatives to b4p? Few. Some swear by Statista, but the unreflected use of surveys kills a statistics unicorn as well as b4p (that’s the reason why you don’t see unicorns anymore). Statista is of course an extremely cool site with great data, but it’s also worth taking a look at the details of who actually did the study. For example, this study from February 2018 that tells us that 14% of people aged 60 and over are on Pinterest. Also 14% of 50-59 year olds. n for both groups around 600. Sounds great for Pinterest. Unfortunately, you have to pay for Statista if you want the source, but someone kindly did that for me. The whole study comes from the Faktenkontor. Survey by Toluna. Toluna? What is that? A website where you get something for taking part in surveys. Humph. So we didn’t randomly select people from the at least 60-year-olds we surveyed, but simply took those who are already online anyway and then also know this portal, etc. It’s just a shame that only 55% of the over-65s are on the Internet at all, according to destatis. So we surveyed an already online-savvy group whose probability of being on Pinterest is correspondingly higher. I would at least be very cautious about recommending Pinterest as a target group for pensioners and using this figure of 14% somewhere.

Conclusion: Take a close look at how the data was collected.

It’s not the data that is the problem, it’s you

The problem is not the data, as long as you understand where it comes from and how it was collected. But sometimes you don’t want to know that exactly, because you have an opinion and prove it by choosing only the data that confirms your own opinion (and I’m not exempt from that). This is called confirmation bias. I have also written about this elsewhere. Or you don’t have the time to check data. Or sometimes not the will. You only see a number, have your own opinion and then shoot off. Sometimes I wish for more differentiation. Just by the way. But simple answers are always easier to communicate.

By the way, a nice book I’m reading right now: Thomas Bausch – Random Sampling Methods in Market Research. It’s available used for less than 5€ at Amazon.

Sistrix traffic vs. Google AdWords keyword planner


If you read along here often, you know that Sistrix is one of my absolute favorite tools (I’ll brazenly link as the best SEO tool), if only because of the lean API, the absolutely lovable Johannes with his really clever blog posts and the calmness with which the toolbox convinces again and again. Of course, all other tools are great, but Sistrix is something like my first great tool love, which you can’t or don’t want to banish from your SEO memory. And even if the following data might scratch the paint, they didn’t cause a real dent in my Sistrix preference.

What problem am I trying to solve?

But enough of the adulation. What is it about? As already described in the post about keywordtools.io or the inaccuracies in the Google AdWords Keyword Planner data mentioned in the margin, it is a challenge to get reliable data about the search volume of keywords. And if you still believe that Google Trends provides absolute numbers, well… Sistrix offers a traffic index of 0-100 for this purpose, which is calculated on the basis of various data sources, which is supposed to result in higher accuracy. But how accurate are the numbers here? Along the way, I also want to show why box plots are a wonderful way to visualize data.

The database and first plots with data from Sistrix and Google

The database here is 4,491 search queries from a sample, where I have both the Sistrix and the Google AdWords Keyword Planner data. By the way, it’s not the first sample I’ve pulled, and the data looks about the same everywhere. So it’s not because of my sample. So let’s first look at the pure data:

As we can see, you could draw a curve into this plot, but the relation doesn’t seem to be linear. But maybe we only have a distorted picture here because of the outlier? Let’s take a look at the plot without the giant outlier:

Maybe we still have too many outliers here, let’s just take those under a search volume of 100,000 per month:

In fact, we see a tendency here to go up to the right, not a clear line (I didn’t do a regression analysis), but we also see that with a traffic value of 5, we have values that go beyond the index values of 10,15,20,25 and 30, even at 50 so we see the curve again:

The median ignores the outliers within the smaller values:

So if we look at the median data, we see a correct trend at least for the higher values, with the exception of the value for the Sistrix traffic value of 65 or 70. However, the variation around these values is very different when plotting the standard deviations for each Sistrix traffic value:

We don’t see a pattern in the spread. It is not the case that the dispersion increases with a higher index value (which would be expected), in fact it is already higher with the index value of 5 than with 10 etc. We see the highest dispersion at the value of 60.

All-in-one: box plots

Because boxplots are simply a wonderful thing, I’ll shoot it after that:

Here the data is reversed once (because it was not really easy to see with the Sistrix data on the X-axis). The box shows where 50% of the data is located, so with a search volume of 390, for example, 50% of the data is between the Sistrix value of 5 and 25, the median is indicated by the line in the box and is 15. The sizes of the boxes increase at the beginning, then they are different sizes again, which indicates a lower dispersion. At some data points, we see small circles that R has calculated as outliers. So we see outliers, especially in the low search volumes. Almost everything we plotted above we get visualized here in a plot. Boxplots are simply wonderful.

What do I do with this data now?

Does this mean that the traffic data in Sistrix is unusable? No, it doesn’t mean that. As described in the introduction, the Keyword Planner data is not always correct. So nothing is known for sure. If you see the Keyword Planner data as the ultimate, you won’t be satisfied with the Sistrix data. It would be helpful if there was more transparency about where exactly the data comes from. Obviously, tethered GSC data would be very helpful as it shows real impressions. My recommendation for action is to look at several data sources and to look at the overlaps and the deviations separately. This is unsatisfactory, as it is not automatic. But “a fool with a tool is still a fool”.

Comments (since February 2020 the comment function has been removed from my blog):

Hanns says

  1. May 2018 at 21:18 Hello, thank you very much for the interesting analysis. Have you ever tried the new traffic numbers in the SISTRIX Toolbox? This also gives you absolute numbers and not index values. To do this, simply activate the new SERP view in the SISTRIX Labs. Information can be found here (https://www.sistrix.de/news/nur-6-prozent-aller-google-klicks-gehen-auf-adwords-anzeigen/) and here (https://www.sistrix.de/changelog/listen-funktion-jetzt-mit-traffic-und-organischen-klick-daten/)

Tom Alby says

  1. May 2018 at 10:58 I hadn’t actually seen that before. Thanks for the hint. But these are the ranges here, not the really absolute numbers. But still very cool.

Martin Says

  1. April 2019 at 13:33 Moin, I read your post and tried to understand. But I can’t figure it out. Sistrix is cool yes, but unfortunately I don’t think how reliable the data is.

I actually don’t understand how this is supposed to work technically. How is Sistrix supposed to get the search queries that run through Google for each keyword? It’s not as if Google informs Sistrix briefly with every request.

The only thing I can think of is that they pull the data for each keyword from AdsPlanner. But… to present this as “own search volume” without any indication of where the data comes from, I would find grossly negligent.

Where could they still get data from?

Tom says

  1. April 2019 at 20:39 Hallo Martin,

the answer is not 1 or 0, that also comes out in the article. You also can’t rely on AdPlanner data. Sistrix also gets data from customers who have linked the Search Console data there, since you can see your page’s impressions for a keyword. But of course, all this is not for every keyword. And that’s why inaccuracies come about.

BG

Tom

Data Science meets SEO, Part 5


The last part of this series on search engine optimization/SEO and data science based on my presentation at SEO Campixx. I converted the data and the code into an HTML document via Knit, which makes my notebook including data comprehensible. There are also a few more examinations in the notebook, but I have documented everything in English, as this is not only interesting for Germans. So if you want to read all the results in one document (without the TF/IDF, WDF/IDF or stemming examples), please take a look at the Data Science & SEO Notebook.

Let’s go back to the age of domains and their importance for SEO

First of all, an addendum: We lacked the age for some domains, and I have now received this data from another source. In our sample, most of the domains were older, and my concern was that the missing domains might be much younger and therefore the average age would be wrongly pulled down. Almost 20% of the domains were missing.

In fact, the missing domains are younger. While the median for our holey data set was 2001, it is 2011 for the missing domains. If you merge the data, however, it is still back to 2001, only the mean has changed from 2002 to 2003. Thus, the number of missing data was not so high that this opposing trend would have had a major impact. Of course, one could now argue that this other source simply has completely different numbers, but this could not be verified in a sample of domains for which an age was already available. And if you look at the plot for the relationship between position on the search results page and the age of a domain, we haven’t found anything new:

Box plots are a wonderful thing, because they show an incredible amount about the data at a glance. The box shows where 50% of the data is, the thick line in the middle the median, and the width of the box the root from the sample set. Even after several beers, there is no pattern to be seen here, except that the boxes are all at about the same height. Google had already said that the age of a domain does not matter.

Longer text = better position?

Another myth, and the great thing about this myth is that we can clear it up relatively easily, because we can crawl the data ourselves. By the way, R is not so great for crawling; there is the package rvest, but if you really only want the content, then nothing comes close to Python’s Beautiful Soup. Nicely, you can also run Python in the RStudio notebooks, so only the actual text is taken as text here, not the source code. However, navigation elements and footers are included, although we can assume that the actual content can be extracted with Google. The following plot shows the relationship between content length and position:

As we can see, we don’t see anything, except for an interesting outlier with more than 140,000 words in a document (http://katalog.premio-tuning.de/), which ranked 3rd for the keyword “tuning catalogs”. Otherwise, no correlation can be observed. A general statement such as “more text = better position” cannot therefore be derived. The median word count is 273, and the mean is 690. Just a reminder, we are in the top 10 here. I would actually be very interested in how the colleagues from Backlinko came up with 1,890 words for the average 1st place document. They have looked at far more search results (what does “1 million search results” mean? Exactly that, i.e. about 100,000 search results pages, i.e. the results for about 100,000 search queries?), but they do not reveal which average they used. As you can see in my numbers, there is a big difference between the median and the mean, which is the arithmetic mean that most people call the average. It’s not for nothing that I always say that the average is the enemy of statistics, but maybe texts are longer in the USA? But since the numbers are not made available to us… and also not the methods… well, at some point I learned that you have to add both the data and the software for evaluation to your results so that everything is really comprehensible.

Is there really nothing to see at all?

In this final part, I have added more signals, including TF/IDF and WDF/IDF. And as you can see in the correlation matrix, we don’t have a correlation anywhere. In the last part, however, we had already seen that this does not apply to all keywords. In the histogram of the correlation coefficients, we saw both positive and negative correlations, but no p-value. If you only look at the correlation coefficients where p is < 0.05, the picture looks different again:

So we have keywords where the backlinks matter, and we also have keywords where the other signals matter. If we can draw one conclusion from the keyword set, it’s that there’s no one-size-fits-all rule. As already stated in the last part, we need the above correlation matrix for each keyword. And that’s exactly what is exciting, because we can look at how the ranking signals behave for each keyword individually or perhaps a topic.

And so you can see for the keyword “player update” (as hash 002849692a74103fa4f867b43ac3b088 in the data in the notebook) that some signals are more prominent, see the figure on the left. Can you now be sure that you now know exactly how the ranking works for this keyword? No, you can’t (especially since we haven’t calculated the p-values here yet). But if we look at several keywords from the same “region” (i.e. similar values in this signal), then there could actually be something in them.

What about WDF/IDF now?

Unfortunately, nothing either. And that was probably the biggest point of contention at SEO Campixx. In this example, I’m only using the exact match for now, so I find exactly the keyword I entered in the text. Of course, we could now go further and have the keywords picked apart matched, but to reduce the complexity, let’s just look at the exact match.

There is no clear pattern here, and there are no correlations. Very few observations even manage a p-value below 0.05 and a correlation coefficient of more than 0.1. In this keyword set, it cannot be understood that WDF/IDF brings anything, at least not for Exact Match. Neither does TF/IDF. I didn’t even look up keyword density.

Reporting Data

The last part of my presentation from the SEO Campixx was a short summary of my series of articles about SEO reporting with R and AWS (especially the part about action-relevant analyses and reporting).

Result

Once again the most important points:

  • My sample is quite small, so it may not be representative of the total population of all searches. So the statement is not that what I have written here really applies to everything. The goal is to show how to approach the topic from a data science perspective. For some topics, however, it cannot be denied that the data situation is sufficient, such as the correlation between document length and position.
  • My statements apply to German search results. The average document length may be different in DE, but I doubt it.
  • The data used for the calculation is not necessarily reliable. The backlink data is most likely not complete, and what Google & Co make of text is not completely transparent either. However, most tools out there don’t even use standard procedures like stemming, so it should at least be proven that it’s certainly exciting to work with such WDF-IDF tools, but it’s not necessarily what actually changes everything.
  • The typical SEO statements cannot be proven for all keywords with the help of this sample, but this should not come as a surprise, because the ranking algorithm is dynamic. This means:
    • Speed is not a ranking factor, at most as a hygiene factor, and even that we can’t prove here
    • HTTPS is not yet a ranking factor.
    • Surprisingly, backlinks do not always correlate, but this may be due to the data basis
    • We have to look at what the ranking signals look like for each keyword.

The emotional reactions of some colleagues are not incomprehensible, because after all, some tools are paid dearly (there were also tool operators in my presentation, one of whom let himself be carried away by the statement that you can tell that I haven’t worked as SEO for a long time). It’s a bit like going to a Christian and saying that his Jesus, unfortunately, never existed. I didn’t say that. I have only said that I cannot understand the effect of common practices on the basis of my data set. But many SEOs, whom I appreciate very much, have told me that e.g. WDF/IDF works for them. In medicine it is said “He who heals is right”, and at the end of the day it is the result that counts, even if it has been proven that homeopathy does not help.nAnd perhaps the good results of these SEOs only come about because they also do many other things right, but then blame it on WDF/IDF.

But what interests me as a data person is reproducibility. In which cases does WDF/IDF work and when does it not? I would like to add that I have no commercial interest in calling any way good or bad, because I don’t sell a tool (let’s see, maybe I’ll build one someday) and I don’t earn my money with SEO. In other words: I pretty much don’t give a s*** what comes out of here. The probability that I will succumb to confirmation bias because I am only looking for facts that support my opinion is extremely low. I’m only interested in the truth in a post-factual world. And unlike the investigation of Backlinko, for example, I provide my data and code so that everyone can understand it. This is complexity, and many try to avoid complexity and are looking for simple answers. But there are no easy answers to difficult questions, even if that is much more attractive to people. My recommendation: Don’t believe any statistics that don’t make the data and methods comprehensible. I hope that all critics will also disclose the data and their software. This is not about vanity.

The Donohue–Levitt hypothesis is a good example for me: For example, the New York Police Department’s Zero Tolerance approach in the 1990s was praised for significantly reducing crime as a result. This is still a widespread opinion today. Donohue and Levitt had examined the figures, but came to a different conclusion, namely that this was a spurious correlation. In reality, the spread of the baby pill was responsible for the fact that the young offenders were not born in the first place, which then became noticeable in the 90s. Of course, this was attacked again, then confirmed again, and then someone also found out that the disappearance of the lead content from gasoline was responsible for the reduction of juvenile delinquency (lead crime hypothesis). However, these are more complex models. More police truncheons equals less crime is easier to understand and is therefore still defended (and maybe there is a bit of truth to it?). But here, too, those who find a model more sympathetic will mainly look at the data that confirm this opinion.

I could have investigated a lot more. But as I said, I do it on the side. I’m already in the mood for more data on the topic. But for now, there are other mountains of data here again: And then the next step would be a larger data set and machine learning to identify patterns more precisely.

Data Science meets SEO, Part 4


Now it’s been another month, and I still haven’t written everything down. However, this is also due to the fact that I have acquired even more data in the last few weeks so that I have a data set that I can share and that is not customer-specific.

Why does data validation and cleansing take so long?

80% of the time is spent on validating and cleaning the data, according to the rule of thumb, and I would add one more point, namely transforming the data. Data is rarely available in such a way that it can be used immediately.

But one thing at a time. For this part, I wanted to add backlink data as well as crawl data, but Google’s backlink data is only available for your own domain, and if you use tools like Sistrix, then the API queries cost money or credits. Using the example of Sistrix, you pay 25 credits for querying for backlinks (links.overview), so with 50,000 credits per week you can query the links for 2,000 URLs. However, I can only use the credits that were not spent on other tools at the end of the week, so I would need more than 7 weeks for 14,099 unique hosts that I generated in the last part with the 5,000 searches. Until then, I have 1,000 other projects and I forget what code I wrote here, so I took a sample based on 500 searches that I randomly pulled from my 5,000 searches. Unfortunately, the ratio unique hosts/all URLs here was not as nice as with the overall set, 2,597 unique hosts had to be queried.

Unfortunately, the Sistrix API had also thrown a spanner in the works here, because for over 250 URLs I got answers that my script had not properly intercepted, e.g.

‘{“method”:[[“links.overview”]], “answer”:[{“total”:[{“num”:903634.75}], “hosts”:[{“num”:21491.504628108}], “domains”:[{“num”:16439.602383232}], “networks”:[{“num”:5979.5586911669}], “class_c”:[{“num”:9905.3625179945}]}], “credits”:[{“used”:25}]}

My script had expected integer values (fractions of a backlink don’t exist in my opinion) and then simply wrote nothing at all to the dataframe when a number came from Sistrix that was not an integer. But even if it had caught that, the number I see here has nothing to do with the number I see in the web interface, although there are also strange numbers here from time to time (see screenshot). Is that 197,520 backlinks or 19,752,000? Please don’t get me wrong, Sistrix is one of my favorite tools, but such things drive me crazy, and R is not easy there either. It didn’t help, I had to look through the data first and add some of it manually (!!). And how difficult R can sometimes be becomes apparent when you want to add existing data, but don’t want to set the existing data in a column to NA. My inelegant solution to the transformation (which took me 2 hours):

test <- merge(sample_2018_04_02,backlinks_2, by = “host”, all.x = TRUE) test <- cbind(test, “backlinks_raw”=with(test, ifelse(is.na(total.y), total.x, total.y))) ‘

We had already seen last time that we were missing data for the age of a domain, but here there is also the official statement from Google that the age of a domain has no influence on the ranking. However, the old domains were in the majority, so one could possibly assume that newer domains have less chance of getting into the index or into the top 10, but it doesn’t matter for the respective position in the top 10. However, we did not make this statement explicitly, because it could be that the missing age values in my data set are exactly the younger domains. So that’s yet to be found out, but no longer as part of this series. Then I might as well examine the top 100

In summary: It always looks very simple, but collecting, transforming and cleaning the data simply costs a lot of time and energy.

So what can we do with this data?

First of all, let’s see if we can see something just by plotting the individual variables in relation to each other. The variables we have here now are:

  • Position
  • https yes/no
  • year (age of the domain)
  • Speed
  • Ips
  • SearchVolume (spelling error, “SwarchVolume”
  • Number of backlinks for the host
  • Number of backlinks for the host, logarithmized
  • Number of results for a search query

There are still a few variables missing, but let’s start with this. As we can see, we see next to nothing So let’s take a look at the bare numbers again:

Here we see a little more. For example, a (sometimes very weak) correlation between http and backlinks_log, year and backlinks_log, speed and year, backlinks_raw and ip, etc. But why the logarithmization of the backlinks at all? This is illustrated by the following example:

If we look at the distribution of the frequencies of the backlinks in the histogram, we see a high bar on the far left and not much else. No wonder, because in the search results we have hosts like Youtube that have a nine-digit number of backlinks, but most hosts have much, far fewer backlinks. If we use a logarithm instead, i.e. if we “compress” it a bit, then the histogram looks quite different:

We see here that many hosts are somewhere in the middle, some stand out with few links (that’s the bar at 0) and few hosts have a lot of backlinks. The question of whether the number of backlinks of the individual search hits is comparable for each search query is also exciting. The answer here is no, as the following histogram shows (also logarithmized):

I calculated the average number of backlinks for each keyword and then put it logarithmized on the histogram (without a logarithm, the histogram would look like the unlogarithmized one before it). And as we can see, we also have areas where the search results come from hosts that have few backlinks, most of the search results are in the middle, and with very few search results we have an enormous number of backlinks. It must always be said that the average is not really a great story. But at least we see that we have different “regions”.

Now that we’ve clarified why certain data is logarithmized, let’s take a closer look at what the correlations look like, starting with age and backlinks:

If we squint a little, we see an impression of a line, it almost looks as if there is a correlation between the age of a domain and its backlinks. Once tested:

Pearson’s product-moment correlation
data: dataset$year and dataset$backlinks_log
t = -24.146, df = 4286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3721183 -0.3194161
sample estimates:
cor-0.3460401

That looks good. And it’s not completely surprising. After all, the longer a domain exists, the more time it has had to collect links. While we don’t know which way a correlation is headed, it’s unlikely that a domain will get older the more backlinks it gets.

Let’s take a look at that again for the combination of age and speed:

Pearson’s product-moment correlation
data: dataset$year and dataset$SPEED
t = 13.129, df = 4356, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1663720 0.2234958
sample estimates:
cor
0.1950994

It is interesting here that the correlation coefficient is positive, i.e. the older a domain is, the slower it is.

Good question. Because as already discussed in the last part, this does not apply to every keyword. Let’s take a look at the correlation between backlinks and position per keyword and then throw the correlation coefficients output onto a histogram:

Clearly, we have some keywords here whose ranking hosts have at least a weak if not moderate correlation with the number of backlinks. This means that we would have to look at the composition of the ranking for each keyword individually. And since we already know that the ranking is dynamic, we see it a little more clearly here.

Unfortunately, it is not the case that there is a correlation between the average number of backlinks of the ranked pages and the correlation between the backlinks of the locations and position. We can see in the screenshot on the left that this is a very colorful mix.

What can be the reason for this? For example, because I only have the data for the backlinks for the hosts, not for the respective landing page. Of course, that would be even nicer, and the most ideal thing would be if I could then look at what the composition of the individual factors of the backlinks would look like. In view of my credit poverty, however, this is not possible at the moment. And here we have a typical problem in the field of data science: We know that the data is out there, but we can’t get to it. Nevertheless, this approach already offers enormous advantages: I can now look at the composition of the current ranking for each keyword individually and act accordingly. In the example on the left, I see that I need a lot of backlinks for the host for “privacy”, but in my dataset (not on the screenshot), my host needs few backlinks for searches like “poems early retirement”. So we need exactly this correlation matrix for each keyword instead of an overall view as above.

In the [next part][10] we will get more data (we started with TF/IDF and WDF/IDF).

[1]: Comments (since February 2020 the comment function has been removed from my blog):

Steffen Blankenbach says

  1. April 2018 at 15:34 Hello Tom, very good and exciting article. Once again, it becomes clear how much effort you have to put into getting insights, only to find out that there are many other influencing factors that could not be taken into account in the calculation. This will probably always remain a problem of data evaluation. Personally, I also assume that thanks to RankBrain, ranking factors even differ at the keyword level and therefore such a mass evaluation is not expedient for companies. A case for a specific industry would be exciting. I look forward to more articles from you.

http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-02-um-12.09.07.png [2]: http://tom.alby.de/wp-content/uploads/2018/04/matrix.png [3]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-01.14.04.png [4]: http://tom.alby.de/wp-content/uploads/2018/04/hist.png [5]: http://tom.alby.de/wp-content/uploads/2018/04/hist_log.png [6]: http://tom.alby.de/wp-content/uploads/2018/04/00000b.png [7]: http://tom.alby.de/wp-content/uploads/2018/04/data-science-seo-age-backlinks.png [8]: http://tom.alby.de/wp-content/uploads/2018/04/plot_zoom_png.png [9]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-02.17.36.png [10]: http://tom.alby.de/data-science-meets-seo-teil-5/

Data Science meets SEO, Part 3


The first two parts were about what data science is and why WDF/IDF values are very likely to have little to do with what happens under the hood at Google. In this part, we go one step further, because we look at whether there are correlations between ranking signals and the position. In the lecture, I had shown this using the example of a search query and, in view of the time available, I had dealt with it rather briefly. Here I can go into depth. However, we will first only look at each individual ranking signal in relation to the position, not the possible effect of the ranking signals on each other.

Risks and Side Effects of Data Science

Since my presentation caused some colleagues to gasp and some “interesting” statements, one point had probably been lost. Because I had expressly said, and I repeat this here, that I am not making the statement that one can assume from this data that the ranking works that way. Anyone who has ever had to deal with lawyers or statisticians knows that they are reluctant to be pinned down to reliable statements. After all, we usually do not know the total population and therefore have to draw conclusions about the population from a small sample; Who is so crazy and lets himself be pinned down to it? Hence all the complexity with confidence level, confidence intervals etc…

The following statements refer to a sample of 5,000 search queries. That sounds like a lot, but we don’t know if these searches correspond to the total population of all searches. So the results are for the sample, and I’m always willing to repeat that for other searches if those searches are made available to me.

Other problems with this approach: We have access to a few ranking signals, but not all of them, and the few signals we do have are also partly inaccurate. Of the more than 200 ranking signals, we have:

  • Domain age: inaccurate, as the individual sources contradict each other and there is no pattern to be seen as to whether one source generally outputs younger values than the other
  • Backlinks: here we have the values that are output from tools, and they usually don’t have a complete overview of all backlinks
  • Anchor Texts: Since we can’t assume that the backlinks are complete, we can’t expect to have all the anchor texts
  • Text Matches: We can identify exact matches, but we saw in the previous part that search engines don’t
  • Text length
  • Speed
  • HTTPS versus HTTP
  • Readability

So we lack signals such as

  • User Signals
  • Quality Score
  • Domain History
  • RankBrain
  • and much more

In summary, we only have a fraction of the data, and some of it is not even accurate. And my calculations are also based on search queries that we don’t know if they are representative.

A word about correlations

My favorite example of the fatal belief in correlations is the statistical correlation between Microsoft Internet Explorer’s market share and the murder rate in the U.S. between 2006 and 2011. While it may be funny to claim that there is a connection here (and this regularly leads to laughter in lectures), the fact is that a statistical relationship, which we call a correlation, does not have to be a real connection here. Correlation does not mean cause and effect. Worse still, in statistics we don’t even know in which direction the statistical connection runs. In this example, whether Internet Explorer’s market shares led to more murders, or whether the murders led to Internet Explorer being used afterwards to cover their tracks.

Of course, the connections are clear in some situations: If I spend more money on AdWords, then I may get more conversions. And if we examine the statistical relationship between ranking signals, then it is likely that more backlinks lead to a better position, even though of course a better position can ensure that more website owners find interesting content and link to it… But we don’t know whether, for example, the individual signals can influence each other, and we usually only look at the top 10. As described in the previous part, this is a kind of Jurassic Park, where we don’t have the whole picture.

A little descriptive statistics

Every analysis begins with a description of the data. For the 5,000 search queries, we got more than 50,000 search results, but we take out the results that point to Google services for the time being, because we don’t know whether they are ranked according to the normal ranking factors. There are 48,837 URLs left, spread across 14,099 hosts.

Lately I’ve been working more and more with dplyr, which allows piping like under Linux/UNIX; this functionality was borrowed from the magrittr package, and it makes the code incredibly clear. In the example in the figure, after the commented out line, I google_serps throw my dataframe to group_by(), which is grouped by host, and I throw the result of this step to summarise(), which then calculates the frequencies per host, and finally I throw it to arrange(), which sorts the result in descending order. I write the result to hostFrequency, and because I want to see the result immediately in my R notebook, I put the whole expression in parentheses so that the result is not only written to the hostFrequency dataframe, but also output immediately. Every time I do something like this with dplyr, I’m a little happy. And if you have really large data sets, then you do the same with sparklyr :).

But back to the topic: So we see here that few hosts rank very frequently, and that means conversely that many hosts rank only once. No surprise here.

To warm up: Speed as a ranking factor

The speed data for each host is very easy to get, because Google offers the PageSpeed Insights API for this, and kindly there is also an R package for this. With more than 10,000 hosts, the query takes a while, and you are not allowed to make more than 25,000 requests per day, as well as no more than 100 (?) Requests per 100 seconds. I just let it run, and after 1 day my R crashed and lost all data. Not pretty, but a workaround: Write the dataframe to the hard disk after each request.

But let’s take a closer look at the data now. Here’s a histogram of the distribution of the speed values of 14,008 hosts (so I got a PageSpeed value for 99.4% of the hosts):

We see that most hosts make it over the 50 points, and summary gives us the following numbers:

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 55.00 70.00 66.69 82.00 100.00

It’s nice to see how misleading the average can be And now let’s plot positions versus speed:

As we see, we don’t see anything. Let’s check this again in more detail:

cor.test(google_serps$position,google_serps$SPEED) Pearson’s product-moment correlation data: google_serps$position and google_serps$SPEED t = -5.6294, df = 48675, p-value = 1.818e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03438350 -0.01662789 sample estimates: cor -0.02550771

Looks like there is no correlation between PageSpeed and position in the top 10 (we don’t know if there might be one in the top 100, it could be that places 11 to 100 have worse PageSpeed values). But it is also not unlikely that we are dealing with a hygiene factor: If you rank well, then you have no advantage, if you rank badly, then you will be punished. It’s a bit like taking a shower, then no one notices, but if you haven’t showered, then it’s noticeable. Or, in a nutshell (that’s how I should have said it at the SEO Campixx), a slow website is like not having showered. However, we also see hosts that rank despite a creepy speed. But if you search for “ganter shoes”, then http://ganter-shoes.com/de-at/ is probably the best result, even if the page takes 30 seconds to load for me.

We should also keep in mind that the PageSpeed API performs a real-time measurement… maybe we just caught a bad moment? You would actually have to measure the PageSpeed several times and create an average from it.

SSL as a ranking factor

What we also get very easily in terms of data is the distinction between whether a host uses https or not. While some see the use of secure protocols as a weighty ranking factor, more moderate voices like Sistrix see the use of SSL as a weak ranking factor. In this dataset, 70% of all URLs have an https. But does that also mean that these pages rank better?

We are dealing here with a special form of calculation, because we are trying to determine the relationship between a continuous variable (the position) and a dichotomous variable (SSL yes/no). We convert the two variants of the protocol into numbers, http becomes a 0, https becomes a 1 (see screenshot below in the secure and secure2 columns).

The determination of the point-biserial coefficient is a special case of the Pearson correlation coefficient; normally in R we would just type cor.test(x,y), but for this an additional package is loaded, which supports this special form. In fact, the values of the two tests hardly differ, and cor.test also provides me with the p-value.

As we can see, we see nothing or almost nothing: With a correlation coefficient of -0.045067, we can rule out the possibility that https has had an impact on the ranking in the top 10. Does this mean that we should all do without https again? No, because it’s better for the user. And as soon as browsers show even more clearly that a connection is not secure, users are more likely to say goodbye to a site more quickly. Not to mention that we only looked at the top 10 here. It could be that the places 11 to 1000 were mainly occupied by sites without SSL. And then the results could look different again.

Perhaps SSL as a ranking factor is also a reverse hygiene factor. Google would like to have SSL, but since some important pages may not yet have SSL, they do not use it. Just as you might want your dream partner to have a shower, but if he or she suddenly stands in front of you, then it doesn’t matter because you’re so in love. In the long run, of course, this does not go well. And that’s how it will be with SSL

Age of a domain as a ranking factor

Let’s move on to the next ranking signal, the age of a domain, even if it says that the age of a domain doesn’t matter. Here we are faced with a first challenge: How do I find out the age of a domain as automatically and, above all, reliably as possible? Sistrix offers the age of a domain, but not that of a host (the difference is explained in the script for my SEO seminar at HAW). Nevertheless, Sistrix has the advantage that the API is very lean and fast. However, Sistrix does not find an age for 2,912 of the domains (not hosts), with 13,226 unique domains that is 22% without domain age. Jurassic Park sends its regards (if you don’t understand this allusion, please read the second part about Data Science and SEO). Nevertheless, let’s take a look at the distribution:

We see a slightly skewed distribution to the right, i.e. that we see a higher frequency on the left of the older domains (no, I don’t see mirror-inverted, that’s what they call it). Younger domains seem to have fewer chances here. However, it is also possible that Sistrix does not have younger domains in its database and that the almost 3,000 missing domains would be more likely to be located on the right.

Can we at least see a correlation here? Let’s plot the data first:

Again, we don’t see anything, and if we let the correlation be calculated, then this is confirmed:

cor.test(google_serps$position,as.numeric(google_serps$year)) Pearson’s product-moment correlation data: google_serps$position and as.numeric(google_serps$year) t = 1.1235, df = 44386, p-value = 0.2612 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.003970486 0.014634746 sample estimates: cor 0.005332591

Not only do I have no correlation, but I am also outside the confidence level. However, the result should be taken with a grain of salt, because as mentioned above, it may be that our selection is simply not representative, because probably not every domain will have had the same chance of getting into Sistrix’s domain age database. So we would have to supplement this data and possibly also check the data from Sistrix (I haven’t been able to find out where Sistrix got the data from); unfortunately, I have not been able to identify a pattern either, because sometimes one source shows older data, sometimes another source. In principle, you would have to take all data sources and then always take the oldest date. Unfortunately, most sources are not so easy to scrape. And so we not only have missing, but also partly incorrect data. And this is not an atypical problem in the field of data science.

Intermediate thoughts

Since I’ll soon have the 2,000 words together (and I know the correlation words/”read to the end of the article” for my blog), I’ll look at the next ranking factors in the next blog post. Important: We can see here that we find no evidence for supposedly very important ranking factors that they actually have an influence. But that doesn’t mean that’s really true:

  • We have some missing and incorrect data
  • We may not have a valid sample
  • And Google has dynamic ranking, which means that for some queries, some signals correlate with position, while others don’t. The sledgehammer method we used here. is certainly not expedient.

Example:

cor.test(google_serps$position[google_serps$keyword==”acne vitamin A”],google_serps$secure2[google_serps$keyword==”acne vitamin a”]) Pearon’s product-moment correlation data: google_serps$position[google_serps$keyword == “acne vitamin A”] and google_serps$secure2[google_serps$keyword == “acne vitamin A”] t = -4.6188, df = 8, p-value = 0.001713 Alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9645284 -0.4819678 sample estimates: COR -0.8528029

Even though the confidence interval here is very large, it is still in the range of a medium to strong correlation for this query “acne vitamin a” (it correlates negatively because the positions go up from 1 to 10 or further). It is therefore also important to identify the segments or “regions” where certain signals have an effect. More on this in the next part about data science and SEO.