Data Science meets SEO, Part 3


The first two parts were about what data science is and why WDF/IDF values are very likely to have little to do with what happens under the hood at Google. In this part, we go one step further, because we look at whether there are correlations between ranking signals and the position. In the lecture, I had shown this using the example of a search query and, in view of the time available, I had dealt with it rather briefly. Here I can go into depth. However, we will first only look at each individual ranking signal in relation to the position, not the possible effect of the ranking signals on each other.

Risks and Side Effects of Data Science

Since my presentation caused some colleagues to gasp and some “interesting” statements, one point had probably been lost. Because I had expressly said, and I repeat this here, that I am not making the statement that one can assume from this data that the ranking works that way. Anyone who has ever had to deal with lawyers or statisticians knows that they are reluctant to be pinned down to reliable statements. After all, we usually do not know the total population and therefore have to draw conclusions about the population from a small sample; Who is so crazy and lets himself be pinned down to it? Hence all the complexity with confidence level, confidence intervals etc…

The following statements refer to a sample of 5,000 search queries. That sounds like a lot, but we don’t know if these searches correspond to the total population of all searches. So the results are for the sample, and I’m always willing to repeat that for other searches if those searches are made available to me.

Other problems with this approach: We have access to a few ranking signals, but not all of them, and the few signals we do have are also partly inaccurate. Of the more than 200 ranking signals, we have:

  • Domain age: inaccurate, as the individual sources contradict each other and there is no pattern to be seen as to whether one source generally outputs younger values than the other
  • Backlinks: here we have the values that are output from tools, and they usually don’t have a complete overview of all backlinks
  • Anchor Texts: Since we can’t assume that the backlinks are complete, we can’t expect to have all the anchor texts
  • Text Matches: We can identify exact matches, but we saw in the previous part that search engines don’t
  • Text length
  • Speed
  • HTTPS versus HTTP
  • Readability

So we lack signals such as

  • User Signals
  • Quality Score
  • Domain History
  • RankBrain
  • and much more

In summary, we only have a fraction of the data, and some of it is not even accurate. And my calculations are also based on search queries that we don’t know if they are representative.

A word about correlations

My favorite example of the fatal belief in correlations is the statistical correlation between Microsoft Internet Explorer’s market share and the murder rate in the U.S. between 2006 and 2011. While it may be funny to claim that there is a connection here (and this regularly leads to laughter in lectures), the fact is that a statistical relationship, which we call a correlation, does not have to be a real connection here. Correlation does not mean cause and effect. Worse still, in statistics we don’t even know in which direction the statistical connection runs. In this example, whether Internet Explorer’s market shares led to more murders, or whether the murders led to Internet Explorer being used afterwards to cover their tracks.

Of course, the connections are clear in some situations: If I spend more money on AdWords, then I may get more conversions. And if we examine the statistical relationship between ranking signals, then it is likely that more backlinks lead to a better position, even though of course a better position can ensure that more website owners find interesting content and link to it… But we don’t know whether, for example, the individual signals can influence each other, and we usually only look at the top 10. As described in the previous part, this is a kind of Jurassic Park, where we don’t have the whole picture.

A little descriptive statistics

Every analysis begins with a description of the data. For the 5,000 search queries, we got more than 50,000 search results, but we take out the results that point to Google services for the time being, because we don’t know whether they are ranked according to the normal ranking factors. There are 48,837 URLs left, spread across 14,099 hosts.

Lately I’ve been working more and more with dplyr, which allows piping like under Linux/UNIX; this functionality was borrowed from the magrittr package, and it makes the code incredibly clear. In the example in the figure, after the commented out line, I google_serps throw my dataframe to group_by(), which is grouped by host, and I throw the result of this step to summarise(), which then calculates the frequencies per host, and finally I throw it to arrange(), which sorts the result in descending order. I write the result to hostFrequency, and because I want to see the result immediately in my R notebook, I put the whole expression in parentheses so that the result is not only written to the hostFrequency dataframe, but also output immediately. Every time I do something like this with dplyr, I’m a little happy. And if you have really large data sets, then you do the same with sparklyr :).

But back to the topic: So we see here that few hosts rank very frequently, and that means conversely that many hosts rank only once. No surprise here.

To warm up: Speed as a ranking factor

The speed data for each host is very easy to get, because Google offers the PageSpeed Insights API for this, and kindly there is also an R package for this. With more than 10,000 hosts, the query takes a while, and you are not allowed to make more than 25,000 requests per day, as well as no more than 100 (?) Requests per 100 seconds. I just let it run, and after 1 day my R crashed and lost all data. Not pretty, but a workaround: Write the dataframe to the hard disk after each request.

But let’s take a closer look at the data now. Here’s a histogram of the distribution of the speed values of 14,008 hosts (so I got a PageSpeed value for 99.4% of the hosts):

We see that most hosts make it over the 50 points, and summary gives us the following numbers:

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 55.00 70.00 66.69 82.00 100.00

It’s nice to see how misleading the average can be And now let’s plot positions versus speed:

As we see, we don’t see anything. Let’s check this again in more detail:

cor.test(google_serps$position,google_serps$SPEED) Pearson’s product-moment correlation data: google_serps$position and google_serps$SPEED t = -5.6294, df = 48675, p-value = 1.818e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03438350 -0.01662789 sample estimates: cor -0.02550771

Looks like there is no correlation between PageSpeed and position in the top 10 (we don’t know if there might be one in the top 100, it could be that places 11 to 100 have worse PageSpeed values). But it is also not unlikely that we are dealing with a hygiene factor: If you rank well, then you have no advantage, if you rank badly, then you will be punished. It’s a bit like taking a shower, then no one notices, but if you haven’t showered, then it’s noticeable. Or, in a nutshell (that’s how I should have said it at the SEO Campixx), a slow website is like not having showered. However, we also see hosts that rank despite a creepy speed. But if you search for “ganter shoes”, then http://ganter-shoes.com/de-at/ is probably the best result, even if the page takes 30 seconds to load for me.

We should also keep in mind that the PageSpeed API performs a real-time measurement… maybe we just caught a bad moment? You would actually have to measure the PageSpeed several times and create an average from it.

SSL as a ranking factor

What we also get very easily in terms of data is the distinction between whether a host uses https or not. While some see the use of secure protocols as a weighty ranking factor, more moderate voices like Sistrix see the use of SSL as a weak ranking factor. In this dataset, 70% of all URLs have an https. But does that also mean that these pages rank better?

We are dealing here with a special form of calculation, because we are trying to determine the relationship between a continuous variable (the position) and a dichotomous variable (SSL yes/no). We convert the two variants of the protocol into numbers, http becomes a 0, https becomes a 1 (see screenshot below in the secure and secure2 columns).

The determination of the point-biserial coefficient is a special case of the Pearson correlation coefficient; normally in R we would just type cor.test(x,y), but for this an additional package is loaded, which supports this special form. In fact, the values of the two tests hardly differ, and cor.test also provides me with the p-value.

As we can see, we see nothing or almost nothing: With a correlation coefficient of -0.045067, we can rule out the possibility that https has had an impact on the ranking in the top 10. Does this mean that we should all do without https again? No, because it’s better for the user. And as soon as browsers show even more clearly that a connection is not secure, users are more likely to say goodbye to a site more quickly. Not to mention that we only looked at the top 10 here. It could be that the places 11 to 1000 were mainly occupied by sites without SSL. And then the results could look different again.

Perhaps SSL as a ranking factor is also a reverse hygiene factor. Google would like to have SSL, but since some important pages may not yet have SSL, they do not use it. Just as you might want your dream partner to have a shower, but if he or she suddenly stands in front of you, then it doesn’t matter because you’re so in love. In the long run, of course, this does not go well. And that’s how it will be with SSL

Age of a domain as a ranking factor

Let’s move on to the next ranking signal, the age of a domain, even if it says that the age of a domain doesn’t matter. Here we are faced with a first challenge: How do I find out the age of a domain as automatically and, above all, reliably as possible? Sistrix offers the age of a domain, but not that of a host (the difference is explained in the script for my SEO seminar at HAW). Nevertheless, Sistrix has the advantage that the API is very lean and fast. However, Sistrix does not find an age for 2,912 of the domains (not hosts), with 13,226 unique domains that is 22% without domain age. Jurassic Park sends its regards (if you don’t understand this allusion, please read the second part about Data Science and SEO). Nevertheless, let’s take a look at the distribution:

We see a slightly skewed distribution to the right, i.e. that we see a higher frequency on the left of the older domains (no, I don’t see mirror-inverted, that’s what they call it). Younger domains seem to have fewer chances here. However, it is also possible that Sistrix does not have younger domains in its database and that the almost 3,000 missing domains would be more likely to be located on the right.

Can we at least see a correlation here? Let’s plot the data first:

Again, we don’t see anything, and if we let the correlation be calculated, then this is confirmed:

cor.test(google_serps$position,as.numeric(google_serps$year)) Pearson’s product-moment correlation data: google_serps$position and as.numeric(google_serps$year) t = 1.1235, df = 44386, p-value = 0.2612 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.003970486 0.014634746 sample estimates: cor 0.005332591

Not only do I have no correlation, but I am also outside the confidence level. However, the result should be taken with a grain of salt, because as mentioned above, it may be that our selection is simply not representative, because probably not every domain will have had the same chance of getting into Sistrix’s domain age database. So we would have to supplement this data and possibly also check the data from Sistrix (I haven’t been able to find out where Sistrix got the data from); unfortunately, I have not been able to identify a pattern either, because sometimes one source shows older data, sometimes another source. In principle, you would have to take all data sources and then always take the oldest date. Unfortunately, most sources are not so easy to scrape. And so we not only have missing, but also partly incorrect data. And this is not an atypical problem in the field of data science.

Intermediate thoughts

Since I’ll soon have the 2,000 words together (and I know the correlation words/”read to the end of the article” for my blog), I’ll look at the next ranking factors in the next blog post. Important: We can see here that we find no evidence for supposedly very important ranking factors that they actually have an influence. But that doesn’t mean that’s really true:

  • We have some missing and incorrect data
  • We may not have a valid sample
  • And Google has dynamic ranking, which means that for some queries, some signals correlate with position, while others don’t. The sledgehammer method we used here. is certainly not expedient.

Example:

cor.test(google_serps$position[google_serps$keyword==”acne vitamin A”],google_serps$secure2[google_serps$keyword==”acne vitamin a”]) Pearon’s product-moment correlation data: google_serps$position[google_serps$keyword == “acne vitamin A”] and google_serps$secure2[google_serps$keyword == “acne vitamin A”] t = -4.6188, df = 8, p-value = 0.001713 Alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9645284 -0.4819678 sample estimates: COR -0.8528029

Even though the confidence interval here is very large, it is still in the range of a medium to strong correlation for this query “acne vitamin a” (it correlates negatively because the positions go up from 1 to 10 or further). It is therefore also important to identify the segments or “regions” where certain signals have an effect. More on this in the next part about data science and SEO.

Leave a Reply

Your email address will not be published. Required fields are marked *