The last part of this series on search engine optimization/SEO and data science based on my presentation at SEO Campixx. I converted the data and the code into an HTML document via Knit, which makes my notebook including data comprehensible. There are also a few more examinations in the notebook, but I have documented everything in English, as this is not only interesting for Germans. So if you want to read all the results in one document (without the TF/IDF, WDF/IDF or stemming examples), please take a look at the Data Science & SEO Notebook.
Let’s go back to the age of domains and their importance for SEO

First of all, an addendum: We lacked the age for some domains, and I have now received this data from another source. In our sample, most of the domains were older, and my concern was that the missing domains might be much younger and therefore the average age would be wrongly pulled down. Almost 20% of the domains were missing.
In fact, the missing domains are younger. While the median for our holey data set was 2001, it is 2011 for the missing domains. If you merge the data, however, it is still back to 2001, only the mean has changed from 2002 to 2003. Thus, the number of missing data was not so high that this opposing trend would have had a major impact. Of course, one could now argue that this other source simply has completely different numbers, but this could not be verified in a sample of domains for which an age was already available. And if you look at the plot for the relationship between position on the search results page and the age of a domain, we haven’t found anything new:

Box plots are a wonderful thing, because they show an incredible amount about the data at a glance. The box shows where 50% of the data is, the thick line in the middle the median, and the width of the box the root from the sample set. Even after several beers, there is no pattern to be seen here, except that the boxes are all at about the same height. Google had already said that the age of a domain does not matter.
Longer text = better position?
Another myth, and the great thing about this myth is that we can clear it up relatively easily, because we can crawl the data ourselves. By the way, R is not so great for crawling; there is the package rvest, but if you really only want the content, then nothing comes close to Python’s Beautiful Soup. Nicely, you can also run Python in the RStudio notebooks, so only the actual text is taken as text here, not the source code. However, navigation elements and footers are included, although we can assume that the actual content can be extracted with Google. The following plot shows the relationship between content length and position:

As we can see, we don’t see anything, except for an interesting outlier with more than 140,000 words in a document (http://katalog.premio-tuning.de/), which ranked 3rd for the keyword “tuning catalogs”. Otherwise, no correlation can be observed. A general statement such as “more text = better position” cannot therefore be derived. The median word count is 273, and the mean is 690. Just a reminder, we are in the top 10 here. I would actually be very interested in how the colleagues from Backlinko came up with 1,890 words for the average 1st place document. They have looked at far more search results (what does “1 million search results” mean? Exactly that, i.e. about 100,000 search results pages, i.e. the results for about 100,000 search queries?), but they do not reveal which average they used. As you can see in my numbers, there is a big difference between the median and the mean, which is the arithmetic mean that most people call the average. It’s not for nothing that I always say that the average is the enemy of statistics, but maybe texts are longer in the USA? But since the numbers are not made available to us… and also not the methods… well, at some point I learned that you have to add both the data and the software for evaluation to your results so that everything is really comprehensible.
Is there really nothing to see at all?

In this final part, I have added more signals, including TF/IDF and WDF/IDF. And as you can see in the correlation matrix, we don’t have a correlation anywhere. In the last part, however, we had already seen that this does not apply to all keywords. In the histogram of the correlation coefficients, we saw both positive and negative correlations, but no p-value. If you only look at the correlation coefficients where p is < 0.05, the picture looks different again:

So we have keywords where the backlinks matter, and we also have keywords where the other signals matter. If we can draw one conclusion from the keyword set, it’s that there’s no one-size-fits-all rule. As already stated in the last part, we need the above correlation matrix for each keyword. And that’s exactly what is exciting, because we can look at how the ranking signals behave for each keyword individually or perhaps a topic.

And so you can see for the keyword “player update” (as hash 002849692a74103fa4f867b43ac3b088 in the data in the notebook) that some signals are more prominent, see the figure on the left. Can you now be sure that you now know exactly how the ranking works for this keyword? No, you can’t (especially since we haven’t calculated the p-values here yet). But if we look at several keywords from the same “region” (i.e. similar values in this signal), then there could actually be something in them.
What about WDF/IDF now?
Unfortunately, nothing either. And that was probably the biggest point of contention at SEO Campixx. In this example, I’m only using the exact match for now, so I find exactly the keyword I entered in the text. Of course, we could now go further and have the keywords picked apart matched, but to reduce the complexity, let’s just look at the exact match.

There is no clear pattern here, and there are no correlations. Very few observations even manage a p-value below 0.05 and a correlation coefficient of more than 0.1. In this keyword set, it cannot be understood that WDF/IDF brings anything, at least not for Exact Match. Neither does TF/IDF. I didn’t even look up keyword density.
Reporting Data
The last part of my presentation from the SEO Campixx was a short summary of my series of articles about SEO reporting with R and AWS (especially the part about action-relevant analyses and reporting).
Result
Once again the most important points:
- My sample is quite small, so it may not be representative of the total population of all searches. So the statement is not that what I have written here really applies to everything. The goal is to show how to approach the topic from a data science perspective. For some topics, however, it cannot be denied that the data situation is sufficient, such as the correlation between document length and position.
- My statements apply to German search results. The average document length may be different in DE, but I doubt it.
- The data used for the calculation is not necessarily reliable. The backlink data is most likely not complete, and what Google & Co make of text is not completely transparent either. However, most tools out there don’t even use standard procedures like stemming, so it should at least be proven that it’s certainly exciting to work with such WDF-IDF tools, but it’s not necessarily what actually changes everything.
- The typical SEO statements cannot be proven for all keywords with the help of this sample, but this should not come as a surprise, because the ranking algorithm is dynamic. This means:
- Speed is not a ranking factor, at most as a hygiene factor, and even that we can’t prove here
- HTTPS is not yet a ranking factor.
- Surprisingly, backlinks do not always correlate, but this may be due to the data basis
- We have to look at what the ranking signals look like for each keyword.
The emotional reactions of some colleagues are not incomprehensible, because after all, some tools are paid dearly (there were also tool operators in my presentation, one of whom let himself be carried away by the statement that you can tell that I haven’t worked as SEO for a long time). It’s a bit like going to a Christian and saying that his Jesus, unfortunately, never existed. I didn’t say that. I have only said that I cannot understand the effect of common practices on the basis of my data set. But many SEOs, whom I appreciate very much, have told me that e.g. WDF/IDF works for them. In medicine it is said “He who heals is right”, and at the end of the day it is the result that counts, even if it has been proven that homeopathy does not help.nAnd perhaps the good results of these SEOs only come about because they also do many other things right, but then blame it on WDF/IDF.
But what interests me as a data person is reproducibility. In which cases does WDF/IDF work and when does it not? I would like to add that I have no commercial interest in calling any way good or bad, because I don’t sell a tool (let’s see, maybe I’ll build one someday) and I don’t earn my money with SEO. In other words: I pretty much don’t give a s*** what comes out of here. The probability that I will succumb to confirmation bias because I am only looking for facts that support my opinion is extremely low. I’m only interested in the truth in a post-factual world. And unlike the investigation of Backlinko, for example, I provide my data and code so that everyone can understand it. This is complexity, and many try to avoid complexity and are looking for simple answers. But there are no easy answers to difficult questions, even if that is much more attractive to people. My recommendation: Don’t believe any statistics that don’t make the data and methods comprehensible. I hope that all critics will also disclose the data and their software. This is not about vanity.
The Donohue–Levitt hypothesis is a good example for me: For example, the New York Police Department’s Zero Tolerance approach in the 1990s was praised for significantly reducing crime as a result. This is still a widespread opinion today. Donohue and Levitt had examined the figures, but came to a different conclusion, namely that this was a spurious correlation. In reality, the spread of the baby pill was responsible for the fact that the young offenders were not born in the first place, which then became noticeable in the 90s. Of course, this was attacked again, then confirmed again, and then someone also found out that the disappearance of the lead content from gasoline was responsible for the reduction of juvenile delinquency (lead crime hypothesis). However, these are more complex models. More police truncheons equals less crime is easier to understand and is therefore still defended (and maybe there is a bit of truth to it?). But here, too, those who find a model more sympathetic will mainly look at the data that confirm this opinion.
I could have investigated a lot more. But as I said, I do it on the side. I’m already in the mood for more data on the topic. But for now, there are other mountains of data here again: And then the next step would be a larger data set and machine learning to identify patterns more precisely.