With every Google update, the rankings for some pages are shaken up, and every now and then you wonder why some pages are affected and others are not. Because sometimes pages are “punished” where you ask yourself the question, how can that be? That’s actually a good site, isn’t it?
It is well known that Google uses machine learning to optimize the relevance calculation algorithms. But how exactly does it work? And what does this mean for search engine optimizers?
How machine learning works
First of all, a distinction is made between supervised and unsupervised learning. Either you leave it to the machine to find patterns in data. Or you can give the machine training materials in advance and say, for example, what are good and what are bad documents, so that the machine can decide whether new documents are good or bad in the future on the basis of what they have learned.
Machine learning often works with distances, and this is such a central concept that it should be examined in more detail. The following lines are greatly simplified. We’ll also start with an unsupervised learning example, which probably doesn’t matter that much in the search engine world, but it shows the concept of distance in a very simple way.
Let’s imagine that we have a lot of documents, and properties have been measured for each document. For example, one property is the number of words in a document (X1), another is a measure such as the highly simplified PageRank of the domain on which the document is located (X2). These are really fictitious values, and it should not be said that there is a correlation here. It’s just a matter of clarifying.
First, the values are brought to the same scale (scale command), then a distance matrix is created (dist command). The commands for the cluster will be discussed later.
The distance matrix shows the distances between the individual rows. For example, the distance from row 1 to row 3 is smaller than that from row 1 to row 4. In the next step, clusters are formed and plotted in a dendrogram:
Here, too, it is easy to understand why the values from rows 7 and 10 belong together rather than the values from rows 1 and 3. The machine was able to calculate these clusters from the distances alone.
What do Google’s Human Quality Raters have to do with machine learning?
Now let’s go one step further. We know that Google lets people judge search results, from highest to lowest, etc. The Rater Guidelines are easy to find. Again, distances come into play as soon as “highest” gets a number and “lowest” and all values in between.
Of course, the Human Quality Raters can’t look through all the search results. Instead, certain “regions” are trained, i.e. the ratings are used to optimize the algorithm for certain search queries or signal constellations. Unlike in the previous example, we are dealing with supervised learning here, because we have a target variable, the rating. If we now assume that more than 200 factors are used for ranking, then the task for the algorithm could be formulated in such a way that it has to adjust all these factors so that it gets to the target rating.
To understand in more detail how something like this works, let’s take another highly simplified example, this time from a Support Vector Machine.
The principle of Support Vector Machines is a simple but quite thoughtful approach to calculate the optimal distance between two different segments. Let’s take the red line in the image above. It cuts through the blue and green circles. But it could just as easily be rotated a few degrees to the left or right, and it would still perfectly separate the two segments. And now comes the trick: To calculate the optimal separation, the line is simply extended by two parallel lines. And the angle at which the two parallel lines are widest or farthest apart, that’s the optimal angle for the red line.
Now let’s assume that the two segments are again signals from the ranking, x1 is the PageRank, x2 is the PageSpeed. The data is plotted here in a two-dimensional space, and you can see that they are wonderfully separated from each other. So we could train our machine on this data and then in the future, when new elements come into the room, we could say that they should be classified based on what we have learned. And this works not only with 2 variables, but also with many. The space between the points is then called a hyperplane.
Now, data is not always so precisely separable. Let’s take the example with PageRank and PageSpeed. Just because a page has a high PageRank doesn’t mean it has to have super speed. So it could also happen in the picture above that there are a few green circles in the blue ones and vice versa. How can a separation bar through the segments be calculated? Quite simply: For every district that is not clearly on “its” side, there is a minus point. And now it is simply calculated which bar and its position have the fewest minus points. This is called a “loss function”. To put it another way: Even “good” pages could be classified as “bad” according to a support vector machine, the trick is to classify as few good pages as possible as bad and vice versa. It’s just unlikely that all “good” sites have the same characteristics.
What does this mean for search engine optimizers?
First of all, it means what I said over a year ago at the SEO Campixx conference, that there is no static weighting; the ranking is dynamic. In Ask.com, we had trained individual regions, for example when there were no backlinks or little text, or for health search queries, etc. No one size fits all. Today, we do not have all 200 signals available to re-engineer the ranking per search term.
At the same time, however, it also becomes clear why sometimes sites are punished that don’t really deserve it. It’s not because they were found to be bad, it’s just that they have too many signals that speak for a worse ranking. And since the raters didn’t consciously look for any signals, the algorithm, be it Support Vector Machines or something else, just selected the signals that mean a minimal loss. And since we don’t have all 200 signals, it’s often impossible for us to understand what exactly it might have been. With a re-engineering, one can only hope that there is already something useful among the available signals.
This makes it all the more important to deal with the Quality Rater Guidelines. What do the raters use to determine expertise, trust and authority? What leads to the “highest” rating? Even if it’s boring, you can hardly give a better tip besides the hygiene factors.
By the way, Support Vector Machines were developed in the 60s. When there was no talk of data science. Also interesting are the ranking SVMs
If you want to use R to access APIs automatically, then authorization via browser is not an option. The solution is called Service User: With a Service User and the associated JSON file, an R program can access the Google Analytics API, the Google Search Console API, but also all the other wonderful machine learning APIs. This short tutorial shows what needs to be done to connect to the Google Search Console.
First, you create a project if you don’t have a suitable one yet, and then the appropriate APIs have to be enabled. After a short search, you will find the Google Search DesktopAPI, which only needs to be activated.
Click on IAM & admin and then click on Service accounts (it’s kind of strangely highlighted in this screenshot:
Click Create Service Account:
Important: Give a meaningful name for the e-mail address, so that you can at least keep track of it a bit…
Browse Project is sufficient for this step:
Click on “Create Key” here:
Create a JSON key, download it and then put it in the desired RStudio directory.
Now the service user must be added as a user in the Search Console; the important thing is that he receives all rights.
What’s really great about the Google Search Console API is that you can see the query and landing page at the same time, unlike in the GUI. By the way, I get the data every day and write it to a database, so I have a nice history that goes beyond the few months of Search Console.
Last but not least, I also provide the R notebook, with which I query the data; it’s basically the code written by the author of the API, Mark Edmondson, but just for the sake of completeness, how the JSON file is included. There is a more elegant variant with R Environment variables, but I don’t know if it works on Windows.
If you read along here often, you know that Sistrix is one of my absolute favorite tools (I’ll brazenly link as the best SEO tool), if only because of the lean API, the absolutely lovable Johannes with his really clever blog posts and the calmness with which the toolbox convinces again and again. Of course, all other tools are great, but Sistrix is something like my first great tool love, which you can’t or don’t want to banish from your SEO memory. And even if the following data might scratch the paint, they didn’t cause a real dent in my Sistrix preference.
What problem am I trying to solve?
But enough of the adulation. What is it about? As already described in the post about keywordtools.io or the inaccuracies in the Google AdWords Keyword Planner data mentioned in the margin, it is a challenge to get reliable data about the search volume of keywords. And if you still believe that Google Trends provides absolute numbers, well… Sistrix offers a traffic index of 0-100 for this purpose, which is calculated on the basis of various data sources, which is supposed to result in higher accuracy. But how accurate are the numbers here? Along the way, I also want to show why box plots are a wonderful way to visualize data.
The database and first plots with data from Sistrix and Google
The database here is 4,491 search queries from a sample, where I have both the Sistrix and the Google AdWords Keyword Planner data. By the way, it’s not the first sample I’ve pulled, and the data looks about the same everywhere. So it’s not because of my sample. So let’s first look at the pure data:
As we can see, you could draw a curve into this plot, but the relation doesn’t seem to be linear. But maybe we only have a distorted picture here because of the outlier? Let’s take a look at the plot without the giant outlier:
Maybe we still have too many outliers here, let’s just take those under a search volume of 100,000 per month:
In fact, we see a tendency here to go up to the right, not a clear line (I didn’t do a regression analysis), but we also see that with a traffic value of 5, we have values that go beyond the index values of 10,15,20,25 and 30, even at 50 so we see the curve again:
The median ignores the outliers within the smaller values:
So if we look at the median data, we see a correct trend at least for the higher values, with the exception of the value for the Sistrix traffic value of 65 or 70. However, the variation around these values is very different when plotting the standard deviations for each Sistrix traffic value:
We don’t see a pattern in the spread. It is not the case that the dispersion increases with a higher index value (which would be expected), in fact it is already higher with the index value of 5 than with 10 etc. We see the highest dispersion at the value of 60.
All-in-one: box plots
Because boxplots are simply a wonderful thing, I’ll shoot it after that:
Here the data is reversed once (because it was not really easy to see with the Sistrix data on the X-axis). The box shows where 50% of the data is located, so with a search volume of 390, for example, 50% of the data is between the Sistrix value of 5 and 25, the median is indicated by the line in the box and is 15. The sizes of the boxes increase at the beginning, then they are different sizes again, which indicates a lower dispersion. At some data points, we see small circles that R has calculated as outliers. So we see outliers, especially in the low search volumes. Almost everything we plotted above we get visualized here in a plot. Boxplots are simply wonderful.
What do I do with this data now?
Does this mean that the traffic data in Sistrix is unusable? No, it doesn’t mean that. As described in the introduction, the Keyword Planner data is not always correct. So nothing is known for sure. If you see the Keyword Planner data as the ultimate, you won’t be satisfied with the Sistrix data. It would be helpful if there was more transparency about where exactly the data comes from. Obviously, tethered GSC data would be very helpful as it shows real impressions. My recommendation for action is to look at several data sources and to look at the overlaps and the deviations separately. This is unsatisfactory, as it is not automatic. But “a fool with a tool is still a fool”.
Comments (since February 2020 the comment function has been removed from my blog):
May 2018 at 10:58 I hadn’t actually seen that before. Thanks for the hint. But these are the ranges here, not the really absolute numbers. But still very cool.
Martin Says
April 2019 at 13:33 Moin, I read your post and tried to understand. But I can’t figure it out. Sistrix is cool yes, but unfortunately I don’t think how reliable the data is.
I actually don’t understand how this is supposed to work technically. How is Sistrix supposed to get the search queries that run through Google for each keyword? It’s not as if Google informs Sistrix briefly with every request.
The only thing I can think of is that they pull the data for each keyword from AdsPlanner. But… to present this as “own search volume” without any indication of where the data comes from, I would find grossly negligent.
Where could they still get data from?
Tom says
April 2019 at 20:39 Hallo Martin,
the answer is not 1 or 0, that also comes out in the article. You also can’t rely on AdPlanner data. Sistrix also gets data from customers who have linked the Search Console data there, since you can see your page’s impressions for a keyword. But of course, all this is not for every keyword. And that’s why inaccuracies come about.
The last part of this series on search engine optimization/SEO and data science based on my presentation at SEO Campixx. I converted the data and the code into an HTML document via Knit, which makes my notebook including data comprehensible. There are also a few more examinations in the notebook, but I have documented everything in English, as this is not only interesting for Germans. So if you want to read all the results in one document (without the TF/IDF, WDF/IDF or stemming examples), please take a look at the Data Science & SEO Notebook.
Let’s go back to the age of domains and their importance for SEO
First of all, an addendum: We lacked the age for some domains, and I have now received this data from another source. In our sample, most of the domains were older, and my concern was that the missing domains might be much younger and therefore the average age would be wrongly pulled down. Almost 20% of the domains were missing.
In fact, the missing domains are younger. While the median for our holey data set was 2001, it is 2011 for the missing domains. If you merge the data, however, it is still back to 2001, only the mean has changed from 2002 to 2003. Thus, the number of missing data was not so high that this opposing trend would have had a major impact. Of course, one could now argue that this other source simply has completely different numbers, but this could not be verified in a sample of domains for which an age was already available. And if you look at the plot for the relationship between position on the search results page and the age of a domain, we haven’t found anything new:
Box plots are a wonderful thing, because they show an incredible amount about the data at a glance. The box shows where 50% of the data is, the thick line in the middle the median, and the width of the box the root from the sample set. Even after several beers, there is no pattern to be seen here, except that the boxes are all at about the same height. Google had already said that the age of a domain does not matter.
Longer text = better position?
Another myth, and the great thing about this myth is that we can clear it up relatively easily, because we can crawl the data ourselves. By the way, R is not so great for crawling; there is the package rvest, but if you really only want the content, then nothing comes close to Python’s Beautiful Soup. Nicely, you can also run Python in the RStudio notebooks, so only the actual text is taken as text here, not the source code. However, navigation elements and footers are included, although we can assume that the actual content can be extracted with Google. The following plot shows the relationship between content length and position:
As we can see, we don’t see anything, except for an interesting outlier with more than 140,000 words in a document (http://katalog.premio-tuning.de/), which ranked 3rd for the keyword “tuning catalogs”. Otherwise, no correlation can be observed. A general statement such as “more text = better position” cannot therefore be derived. The median word count is 273, and the mean is 690. Just a reminder, we are in the top 10 here. I would actually be very interested in how the colleagues from Backlinko came up with 1,890 words for the average 1st place document. They have looked at far more search results (what does “1 million search results” mean? Exactly that, i.e. about 100,000 search results pages, i.e. the results for about 100,000 search queries?), but they do not reveal which average they used. As you can see in my numbers, there is a big difference between the median and the mean, which is the arithmetic mean that most people call the average. It’s not for nothing that I always say that the average is the enemy of statistics, but maybe texts are longer in the USA? But since the numbers are not made available to us… and also not the methods… well, at some point I learned that you have to add both the data and the software for evaluation to your results so that everything is really comprehensible.
Is there really nothing to see at all?
In this final part, I have added more signals, including TF/IDF and WDF/IDF. And as you can see in the correlation matrix, we don’t have a correlation anywhere. In the last part, however, we had already seen that this does not apply to all keywords. In the histogram of the correlation coefficients, we saw both positive and negative correlations, but no p-value. If you only look at the correlation coefficients where p is < 0.05, the picture looks different again:
So we have keywords where the backlinks matter, and we also have keywords where the other signals matter. If we can draw one conclusion from the keyword set, it’s that there’s no one-size-fits-all rule. As already stated in the last part, we need the above correlation matrix for each keyword. And that’s exactly what is exciting, because we can look at how the ranking signals behave for each keyword individually or perhaps a topic.
And so you can see for the keyword “player update” (as hash 002849692a74103fa4f867b43ac3b088 in the data in the notebook) that some signals are more prominent, see the figure on the left. Can you now be sure that you now know exactly how the ranking works for this keyword? No, you can’t (especially since we haven’t calculated the p-values here yet). But if we look at several keywords from the same “region” (i.e. similar values in this signal), then there could actually be something in them.
What about WDF/IDF now?
Unfortunately, nothing either. And that was probably the biggest point of contention at SEO Campixx. In this example, I’m only using the exact match for now, so I find exactly the keyword I entered in the text. Of course, we could now go further and have the keywords picked apart matched, but to reduce the complexity, let’s just look at the exact match.
There is no clear pattern here, and there are no correlations. Very few observations even manage a p-value below 0.05 and a correlation coefficient of more than 0.1. In this keyword set, it cannot be understood that WDF/IDF brings anything, at least not for Exact Match. Neither does TF/IDF. I didn’t even look up keyword density.
Reporting Data
The last part of my presentation from the SEO Campixx was a short summary of my series of articles about SEO reporting with R and AWS (especially the part about action-relevant analyses and reporting).
Result
Once again the most important points:
My sample is quite small, so it may not be representative of the total population of all searches. So the statement is not that what I have written here really applies to everything. The goal is to show how to approach the topic from a data science perspective. For some topics, however, it cannot be denied that the data situation is sufficient, such as the correlation between document length and position.
My statements apply to German search results. The average document length may be different in DE, but I doubt it.
The data used for the calculation is not necessarily reliable. The backlink data is most likely not complete, and what Google & Co make of text is not completely transparent either. However, most tools out there don’t even use standard procedures like stemming, so it should at least be proven that it’s certainly exciting to work with such WDF-IDF tools, but it’s not necessarily what actually changes everything.
The typical SEO statements cannot be proven for all keywords with the help of this sample, but this should not come as a surprise, because the ranking algorithm is dynamic. This means:
Speed is not a ranking factor, at most as a hygiene factor, and even that we can’t prove here
HTTPS is not yet a ranking factor.
Surprisingly, backlinks do not always correlate, but this may be due to the data basis
We have to look at what the ranking signals look like for each keyword.
The emotional reactions of some colleagues are not incomprehensible, because after all, some tools are paid dearly (there were also tool operators in my presentation, one of whom let himself be carried away by the statement that you can tell that I haven’t worked as SEO for a long time). It’s a bit like going to a Christian and saying that his Jesus, unfortunately, never existed. I didn’t say that. I have only said that I cannot understand the effect of common practices on the basis of my data set. But many SEOs, whom I appreciate very much, have told me that e.g. WDF/IDF works for them. In medicine it is said “He who heals is right”, and at the end of the day it is the result that counts, even if it has been proven that homeopathy does not help.nAnd perhaps the good results of these SEOs only come about because they also do many other things right, but then blame it on WDF/IDF.
But what interests me as a data person is reproducibility. In which cases does WDF/IDF work and when does it not? I would like to add that I have no commercial interest in calling any way good or bad, because I don’t sell a tool (let’s see, maybe I’ll build one someday) and I don’t earn my money with SEO. In other words: I pretty much don’t give a s*** what comes out of here. The probability that I will succumb to confirmation bias because I am only looking for facts that support my opinion is extremely low. I’m only interested in the truth in a post-factual world. And unlike the investigation of Backlinko, for example, I provide my data and code so that everyone can understand it. This is complexity, and many try to avoid complexity and are looking for simple answers. But there are no easy answers to difficult questions, even if that is much more attractive to people. My recommendation: Don’t believe any statistics that don’t make the data and methods comprehensible. I hope that all critics will also disclose the data and their software. This is not about vanity.
The Donohue–Levitt hypothesis is a good example for me: For example, the New York Police Department’s Zero Tolerance approach in the 1990s was praised for significantly reducing crime as a result. This is still a widespread opinion today. Donohue and Levitt had examined the figures, but came to a different conclusion, namely that this was a spurious correlation. In reality, the spread of the baby pill was responsible for the fact that the young offenders were not born in the first place, which then became noticeable in the 90s. Of course, this was attacked again, then confirmed again, and then someone also found out that the disappearance of the lead content from gasoline was responsible for the reduction of juvenile delinquency (lead crime hypothesis). However, these are more complex models. More police truncheons equals less crime is easier to understand and is therefore still defended (and maybe there is a bit of truth to it?). But here, too, those who find a model more sympathetic will mainly look at the data that confirm this opinion.
I could have investigated a lot more. But as I said, I do it on the side. I’m already in the mood for more data on the topic. But for now, there are other mountains of data here again: And then the next step would be a larger data set and machine learning to identify patterns more precisely.
Now it’s been another month, and I still haven’t written everything down. However, this is also due to the fact that I have acquired even more data in the last few weeks so that I have a data set that I can share and that is not customer-specific.
Why does data validation and cleansing take so long?
80% of the time is spent on validating and cleaning the data, according to the rule of thumb, and I would add one more point, namely transforming the data. Data is rarely available in such a way that it can be used immediately.
But one thing at a time. For this part, I wanted to add backlink data as well as crawl data, but Google’s backlink data is only available for your own domain, and if you use tools like Sistrix, then the API queries cost money or credits. Using the example of Sistrix, you pay 25 credits for querying for backlinks (links.overview), so with 50,000 credits per week you can query the links for 2,000 URLs. However, I can only use the credits that were not spent on other tools at the end of the week, so I would need more than 7 weeks for 14,099 unique hosts that I generated in the last part with the 5,000 searches. Until then, I have 1,000 other projects and I forget what code I wrote here, so I took a sample based on 500 searches that I randomly pulled from my 5,000 searches. Unfortunately, the ratio unique hosts/all URLs here was not as nice as with the overall set, 2,597 unique hosts had to be queried.
Unfortunately, the Sistrix API had also thrown a spanner in the works here, because for over 250 URLs I got answers that my script had not properly intercepted, e.g.
My script had expected integer values (fractions of a backlink don’t exist in my opinion) and then simply wrote nothing at all to the dataframe when a number came from Sistrix that was not an integer. But even if it had caught that, the number I see here has nothing to do with the number I see in the web interface, although there are also strange numbers here from time to time (see screenshot). Is that 197,520 backlinks or 19,752,000? Please don’t get me wrong, Sistrix is one of my favorite tools, but such things drive me crazy, and R is not easy there either. It didn’t help, I had to look through the data first and add some of it manually (!!). And how difficult R can sometimes be becomes apparent when you want to add existing data, but don’t want to set the existing data in a column to NA. My inelegant solution to the transformation (which took me 2 hours):
test <- merge(sample_2018_04_02,backlinks_2, by = “host”, all.x = TRUE) test <- cbind(test, “backlinks_raw”=with(test, ifelse(is.na(total.y), total.x, total.y))) ‘
We had already seen last time that we were missing data for the age of a domain, but here there is also the official statement from Google that the age of a domain has no influence on the ranking. However, the old domains were in the majority, so one could possibly assume that newer domains have less chance of getting into the index or into the top 10, but it doesn’t matter for the respective position in the top 10. However, we did not make this statement explicitly, because it could be that the missing age values in my data set are exactly the younger domains. So that’s yet to be found out, but no longer as part of this series. Then I might as well examine the top 100
In summary: It always looks very simple, but collecting, transforming and cleaning the data simply costs a lot of time and energy.
So what can we do with this data?
First of all, let’s see if we can see something just by plotting the individual variables in relation to each other. The variables we have here now are:
Position
https yes/no
year (age of the domain)
Speed
Ips
SearchVolume (spelling error, “SwarchVolume”
Number of backlinks for the host
Number of backlinks for the host, logarithmized
Number of results for a search query
There are still a few variables missing, but let’s start with this. As we can see, we see next to nothing So let’s take a look at the bare numbers again:
Here we see a little more. For example, a (sometimes very weak) correlation between http and backlinks_log, year and backlinks_log, speed and year, backlinks_raw and ip, etc. But why the logarithmization of the backlinks at all? This is illustrated by the following example:
If we look at the distribution of the frequencies of the backlinks in the histogram, we see a high bar on the far left and not much else. No wonder, because in the search results we have hosts like Youtube that have a nine-digit number of backlinks, but most hosts have much, far fewer backlinks. If we use a logarithm instead, i.e. if we “compress” it a bit, then the histogram looks quite different:
We see here that many hosts are somewhere in the middle, some stand out with few links (that’s the bar at 0) and few hosts have a lot of backlinks. The question of whether the number of backlinks of the individual search hits is comparable for each search query is also exciting. The answer here is no, as the following histogram shows (also logarithmized):
I calculated the average number of backlinks for each keyword and then put it logarithmized on the histogram (without a logarithm, the histogram would look like the unlogarithmized one before it). And as we can see, we also have areas where the search results come from hosts that have few backlinks, most of the search results are in the middle, and with very few search results we have an enormous number of backlinks. It must always be said that the average is not really a great story. But at least we see that we have different “regions”.
Why do age and backlinks correlate?
Now that we’ve clarified why certain data is logarithmized, let’s take a closer look at what the correlations look like, starting with age and backlinks:
If we squint a little, we see an impression of a line, it almost looks as if there is a correlation between the age of a domain and its backlinks. Once tested:
Pearson’s product-moment correlation data: dataset$year and dataset$backlinks_log t = -24.146, df = 4286, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3721183 -0.3194161 sample estimates: cor-0.3460401
That looks good. And it’s not completely surprising. After all, the longer a domain exists, the more time it has had to collect links. While we don’t know which way a correlation is headed, it’s unlikely that a domain will get older the more backlinks it gets.
Let’s take a look at that again for the combination of age and speed:
Pearson’s product-moment correlation data: dataset$year and dataset$SPEED t = 13.129, df = 4356, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1663720 0.2234958 sample estimates: cor 0.1950994
It is interesting here that the correlation coefficient is positive, i.e. the older a domain is, the slower it is.
Why don’t position and number of backlinks correlate?
Good question. Because as already discussed in the last part, this does not apply to every keyword. Let’s take a look at the correlation between backlinks and position per keyword and then throw the correlation coefficients output onto a histogram:
Clearly, we have some keywords here whose ranking hosts have at least a weak if not moderate correlation with the number of backlinks. This means that we would have to look at the composition of the ranking for each keyword individually. And since we already know that the ranking is dynamic, we see it a little more clearly here.
Unfortunately, it is not the case that there is a correlation between the average number of backlinks of the ranked pages and the correlation between the backlinks of the locations and position. We can see in the screenshot on the left that this is a very colorful mix.
What can be the reason for this? For example, because I only have the data for the backlinks for the hosts, not for the respective landing page. Of course, that would be even nicer, and the most ideal thing would be if I could then look at what the composition of the individual factors of the backlinks would look like. In view of my credit poverty, however, this is not possible at the moment. And here we have a typical problem in the field of data science: We know that the data is out there, but we can’t get to it. Nevertheless, this approach already offers enormous advantages: I can now look at the composition of the current ranking for each keyword individually and act accordingly. In the example on the left, I see that I need a lot of backlinks for the host for “privacy”, but in my dataset (not on the screenshot), my host needs few backlinks for searches like “poems early retirement”. So we need exactly this correlation matrix for each keyword instead of an overall view as above.
In the [next part][10] we will get more data (we started with TF/IDF and WDF/IDF).
[1]: Comments (since February 2020 the comment function has been removed from my blog):
Steffen Blankenbach says
April 2018 at 15:34 Hello Tom, very good and exciting article. Once again, it becomes clear how much effort you have to put into getting insights, only to find out that there are many other influencing factors that could not be taken into account in the calculation. This will probably always remain a problem of data evaluation. Personally, I also assume that thanks to RankBrain, ranking factors even differ at the keyword level and therefore such a mass evaluation is not expedient for companies. A case for a specific industry would be exciting. I look forward to more articles from you.
The first two parts were about what data science is and why WDF/IDF values are very likely to have little to do with what happens under the hood at Google. In this part, we go one step further, because we look at whether there are correlations between ranking signals and the position. In the lecture, I had shown this using the example of a search query and, in view of the time available, I had dealt with it rather briefly. Here I can go into depth. However, we will first only look at each individual ranking signal in relation to the position, not the possible effect of the ranking signals on each other.
Risks and Side Effects of Data Science
Since my presentation caused some colleagues to gasp and some “interesting” statements, one point had probably been lost. Because I had expressly said, and I repeat this here, that I am not making the statement that one can assume from this data that the ranking works that way. Anyone who has ever had to deal with lawyers or statisticians knows that they are reluctant to be pinned down to reliable statements. After all, we usually do not know the total population and therefore have to draw conclusions about the population from a small sample; Who is so crazy and lets himself be pinned down to it? Hence all the complexity with confidence level, confidence intervals etc…
The following statements refer to a sample of 5,000 search queries. That sounds like a lot, but we don’t know if these searches correspond to the total population of all searches. So the results are for the sample, and I’m always willing to repeat that for other searches if those searches are made available to me.
Other problems with this approach: We have access to a few ranking signals, but not all of them, and the few signals we do have are also partly inaccurate. Of the more than 200 ranking signals, we have:
Domain age: inaccurate, as the individual sources contradict each other and there is no pattern to be seen as to whether one source generally outputs younger values than the other
Backlinks: here we have the values that are output from tools, and they usually don’t have a complete overview of all backlinks
Anchor Texts: Since we can’t assume that the backlinks are complete, we can’t expect to have all the anchor texts
Text Matches: We can identify exact matches, but we saw in the previous part that search engines don’t
Text length
Speed
HTTPS versus HTTP
Readability
So we lack signals such as
User Signals
Quality Score
Domain History
RankBrain
and much more
In summary, we only have a fraction of the data, and some of it is not even accurate. And my calculations are also based on search queries that we don’t know if they are representative.
A word about correlations
My favorite example of the fatal belief in correlations is the statistical correlation between Microsoft Internet Explorer’s market share and the murder rate in the U.S. between 2006 and 2011. While it may be funny to claim that there is a connection here (and this regularly leads to laughter in lectures), the fact is that a statistical relationship, which we call a correlation, does not have to be a real connection here. Correlation does not mean cause and effect. Worse still, in statistics we don’t even know in which direction the statistical connection runs. In this example, whether Internet Explorer’s market shares led to more murders, or whether the murders led to Internet Explorer being used afterwards to cover their tracks.
Of course, the connections are clear in some situations: If I spend more money on AdWords, then I may get more conversions. And if we examine the statistical relationship between ranking signals, then it is likely that more backlinks lead to a better position, even though of course a better position can ensure that more website owners find interesting content and link to it… But we don’t know whether, for example, the individual signals can influence each other, and we usually only look at the top 10. As described in the previous part, this is a kind of Jurassic Park, where we don’t have the whole picture.
A little descriptive statistics
Every analysis begins with a description of the data. For the 5,000 search queries, we got more than 50,000 search results, but we take out the results that point to Google services for the time being, because we don’t know whether they are ranked according to the normal ranking factors. There are 48,837 URLs left, spread across 14,099 hosts.
Lately I’ve been working more and more with dplyr, which allows piping like under Linux/UNIX; this functionality was borrowed from the magrittr package, and it makes the code incredibly clear. In the example in the figure, after the commented out line, I google_serps throw my dataframe to group_by(), which is grouped by host, and I throw the result of this step to summarise(), which then calculates the frequencies per host, and finally I throw it to arrange(), which sorts the result in descending order. I write the result to hostFrequency, and because I want to see the result immediately in my R notebook, I put the whole expression in parentheses so that the result is not only written to the hostFrequency dataframe, but also output immediately. Every time I do something like this with dplyr, I’m a little happy. And if you have really large data sets, then you do the same with sparklyr :).
But back to the topic: So we see here that few hosts rank very frequently, and that means conversely that many hosts rank only once. No surprise here.
To warm up: Speed as a ranking factor
The speed data for each host is very easy to get, because Google offers the PageSpeed Insights API for this, and kindly there is also an R package for this. With more than 10,000 hosts, the query takes a while, and you are not allowed to make more than 25,000 requests per day, as well as no more than 100 (?) Requests per 100 seconds. I just let it run, and after 1 day my R crashed and lost all data. Not pretty, but a workaround: Write the dataframe to the hard disk after each request.
But let’s take a closer look at the data now. Here’s a histogram of the distribution of the speed values of 14,008 hosts (so I got a PageSpeed value for 99.4% of the hosts):
We see that most hosts make it over the 50 points, and summary gives us the following numbers:
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 55.00 70.00 66.69 82.00 100.00
It’s nice to see how misleading the average can be And now let’s plot positions versus speed:
As we see, we don’t see anything. Let’s check this again in more detail:
cor.test(google_serps$position,google_serps$SPEED) Pearson’s product-moment correlation data: google_serps$position and google_serps$SPEED t = -5.6294, df = 48675, p-value = 1.818e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03438350 -0.01662789 sample estimates: cor -0.02550771
Looks like there is no correlation between PageSpeed and position in the top 10 (we don’t know if there might be one in the top 100, it could be that places 11 to 100 have worse PageSpeed values). But it is also not unlikely that we are dealing with a hygiene factor: If you rank well, then you have no advantage, if you rank badly, then you will be punished. It’s a bit like taking a shower, then no one notices, but if you haven’t showered, then it’s noticeable. Or, in a nutshell (that’s how I should have said it at the SEO Campixx), a slow website is like not having showered. However, we also see hosts that rank despite a creepy speed. But if you search for “ganter shoes”, then http://ganter-shoes.com/de-at/ is probably the best result, even if the page takes 30 seconds to load for me.
We should also keep in mind that the PageSpeed API performs a real-time measurement… maybe we just caught a bad moment? You would actually have to measure the PageSpeed several times and create an average from it.
SSL as a ranking factor
What we also get very easily in terms of data is the distinction between whether a host uses https or not. While some see the use of secure protocols as a weighty ranking factor, more moderate voices like Sistrix see the use of SSL as a weak ranking factor. In this dataset, 70% of all URLs have an https. But does that also mean that these pages rank better?
We are dealing here with a special form of calculation, because we are trying to determine the relationship between a continuous variable (the position) and a dichotomous variable (SSL yes/no). We convert the two variants of the protocol into numbers, http becomes a 0, https becomes a 1 (see screenshot below in the secure and secure2 columns).
As we can see, we see nothing or almost nothing: With a correlation coefficient of -0.045067, we can rule out the possibility that https has had an impact on the ranking in the top 10. Does this mean that we should all do without https again? No, because it’s better for the user. And as soon as browsers show even more clearly that a connection is not secure, users are more likely to say goodbye to a site more quickly. Not to mention that we only looked at the top 10 here. It could be that the places 11 to 1000 were mainly occupied by sites without SSL. And then the results could look different again.
Perhaps SSL as a ranking factor is also a reverse hygiene factor. Google would like to have SSL, but since some important pages may not yet have SSL, they do not use it. Just as you might want your dream partner to have a shower, but if he or she suddenly stands in front of you, then it doesn’t matter because you’re so in love. In the long run, of course, this does not go well. And that’s how it will be with SSL
Age of a domain as a ranking factor
Let’s move on to the next ranking signal, the age of a domain, even if it says that the age of a domain doesn’t matter. Here we are faced with a first challenge: How do I find out the age of a domain as automatically and, above all, reliably as possible? Sistrix offers the age of a domain, but not that of a host (the difference is explained in the script for my SEO seminar at HAW). Nevertheless, Sistrix has the advantage that the API is very lean and fast. However, Sistrix does not find an age for 2,912 of the domains (not hosts), with 13,226 unique domains that is 22% without domain age. Jurassic Park sends its regards (if you don’t understand this allusion, please read the second part about Data Science and SEO). Nevertheless, let’s take a look at the distribution:
We see a slightly skewed distribution to the right, i.e. that we see a higher frequency on the left of the older domains (no, I don’t see mirror-inverted, that’s what they call it). Younger domains seem to have fewer chances here. However, it is also possible that Sistrix does not have younger domains in its database and that the almost 3,000 missing domains would be more likely to be located on the right.
Can we at least see a correlation here? Let’s plot the data first:
Again, we don’t see anything, and if we let the correlation be calculated, then this is confirmed:
cor.test(google_serps$position,as.numeric(google_serps$year)) Pearson’s product-moment correlation data: google_serps$position and as.numeric(google_serps$year) t = 1.1235, df = 44386, p-value = 0.2612 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.003970486 0.014634746 sample estimates: cor 0.005332591
Not only do I have no correlation, but I am also outside the confidence level. However, the result should be taken with a grain of salt, because as mentioned above, it may be that our selection is simply not representative, because probably not every domain will have had the same chance of getting into Sistrix’s domain age database. So we would have to supplement this data and possibly also check the data from Sistrix (I haven’t been able to find out where Sistrix got the data from); unfortunately, I have not been able to identify a pattern either, because sometimes one source shows older data, sometimes another source. In principle, you would have to take all data sources and then always take the oldest date. Unfortunately, most sources are not so easy to scrape. And so we not only have missing, but also partly incorrect data. And this is not an atypical problem in the field of data science.
Intermediate thoughts
Since I’ll soon have the 2,000 words together (and I know the correlation words/”read to the end of the article” for my blog), I’ll look at the next ranking factors in the next blog post. Important: We can see here that we find no evidence for supposedly very important ranking factors that they actually have an influence. But that doesn’t mean that’s really true:
We have some missing and incorrect data
We may not have a valid sample
And Google has dynamic ranking, which means that for some queries, some signals correlate with position, while others don’t. The sledgehammer method we used here. is certainly not expedient.
Example:
cor.test(google_serps$position[google_serps$keyword==”acne vitamin A”],google_serps$secure2[google_serps$keyword==”acne vitamin a”]) Pearon’s product-moment correlation data: google_serps$position[google_serps$keyword == “acne vitamin A”] and google_serps$secure2[google_serps$keyword == “acne vitamin A”] t = -4.6188, df = 8, p-value = 0.001713 Alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9645284 -0.4819678 sample estimates: COR -0.8528029
Even though the confidence interval here is very large, it is still in the range of a medium to strong correlation for this query “acne vitamin a” (it correlates negatively because the positions go up from 1 to 10 or further). It is therefore also important to identify the segments or “regions” where certain signals have an effect. More on this in the next part about data science and SEO.
After I explained in the first part what data science is and what already exists in this area on the subject of SEO, now the second part, where we take a closer look at what the linguistic processing of a document by a search engine has on SEO concepts such as keyword density, TF/IDF and WDF/IDF. Since I showed Campixx live code at SEO, I offer everything for download here, which makes following the examples even more eventful. By the way, this is also possible without the installation of R, here you can find the complete code with explanations and results.
For those who want to recreate it in R
(Please skip if you don’t want to recreate this yourself in R)
I recommend using RStudio in addition to R, because handling the data is a bit easier for newbies (and for professionals as well). R is available at the R Project, RStudio on rstudio.com. First install R, then RStudio.
In the ZIP file there are two files, a notebook and a CSV file with a small text corpus. I refer to the notebook from time to time in this text, but the notebook can also be worked through in this way. Important: Please do not read the CSV file with the import button, because then a library will be loaded, which will nullify the functionality of another library.
The notebook has the great advantage that both my description, the program code and the result can be seen in the same document.
To execute the program code, simply click on the green arrow in the upper right corner, and it works
What is TF/IDF or WDF/IDF?
(Please skip if the concepts are clear!)
There are people who can explain TF/IDF (Term Frequency/Inverse Document Frequency) or WDF/IDF (Word Document Frequency/Inverse Document Frequency) better than me, at WDF/IDF Karl has distinguished himself with an incredibly good article (and by the way, in this article he has already said that “actually no small or medium-sized provider of analysis tools can offer such a calculation for a large number of users … ; -“).
A simplistic explanation is that the term frequency (TF) is the frequency of a term in a document, while the Inverse Document Frequency (IDF) measures the importance of a term in relation to all documents in a corpus in which the term occurs (corpus is the linguists’ term for a collection of documents; this is not the same as an index).
In TF, the number of occurrences of a term is counted, simplified but still far too complicated, and then normalized, usually by dividing that number by the number of all words in the document (this definition can be found in the book “Modern Information Retrieval” by Baeza-Yates et al, the bible of search engine builders). But there are other weightings of the TF, and WDF is actually nothing more than such a different weighting, because here the term frequency is provided by a logarithm to base 2.
In the IDF, the number of all documents in the corpus is divided by the number of documents in which the term appears, and the result is then provided with a logarithm to base 2. The Term Frequency or the Word Document Frequency is then multiplied by the Inverse Document Frequency, and we have TF/IDF (or, if we used WDF instead of TF, WDF/IDF). The question that arises to me as an ex-employee of several search engines, however, is what exactly is a term here, because behind the scenes the documents are eagerly tinkered with. More on this in the next section “Crash Course Stemming…”.
Crash Course Stemming, Composite and Proximity
Everyone has had the experience that you search for a term and then get a results page where the term can also be found in a modified form. This is because, for example, stemming is carried out, in which words are reduced to their word stem. Because not only in German is conjugated and declined, other languages also change words depending on the person, tense, etc. In order to increase the recall, not only the exact term is searched for, but also variants of this term (I did not use the term “recall” in the lecture, it describes in the Information Retrieval how many suitable documents are found for a search query). For example, for the search query “Data scientist with experience”, documents are also found that contain the terms “data scientist” and “experience” (see screenshot on the left).
In the lecture, I “lifted” live, with the SnowballC-Stemmer. This is not necessarily the best stemmer, but it gives an impression of how stemming works. In the R-Notebook for this article, the Stemmer is slightly expanded, because unfortunately the SnowBallC only ever lifts the last word, so that the Stemmer has been packed into a function that can then handle a whole set completely:
> stem_text(“This is a great post”) [1] “This is a great post” > stem_text(“Hope you’ve read this far.”) [1] “Hopefully you have read this far.”
But they are not just variants of a term. We Germans in particular are world champions in inventing new words by “composing” them by putting them together with other words. It is not for nothing that these neologisms are called compounds. An example of the processing of compounds can be found in the screenshot on the right, where the country triangle becomes the border triangle. First of all, it sounds quite simple, you simply separate terms that could stand as a single word and then index them. However, it is not quite that simple. Because not every compound may be separated, because it takes on a different meaning separately. Think, for example, of “pressure drop”, where the drop in pressure could also be interpreted as pressure and waste
The example of the “data scientist with experience” also illustrates another method of search engines: Terms do not have to be together. They can be further apart, which can sometimes be extremely annoying for the searcher if one of the terms in the search query is in a completely different context. The proximity, i.e. the proximity of the terms, can be a signal for the relevance of the terms present in the document for the search query. Google offers Proximity as a feature, but it is not clear how Proximity will be used as a ranking signal. And this only means the textual proximity, not yet the semantic proximity. Of course, there is much more processing in lexical analysis, apart from stop word removal etc. But for now, this is just a small insight.
So we see three points here that most SEO tools can’t do: stemming, proximity and decompounding. So when you talk about keyword density, TF/IDF or WDF/IDF in the tools, it’s usually based on exact matches and not the variants that a search engine processes. This is not clear with most tools; for example, the ever-popular Yoast SEO plugin for WordPress can only use the exact match and then calculates a keyword density (number of the term in relation to all words in the document). But ryte.com says, for example:
In addition, the formula WDF*IDF alone does not take into account that search terms can also occur more frequently in a paragraph, that stemming rules could apply or that a text works more with synonyms.
So as long as SEO tools can’t do that, we can’t assume that these tools give us “real” values, and that’s exactly what dear Karl Kratz has already said. These are values that were calculated on the basis of an exact match, whereas search engines use a different basis. Maybe it doesn’t matter at all, because everyone only uses the SEO tools and optimizes for them, we’ll take a look at that in the next part. But there are other reasons why the tools don’t give us the whole view, and we’ll take a closer look at them in the next section.
Can we measure TF/IDF or WDF/IDF properly at all?
Now, the definition of IDF already says why a tool that spits out TF/IDF has a small problem: It doesn’t know how many documents with the term are in the Google corpus, unless this number is also “scraped” from the results. And here we are dealing more with estimates. It is not for nothing that the search results always say “Approximately 92,800 results”. Instead, most tools use either the top 10, the top 100, or maybe even their own small index to calculate the IDF. In addition, we also need the number of ALL documents in the Google index (or all documents of a language area, which Karl Kratz drew my attention to again). According to Google, this is 130 trillion documents. So, in a very simplified way, we would have to calculate like this (I’ll take TF/IDF, so that the logarithm doesn’t scare everyone off, but the principle is the same):
TF/IDF = (Frequency of the term in the document/Number of words in the document)*log(130 trillion/”Approximately x results”),
where x is the number that Google displays per search result. So, and then we would have one number per document, but we don’t know what TF/IDF or WDF/IDF value the documents that are not among the top 10 or top 100 results examined. It could be that there is a document in 967th place that has better values. We only see our excerpt and assume that this excerpt explains the world to us.
And this is where chaos theory comes into play: Anyone who has seen Jurassic Park (or even read the book) may remember the chaos theorist, played by Jeff Goldblum in the film. Chaos theory plays a major role in Jurassic Park, because it is about the fact that complex systems can exhibit behavior that is difficult to predict. And so large parts of the park are monitored by video cameras, except for 3% of the area, and this is exactly where the female dinosaurs reproduce. Because they can change their sex (which frogs, for example, can also do). Transferred to TF/IDF and WDF/IDF, this means: we don’t see 97% of the area, but less than 1% (the top 10 or top 100 of the search results) and don’t know what’s lying dormant in the rest of our search results page. Nevertheless, we try to predict something on the basis of this small part.
Does this mean that TF/IDF or WDF/IDF are nonsense? No, so far I have only shown that these values do not necessarily have anything to do with what values a search engine has internally. And this is not even new information, but already documented by Karl and some tool providers. Therefore, in the next part, we will take a closer look at whether or not we can find a correlation between TF/IDF or WDF/IDF and position on the search results page.
Say it with data or it didn’t happen
In the enclosed R-notebook, I have chosen an example to illustrate this, which (hopefully) reminds us all of school, namely I have made a small corpus of Goethe poems (at least here I have no copyright problems, I was not quite so sure about the search results). A little more than 100 poems, one of which I still know by heart after 30 years. In this wonderful little corpus, I first normalize all the words by lowering them all, removing numbers, removing periods, commas, etc., and removing stop words.
Although there is a library tm in R with which you can calculate TF/IDF as well as TF (normalized)/IDF, but, scandal (!!), nothing about WDF/IDF. Maybe I’ll build a package for it myself. But for illustrative purposes, I just built all the variants myself and put them next to each other. So you can see for yourself what I did in my code. Let’s take a look at the data for the unstemmed version:
We can see here that the “ranking” would change if we were to sort by TF/IDF or TF (normalized)/IDF. So there is a difference between WDF/IDF and the “classical” methods. And now let’s take a look at the data for the managed version:
We see that two documents have cheated their way into the top 10, and also that we suddenly have 22 instead of 20 documents. This is logical, because a term with two or more different stems can now have become one. But we also see very quickly that all the numbers have changed. Because now we have a different basis. In other words, whatever SEOs read from WDF/IDF, the values are most likely not what actually happens at Google. And again: This is not news! Karl Kratz has already said this, and some tools also say this in their glossaries. After my lecture, however, it seemed as if I had said that God does not exist
And perhaps it is also the case that the rapprochement alone works. We’ll look at that in the next parts about data science and SEO.
Was that data science?
Phew. No. Not really. Just because you’re working with R doesn’t mean you’re doing data science. But at least we warmed up a bit by really understanding how our tool actually works.
On 1.3.2018 I gave a lecture on the topic of data science and SEO at the SEO Campixx, and since there were some discussions afterwards :-), I will describe the contents in more detail here, in several parts. This part is first of all about what data science is and what already exists on the topic.
What exactly is data science?
“The sexiest job of the 21st century” is rather dull on closer inspection, because most of the time is spent acquiring and cleaning data and using it to build models. It’s coding, it’s math, it’s statistics, and with larger amounts of data, it’s also a lot of knowledge about where to wire which instances with each other, such as on Amazon Web Services or Google Cloud Platform. A global galactic definition of data science does not exist to my knowledge, but I would consider data science to be the intersection of
Data Mining
Statistics and
Machine Learning
define. These are not new topics, but what is new is that we have much more data, much faster processors, cheap cloud processing and many development libraries. For the statistics language and development environment R used here, libraries exist for almost every purpose; somewhere there was someone who was faced with the same problem and then built a solution for it. What is also new is that more and more companies feel that you can do something with data, after all, Spotify uses data to know what music you might still like, and Google knows when you should set off if you want to get to work on time.
Unfortunately, the data hype (which will be followed by a healthy understanding of what is possible after a plateau of disappointment) is countered by relatively few people who feel at home in all three disciplines (plus cloud computing). Which in turn leads to the fact that these data scientist unicorns are sometimes offered unreasonable sums and 1000s of courses are offered on Udemy & Co that are supposed to provide you with the necessary knowledge.
A real problem with data science, however, is that not only knowledge in several areas is necessary, but also the understanding that data is used to solve a problem. I can deal with algorithms and data all day long, for me it’s like a kind of meditation and relaxation. In fact, sometimes I feel like playing with Lego, but at the end of the day, it’s all about solving problems. Not only collecting data, but also extracting the right information from it and then taking the right action (the holy trinity of data). And here is the challenge that often enough people just say, here is data, make something out of it. Therefore, it is an art for the data scientist to understand exactly what the problem actually is and to translate this into code.
In addition, many people have bad memories of math. Accordingly, the audience’s willingness to consume slides with lots of numbers and formulas tends to be on the lower end of the scale. That’s why I also worked with smaller examples in the lecture, which everyone should be able to understand well.
What kind of topics am I working on? Very different. Classification. Clustering. Personalization. Chatbots. But also analyses of slightly larger data volumes of 25 million rows of Analytics data and more, which have to be processed in a few minutes. All kinds of things.
What is already there?
On the search engine side, there is already a lot. When I was still at Ask, we had already worked with Support Vector Machines, for example, to create the ranking for the queries where the pages had almost no backlinks. Even then, there was a dynamic ranking. The topic recognition of most search engines is based on machine learning. RankBrain will be based on machine learning. So it’s not a new topic for search engines.
On the other hand, that of SEOs, the topic still seems to be relatively fresh. Search Engine Land says thatevery search marketer can think of themselves as a data scientist. I’m not sure I would subscribe to that, because most search marketers I know don’t build their own models. As a rule, they use tools that do this for them. On SEMRush you can find a collection of ideas, but more for SEA. Remi Bacha is still exciting, although I haven’t seen any data from him yet. Keyword Hero have come up with something pretty cool by using deep learning to identify the organic keywords that are no longer included since the switch to https. Otherwise, I haven’t seen much on the subject. So we see that we are at the very beginning.
What would we like to have?
Back to the question of what problem I actually want to solve with my work. In an ideal world, SEO naturally wants to be able to re-engineer the Google algorithm. However, this is unlikely, because of the more than 200 ranking signals, only a few are available to us. What we can do, however, is try to build models with the signals we have, and possibly create smaller tools. And that’s exactly what the next part is about
This is the final part of the series on SEO with R and AWS. In the first part, we prepared the AWS instance with RStudio, in the second part we carried out a small SEO analysis, and in the third part, we created a visibility index and an “actionable reporting”. This part is about how it is too exhausting for even the most hardened data scientist to run the individual scripts through RStudio every day. SEO monitoring should therefore run via an appealing interface.
GUI development with Shiny
With Shiny, the developers of RStudio have provided a framework that enables interaction with data with just a few lines of code. Fortunately, Shiny is included in the AMI that we installed in Part 1 of this series.
The code is divided into two parts, the server code and the UI code; both parts can be converted into a file, which can then be used as an app. R is placed in a special folder in RStudio. The server part uses the individual functions that we used for the first parts of this series (that’s why you can’t find all the code here).
Not necessarily nice for a customer, but perfectly sufficient to build a reporting system for yourself. In the upper left corner, I select the website for which I want to have the reporting, and after the scripts have been run, the visibility index plot and the table with the largest ranking losses appear. Creating this dashboard took just 20 minutes (ok, to be fair, I had built other dashboards with Shiny before, so I had some experience
Next steps
In my current setup, I just used the code I had used so far, but that also means that it takes a while until something can be seen. It would be better if the data were automatically evaluated every day and the dashboard then simply retrieved the aggregated data. It’s on my to-do list.
Then it would be nice if you could click on an entry in the table and then analyze it further. It’s also on my to-do list.
A connection to Google Analytics would be great.
Not every keyword is interesting and should be included in the report.
All good points that I could tackle in time, and above all good material for further posts And I’m always open to more ideas
Result
With a little brainpower, little code, and the help of Amazon Web Services and R, we built automated and free SEO reporting. Of course, Sistrix & Co offer even more functions. This approach is primarily about making better use of the data from the Webmaster Console, which is available free of charge. Because let’s be honest: We usually have enough data. We just don’t make much of it most of the time.
This approach has another advantage: The data from the Search Console is gone after 90 days. You can’t look any further back. Here, however, we are building up an archive and can also look at longer-term developments.
The third episode on search engine optimization with R and AWS is about creating your own visibility index to get an aggregated overview of the ranking of many keywords. In the first part, we looked at how to automatically pull data from the Webmaster Console with R and an AWS Free Tier EC2 instance, and the second part was about initial analyses based on click-through rates on positions.
What does a visibility index do?
To do this, let’s first look at what a visibility index is supposed to do. With Sistrix, the visibility index is independent of seasonal effects, which is quite charming, because you know in summer whether you can win a flower pot with winterreifen.de. This can be solved, for example, by using the average number of searches from the AdWords Keyword Planner. It’s just a shame that this tool only spits out reasonably useful values if you spend enough budget in Google AdWords. This approach sometimes falls flat, because we want to keep our monitoring as cheap as possible, at best free of charge.
Sistrix has the disadvantage that a) it is still too expensive for students who attend my SEO courses despite the low price and b) my little niche stories are not always available in the Sistrix keyword database. Sistrix and Co. are especially exciting if you want to compare a site with other sites (ideally from the same industry with the same portfolio). An abstract number like a visibility index of 2 is otherwise pretty meaningless. This number only makes sense if I can relate it to other websites and/or if I can use it to track the evolution of my own website rankings over time. The number itself is still not meaningful, because what kind of metric is being measured in? If I lose 2 kilos, then the metric is clear. But lose 0.002 SI? How many visitors are there?
We want to build an index that allows us to see whether our ranking changes over time across a large number of keywords. How our market competitors are developing can only be seen by scraping Google, and that is not allowed.
Visibility index based on the Webmaster Console
Obviously, it is better to rank 3rd with a search term that is searched 100 times a day than for a search term that is searched for only 2 times a day. So the number of searches should play a role in our visibility index. We exclude the search volume via the AdWords Keyword Planner for the reasons mentioned above, the only source we have left is the impressions from the Webmaster Console. Once we’re on the first page and far enough up (I’m not sure if it’s counted as an impression if you’re in 10th place and you’ve never been in viewable), we should be able to use the number of impressions from the Webmaster Console, even on a daily basis!
A small health check for the keyword “Scalable Capital Experiences” (AdWords / real data from the Webmaster Console):
2400 / 1795 (September, but only half a month)
5400 / 5438 (October)
1000 / 1789 (November)
For September and October it looks good, only in November it is a bit strange that I had almost 80% more impressions than there were supposedly searches. Something must have happened in September/October that Scalable Capital suddenly had so many searches. In fact, this was also visible in the traffic on my site. We don’t get the first point clarified and accept that the AdWords numbers are not perfect either.
The following figures illustrate how different the visibility indices are depending on the weighting:
In the simplest model, only 11 minus position is calculated, everything above (greater than) position 10 then gets 1 point. The data for each result is added up for each day. This model has the disadvantage that I can climb up very quickly, even if I’m only in 1st place terms that are only searched for once a month.
In the second model, the same procedure is chosen, except that here the value is multiplied by the impressions
In the third model, the average CTR on the SERP from the second part of this series is multiplied by the impressions.
If you now look at the actual traffic, you can see that the 3rd model is already very close to the traffic. The spikes in real traffic aren’t quite as strong as in the index, and in the end, I don’t get as much traffic, but that may be because the actual CTR is below the expected CTR.
Alternative approach with the Webmaster Console
If you look at the plots, however, it becomes clear that this approach with impressions on a daily basis makes little sense. Because if the curve goes down, it doesn’t mean that I can do anything, because maybe there is just less searching for this topic and my rankings haven’t changed at all (usually you only write about what works, but I also find the failures exciting, because you can learn a lot from them :-)). This is exactly why Sistrix will probably also calculate out seasonal fluctuations.
Alternatively, you could simply average all impression data of a keyword landing page pair and use that average to calculate, again with the weighted CTR per position. The good thing about this approach is that seasonal or temporary fluctuations balance each other out. Plotted, it looks like this:
This plot looks very similar to the first plot, but that doesn’t mean that it always has to be that way. But when I look at the Sistrix values (even if I’m on a very low level), it looks very similar.
From data to action relevance
Now we have a plot that shows us the weighted development of our rankings, but what do we do with it? It’s not really “actionable”. It only gets exciting when we also look at which rankings change the most and have an influence on our visibility index. To do this, we first take the minimum ranking for each keyword landing page pair (minimal because low, and lower than 1st place is not possible) and then the current ranking. Finally, we calculate the delta and sort according to it:
The higher the delta, the greater the loss of ranking places, so to speak. And the greater the need for action, provided that the keyword is really interesting. In my example, for example, I wouldn’t think it would be bad to rank for “seo monitoring”, after all, the articles from this series are relevant for it. You could now weigh based on the impressions or the viewability index we chose earlier:
This looks more exciting: In fact, there are some search queries at the top (sorted by “Actionability”) that I find quite interesting. And now you could combine that with the data from the second part and build a dashboard… more on this in the fourth part