How Search Engines Use Data Science (Even Before It Was Called Data Science)


With every Google update, the rankings for some pages are shaken up, and every now and then you wonder why some pages are affected and others are not. Because sometimes pages are “punished” where you ask yourself the question, how can that be? That’s actually a good site, isn’t it?

It is well known that Google uses machine learning to optimize the relevance calculation algorithms. But how exactly does it work? And what does this mean for search engine optimizers?

How machine learning works

First of all, a distinction is made between supervised and unsupervised learning. Either you leave it to the machine to find patterns in data. Or you can give the machine training materials in advance and say, for example, what are good and what are bad documents, so that the machine can decide whether new documents are good or bad in the future on the basis of what they have learned.

Machine learning often works with distances, and this is such a central concept that it should be examined in more detail. The following lines are greatly simplified. We’ll also start with an unsupervised learning example, which probably doesn’t matter that much in the search engine world, but it shows the concept of distance in a very simple way.

Let’s imagine that we have a lot of documents, and properties have been measured for each document. For example, one property is the number of words in a document (X1), another is a measure such as the highly simplified PageRank of the domain on which the document is located (X2). These are really fictitious values, and it should not be said that there is a correlation here. It’s just a matter of clarifying.

First, the values are brought to the same scale (scale command), then a distance matrix is created (dist command). The commands for the cluster will be discussed later.

The distance matrix shows the distances between the individual rows. For example, the distance from row 1 to row 3 is smaller than that from row 1 to row 4. In the next step, clusters are formed and plotted in a dendrogram:

Here, too, it is easy to understand why the values from rows 7 and 10 belong together rather than the values from rows 1 and 3. The machine was able to calculate these clusters from the distances alone.

What do Google’s Human Quality Raters have to do with machine learning?

Now let’s go one step further. We know that Google lets people judge search results, from highest to lowest, etc. The Rater Guidelines are easy to find. Again, distances come into play as soon as “highest” gets a number and “lowest” and all values in between.

Of course, the Human Quality Raters can’t look through all the search results. Instead, certain “regions” are trained, i.e. the ratings are used to optimize the algorithm for certain search queries or signal constellations. Unlike in the previous example, we are dealing with supervised learning here, because we have a target variable, the rating. If we now assume that more than 200 factors are used for ranking, then the task for the algorithm could be formulated in such a way that it has to adjust all these factors so that it gets to the target rating.

To understand in more detail how something like this works, let’s take another highly simplified example, this time from a Support Vector Machine.

The principle of Support Vector Machines is a simple but quite thoughtful approach to calculate the optimal distance between two different segments. Let’s take the red line in the image above. It cuts through the blue and green circles. But it could just as easily be rotated a few degrees to the left or right, and it would still perfectly separate the two segments. And now comes the trick: To calculate the optimal separation, the line is simply extended by two parallel lines. And the angle at which the two parallel lines are widest or farthest apart, that’s the optimal angle for the red line.

Now let’s assume that the two segments are again signals from the ranking, x1 is the PageRank, x2 is the PageSpeed. The data is plotted here in a two-dimensional space, and you can see that they are wonderfully separated from each other. So we could train our machine on this data and then in the future, when new elements come into the room, we could say that they should be classified based on what we have learned. And this works not only with 2 variables, but also with many. The space between the points is then called a hyperplane.

Now, data is not always so precisely separable. Let’s take the example with PageRank and PageSpeed. Just because a page has a high PageRank doesn’t mean it has to have super speed. So it could also happen in the picture above that there are a few green circles in the blue ones and vice versa. How can a separation bar through the segments be calculated? Quite simply: For every district that is not clearly on “its” side, there is a minus point. And now it is simply calculated which bar and its position have the fewest minus points. This is called a “loss function”. To put it another way: Even “good” pages could be classified as “bad” according to a support vector machine, the trick is to classify as few good pages as possible as bad and vice versa. It’s just unlikely that all “good” sites have the same characteristics.

What does this mean for search engine optimizers?

First of all, it means what I said over a year ago at the SEO Campixx conference, that there is no static weighting; the ranking is dynamic. In Ask.com, we had trained individual regions, for example when there were no backlinks or little text, or for health search queries, etc. No one size fits all. Today, we do not have all 200 signals available to re-engineer the ranking per search term.

At the same time, however, it also becomes clear why sometimes sites are punished that don’t really deserve it. It’s not because they were found to be bad, it’s just that they have too many signals that speak for a worse ranking. And since the raters didn’t consciously look for any signals, the algorithm, be it Support Vector Machines or something else, just selected the signals that mean a minimal loss. And since we don’t have all 200 signals, it’s often impossible for us to understand what exactly it might have been. With a re-engineering, one can only hope that there is already something useful among the available signals.

This makes it all the more important to deal with the Quality Rater Guidelines. What do the raters use to determine expertise, trust and authority? What leads to the “highest” rating? Even if it’s boring, you can hardly give a better tip besides the hygiene factors.

By the way, Support Vector Machines were developed in the 60s. When there was no talk of data science. Also interesting are the ranking SVMs