Data Science meets SEO, Part 4


Now it’s been another month, and I still haven’t written everything down. However, this is also due to the fact that I have acquired even more data in the last few weeks so that I have a data set that I can share and that is not customer-specific.

Why does data validation and cleansing take so long?

80% of the time is spent on validating and cleaning the data, according to the rule of thumb, and I would add one more point, namely transforming the data. Data is rarely available in such a way that it can be used immediately.

But one thing at a time. For this part, I wanted to add backlink data as well as crawl data, but Google’s backlink data is only available for your own domain, and if you use tools like Sistrix, then the API queries cost money or credits. Using the example of Sistrix, you pay 25 credits for querying for backlinks (links.overview), so with 50,000 credits per week you can query the links for 2,000 URLs. However, I can only use the credits that were not spent on other tools at the end of the week, so I would need more than 7 weeks for 14,099 unique hosts that I generated in the last part with the 5,000 searches. Until then, I have 1,000 other projects and I forget what code I wrote here, so I took a sample based on 500 searches that I randomly pulled from my 5,000 searches. Unfortunately, the ratio unique hosts/all URLs here was not as nice as with the overall set, 2,597 unique hosts had to be queried.

Unfortunately, the Sistrix API had also thrown a spanner in the works here, because for over 250 URLs I got answers that my script had not properly intercepted, e.g.

‘{“method”:[[“links.overview”]], “answer”:[{“total”:[{“num”:903634.75}], “hosts”:[{“num”:21491.504628108}], “domains”:[{“num”:16439.602383232}], “networks”:[{“num”:5979.5586911669}], “class_c”:[{“num”:9905.3625179945}]}], “credits”:[{“used”:25}]}

My script had expected integer values (fractions of a backlink don’t exist in my opinion) and then simply wrote nothing at all to the dataframe when a number came from Sistrix that was not an integer. But even if it had caught that, the number I see here has nothing to do with the number I see in the web interface, although there are also strange numbers here from time to time (see screenshot). Is that 197,520 backlinks or 19,752,000? Please don’t get me wrong, Sistrix is one of my favorite tools, but such things drive me crazy, and R is not easy there either. It didn’t help, I had to look through the data first and add some of it manually (!!). And how difficult R can sometimes be becomes apparent when you want to add existing data, but don’t want to set the existing data in a column to NA. My inelegant solution to the transformation (which took me 2 hours):

test <- merge(sample_2018_04_02,backlinks_2, by = “host”, all.x = TRUE) test <- cbind(test, “backlinks_raw”=with(test, ifelse(is.na(total.y), total.x, total.y))) ‘

We had already seen last time that we were missing data for the age of a domain, but here there is also the official statement from Google that the age of a domain has no influence on the ranking. However, the old domains were in the majority, so one could possibly assume that newer domains have less chance of getting into the index or into the top 10, but it doesn’t matter for the respective position in the top 10. However, we did not make this statement explicitly, because it could be that the missing age values in my data set are exactly the younger domains. So that’s yet to be found out, but no longer as part of this series. Then I might as well examine the top 100

In summary: It always looks very simple, but collecting, transforming and cleaning the data simply costs a lot of time and energy.

So what can we do with this data?

First of all, let’s see if we can see something just by plotting the individual variables in relation to each other. The variables we have here now are:

  • Position
  • https yes/no
  • year (age of the domain)
  • Speed
  • Ips
  • SearchVolume (spelling error, “SwarchVolume”
  • Number of backlinks for the host
  • Number of backlinks for the host, logarithmized
  • Number of results for a search query

There are still a few variables missing, but let’s start with this. As we can see, we see next to nothing So let’s take a look at the bare numbers again:

Here we see a little more. For example, a (sometimes very weak) correlation between http and backlinks_log, year and backlinks_log, speed and year, backlinks_raw and ip, etc. But why the logarithmization of the backlinks at all? This is illustrated by the following example:

If we look at the distribution of the frequencies of the backlinks in the histogram, we see a high bar on the far left and not much else. No wonder, because in the search results we have hosts like Youtube that have a nine-digit number of backlinks, but most hosts have much, far fewer backlinks. If we use a logarithm instead, i.e. if we “compress” it a bit, then the histogram looks quite different:

We see here that many hosts are somewhere in the middle, some stand out with few links (that’s the bar at 0) and few hosts have a lot of backlinks. The question of whether the number of backlinks of the individual search hits is comparable for each search query is also exciting. The answer here is no, as the following histogram shows (also logarithmized):

I calculated the average number of backlinks for each keyword and then put it logarithmized on the histogram (without a logarithm, the histogram would look like the unlogarithmized one before it). And as we can see, we also have areas where the search results come from hosts that have few backlinks, most of the search results are in the middle, and with very few search results we have an enormous number of backlinks. It must always be said that the average is not really a great story. But at least we see that we have different “regions”.

Now that we’ve clarified why certain data is logarithmized, let’s take a closer look at what the correlations look like, starting with age and backlinks:

If we squint a little, we see an impression of a line, it almost looks as if there is a correlation between the age of a domain and its backlinks. Once tested:

Pearson’s product-moment correlation
data: dataset$year and dataset$backlinks_log
t = -24.146, df = 4286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3721183 -0.3194161
sample estimates:
cor-0.3460401

That looks good. And it’s not completely surprising. After all, the longer a domain exists, the more time it has had to collect links. While we don’t know which way a correlation is headed, it’s unlikely that a domain will get older the more backlinks it gets.

Let’s take a look at that again for the combination of age and speed:

Pearson’s product-moment correlation
data: dataset$year and dataset$SPEED
t = 13.129, df = 4356, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1663720 0.2234958
sample estimates:
cor
0.1950994

It is interesting here that the correlation coefficient is positive, i.e. the older a domain is, the slower it is.

Good question. Because as already discussed in the last part, this does not apply to every keyword. Let’s take a look at the correlation between backlinks and position per keyword and then throw the correlation coefficients output onto a histogram:

Clearly, we have some keywords here whose ranking hosts have at least a weak if not moderate correlation with the number of backlinks. This means that we would have to look at the composition of the ranking for each keyword individually. And since we already know that the ranking is dynamic, we see it a little more clearly here.

Unfortunately, it is not the case that there is a correlation between the average number of backlinks of the ranked pages and the correlation between the backlinks of the locations and position. We can see in the screenshot on the left that this is a very colorful mix.

What can be the reason for this? For example, because I only have the data for the backlinks for the hosts, not for the respective landing page. Of course, that would be even nicer, and the most ideal thing would be if I could then look at what the composition of the individual factors of the backlinks would look like. In view of my credit poverty, however, this is not possible at the moment. And here we have a typical problem in the field of data science: We know that the data is out there, but we can’t get to it. Nevertheless, this approach already offers enormous advantages: I can now look at the composition of the current ranking for each keyword individually and act accordingly. In the example on the left, I see that I need a lot of backlinks for the host for “privacy”, but in my dataset (not on the screenshot), my host needs few backlinks for searches like “poems early retirement”. So we need exactly this correlation matrix for each keyword instead of an overall view as above.

In the [next part][10] we will get more data (we started with TF/IDF and WDF/IDF).

[1]: Comments (since February 2020 the comment function has been removed from my blog):

Steffen Blankenbach says

  1. April 2018 at 15:34 Hello Tom, very good and exciting article. Once again, it becomes clear how much effort you have to put into getting insights, only to find out that there are many other influencing factors that could not be taken into account in the calculation. This will probably always remain a problem of data evaluation. Personally, I also assume that thanks to RankBrain, ranking factors even differ at the keyword level and therefore such a mass evaluation is not expedient for companies. A case for a specific industry would be exciting. I look forward to more articles from you.

http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-02-um-12.09.07.png [2]: http://tom.alby.de/wp-content/uploads/2018/04/matrix.png [3]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-01.14.04.png [4]: http://tom.alby.de/wp-content/uploads/2018/04/hist.png [5]: http://tom.alby.de/wp-content/uploads/2018/04/hist_log.png [6]: http://tom.alby.de/wp-content/uploads/2018/04/00000b.png [7]: http://tom.alby.de/wp-content/uploads/2018/04/data-science-seo-age-backlinks.png [8]: http://tom.alby.de/wp-content/uploads/2018/04/plot_zoom_png.png [9]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-02.17.36.png [10]: http://tom.alby.de/data-science-meets-seo-teil-5/

Leave a Reply

Your email address will not be published. Required fields are marked *