Give me a reason why: SEO & Data Science?

This is one part of an analysis I did for a talk “Data Science meets SEO” at the SEO Campixx in Berlin on March 1st, 2018. My main focus was on looking at a larger number of data and apply basic data science approaches. The whole series (in German) is available on my homepage

While Google uses more than 200 ranking signals according to their own blog, only a fraction of these is available to us:

This data relies on 500 search queries, i.e. 4.890 results, given that I have removed Google domains from the results. The dataset obviously is limited, mainly because access to APIs containing backlink data results in costs. The backlink data for this set comes from the beautiful German tool Sistrix. 500 queries are a tiny sample, and it is questionable if this is enough to derive reliable information from this. This is more about the approach itself, and some insights seem to be valid anyway.

This document has been written as a Markdown document in RStudio using R.

Getting started

Libraries, importing data and tidying it

Load Libraries

library(tidyverse)
library(digest)

load dataset

all_data <- read.csv("/home/tom/data-science-seo/data-science-seo/data/all_data.csv")

pure numbers, i.e. remove keywords, host names, etc

(dataset <- all_data %>%
  select(hash,position, secure2, year, SPEED, backlinks_raw, backlinks_log,title_word_count,desc_word_count,content_word_count,content_tf_idf,content_wdf_idf,totalResults,SwarchVolume) %>%
  group_by(hash,position) %>%
  arrange(hash, position))

Let’s remove infinite numbers because they make things more difficult

dataset$backlinks_log[!is.finite(dataset$backlinks_log)] <- NA

First Look at the Data

plot the raw data in a scatterplot matrix

pairs(dataset[2:12])

Visualization is always great to identify patterns but as we can see here, there are no interesting patterns. A correlation matrix may disclose more, however, it shows all correlation coefficients, even for those where p > 0.05.

Correlation Matrix, added “complete.obs” for all the NAs

cor(dataset[2:13], use = "complete.obs", method="pearson")
##                        position      secure2         year       SPEED
## position            1.000000000 -0.038356221  0.020731298 -0.02759370
## secure2            -0.038356221  1.000000000 -0.056481291  0.08849541
## year                0.020731298 -0.056481291  1.000000000  0.14196196
## SPEED              -0.027593704  0.088495407  0.141961962  1.00000000
## backlinks_raw       0.018421465  0.087199717  0.027382804  0.05877407
## backlinks_log      -0.059347963  0.235114256 -0.430682752  0.00779184
## title_word_count   -0.010990383 -0.025079491 -0.004333514 -0.02713053
## desc_word_count    -0.005155549 -0.055839650  0.013637362 -0.03073151
## content_word_count -0.005152130 -0.025663786  0.039421840 -0.06848420
## content_tf_idf     -0.020095666 -0.037562046  0.030942858  0.02560561
## content_wdf_idf    -0.017440558 -0.053175664  0.023868711  0.03538900
## totalResults       -0.009186989 -0.007294347 -0.052269211 -0.03102022
##                    backlinks_raw backlinks_log title_word_count
## position            0.0184214648   -0.05934796     -0.010990383
## secure2             0.0871997167    0.23511426     -0.025079491
## year                0.0273828038   -0.43068275     -0.004333514
## SPEED               0.0587740667    0.00779184     -0.027130527
## backlinks_raw       1.0000000000    0.36591714     -0.213687201
## backlinks_log       0.3659171390    1.00000000      0.024943806
## title_word_count   -0.2136872013    0.02494381      1.000000000
## desc_word_count    -0.1125314088   -0.08507592      0.083838293
## content_word_count -0.0327708234   -0.05269331      0.009694726
## content_tf_idf     -0.0253931259   -0.07992594     -0.058933273
## content_wdf_idf    -0.0205034149   -0.06646933     -0.046400601
## totalResults       -0.0002951899    0.03921995      0.001780916
##                    desc_word_count content_word_count content_tf_idf
## position              -0.005155549       -0.005152130   -0.020095666
## secure2               -0.055839650       -0.025663786   -0.037562046
## year                   0.013637362        0.039421840    0.030942858
## SPEED                 -0.030731506       -0.068484196    0.025605607
## backlinks_raw         -0.112531409       -0.032770823   -0.025393126
## backlinks_log         -0.085075921       -0.052693313   -0.079925941
## title_word_count       0.083838293        0.009694726   -0.058933273
## desc_word_count        1.000000000        0.179998982   -0.004068675
## content_word_count     0.179998982        1.000000000   -0.012556464
## content_tf_idf        -0.004068675       -0.012556464    1.000000000
## content_wdf_idf       -0.033608496       -0.021117337    0.941572002
## totalResults           0.019239501       -0.008309859    0.016201171
##                    content_wdf_idf  totalResults
## position               -0.01744056 -0.0091869894
## secure2                -0.05317566 -0.0072943470
## year                    0.02386871 -0.0522692110
## SPEED                   0.03538900 -0.0310202206
## backlinks_raw          -0.02050341 -0.0002951899
## backlinks_log          -0.06646933  0.0392199532
## title_word_count       -0.04640060  0.0017809161
## desc_word_count        -0.03360850  0.0192395009
## content_word_count     -0.02111734 -0.0083098595
## content_tf_idf          0.94157200  0.0162011709
## content_wdf_idf         1.00000000  0.0127050011
## totalResults            0.01270500  1.0000000000

For those not looking at such numbers every day, correlation coeeficients are numbers between -1 and +1 where 0 means no correlation at all and -1 or 1 mean a high correlation. Correlation is not cause and effect but at least something to look into. As we can see in this matrix, we see correlations where it would be strange not to see them, e.g., obviously, there should be a correlation between WDF/IDF in content TF/IDF in content. However, we are mainly interested in the 1st column, the correlation of positon with all other signals that we have here. Total Results and Search Volume are of course no ranking signals but are interesting to look at later.

Is Speed a ranking factor?

look at the summary for Speed

summary(dataset$SPEED, breaks=50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   60.00   73.00   69.18   81.00  100.00      15

On average, pages in this sample have a mean of 69.18 but this seems to be influenced by outliers since the median is at 73. Unfortunately, R does not provide a mode, so we have to use our own function for this.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
Mode(dataset$SPEED)
## [1] 80

This means, that most sites have a page speed of 80. Let’s look at a histogram:

plot a histogram for speed

hist(dataset$SPEED, breaks=50)

Now, is there a correlation between speed and position on a Google SERP?

Plot the relation between Position and SiteSpeed

boxplot(position~SPEED, data=dataset)

Looks like there is nothing in it. Let’s do the cor.test: ## Correlation test

cor.test(dataset$position, dataset$SPEED, use = "complete.obs", method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$position and dataset$SPEED
## t = -2.0287, df = 4929, p-value = 0.04255
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0567504633 -0.0009718484
## sample estimates:
##         cor 
## -0.02888364

We don’t see a correlation between Speed and position on a Google SERP. One reason for this could be that having a slow page is like not having showered so that everyone will leave while noone notices when you have showered. In other words, having a fast page will not help you but a slow page may have an impact. However, we only look at the top 10 results. Maybe there would be a higher impact in the top 100.

Is SSL a ranking factor?

cor.test(dataset$position,dataset$secure2)
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$position and dataset$secure2
## t = -3.214, df = 4944, p-value = 0.001318
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07343806 -0.01781376
## sample estimates:
##        cor 
## -0.0456613

No. It is not.

Does Domain Age really play no role at all?

Data Review

Histogram of Domain Ages

hist(dataset$year, breaks=25, main="Histogram Domain Ages", xlab="Year in which a domain was seen first")

Obviously, there are by far more older domains in the dataset which could result in the interpretation that they are more likely to rank. However, that is not the case.

Plot positon & age

boxplot(position~year,data = dataset, varwidth=TRUE, main="Boxplot of domain age and position", xlab="Domain Age", ylab="Position")

The boxplot displays the range of 50 percent of all results in a box with the median as a line. As we can see, the median of the oldest domains here is at position 6 whereas a domain from 2017 has a median of position 5. However, there are by far less domains from 2017. Domains from 2018 rank worst.

Correlation Test Position & Age

cor.test(dataset$position,dataset$year)
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$position and dataset$year
## t = 1.4669, df = 4929, p-value = 0.1425
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.007027885  0.048772927
## sample estimates:
##        cor 
## 0.02088879

The result of this correlation test did not pass with a p value of 0.1425. But the plot alone shows that is is impossible to spot a relationship. For all readers of the Google blog, this is not news. Google said that domain age doesn’t play a role years ago. Having said that, there are some other interesting observations.

Other observations re Domain Age

Age and Speed

plot(dataset$year,dataset$SPEED, main="Speed versus Domain Age", xlab="Domain Age", ylab="Speed")

We could also try to draw a line here although it is not as obvious as the one before.

Correlation Test

cor.test(dataset$year,dataset$SPEED)
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$year and dataset$SPEED
## t = 10.87, df = 4915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1258000 0.1803932
## sample estimates:
##       cor 
## 0.1532135

There is a very weak correlation here, so we could conclude that website owners with old domains are more likely to get a SSL certificate for their site.

Number of words and position

Plot

A study of backlinko said that the average number 1 documents on Google has 1.890 words. With my (much smaller) dataset, I cannot reproduce this. But let’s look at a plot first.

plot(dataset$position,dataset$content_word_count, main="Number of words in document and ranking position", xlab = "Position in SERP", ylab = "Number of words")

Looks like we cannot draw a line here.

Correlation Test

cor.test(dataset$position,dataset$content_word_count)
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$position and dataset$content_word_count
## t = -0.49707, df = 4944, p-value = 0.6192
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03493254  0.02080510
## sample estimates:
##          cor 
## -0.007069213

The test did not deliver results either. So let’s look at the raw data:

summary(dataset$content_word_count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      1.0     77.0    273.0    690.3    839.0 140910.0

The “average” document has 690 words but the median is at 273 words; unfortunately, backlinko did not disclose what “average” they used. Also, they did not disclose how they got the keyword count. For this study, Python’s BeautifulSoup was used for scraping and extracting the content. I have also only looked at documents in Germany so maybe more words are used in the US. But this seems to be unlikely. So let’s look at number 1 documents only:

summary(dataset$content_word_count[dataset$position==1])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00    98.75   313.00   640.53   733.75 10469.00

The mean is lower whilst the median is higher. Looks like there is something in there, right?

summary(dataset$content_word_count[dataset$position==11])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   201.2   565.5   848.0  1222.8  4288.0

But if I write a bit more, then it seems to be more likely for me to get to position 11. In other words, writing more does not mean that I will get higher on the Google SERP. Let’s look at a log(wordcount):

boxplot(log(content_word_count)~position, data = dataset)

To sum up, in my small dataset, I cannot conclude that writing more will bring you to a higher position on a Google results page. And I cannot reproduce the number from backlinko.

WDF/IDF

WDF/IDF has been one of the hottest things for many SEOs in the last years, so what if we look at exact match WDF/IDF?

Plot

plot(dataset$position,dataset$content_wdf_idf)

Does not look like we can see a pattern here.

Correlation Test

cor.test(dataset$position,dataset$content_wdf_idf)
## 
##  Pearson's product-moment correlation
## 
## data:  dataset$position and dataset$content_wdf_idf
## t = -1.0225, df = 4745, p-value = 0.3066
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04327262  0.01361174
## sample estimates:
##         cor 
## -0.01484245

And our test also does not come to a result. Based on this data, I would be hesitant to recommend WDF/IDF, also because we are missing all the stemming and other linguistic methods applied to text.

Looking at Keyword Level

Why look at data on a Keyword Level

Mainly because ranking is dynamic. Not for all documents, backlink data is available, and in such cases, other ranking signals become more dominant. So we may want to find out where a ranking signal is a statistically reliable signal.

cor_per_keyword_backlinks <- dataset %>%
  group_by(hash) %>%
  mutate(cor = cor(position,backlinks_raw), p = cor.test(position,backlinks_raw)$p.value) %>%
  arrange(hash,position) %>%
  select(hash,cor,p) %>%
   distinct()

hist(cor_per_keyword_backlinks$cor, breaks=100, main="Histogram of Correlations of Backlinks/Positions on a Keyword Level", xlab="Correlation Coefficients")

This is the distribution of the correlation coefficients for all raw backlink data. We would expect a negative correlation (the higher the position, the lower the number of backlinks), but what we see is also keywords where the opposite is the case. Unfortunately, the pure correlation coefficient is not reliable, we also need the p value; so let’s look only at results where p < 0.05:

Speed

Let’s do the same for speed:

cor_per_keyword_speed <- dataset %>%
  group_by(hash) %>%
  mutate(cor = cor(position,SPEED), p = cor.test(position,SPEED)$p.value) %>%
  arrange(hash,position) %>%
  select(hash,cor,p) %>%
   distinct()

hist(cor_per_keyword_speed$cor[cor_per_keyword_speed$p<0.05], breaks=100, main="Histogram of Correlations of Speed/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")

Again, it is unlikely that having a slow page will improve your ranking on Google. So please take this with a grain of salt.

HTTPS

cor_per_keyword_https <- dataset %>%
  group_by(hash) %>%
  mutate(cor = cor(position,secure2), p = cor.test(position,secure2)$p.value) %>%
  arrange(hash,position) %>%
  select(hash,cor,p) %>%
   distinct()

hist(cor_per_keyword_https$cor[cor_per_keyword_https$p<0.05], breaks=100, main="Histogram of Correlations of HTTPS/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")

Same here.

WDF/IDF

cor_per_keyword_wdf_idf <- dataset %>%
  group_by(hash) %>%
  filter(!is.na(content_wdf_idf)) %>%
  mutate(cor = cor(position,content_wdf_idf), p = cor.test(position,content_wdf_idf)$p.value) %>%
  arrange(hash,position) %>%
  select(hash,cor,p) %>%
   distinct()

hist(cor_per_keyword_wdf_idf$cor[cor_per_keyword_wdf_idf$p<0.05], breaks=100, main="Histogram of Correlations of WDF-IDF/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")

There are only very few results with a significant correlation. Compared to the other results, I would be extremely careful to use the exact match-based WDF/IDF tools for an analysis.

Example for one Keyword

Now, let’s look at only one keyword, “player update”:

player_update <- dataset[ which(dataset$hash == "002849692a74103fa4f867b43ac3b088"),]
cor(player_update[2:12])
##                        position     secure2        year       SPEED
## position            1.000000000 -0.14213381  0.68232305  0.66909158
## secure2            -0.142133811  1.00000000 -0.12309149  0.03240948
## year                0.682323045 -0.12309149  1.00000000  0.13563725
## SPEED               0.669091578  0.03240948  0.13563725  1.00000000
## backlinks_raw      -0.680702773  0.47885649 -0.36529823 -0.68277810
## backlinks_log      -0.266761578 -0.11383705 -0.41150574 -0.34237645
## title_word_count   -0.051752817 -0.86063153  0.06519164 -0.24155722
## desc_word_count     0.004887624 -0.41838105  0.33015892 -0.09250187
## content_word_count  0.236537896 -0.87914035  0.34344837 -0.27081845
## content_tf_idf     -0.058025885 -0.40824829 -0.20100756  0.19626152
## content_wdf_idf    -0.058025885 -0.40824829 -0.20100756  0.19626152
##                    backlinks_raw backlinks_log title_word_count
## position              -0.6807028   -0.26676158      -0.05175282
## secure2                0.4788565   -0.11383705      -0.86063153
## year                  -0.3652982   -0.41150574       0.06519164
## SPEED                 -0.6827781   -0.34237645      -0.24155722
## backlinks_raw          1.0000000    0.60288490      -0.12780842
## backlinks_log          0.6028849    1.00000000       0.39313530
## title_word_count      -0.1278084    0.39313530       1.00000000
## desc_word_count        0.0220935    0.04238384       0.44923620
## content_word_count    -0.3810450    0.10522050       0.76857971
## content_tf_idf        -0.1984321    0.03267363       0.35135135
## content_wdf_idf       -0.1984321    0.03267363       0.35135135
##                    desc_word_count content_word_count content_tf_idf
## position               0.004887624          0.2365379    -0.05802589
## secure2               -0.418381048         -0.8791404    -0.40824829
## year                   0.330158915          0.3434484    -0.20100756
## SPEED                 -0.092501869         -0.2708185     0.19626152
## backlinks_raw          0.022093496         -0.3810450    -0.19843212
## backlinks_log          0.042383844          0.1052205     0.03267363
## title_word_count       0.449236202          0.7685797     0.35135135
## desc_word_count        1.000000000          0.2758416     0.69725202
## content_word_count     0.275841618          1.0000000     0.12353875
## content_tf_idf         0.697252022          0.1235387     1.00000000
## content_wdf_idf        0.697252022          0.1235387     1.00000000
##                    content_wdf_idf
## position               -0.05802589
## secure2                -0.40824829
## year                   -0.20100756
## SPEED                   0.19626152
## backlinks_raw          -0.19843212
## backlinks_log           0.03267363
## title_word_count        0.35135135
## desc_word_count         0.69725202
## content_word_count      0.12353875
## content_tf_idf          1.00000000
## content_wdf_idf         1.00000000

Whilst we can see correlations here, pairs() does not offer p values. However, we can see that on a keyword level, signal impact looks differently, and looking at keywords of the same region (i.e. similar correlations for one signal), we may find more robust other signals for that region.

Summary

Looking at this small dataset, it was not possible to prove that some of the common SEO practices really have an impact as long as they are regarded as general advice that will always work. Having said that, this is a really small data set. Nevertheless, the reactions to the presentation were emotional, to pick my words carefully.

Unless other participants, I do not sell SEO tools, and I also do not earn money selling SEO services. It is like telling a Christian that Jesus never existed. But that’s not what I had said; I only said that based on the data I currently have, I don’t see that WDF/IDF has an impact on SEO. This does not mean that SEO consultants cannot be successful using WDF/IDF: My dataset is small. But these SEO consultants may also do a lot of other great stuff for a website but then think it was WDF/IDF that moved the needle. And of course, as soon as you believe in something, you only look at the data that supports your opinion (confirmation bias). Unfortunately, people love easy explanations for things they see. But there are no easy answers to complex questions.