Give me a reason why: SEO & Data Science?

This is one part of an analysis I did for a talk “Data Science meets SEO” at the SEO Campixx in Berlin on March 1st, 2018. My main focus was on looking at a larger number of data and apply basic data science approaches. The whole series (in German) is available on my homepage

While Google uses more than 200 ranking signals according to their own blog, only a fraction of these is available to us:

This data relies on 500 search queries, i.e. 4.890 results, given that I have removed Google domains from the results. The dataset obviously is limited, mainly because access to APIs containing backlink data results in costs. The backlink data for this set comes from the beautiful German tool Sistrix. 500 queries are a tiny sample, and it is questionable if this is enough to derive reliable information from this. This is more about the approach itself, and some insights seem to be valid anyway.

This document has been written as a Markdown document in RStudio using R.

Getting started

Libraries, importing data and tidying it

Load Libraries

library(tidyverse)
library(digest)

load dataset

all_data <- read.csv("/home/tom/data-science-seo/data-science-seo/data/all_data.csv")

pure numbers, i.e. remove keywords, host names, etc

(dataset <- all_data %>%
  select(hash,position, secure2, year, SPEED, backlinks_raw, backlinks_log,title_word_count,desc_word_count,content_word_count,content_tf_idf,content_wdf_idf,totalResults,SwarchVolume) %>%
  group_by(hash,position) %>%
  arrange(hash, position))