This is one part of an analysis I did for a talk “Data Science meets SEO” at the SEO Campixx in Berlin on March 1st, 2018. My main focus was on looking at a larger number of data and apply basic data science approaches. The whole series (in German) is available on my homepage
While Google uses more than 200 ranking signals according to their own blog, only a fraction of these is available to us:
This data relies on 500 search queries, i.e. 4.890 results, given that I have removed Google domains from the results. The dataset obviously is limited, mainly because access to APIs containing backlink data results in costs. The backlink data for this set comes from the beautiful German tool Sistrix. 500 queries are a tiny sample, and it is questionable if this is enough to derive reliable information from this. This is more about the approach itself, and some insights seem to be valid anyway.
This document has been written as a Markdown document in RStudio using R.
library(tidyverse)
library(digest)
all_data <- read.csv("/home/tom/data-science-seo/data-science-seo/data/all_data.csv")
(dataset <- all_data %>%
select(hash,position, secure2, year, SPEED, backlinks_raw, backlinks_log,title_word_count,desc_word_count,content_word_count,content_tf_idf,content_wdf_idf,totalResults,SwarchVolume) %>%
group_by(hash,position) %>%
arrange(hash, position))