Sometimes R displays results in scientific notation, which wouldn’t be bad if the exponent base didn’t differ:

This can be easily prevented by adding:format(x, scientific=FALSE)

There are 26 posts tagged r (this is page 2 of 3).
Sometimes R displays results in scientific notation, which wouldn’t be bad if the exponent base didn’t differ:

This can be easily prevented by adding:format(x, scientific=FALSE)

Louis Aslett’s AMIs are useful for quickly starting an RStudio server on an AWS EC2 instance. However, these AMIs do not always contain the latest version of R or RStudio. These two commands help to update both R and RStudio:
sudo apt-get install gdebi-core<br /> wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb<br /> sudo gdebi rstudio-server-1.1.442-amd64.deb
echo "deb http://cran.stat.ucla.edu/bin/linux/ubuntu `lsb_release -sc`/" | sudo tee --append /etc/apt/sources.list.d/cran.list<br /> sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9<br /> sudo apt update<br /> apt list --upgradable
If you read along here often, you know that Sistrix is one of my absolute favorite tools (I’ll brazenly link as the best SEO tool), if only because of the lean API, the absolutely lovable Johannes with his really clever blog posts and the calmness with which the toolbox convinces again and again. Of course, all other tools are great, but Sistrix is something like my first great tool love, which you can’t or don’t want to banish from your SEO memory. And even if the following data might scratch the paint, they didn’t cause a real dent in my Sistrix preference.
But enough of the adulation. What is it about? As already described in the post about keywordtools.io or the inaccuracies in the Google AdWords Keyword Planner data mentioned in the margin, it is a challenge to get reliable data about the search volume of keywords. And if you still believe that Google Trends provides absolute numbers, well… Sistrix offers a traffic index of 0-100 for this purpose, which is calculated on the basis of various data sources, which is supposed to result in higher accuracy. But how accurate are the numbers here? Along the way, I also want to show why box plots are a wonderful way to visualize data.
The database here is 4,491 search queries from a sample, where I have both the Sistrix and the Google AdWords Keyword Planner data. By the way, it’s not the first sample I’ve pulled, and the data looks about the same everywhere. So it’s not because of my sample. So let’s first look at the pure data:

As we can see, you could draw a curve into this plot, but the relation doesn’t seem to be linear. But maybe we only have a distorted picture here because of the outlier? Let’s take a look at the plot without the giant outlier:

Maybe we still have too many outliers here, let’s just take those under a search volume of 100,000 per month:

In fact, we see a tendency here to go up to the right, not a clear line (I didn’t do a regression analysis), but we also see that with a traffic value of 5, we have values that go beyond the index values of 10,15,20,25 and 30, even at 50 so we see the curve again:

The median ignores the outliers within the smaller values:

So if we look at the median data, we see a correct trend at least for the higher values, with the exception of the value for the Sistrix traffic value of 65 or 70. However, the variation around these values is very different when plotting the standard deviations for each Sistrix traffic value:

We don’t see a pattern in the spread. It is not the case that the dispersion increases with a higher index value (which would be expected), in fact it is already higher with the index value of 5 than with 10 etc. We see the highest dispersion at the value of 60.
Because boxplots are simply a wonderful thing, I’ll shoot it after that:

Here the data is reversed once (because it was not really easy to see with the Sistrix data on the X-axis). The box shows where 50% of the data is located, so with a search volume of 390, for example, 50% of the data is between the Sistrix value of 5 and 25, the median is indicated by the line in the box and is 15. The sizes of the boxes increase at the beginning, then they are different sizes again, which indicates a lower dispersion. At some data points, we see small circles that R has calculated as outliers. So we see outliers, especially in the low search volumes. Almost everything we plotted above we get visualized here in a plot. Boxplots are simply wonderful.
Does this mean that the traffic data in Sistrix is unusable? No, it doesn’t mean that. As described in the introduction, the Keyword Planner data is not always correct. So nothing is known for sure. If you see the Keyword Planner data as the ultimate, you won’t be satisfied with the Sistrix data. It would be helpful if there was more transparency about where exactly the data comes from. Obviously, tethered GSC data would be very helpful as it shows real impressions. My recommendation for action is to look at several data sources and to look at the overlaps and the deviations separately. This is unsatisfactory, as it is not automatic. But “a fool with a tool is still a fool”.
Comments (since February 2020 the comment function has been removed from my blog):
Hanns says
Tom Alby says
Martin Says
I actually don’t understand how this is supposed to work technically. How is Sistrix supposed to get the search queries that run through Google for each keyword? It’s not as if Google informs Sistrix briefly with every request.
The only thing I can think of is that they pull the data for each keyword from AdsPlanner. But… to present this as “own search volume” without any indication of where the data comes from, I would find grossly negligent.
Where could they still get data from?
Tom says
the answer is not 1 or 0, that also comes out in the article. You also can’t rely on AdPlanner data. Sistrix also gets data from customers who have linked the Search Console data there, since you can see your page’s impressions for a keyword. But of course, all this is not for every keyword. And that’s why inaccuracies come about.
BG
Tom
The last part of this series on search engine optimization/SEO and data science based on my presentation at SEO Campixx. I converted the data and the code into an HTML document via Knit, which makes my notebook including data comprehensible. There are also a few more examinations in the notebook, but I have documented everything in English, as this is not only interesting for Germans. So if you want to read all the results in one document (without the TF/IDF, WDF/IDF or stemming examples), please take a look at the Data Science & SEO Notebook.

First of all, an addendum: We lacked the age for some domains, and I have now received this data from another source. In our sample, most of the domains were older, and my concern was that the missing domains might be much younger and therefore the average age would be wrongly pulled down. Almost 20% of the domains were missing.
In fact, the missing domains are younger. While the median for our holey data set was 2001, it is 2011 for the missing domains. If you merge the data, however, it is still back to 2001, only the mean has changed from 2002 to 2003. Thus, the number of missing data was not so high that this opposing trend would have had a major impact. Of course, one could now argue that this other source simply has completely different numbers, but this could not be verified in a sample of domains for which an age was already available. And if you look at the plot for the relationship between position on the search results page and the age of a domain, we haven’t found anything new:

Box plots are a wonderful thing, because they show an incredible amount about the data at a glance. The box shows where 50% of the data is, the thick line in the middle the median, and the width of the box the root from the sample set. Even after several beers, there is no pattern to be seen here, except that the boxes are all at about the same height. Google had already said that the age of a domain does not matter.
Another myth, and the great thing about this myth is that we can clear it up relatively easily, because we can crawl the data ourselves. By the way, R is not so great for crawling; there is the package rvest, but if you really only want the content, then nothing comes close to Python’s Beautiful Soup. Nicely, you can also run Python in the RStudio notebooks, so only the actual text is taken as text here, not the source code. However, navigation elements and footers are included, although we can assume that the actual content can be extracted with Google. The following plot shows the relationship between content length and position:

As we can see, we don’t see anything, except for an interesting outlier with more than 140,000 words in a document (http://katalog.premio-tuning.de/), which ranked 3rd for the keyword “tuning catalogs”. Otherwise, no correlation can be observed. A general statement such as “more text = better position” cannot therefore be derived. The median word count is 273, and the mean is 690. Just a reminder, we are in the top 10 here. I would actually be very interested in how the colleagues from Backlinko came up with 1,890 words for the average 1st place document. They have looked at far more search results (what does “1 million search results” mean? Exactly that, i.e. about 100,000 search results pages, i.e. the results for about 100,000 search queries?), but they do not reveal which average they used. As you can see in my numbers, there is a big difference between the median and the mean, which is the arithmetic mean that most people call the average. It’s not for nothing that I always say that the average is the enemy of statistics, but maybe texts are longer in the USA? But since the numbers are not made available to us… and also not the methods… well, at some point I learned that you have to add both the data and the software for evaluation to your results so that everything is really comprehensible.

In this final part, I have added more signals, including TF/IDF and WDF/IDF. And as you can see in the correlation matrix, we don’t have a correlation anywhere. In the last part, however, we had already seen that this does not apply to all keywords. In the histogram of the correlation coefficients, we saw both positive and negative correlations, but no p-value. If you only look at the correlation coefficients where p is < 0.05, the picture looks different again:

So we have keywords where the backlinks matter, and we also have keywords where the other signals matter. If we can draw one conclusion from the keyword set, it’s that there’s no one-size-fits-all rule. As already stated in the last part, we need the above correlation matrix for each keyword. And that’s exactly what is exciting, because we can look at how the ranking signals behave for each keyword individually or perhaps a topic.

And so you can see for the keyword “player update” (as hash 002849692a74103fa4f867b43ac3b088 in the data in the notebook) that some signals are more prominent, see the figure on the left. Can you now be sure that you now know exactly how the ranking works for this keyword? No, you can’t (especially since we haven’t calculated the p-values here yet). But if we look at several keywords from the same “region” (i.e. similar values in this signal), then there could actually be something in them.
Unfortunately, nothing either. And that was probably the biggest point of contention at SEO Campixx. In this example, I’m only using the exact match for now, so I find exactly the keyword I entered in the text. Of course, we could now go further and have the keywords picked apart matched, but to reduce the complexity, let’s just look at the exact match.

There is no clear pattern here, and there are no correlations. Very few observations even manage a p-value below 0.05 and a correlation coefficient of more than 0.1. In this keyword set, it cannot be understood that WDF/IDF brings anything, at least not for Exact Match. Neither does TF/IDF. I didn’t even look up keyword density.
The last part of my presentation from the SEO Campixx was a short summary of my series of articles about SEO reporting with R and AWS (especially the part about action-relevant analyses and reporting).
Once again the most important points:
The emotional reactions of some colleagues are not incomprehensible, because after all, some tools are paid dearly (there were also tool operators in my presentation, one of whom let himself be carried away by the statement that you can tell that I haven’t worked as SEO for a long time). It’s a bit like going to a Christian and saying that his Jesus, unfortunately, never existed. I didn’t say that. I have only said that I cannot understand the effect of common practices on the basis of my data set. But many SEOs, whom I appreciate very much, have told me that e.g. WDF/IDF works for them. In medicine it is said “He who heals is right”, and at the end of the day it is the result that counts, even if it has been proven that homeopathy does not help.nAnd perhaps the good results of these SEOs only come about because they also do many other things right, but then blame it on WDF/IDF.
But what interests me as a data person is reproducibility. In which cases does WDF/IDF work and when does it not? I would like to add that I have no commercial interest in calling any way good or bad, because I don’t sell a tool (let’s see, maybe I’ll build one someday) and I don’t earn my money with SEO. In other words: I pretty much don’t give a s*** what comes out of here. The probability that I will succumb to confirmation bias because I am only looking for facts that support my opinion is extremely low. I’m only interested in the truth in a post-factual world. And unlike the investigation of Backlinko, for example, I provide my data and code so that everyone can understand it. This is complexity, and many try to avoid complexity and are looking for simple answers. But there are no easy answers to difficult questions, even if that is much more attractive to people. My recommendation: Don’t believe any statistics that don’t make the data and methods comprehensible. I hope that all critics will also disclose the data and their software. This is not about vanity.
The Donohue–Levitt hypothesis is a good example for me: For example, the New York Police Department’s Zero Tolerance approach in the 1990s was praised for significantly reducing crime as a result. This is still a widespread opinion today. Donohue and Levitt had examined the figures, but came to a different conclusion, namely that this was a spurious correlation. In reality, the spread of the baby pill was responsible for the fact that the young offenders were not born in the first place, which then became noticeable in the 90s. Of course, this was attacked again, then confirmed again, and then someone also found out that the disappearance of the lead content from gasoline was responsible for the reduction of juvenile delinquency (lead crime hypothesis). However, these are more complex models. More police truncheons equals less crime is easier to understand and is therefore still defended (and maybe there is a bit of truth to it?). But here, too, those who find a model more sympathetic will mainly look at the data that confirm this opinion.
I could have investigated a lot more. But as I said, I do it on the side. I’m already in the mood for more data on the topic. But for now, there are other mountains of data here again: And then the next step would be a larger data set and machine learning to identify patterns more precisely.
Now it’s been another month, and I still haven’t written everything down. However, this is also due to the fact that I have acquired even more data in the last few weeks so that I have a data set that I can share and that is not customer-specific.
80% of the time is spent on validating and cleaning the data, according to the rule of thumb, and I would add one more point, namely transforming the data. Data is rarely available in such a way that it can be used immediately.
But one thing at a time. For this part, I wanted to add backlink data as well as crawl data, but Google’s backlink data is only available for your own domain, and if you use tools like Sistrix, then the API queries cost money or credits. Using the example of Sistrix, you pay 25 credits for querying for backlinks (links.overview), so with 50,000 credits per week you can query the links for 2,000 URLs. However, I can only use the credits that were not spent on other tools at the end of the week, so I would need more than 7 weeks for 14,099 unique hosts that I generated in the last part with the 5,000 searches. Until then, I have 1,000 other projects and I forget what code I wrote here, so I took a sample based on 500 searches that I randomly pulled from my 5,000 searches. Unfortunately, the ratio unique hosts/all URLs here was not as nice as with the overall set, 2,597 unique hosts had to be queried.
Unfortunately, the Sistrix API had also thrown a spanner in the works here, because for over 250 URLs I got answers that my script had not properly intercepted, e.g.
‘{“method”:[[“links.overview”]], “answer”:[{“total”:[{“num”:903634.75}], “hosts”:[{“num”:21491.504628108}], “domains”:[{“num”:16439.602383232}], “networks”:[{“num”:5979.5586911669}], “class_c”:[{“num”:9905.3625179945}]}], “credits”:[{“used”:25}]}

My script had expected integer values (fractions of a backlink don’t exist in my opinion) and then simply wrote nothing at all to the dataframe when a number came from Sistrix that was not an integer. But even if it had caught that, the number I see here has nothing to do with the number I see in the web interface, although there are also strange numbers here from time to time (see screenshot). Is that 197,520 backlinks or 19,752,000? Please don’t get me wrong, Sistrix is one of my favorite tools, but such things drive me crazy, and R is not easy there either. It didn’t help, I had to look through the data first and add some of it manually (!!). And how difficult R can sometimes be becomes apparent when you want to add existing data, but don’t want to set the existing data in a column to NA. My inelegant solution to the transformation (which took me 2 hours):
test <- merge(sample_2018_04_02,backlinks_2, by = “host”, all.x = TRUE) test <- cbind(test, “backlinks_raw”=with(test, ifelse(is.na(total.y), total.x, total.y))) ‘
We had already seen last time that we were missing data for the age of a domain, but here there is also the official statement from Google that the age of a domain has no influence on the ranking. However, the old domains were in the majority, so one could possibly assume that newer domains have less chance of getting into the index or into the top 10, but it doesn’t matter for the respective position in the top 10. However, we did not make this statement explicitly, because it could be that the missing age values in my data set are exactly the younger domains. So that’s yet to be found out, but no longer as part of this series. Then I might as well examine the top 100
In summary: It always looks very simple, but collecting, transforming and cleaning the data simply costs a lot of time and energy.

First of all, let’s see if we can see something just by plotting the individual variables in relation to each other. The variables we have here now are:
There are still a few variables missing, but let’s start with this. As we can see, we see next to nothing So let’s take a look at the bare numbers again:

Here we see a little more. For example, a (sometimes very weak) correlation between http and backlinks_log, year and backlinks_log, speed and year, backlinks_raw and ip, etc. But why the logarithmization of the backlinks at all? This is illustrated by the following example:

If we look at the distribution of the frequencies of the backlinks in the histogram, we see a high bar on the far left and not much else. No wonder, because in the search results we have hosts like Youtube that have a nine-digit number of backlinks, but most hosts have much, far fewer backlinks. If we use a logarithm instead, i.e. if we “compress” it a bit, then the histogram looks quite different:

We see here that many hosts are somewhere in the middle, some stand out with few links (that’s the bar at 0) and few hosts have a lot of backlinks. The question of whether the number of backlinks of the individual search hits is comparable for each search query is also exciting. The answer here is no, as the following histogram shows (also logarithmized):

I calculated the average number of backlinks for each keyword and then put it logarithmized on the histogram (without a logarithm, the histogram would look like the unlogarithmized one before it). And as we can see, we also have areas where the search results come from hosts that have few backlinks, most of the search results are in the middle, and with very few search results we have an enormous number of backlinks. It must always be said that the average is not really a great story. But at least we see that we have different “regions”.
Now that we’ve clarified why certain data is logarithmized, let’s take a closer look at what the correlations look like, starting with age and backlinks:

If we squint a little, we see an impression of a line, it almost looks as if there is a correlation between the age of a domain and its backlinks. Once tested:
Pearson’s product-moment correlation
data: dataset$year and dataset$backlinks_log
t = -24.146, df = 4286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3721183 -0.3194161
sample estimates:
cor-0.3460401
That looks good. And it’s not completely surprising. After all, the longer a domain exists, the more time it has had to collect links. While we don’t know which way a correlation is headed, it’s unlikely that a domain will get older the more backlinks it gets.
Let’s take a look at that again for the combination of age and speed:
Pearson’s product-moment correlation
data: dataset$year and dataset$SPEED
t = 13.129, df = 4356, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1663720 0.2234958
sample estimates:
cor
0.1950994
It is interesting here that the correlation coefficient is positive, i.e. the older a domain is, the slower it is.
Good question. Because as already discussed in the last part, this does not apply to every keyword. Let’s take a look at the correlation between backlinks and position per keyword and then throw the correlation coefficients output onto a histogram:

Clearly, we have some keywords here whose ranking hosts have at least a weak if not moderate correlation with the number of backlinks. This means that we would have to look at the composition of the ranking for each keyword individually. And since we already know that the ranking is dynamic, we see it a little more clearly here.

Unfortunately, it is not the case that there is a correlation between the average number of backlinks of the ranked pages and the correlation between the backlinks of the locations and position. We can see in the screenshot on the left that this is a very colorful mix.
What can be the reason for this? For example, because I only have the data for the backlinks for the hosts, not for the respective landing page. Of course, that would be even nicer, and the most ideal thing would be if I could then look at what the composition of the individual factors of the backlinks would look like. In view of my credit poverty, however, this is not possible at the moment. And here we have a typical problem in the field of data science: We know that the data is out there, but we can’t get to it. Nevertheless, this approach already offers enormous advantages: I can now look at the composition of the current ranking for each keyword individually and act accordingly. In the example on the left, I see that I need a lot of backlinks for the host for “privacy”, but in my dataset (not on the screenshot), my host needs few backlinks for searches like “poems early retirement”. So we need exactly this correlation matrix for each keyword instead of an overall view as above.
In the [next part][10] we will get more data (we started with TF/IDF and WDF/IDF).
[1]: Comments (since February 2020 the comment function has been removed from my blog):
Steffen Blankenbach says
http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-02-um-12.09.07.png [2]: http://tom.alby.de/wp-content/uploads/2018/04/matrix.png [3]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-01.14.04.png [4]: http://tom.alby.de/wp-content/uploads/2018/04/hist.png [5]: http://tom.alby.de/wp-content/uploads/2018/04/hist_log.png [6]: http://tom.alby.de/wp-content/uploads/2018/04/00000b.png [7]: http://tom.alby.de/wp-content/uploads/2018/04/data-science-seo-age-backlinks.png [8]: http://tom.alby.de/wp-content/uploads/2018/04/plot_zoom_png.png [9]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-02.17.36.png [10]: http://tom.alby.de/data-science-meets-seo-teil-5/
In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.
My work includes dealing with larger files almost every day, and for this test, I have used a 16 GB CSV file with 88.956.866 rows and 7 columns. After reading the file, I will do a few simple operations with that data and then write the result back to disk. The test is performed on an AWS EC2 m4.2xlarge instance with 32 GB of RAM and 8 vCPUs of which we will use 4. Let’s start with data.table:
`
library(data.table)
overallStart_time <- Sys.time()
start_time <- Sys.time()
userDataSetDT <- fread(“/home/tom/huge-file.csv”)
Read 88956065 rows and 7 (of 7) columns from 15.969 GB file in 00:02:52
end_time <- Sys.time()
end_time – start_time
Time difference of 6.507585 mins
`
I have no idea why fread says it only needed 2:52 minutes; there were no other CPU-hungry processes running or processes that had a huge impact on IO.
`> start_time <- Sys.time()
userDataSetDT <- userDataSetDT[!is.na(Timestamp)]
end_time <- Sys.time()
end_time – start_time
Time difference of 39.44712 secsstart_time <- Sys.time()
userDataSetDT <- userDataSetDT[!is.na(URL)]
end_time <- Sys.time()
end_time – start_time
Time difference of 38.62926 secsstart_time <- Sys.time()
configs <- userDataSetDT[configSection == “Select Engine”]
end_time <- Sys.time()
end_time – start_time
Time difference of 2.412425 secsstart_time <- Sys.time()
fwrite(configs,file=“configsDT.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.07708573 secsoverallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.341271 mins`
data.table uses only one vCPU or one core respectively by default but consumes more virtual memory (43GB instead of 13GB being used by the R/sparklyr combination). We could use packages such as the parallel one but in fact, data.table is a bit more complex with respect to parallelization.
Now, let’s look at sparklyr:
“> overallStart_time <- Sys.time()
conf <- spark_config()
conf$sparklyr.defaultPackages <- “org.apache.hadoop:hadoop-aws:2.7.3”
conf$sparklyr.cores.local<- 4
conf$sparklyr.shell.driver-memory<- “8G”
conf$sparklyr.shell.executor-memory<- “16G”
conf$spark.yarn.executor.memoryOverhead <- “4g”
conf$spark.memory.fraction <- 0.9
conf$spark.driver.maxResultSize <- “8G”
sc <- spark_connect(master = “local”,
start_time <- Sys.time()
userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 2.432772 minsstart_time <- Sys.time()
userDataSet <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01466608 secsstart_time <- Sys.time()
userDataSet <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.001867533 secsstart_time <- Sys.time()
configs <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.001763344 secsstart_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 1.333298 minsstart_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01505065 secsoverallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 3.878917 mins
“
We have saved more than 50% here! However, looking at the details, we see that collecting the data has cost us a lot of time. Having said that, doing the selects is faster on sparklyr compared to data.table. We have used 4 vCPUs for this, so there seems to be an advantage in parallelizing computing the data, there is almost no difference in writing the data, also given that data.table’s fread has been highly optimized. Edit: As one commenter said below, you would probably not collect the whole dataset and rather let Spark write the CSV but I have not done to make the approach more comparable.
If we used only one core for sparklyr (which doesn’t make any sense because even every Macbook today has 4 cores), how long would it take then?
`> start_time <- Sys.time()
userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.651707 minsstart_time <- Sys.time()
userDataSet <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.003172636 secsstart_time <- Sys.time()
userDataSet <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.002816916 secsstart_time <- Sys.time()
configs <- userDataSet %>%
end_time <- Sys.time()
end_time – start_time
Time difference of 0.002915621 secsstart_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.487081 minsstart_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01447606 secsoverallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.345677 mins`
The selects are a bit slower albeit not noticable. sparklyr is much slower though when it comes to reading large files and collectiong data with only one core. Having said that, as mentioned above, there is no reason to use only one core.
However, there is still a good reason to use data.table: As you can see in the config file of the sparklyr code, a huge chunk of memory had to be assigned to the collector and the driver, simply because the computation or the collection will throw errors if there is not enough memory available. Finding out how much memory should be allocated to what component is difficult, and not allocating the right amount of memory will result in restarting R and running the code again and again, making data.table the better choice since no configuration is required whatsoever. In addition, it is amazing how fast data.table still is using one core only compared to sparklyr using 4 cores. On the contrary, running the same code on my Macbook Air with 8 GB RAM and 4 cores, data.table had not managed to read the file in 30 minutes whilst sparkly (using 3 of the 4 cores) managed to get everything processed in less than 8 minutes.
While I personally find dplyr a bit more easy to learn, data.table has caught me, too.
The first two parts were about what data science is and why WDF/IDF values are very likely to have little to do with what happens under the hood at Google. In this part, we go one step further, because we look at whether there are correlations between ranking signals and the position. In the lecture, I had shown this using the example of a search query and, in view of the time available, I had dealt with it rather briefly. Here I can go into depth. However, we will first only look at each individual ranking signal in relation to the position, not the possible effect of the ranking signals on each other.
Since my presentation caused some colleagues to gasp and some “interesting” statements, one point had probably been lost. Because I had expressly said, and I repeat this here, that I am not making the statement that one can assume from this data that the ranking works that way. Anyone who has ever had to deal with lawyers or statisticians knows that they are reluctant to be pinned down to reliable statements. After all, we usually do not know the total population and therefore have to draw conclusions about the population from a small sample; Who is so crazy and lets himself be pinned down to it? Hence all the complexity with confidence level, confidence intervals etc…
The following statements refer to a sample of 5,000 search queries. That sounds like a lot, but we don’t know if these searches correspond to the total population of all searches. So the results are for the sample, and I’m always willing to repeat that for other searches if those searches are made available to me.
Other problems with this approach: We have access to a few ranking signals, but not all of them, and the few signals we do have are also partly inaccurate. Of the more than 200 ranking signals, we have:
So we lack signals such as
In summary, we only have a fraction of the data, and some of it is not even accurate. And my calculations are also based on search queries that we don’t know if they are representative.
My favorite example of the fatal belief in correlations is the statistical correlation between Microsoft Internet Explorer’s market share and the murder rate in the U.S. between 2006 and 2011. While it may be funny to claim that there is a connection here (and this regularly leads to laughter in lectures), the fact is that a statistical relationship, which we call a correlation, does not have to be a real connection here. Correlation does not mean cause and effect. Worse still, in statistics we don’t even know in which direction the statistical connection runs. In this example, whether Internet Explorer’s market shares led to more murders, or whether the murders led to Internet Explorer being used afterwards to cover their tracks.
Of course, the connections are clear in some situations: If I spend more money on AdWords, then I may get more conversions. And if we examine the statistical relationship between ranking signals, then it is likely that more backlinks lead to a better position, even though of course a better position can ensure that more website owners find interesting content and link to it… But we don’t know whether, for example, the individual signals can influence each other, and we usually only look at the top 10. As described in the previous part, this is a kind of Jurassic Park, where we don’t have the whole picture.
Every analysis begins with a description of the data. For the 5,000 search queries, we got more than 50,000 search results, but we take out the results that point to Google services for the time being, because we don’t know whether they are ranked according to the normal ranking factors. There are 48,837 URLs left, spread across 14,099 hosts.

Lately I’ve been working more and more with dplyr, which allows piping like under Linux/UNIX; this functionality was borrowed from the magrittr package, and it makes the code incredibly clear. In the example in the figure, after the commented out line, I google_serps throw my dataframe to group_by(), which is grouped by host, and I throw the result of this step to summarise(), which then calculates the frequencies per host, and finally I throw it to arrange(), which sorts the result in descending order. I write the result to hostFrequency, and because I want to see the result immediately in my R notebook, I put the whole expression in parentheses so that the result is not only written to the hostFrequency dataframe, but also output immediately. Every time I do something like this with dplyr, I’m a little happy. And if you have really large data sets, then you do the same with sparklyr :).
But back to the topic: So we see here that few hosts rank very frequently, and that means conversely that many hosts rank only once. No surprise here.
The speed data for each host is very easy to get, because Google offers the PageSpeed Insights API for this, and kindly there is also an R package for this. With more than 10,000 hosts, the query takes a while, and you are not allowed to make more than 25,000 requests per day, as well as no more than 100 (?) Requests per 100 seconds. I just let it run, and after 1 day my R crashed and lost all data. Not pretty, but a workaround: Write the dataframe to the hard disk after each request.
But let’s take a closer look at the data now. Here’s a histogram of the distribution of the speed values of 14,008 hosts (so I got a PageSpeed value for 99.4% of the hosts):

We see that most hosts make it over the 50 points, and summary gives us the following numbers:
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 55.00 70.00 66.69 82.00 100.00
It’s nice to see how misleading the average can be And now let’s plot positions versus speed:

As we see, we don’t see anything. Let’s check this again in more detail:
cor.test(google_serps$position,google_serps$SPEED) Pearson’s product-moment correlation data: google_serps$position and google_serps$SPEED t = -5.6294, df = 48675, p-value = 1.818e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03438350 -0.01662789 sample estimates: cor -0.02550771
Looks like there is no correlation between PageSpeed and position in the top 10 (we don’t know if there might be one in the top 100, it could be that places 11 to 100 have worse PageSpeed values). But it is also not unlikely that we are dealing with a hygiene factor: If you rank well, then you have no advantage, if you rank badly, then you will be punished. It’s a bit like taking a shower, then no one notices, but if you haven’t showered, then it’s noticeable. Or, in a nutshell (that’s how I should have said it at the SEO Campixx), a slow website is like not having showered. However, we also see hosts that rank despite a creepy speed. But if you search for “ganter shoes”, then http://ganter-shoes.com/de-at/ is probably the best result, even if the page takes 30 seconds to load for me.
We should also keep in mind that the PageSpeed API performs a real-time measurement… maybe we just caught a bad moment? You would actually have to measure the PageSpeed several times and create an average from it.
What we also get very easily in terms of data is the distinction between whether a host uses https or not. While some see the use of secure protocols as a weighty ranking factor, more moderate voices like Sistrix see the use of SSL as a weak ranking factor. In this dataset, 70% of all URLs have an https. But does that also mean that these pages rank better?
We are dealing here with a special form of calculation, because we are trying to determine the relationship between a continuous variable (the position) and a dichotomous variable (SSL yes/no). We convert the two variants of the protocol into numbers, http becomes a 0, https becomes a 1 (see screenshot below in the secure and secure2 columns).
The determination of the point-biserial coefficient is a special case of the Pearson correlation coefficient; normally in R we would just type cor.test(x,y), but for this an additional package is loaded, which supports this special form. In fact, the values of the two tests hardly differ, and cor.test also provides me with the p-value.

As we can see, we see nothing or almost nothing: With a correlation coefficient of -0.045067, we can rule out the possibility that https has had an impact on the ranking in the top 10. Does this mean that we should all do without https again? No, because it’s better for the user. And as soon as browsers show even more clearly that a connection is not secure, users are more likely to say goodbye to a site more quickly. Not to mention that we only looked at the top 10 here. It could be that the places 11 to 1000 were mainly occupied by sites without SSL. And then the results could look different again.
Perhaps SSL as a ranking factor is also a reverse hygiene factor. Google would like to have SSL, but since some important pages may not yet have SSL, they do not use it. Just as you might want your dream partner to have a shower, but if he or she suddenly stands in front of you, then it doesn’t matter because you’re so in love. In the long run, of course, this does not go well. And that’s how it will be with SSL
Let’s move on to the next ranking signal, the age of a domain, even if it says that the age of a domain doesn’t matter. Here we are faced with a first challenge: How do I find out the age of a domain as automatically and, above all, reliably as possible? Sistrix offers the age of a domain, but not that of a host (the difference is explained in the script for my SEO seminar at HAW). Nevertheless, Sistrix has the advantage that the API is very lean and fast. However, Sistrix does not find an age for 2,912 of the domains (not hosts), with 13,226 unique domains that is 22% without domain age. Jurassic Park sends its regards (if you don’t understand this allusion, please read the second part about Data Science and SEO). Nevertheless, let’s take a look at the distribution:

We see a slightly skewed distribution to the right, i.e. that we see a higher frequency on the left of the older domains (no, I don’t see mirror-inverted, that’s what they call it). Younger domains seem to have fewer chances here. However, it is also possible that Sistrix does not have younger domains in its database and that the almost 3,000 missing domains would be more likely to be located on the right.
Can we at least see a correlation here? Let’s plot the data first:

Again, we don’t see anything, and if we let the correlation be calculated, then this is confirmed:
cor.test(google_serps$position,as.numeric(google_serps$year)) Pearson’s product-moment correlation data: google_serps$position and as.numeric(google_serps$year) t = 1.1235, df = 44386, p-value = 0.2612 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.003970486 0.014634746 sample estimates: cor 0.005332591
Not only do I have no correlation, but I am also outside the confidence level. However, the result should be taken with a grain of salt, because as mentioned above, it may be that our selection is simply not representative, because probably not every domain will have had the same chance of getting into Sistrix’s domain age database. So we would have to supplement this data and possibly also check the data from Sistrix (I haven’t been able to find out where Sistrix got the data from); unfortunately, I have not been able to identify a pattern either, because sometimes one source shows older data, sometimes another source. In principle, you would have to take all data sources and then always take the oldest date. Unfortunately, most sources are not so easy to scrape. And so we not only have missing, but also partly incorrect data. And this is not an atypical problem in the field of data science.
Since I’ll soon have the 2,000 words together (and I know the correlation words/”read to the end of the article” for my blog), I’ll look at the next ranking factors in the next blog post. Important: We can see here that we find no evidence for supposedly very important ranking factors that they actually have an influence. But that doesn’t mean that’s really true:
Example:
cor.test(google_serps$position[google_serps$keyword==”acne vitamin A”],google_serps$secure2[google_serps$keyword==”acne vitamin a”]) Pearon’s product-moment correlation data: google_serps$position[google_serps$keyword == “acne vitamin A”] and google_serps$secure2[google_serps$keyword == “acne vitamin A”] t = -4.6188, df = 8, p-value = 0.001713 Alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9645284 -0.4819678 sample estimates: COR -0.8528029
Even though the confidence interval here is very large, it is still in the range of a medium to strong correlation for this query “acne vitamin a” (it correlates negatively because the positions go up from 1 to 10 or further). It is therefore also important to identify the segments or “regions” where certain signals have an effect. More on this in the next part about data science and SEO.
After I explained in the first part what data science is and what already exists in this area on the subject of SEO, now the second part, where we take a closer look at what the linguistic processing of a document by a search engine has on SEO concepts such as keyword density, TF/IDF and WDF/IDF. Since I showed Campixx live code at SEO, I offer everything for download here, which makes following the examples even more eventful. By the way, this is also possible without the installation of R, here you can find the complete code with explanations and results.
(Please skip if you don’t want to recreate this yourself in R)
I recommend using RStudio in addition to R, because handling the data is a bit easier for newbies (and for professionals as well). R is available at the R Project, RStudio on rstudio.com. First install R, then RStudio.
In the ZIP file there are two files, a notebook and a CSV file with a small text corpus. I refer to the notebook from time to time in this text, but the notebook can also be worked through in this way. Important: Please do not read the CSV file with the import button, because then a library will be loaded, which will nullify the functionality of another library.
The notebook has the great advantage that both my description, the program code and the result can be seen in the same document.

To execute the program code, simply click on the green arrow in the upper right corner, and it works
(Please skip if the concepts are clear!)
There are people who can explain TF/IDF (Term Frequency/Inverse Document Frequency) or WDF/IDF (Word Document Frequency/Inverse Document Frequency) better than me, at WDF/IDF Karl has distinguished himself with an incredibly good article (and by the way, in this article he has already said that “actually no small or medium-sized provider of analysis tools can offer such a calculation for a large number of users … ; -“).
A simplistic explanation is that the term frequency (TF) is the frequency of a term in a document, while the Inverse Document Frequency (IDF) measures the importance of a term in relation to all documents in a corpus in which the term occurs (corpus is the linguists’ term for a collection of documents; this is not the same as an index).
In TF, the number of occurrences of a term is counted, simplified but still far too complicated, and then normalized, usually by dividing that number by the number of all words in the document (this definition can be found in the book “Modern Information Retrieval” by Baeza-Yates et al, the bible of search engine builders). But there are other weightings of the TF, and WDF is actually nothing more than such a different weighting, because here the term frequency is provided by a logarithm to base 2.
In the IDF, the number of all documents in the corpus is divided by the number of documents in which the term appears, and the result is then provided with a logarithm to base 2. The Term Frequency or the Word Document Frequency is then multiplied by the Inverse Document Frequency, and we have TF/IDF (or, if we used WDF instead of TF, WDF/IDF). The question that arises to me as an ex-employee of several search engines, however, is what exactly is a term here, because behind the scenes the documents are eagerly tinkered with. More on this in the next section “Crash Course Stemming…”.

Everyone has had the experience that you search for a term and then get a results page where the term can also be found in a modified form. This is because, for example, stemming is carried out, in which words are reduced to their word stem. Because not only in German is conjugated and declined, other languages also change words depending on the person, tense, etc. In order to increase the recall, not only the exact term is searched for, but also variants of this term (I did not use the term “recall” in the lecture, it describes in the Information Retrieval how many suitable documents are found for a search query). For example, for the search query “Data scientist with experience”, documents are also found that contain the terms “data scientist” and “experience” (see screenshot on the left).
In the lecture, I “lifted” live, with the SnowballC-Stemmer. This is not necessarily the best stemmer, but it gives an impression of how stemming works. In the R-Notebook for this article, the Stemmer is slightly expanded, because unfortunately the SnowBallC only ever lifts the last word, so that the Stemmer has been packed into a function that can then handle a whole set completely:
| > stem_text(“This is a great post”) [1] “This is a great post” > stem_text(“Hope you’ve read this far.”) [1] “Hopefully you have read this far.” |
But they are not just variants of a term. We Germans in particular are world champions in inventing new words by “composing” them by putting them together with other words. It is not for nothing that these neologisms are called compounds. An example of the processing of compounds can be found in the screenshot on the right, where the country triangle becomes the border triangle. First of all, it sounds quite simple, you simply separate terms that could stand as a single word and then index them. However, it is not quite that simple. Because not every compound may be separated, because it takes on a different meaning separately. Think, for example, of “pressure drop”, where the drop in pressure could also be interpreted as pressure and waste

The example of the “data scientist with experience” also illustrates another method of search engines: Terms do not have to be together. They can be further apart, which can sometimes be extremely annoying for the searcher if one of the terms in the search query is in a completely different context. The proximity, i.e. the proximity of the terms, can be a signal for the relevance of the terms present in the document for the search query. Google offers Proximity as a feature, but it is not clear how Proximity will be used as a ranking signal. And this only means the textual proximity, not yet the semantic proximity. Of course, there is much more processing in lexical analysis, apart from stop word removal etc. But for now, this is just a small insight.
So we see three points here that most SEO tools can’t do: stemming, proximity and decompounding. So when you talk about keyword density, TF/IDF or WDF/IDF in the tools, it’s usually based on exact matches and not the variants that a search engine processes. This is not clear with most tools; for example, the ever-popular Yoast SEO plugin for WordPress can only use the exact match and then calculates a keyword density (number of the term in relation to all words in the document). But ryte.com says, for example:
In addition, the formula WDF*IDF alone does not take into account that search terms can also occur more frequently in a paragraph, that stemming rules could apply or that a text works more with synonyms.
So as long as SEO tools can’t do that, we can’t assume that these tools give us “real” values, and that’s exactly what dear Karl Kratz has already said. These are values that were calculated on the basis of an exact match, whereas search engines use a different basis. Maybe it doesn’t matter at all, because everyone only uses the SEO tools and optimizes for them, we’ll take a look at that in the next part. But there are other reasons why the tools don’t give us the whole view, and we’ll take a closer look at them in the next section.
Now, the definition of IDF already says why a tool that spits out TF/IDF has a small problem: It doesn’t know how many documents with the term are in the Google corpus, unless this number is also “scraped” from the results. And here we are dealing more with estimates. It is not for nothing that the search results always say “Approximately 92,800 results”. Instead, most tools use either the top 10, the top 100, or maybe even their own small index to calculate the IDF. In addition, we also need the number of ALL documents in the Google index (or all documents of a language area, which Karl Kratz drew my attention to again). According to Google, this is 130 trillion documents. So, in a very simplified way, we would have to calculate like this (I’ll take TF/IDF, so that the logarithm doesn’t scare everyone off, but the principle is the same):
TF/IDF = (Frequency of the term in the document/Number of words in the document)*log(130 trillion/”Approximately x results”),
where x is the number that Google displays per search result. So, and then we would have one number per document, but we don’t know what TF/IDF or WDF/IDF value the documents that are not among the top 10 or top 100 results examined. It could be that there is a document in 967th place that has better values. We only see our excerpt and assume that this excerpt explains the world to us.
And this is where chaos theory comes into play: Anyone who has seen Jurassic Park (or even read the book) may remember the chaos theorist, played by Jeff Goldblum in the film. Chaos theory plays a major role in Jurassic Park, because it is about the fact that complex systems can exhibit behavior that is difficult to predict. And so large parts of the park are monitored by video cameras, except for 3% of the area, and this is exactly where the female dinosaurs reproduce. Because they can change their sex (which frogs, for example, can also do). Transferred to TF/IDF and WDF/IDF, this means: we don’t see 97% of the area, but less than 1% (the top 10 or top 100 of the search results) and don’t know what’s lying dormant in the rest of our search results page. Nevertheless, we try to predict something on the basis of this small part.
Does this mean that TF/IDF or WDF/IDF are nonsense? No, so far I have only shown that these values do not necessarily have anything to do with what values a search engine has internally. And this is not even new information, but already documented by Karl and some tool providers. Therefore, in the next part, we will take a closer look at whether or not we can find a correlation between TF/IDF or WDF/IDF and position on the search results page.
In the enclosed R-notebook, I have chosen an example to illustrate this, which (hopefully) reminds us all of school, namely I have made a small corpus of Goethe poems (at least here I have no copyright problems, I was not quite so sure about the search results). A little more than 100 poems, one of which I still know by heart after 30 years. In this wonderful little corpus, I first normalize all the words by lowering them all, removing numbers, removing periods, commas, etc., and removing stop words.
Although there is a library tm in R with which you can calculate TF/IDF as well as TF (normalized)/IDF, but, scandal (!!), nothing about WDF/IDF. Maybe I’ll build a package for it myself. But for illustrative purposes, I just built all the variants myself and put them next to each other. So you can see for yourself what I did in my code. Let’s take a look at the data for the unstemmed version:

We can see here that the “ranking” would change if we were to sort by TF/IDF or TF (normalized)/IDF. So there is a difference between WDF/IDF and the “classical” methods. And now let’s take a look at the data for the managed version:

We see that two documents have cheated their way into the top 10, and also that we suddenly have 22 instead of 20 documents. This is logical, because a term with two or more different stems can now have become one. But we also see very quickly that all the numbers have changed. Because now we have a different basis. In other words, whatever SEOs read from WDF/IDF, the values are most likely not what actually happens at Google. And again: This is not news! Karl Kratz has already said this, and some tools also say this in their glossaries. After my lecture, however, it seemed as if I had said that God does not exist
And perhaps it is also the case that the rapprochement alone works. We’ll look at that in the next parts about data science and SEO.
Phew. No. Not really. Just because you’re working with R doesn’t mean you’re doing data science. But at least we warmed up a bit by really understanding how our tool actually works.
On 1.3.2018 I gave a lecture on the topic of data science and SEO at the SEO Campixx, and since there were some discussions afterwards :-), I will describe the contents in more detail here, in several parts. This part is first of all about what data science is and what already exists on the topic.
“The sexiest job of the 21st century” is rather dull on closer inspection, because most of the time is spent acquiring and cleaning data and using it to build models. It’s coding, it’s math, it’s statistics, and with larger amounts of data, it’s also a lot of knowledge about where to wire which instances with each other, such as on Amazon Web Services or Google Cloud Platform. A global galactic definition of data science does not exist to my knowledge, but I would consider data science to be the intersection of
define. These are not new topics, but what is new is that we have much more data, much faster processors, cheap cloud processing and many development libraries. For the statistics language and development environment R used here, libraries exist for almost every purpose; somewhere there was someone who was faced with the same problem and then built a solution for it. What is also new is that more and more companies feel that you can do something with data, after all, Spotify uses data to know what music you might still like, and Google knows when you should set off if you want to get to work on time.
Unfortunately, the data hype (which will be followed by a healthy understanding of what is possible after a plateau of disappointment) is countered by relatively few people who feel at home in all three disciplines (plus cloud computing). Which in turn leads to the fact that these data scientist unicorns are sometimes offered unreasonable sums and 1000s of courses are offered on Udemy & Co that are supposed to provide you with the necessary knowledge.
A real problem with data science, however, is that not only knowledge in several areas is necessary, but also the understanding that data is used to solve a problem. I can deal with algorithms and data all day long, for me it’s like a kind of meditation and relaxation. In fact, sometimes I feel like playing with Lego, but at the end of the day, it’s all about solving problems. Not only collecting data, but also extracting the right information from it and then taking the right action (the holy trinity of data). And here is the challenge that often enough people just say, here is data, make something out of it. Therefore, it is an art for the data scientist to understand exactly what the problem actually is and to translate this into code.
In addition, many people have bad memories of math. Accordingly, the audience’s willingness to consume slides with lots of numbers and formulas tends to be on the lower end of the scale. That’s why I also worked with smaller examples in the lecture, which everyone should be able to understand well.
What kind of topics am I working on? Very different. Classification. Clustering. Personalization. Chatbots. But also analyses of slightly larger data volumes of 25 million rows of Analytics data and more, which have to be processed in a few minutes. All kinds of things.
On the search engine side, there is already a lot. When I was still at Ask, we had already worked with Support Vector Machines, for example, to create the ranking for the queries where the pages had almost no backlinks. Even then, there was a dynamic ranking. The topic recognition of most search engines is based on machine learning. RankBrain will be based on machine learning. So it’s not a new topic for search engines.
On the other hand, that of SEOs, the topic still seems to be relatively fresh. Search Engine Land says thatevery search marketer can think of themselves as a data scientist. I’m not sure I would subscribe to that, because most search marketers I know don’t build their own models. As a rule, they use tools that do this for them. On SEMRush you can find a collection of ideas, but more for SEA. Remi Bacha is still exciting, although I haven’t seen any data from him yet. Keyword Hero have come up with something pretty cool by using deep learning to identify the organic keywords that are no longer included since the switch to https. Otherwise, I haven’t seen much on the subject. So we see that we are at the very beginning.
What would we like to have?
Back to the question of what problem I actually want to solve with my work. In an ideal world, SEO naturally wants to be able to re-engineer the Google algorithm. However, this is unlikely, because of the more than 200 ranking signals, only a few are available to us. What we can do, however, is try to build models with the signals we have, and possibly create smaller tools. And that’s exactly what the next part is about
This is the final part of the series on SEO with R and AWS. In the first part, we prepared the AWS instance with RStudio, in the second part we carried out a small SEO analysis, and in the third part, we created a visibility index and an “actionable reporting”. This part is about how it is too exhausting for even the most hardened data scientist to run the individual scripts through RStudio every day. SEO monitoring should therefore run via an appealing interface.
With Shiny, the developers of RStudio have provided a framework that enables interaction with data with just a few lines of code. Fortunately, Shiny is included in the AMI that we installed in Part 1 of this series.
The code is divided into two parts, the server code and the UI code; both parts can be converted into a file, which can then be used as an app. R is placed in a special folder in RStudio. The server part uses the individual functions that we used for the first parts of this series (that’s why you can’t find all the code here).
`
server <- function(input, output, session) {
sites <- unique(results$website)
output$moreControls <- renderUI({
selectInput(“moreControls”, “Select Website”, choices=sites)
output$sviPlot <- renderPlot({
})
output$actions <- renderTable({
})
}
`
In the UI part, the UI is simply screwed together:
<br /> ui <- fluidPage(<br /> title = "Tom's SEO Tool",<br /> uiOutput("moreControls"),<br /> hr(),<br /> plotOutput("sviPlot"),<br /> tableOutput("actions")<br /> )<br />
And finally, the app is started:
`
shinyApp(ui=ui, server=server)
`
That’s it. Simple and simple:

Not necessarily nice for a customer, but perfectly sufficient to build a reporting system for yourself. In the upper left corner, I select the website for which I want to have the reporting, and after the scripts have been run, the visibility index plot and the table with the largest ranking losses appear. Creating this dashboard took just 20 minutes (ok, to be fair, I had built other dashboards with Shiny before, so I had some experience
In my current setup, I just used the code I had used so far, but that also means that it takes a while until something can be seen. It would be better if the data were automatically evaluated every day and the dashboard then simply retrieved the aggregated data. It’s on my to-do list.
Then it would be nice if you could click on an entry in the table and then analyze it further. It’s also on my to-do list.
A connection to Google Analytics would be great.
Not every keyword is interesting and should be included in the report.
All good points that I could tackle in time, and above all good material for further posts And I’m always open to more ideas
With a little brainpower, little code, and the help of Amazon Web Services and R, we built automated and free SEO reporting. Of course, Sistrix & Co offer even more functions. This approach is primarily about making better use of the data from the Webmaster Console, which is available free of charge. Because let’s be honest: We usually have enough data. We just don’t make much of it most of the time.
This approach has another advantage: The data from the Search Console is gone after 90 days. You can’t look any further back. Here, however, we are building up an archive and can also look at longer-term developments.