Update RStudio Server


Louis Aslett’s AMIs are useful for quickly starting an RStudio server on an AWS EC2 instance. However, these AMIs do not always contain the latest version of R or RStudio. These two commands help to update both R and RStudio:

sudo apt-get install gdebi-core<br /> wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb<br /> sudo gdebi rstudio-server-1.1.442-amd64.deb

echo "deb http://cran.stat.ucla.edu/bin/linux/ubuntu `lsb_release -sc`/" | sudo tee --append /etc/apt/sources.list.d/cran.list<br /> sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9<br /> sudo apt update<br /> apt list --upgradable

SEO monitoring with R, AWS, and Shiny


This is the final part of the series on SEO with R and AWS. In the first part, we prepared the AWS instance with RStudio, in the second part we carried out a small SEO analysis, and in the third part, we created a visibility index and an “actionable reporting”. This part is about how it is too exhausting for even the most hardened data scientist to run the individual scripts through RStudio every day. SEO monitoring should therefore run via an appealing interface.

GUI development with Shiny

With Shiny, the developers of RStudio have provided a framework that enables interaction with data with just a few lines of code. Fortunately, Shiny is included in the AMI that we installed in Part 1 of this series.

The code is divided into two parts, the server code and the UI code; both parts can be converted into a file, which can then be used as an app. R is placed in a special folder in RStudio. The server part uses the individual functions that we used for the first parts of this series (that’s why you can’t find all the code here).

`
server <- function(input, output, session) {

get data from db and create dropdown-items

sites <- unique(results$website)
output$moreControls <- renderUI({
selectInput(“moreControls”, “Select Website”, choices=sites)
output$sviPlot <- renderPlot({

Visibility Index Code

})
output$actions <- renderTable({

Actionable Data Code

})
}
`

In the UI part, the UI is simply screwed together:

<br /> ui <- fluidPage(<br /> title = "Tom's SEO Tool",<br /> uiOutput("moreControls"),<br /> hr(),<br /> plotOutput("sviPlot"),<br /> tableOutput("actions")<br /> )<br />

And finally, the app is started:

`

Run the application

shinyApp(ui=ui, server=server)
`

That’s it. Simple and simple:

Not necessarily nice for a customer, but perfectly sufficient to build a reporting system for yourself. In the upper left corner, I select the website for which I want to have the reporting, and after the scripts have been run, the visibility index plot and the table with the largest ranking losses appear. Creating this dashboard took just 20 minutes (ok, to be fair, I had built other dashboards with Shiny before, so I had some experience

Next steps

In my current setup, I just used the code I had used so far, but that also means that it takes a while until something can be seen. It would be better if the data were automatically evaluated every day and the dashboard then simply retrieved the aggregated data. It’s on my to-do list.

Then it would be nice if you could click on an entry in the table and then analyze it further. It’s also on my to-do list.

A connection to Google Analytics would be great.

Not every keyword is interesting and should be included in the report.

All good points that I could tackle in time, and above all good material for further posts And I’m always open to more ideas

Result

With a little brainpower, little code, and the help of Amazon Web Services and R, we built automated and free SEO reporting. Of course, Sistrix & Co offer even more functions. This approach is primarily about making better use of the data from the Webmaster Console, which is available free of charge. Because let’s be honest: We usually have enough data. We just don’t make much of it most of the time.

This approach has another advantage: The data from the Search Console is gone after 90 days. You can’t look any further back. Here, however, we are building up an archive and can also look at longer-term developments.

A custom visibility index with R and AWS


The third episode on search engine optimization with R and AWS is about creating your own visibility index to get an aggregated overview of the ranking of many keywords. In the first part, we looked at how to automatically pull data from the Webmaster Console with R and an AWS Free Tier EC2 instance, and the second part was about initial analyses based on click-through rates on positions.

What does a visibility index do?

To do this, let’s first look at what a visibility index is supposed to do. With Sistrix, the visibility index is independent of seasonal effects, which is quite charming, because you know in summer whether you can win a flower pot with winterreifen.de. This can be solved, for example, by using the average number of searches from the AdWords Keyword Planner. It’s just a shame that this tool only spits out reasonably useful values if you spend enough budget in Google AdWords. This approach sometimes falls flat, because we want to keep our monitoring as cheap as possible, at best free of charge.

Sistrix has the disadvantage that a) it is still too expensive for students who attend my SEO courses despite the low price and b) my little niche stories are not always available in the Sistrix keyword database. Sistrix and Co. are especially exciting if you want to compare a site with other sites (ideally from the same industry with the same portfolio). An abstract number like a visibility index of 2 is otherwise pretty meaningless. This number only makes sense if I can relate it to other websites and/or if I can use it to track the evolution of my own website rankings over time. The number itself is still not meaningful, because what kind of metric is being measured in? If I lose 2 kilos, then the metric is clear. But lose 0.002 SI? How many visitors are there?

We want to build an index that allows us to see whether our ranking changes over time across a large number of keywords. How our market competitors are developing can only be seen by scraping Google, and that is not allowed.

Visibility index based on the Webmaster Console

Obviously, it is better to rank 3rd with a search term that is searched 100 times a day than for a search term that is searched for only 2 times a day. So the number of searches should play a role in our visibility index. We exclude the search volume via the AdWords Keyword Planner for the reasons mentioned above, the only source we have left is the impressions from the Webmaster Console. Once we’re on the first page and far enough up (I’m not sure if it’s counted as an impression if you’re in 10th place and you’ve never been in viewable), we should be able to use the number of impressions from the Webmaster Console, even on a daily basis!

A small health check for the keyword “Scalable Capital Experiences” (AdWords / real data from the Webmaster Console):

  • 2400 / 1795 (September, but only half a month)
  • 5400 / 5438 (October)
  • 1000 / 1789 (November)

For September and October it looks good, only in November it is a bit strange that I had almost 80% more impressions than there were supposedly searches. Something must have happened in September/October that Scalable Capital suddenly had so many searches. In fact, this was also visible in the traffic on my site. We don’t get the first point clarified and accept that the AdWords numbers are not perfect either.

The following figures illustrate how different the visibility indices are depending on the weighting:

  • In the simplest model, only 11 minus position is calculated, everything above (greater than) position 10 then gets 1 point. The data for each result is added up for each day. This model has the disadvantage that I can climb up very quickly, even if I’m only in 1st place terms that are only searched for once a month.
  • In the second model, the same procedure is chosen, except that here the value is multiplied by the impressions
  • In the third model, the average CTR on the SERP from the second part of this series is multiplied by the impressions.

If you now look at the actual traffic, you can see that the 3rd model is already very close to the traffic. The spikes in real traffic aren’t quite as strong as in the index, and in the end, I don’t get as much traffic, but that may be because the actual CTR is below the expected CTR.

Alternative approach with the Webmaster Console

If you look at the plots, however, it becomes clear that this approach with impressions on a daily basis makes little sense. Because if the curve goes down, it doesn’t mean that I can do anything, because maybe there is just less searching for this topic and my rankings haven’t changed at all (usually you only write about what works, but I also find the failures exciting, because you can learn a lot from them :-)). This is exactly why Sistrix will probably also calculate out seasonal fluctuations.

Alternatively, you could simply average all impression data of a keyword landing page pair and use that average to calculate, again with the weighted CTR per position. The good thing about this approach is that seasonal or temporary fluctuations balance each other out. Plotted, it looks like this:

This plot looks very similar to the first plot, but that doesn’t mean that it always has to be that way. But when I look at the Sistrix values (even if I’m on a very low level), it looks very similar.

From data to action relevance

Now we have a plot that shows us the weighted development of our rankings, but what do we do with it? It’s not really “actionable”. It only gets exciting when we also look at which rankings change the most and have an influence on our visibility index. To do this, we first take the minimum ranking for each keyword landing page pair (minimal because low, and lower than 1st place is not possible) and then the current ranking. Finally, we calculate the delta and sort according to it:

The higher the delta, the greater the loss of ranking places, so to speak. And the greater the need for action, provided that the keyword is really interesting. In my example, for example, I wouldn’t think it would be bad to rank for “seo monitoring”, after all, the articles from this series are relevant for it. You could now weigh based on the impressions or the viewability index we chose earlier:

This looks more exciting: In fact, there are some search queries at the top (sorted by “Actionability”) that I find quite interesting. And now you could combine that with the data from the second part and build a dashboard… more on this in the fourth part

Actionable SEO Reporting with R and AWS


The first part was about how to create automated SEO monitoring with R and an initially free AWS instance. The Webmaster Console is queried every day via API, and the data is written to a database. Of course, collecting data alone doesn’t help, we should do something with it, and my mantra, which every student hears several times a day in my courses, is “Data – Information – Action”. In most treatises, knowledge takes the place of action, because the processing of information creates knowledge in the first place. In the field of data analysis or even data science, however, it is more often a matter of not only knowing, but also doing something with the knowledge, ideally knowing what to do. The reason why most reports are not read is usually that no action can be derived. 5 more likes this week. So what? What do I do differently tomorrow? So we don’t just want to build SEO reporting, we want to create SEO reporting that tells us where to do something and what to do. “Actionable” is the name of the game in modern German, and there is actually no really nice translation in German. “Action-relevant”? It’s somehow not the same. So let’s live with this term for the time being.

The value of the Webmaster Console API

Let’s first take a look at the data we got from the Webmaster Console. There is already a special feature to be seen here that we don’t get in the interface. There we get either the search queries or the pages that the users came to after clicking on a search result, but not both at the same time. This is already an added value, because we can form pairs of keyword and landing page, so to speak. A landing page can have several keywords, and vice versa, by the way, if several pages from one host rank for a keyword. We use these pairs for our evaluation to have a unique identifier, for example by putting both together and making an MD5 hash out of them (the close observer will notice that the same combination occurs twice in one day, but it only looks like that, because one version of the URL has a /amp/ after it, but this is not to be seen in the table).

If we haven’t already done this when writing to the table, we’ll do it now:

library(digest)<br /> i <- 1<br /> newResults <- results<br /> newResults["Hash"] <- NA<br /> for (i in i:nrow(newResults)) {<br /> newResults$Hash[i] <- digest(paste(newResults$query[i],newResults$page[i],sep=""))<br /> i <- i+1<br /> }

Of course, this can be done even more nicely with apply, but we are currently in hacker mode, not in nice programming mode

Another special feature is that we probably won’t bother to look at the data for each day individually in the interface. This is extremely helpful, though, because the default setting in the Webmaster Console is the last 28 days, and this is where an average is calculated for each value. Those who know me better will now look embarrassed at the ground, because I say the same thing here again and again: the average, and indeed the arithmetic mean, is the enemy of statistics. Because this type of average does not show us the swings. More about that another time when I put the script for the data analysis event online. The point here is that in the arithmetic mean of the interface, I can see that a keyword has been in a certain position in the last 28 days, but in fact, the range per day is much more interesting, because if I grab it per day, then I can map more accurate trends. Last but not least, the Webmaster Console also gives us impressions and keywords that cannot be found in the databases of the common tools. Sistrix, as much as I’ve grown fond of it, can’t find me for “Cookidoo” or “Scalable Capital”. Sure, I could provide Sistrix with my Webmaster Console data, but unfortunately I’m not allowed to do that for every project.

Since we’re querying the data every day, we can now walk through the spreadsheet and get the values for the created identifier so that we can get all the values for a keyword landing page combination and plot it. On the x-axis we have the time history, on the y-axis we have the position. By the way, here are two little R-tricks to see. R usually plots on the Y-axis from the low value upwards to a higher value. And then exactly the area that is interesting for us is caught, namely not all positions from 1 to 100, but only the area for which we were ranked. The plot requires only a few lines of code:

maxValue <- max(currentQuery$position)<br /> minValue <- min(currentQuery$position)<br /> x <- minValue:maxValue<br /> query <- currentQuery$query[1]<br /> plot(df$Date,df$Position,ylim = rev(range(x)))<br /> title(main=query)

We see the development of the ranking for a keyword on our plot, but it is not really “actionable” yet. Let’s take a look at the CTR next to it:

The higher the position, the higher the click-through rate. That’s obvious. But in these two plots you can see that first the click-through rate went down and then the position. Not that click-through rate is the only ranking factor, but a poor click-through rate on a result testifies to suboptimal perceived relevance, and no search engine wants the results to be perceived as less relevant. So a look at the title and description would be a good recommendation for action. But how do we actually know what a good CTR is for a position? For example, we can take a plot from our own rankings:

And we could compare this to the results of Advanced Web Ranking, which create plots from the Webmaster Console data of the large number of their customers. Every country and industry is different, and the CTR depends on the SERP, whether there are other elements that influence the CTR. But from the plot alone, you can see that certain CTRs are suboptimal for certain results. So here “only” a report would have to be created to determine which keyword-landing page combinations are below average.

If we take a closer look at the plot with the CTRs per position, we see a few unusual things. On the one hand, there are always a few results where I always have 100% CTR no matter what position I have. And then there is a lot of noise between positions 1, 2, 3 and so on. The latter is very easy to explain, because the API gives us average positions with decimal places, as described above. So we would only have to round to get actual positions. The 100% CTR mainly affects results with few impressions. For example, if all results that had less than 10 impressions per day are filtered out, then the picture looks different:

And lo and behold, I don’t have that many number 1 placements with more than one homeopathic dose of impressions. But if I squinted my eyes a little, I could see a line. And indeed, if we calculate the mean values (here with summary), then I see a non-linear descending curve in the mean:<br /> Position 1:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.2143 0.3971 0.4929 0.4828 0.5786 0.7059<br /> Position 2:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.2667 0.4118 0.3744 0.5000 0.7692<br /> Position 3:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1176 0.1818 0.2217 0.3205 0.5769<br /> Position 4:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.08333 0.18182 0.17266 0.26667 0.45454<br /> Position 5:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1240 0.1579 0.1584 0.2053 0.3200<br /> Position 6:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.06977 0.11765 0.12223 0.16667 0.30769<br /> Position 7:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.05043 0.09091 0.09246 0.13229 0.22222<br /> Position 8:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.03880 0.04594 0.08052 0.19048<br /> Position 9:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.00000 0.01412 0.01205 0.16514<br /> Position 10:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.000000 0.000000 0.000000 0.010284 0.004045 0.093023

In the next step, we “only” have to evaluate which result is below a certain limit of the CTR. Fortunately, there is a measure in statistics that helps us identify the variation around a mean, the standard deviation. This is not explained here now (but in the script for the data analysis course). But on the way to standard deviation, let’s look at an overview of the distribution of CTRs. As you can see, this looks like a normal distribution. Now we calculate the standard deviation per position with a clear conscience:

<br /> [1] 0.139202<br /> [2] 0.1691641<br /> [3] 0.1405702<br /> [4] 0.1116699<br /> [5] 0.07492808<br /> [6] 0.07420478<br /> [7] 0.05702427<br /> [8] 0.05028635<br /> [9] 0.03001044<br /> [10] 0.02134183

So, for position 1, we could calculate mean minus standard deviation equals result with click-through rate that deserves attention, here

0.4828 - 0.139202 = 0.3436019

All results with a rounded position 1 and a CTR of 0.43… deserve attention. In my case, these are the search queries for scalable capital experience and cookidoo. While there are also CTRs above average on some days, sometimes they fall below average. This is then repeated for each position.

Too much data?

We now have a problem. Because my blog is still a relatively small site. What do we do with a page with a lot of keywords and a lot of rankings? We’ll get to that next time by creating our own visibility index.

Free and automated SEO monitoring with R and AWS


Slowly, R is making its way into the world of SEO, and while R may be a bit confusing at first (functional programming instead of procedural), you can build cool stuff with just a few lines of code. As an example, a free SEO monitoring should serve here, which of course can’t keep up with Sistrix and Co at all, but if you only want to track your own rankings, then this is a great and, above all, free solution.

Let’s start with the infrastructure, we only need three components:

  • One (free) EC2 instance
  • A Google Webmaster Account
  • A Google API service account

Amazon offers EC2 instances in the “free tier”, after 12 months a fee is due, but it is more in the homeopathic area. The t2.micro instance is relatively weak on the chest with its one vCPU, 1GB RAM and 30GB SSD, but for our purposes it is perfectly sufficient. R is of course not there from the outset, BUT Luis Aslett offers free AMIs (Amazon Machine Images) where RStudio Server is already pre-configured. Matt explains very well how to use these AMIs to install your own RStudio instance on AWS. All this takes a maximum of 15 minutes, and you have your own free RStudio Server computer in the AWS cloud. Large calculations are not possible with it, but once you get a taste for it, you quickly find yourself in the situation that an instance with a lot of memory is used for larger computing tasks. One click, a few euros a day, and you have a whole cluster with 16 processors and 128 GB of RAM for yourself. But I digress.

In the next step, we’ll take Mark Edmonson’s R package searchConsoleR. This is the elementary cornerstone of our SEO monitoring. In Mark’s example, he simply writes the data to disk, but we prefer to write the data to a database (how to install MySQL on our newly acquired EC2 instance is here. Please note that you only have one user “ubuntu”, i.e. you have to do everything with sudo; otherwise you can also book an RDS instance for a fee). In order to access the Webmaster Console data from a server, a service account is required. This is not quite so easy to set up, but that would go beyond the scope of this article. It is important that the email address of the service account is entered as a full user in the Webmaster Console. And here is the code:

library(jsonlite) library(googleAuthR) library(searchConsoleR) library(RMySQL) options(googleAuthR.scopes.selected=”https://www.googleapis.com/auth/webmasters”)’ gar_auth_service( json_file = “/home/rstudio/XXXXXXX.json”, scope = “https://www.googleapis.com/auth/webmasters” )

We’ll get the data from 5 days ago, then it’s definitely available in the Webmaster Console:

delay <- 5 start <- Sys.Date() – delay end <- Sys.Date() – delay

Here is the query:

website <;- “XXXXX” download_dimensions <- c(‘query’,’page’) type <- c(‘web’) sc_data <- search_analytics(siteURL = website, startDate = start, endDate = end, dimensions = download_dimensions, searchType = type)

We add a date and the website (if we query several websites)

sc_data7 <- Website sc_data8 <- Start colnames(sc_data)7 <- “Website” colnames(sc_data)8 <- “Date”

Now we write the dataframe to the database (we already added the DB and the table earlier):

mydb = dbConnect(MySQL(), user=’XXXXX’, password=’XXXXX’, host=’XXXXX’) dbSendQuery(mydb, “USE XXXXX”) dbWriteTable(mydb, value = sc_data, name = “rankings”, append = TRUE, row.names = FALSE) dbDisconnect(mydb)

Now all we need is a cron job that performs this query every day. To do this, I first use a small shell script, which then calls the R script:

cd /home/rstudio/SearchConsole/ Rscript /home/rstudio/SearchConsole/rankings. R

Then the cronjob is set up, my cronjob starts every day at 10:30, or 11:30, because the instance is in a different time zone:sudo crontab -e

30 10 * * * /home/rstudio/SearchConsole/rankings.sh 2>&1 | /usr/bin/logger -t rstudio

Important: Newline after the last line, otherwise cron will complain. The automatic and free SEO monitoring is ready! In one of the next posts, I will show how the data can then be evaluated.