The first part was about how to create automated SEO monitoring with R and an initially free AWS instance. The Webmaster Console is queried every day via API, and the data is written to a database. Of course, collecting data alone doesn’t help, we should do something with it, and my mantra, which every student hears several times a day in my courses, is “Data – Information – Action”. In most treatises, knowledge takes the place of action, because the processing of information creates knowledge in the first place. In the field of data analysis or even data science, however, it is more often a matter of not only knowing, but also doing something with the knowledge, ideally knowing what to do. The reason why most reports are not read is usually that no action can be derived. 5 more likes this week. So what? What do I do differently tomorrow? So we don’t just want to build SEO reporting, we want to create SEO reporting that tells us where to do something and what to do. “Actionable” is the name of the game in modern German, and there is actually no really nice translation in German. “Action-relevant”? It’s somehow not the same. So let’s live with this term for the time being.
The value of the Webmaster Console API
Let’s first take a look at the data we got from the Webmaster Console. There is already a special feature to be seen here that we don’t get in the interface. There we get either the search queries or the pages that the users came to after clicking on a search result, but not both at the same time. This is already an added value, because we can form pairs of keyword and landing page, so to speak. A landing page can have several keywords, and vice versa, by the way, if several pages from one host rank for a keyword. We use these pairs for our evaluation to have a unique identifier, for example by putting both together and making an MD5 hash out of them (the close observer will notice that the same combination occurs twice in one day, but it only looks like that, because one version of the URL has a /amp/ after it, but this is not to be seen in the table).
If we haven’t already done this when writing to the table, we’ll do it now:
library(digest)<br /> i <- 1<br /> newResults <- results<br /> newResults["Hash"] <- NA<br /> for (i in i:nrow(newResults)) {<br /> newResults$Hash[i] <- digest(paste(newResults$query[i],newResults$page[i],sep=""))<br /> i <- i+1<br /> }
Of course, this can be done even more nicely with apply, but we are currently in hacker mode, not in nice programming mode
Another special feature is that we probably won’t bother to look at the data for each day individually in the interface. This is extremely helpful, though, because the default setting in the Webmaster Console is the last 28 days, and this is where an average is calculated for each value. Those who know me better will now look embarrassed at the ground, because I say the same thing here again and again: the average, and indeed the arithmetic mean, is the enemy of statistics. Because this type of average does not show us the swings. More about that another time when I put the script for the data analysis event online. The point here is that in the arithmetic mean of the interface, I can see that a keyword has been in a certain position in the last 28 days, but in fact, the range per day is much more interesting, because if I grab it per day, then I can map more accurate trends. Last but not least, the Webmaster Console also gives us impressions and keywords that cannot be found in the databases of the common tools. Sistrix, as much as I’ve grown fond of it, can’t find me for “Cookidoo” or “Scalable Capital”. Sure, I could provide Sistrix with my Webmaster Console data, but unfortunately I’m not allowed to do that for every project.
Since we’re querying the data every day, we can now walk through the spreadsheet and get the values for the created identifier so that we can get all the values for a keyword landing page combination and plot it. On the x-axis we have the time history, on the y-axis we have the position. By the way, here are two little R-tricks to see. R usually plots on the Y-axis from the low value upwards to a higher value. And then exactly the area that is interesting for us is caught, namely not all positions from 1 to 100, but only the area for which we were ranked. The plot requires only a few lines of code:
maxValue <- max(currentQuery$position)<br /> minValue <- min(currentQuery$position)<br /> x <- minValue:maxValue<br /> query <- currentQuery$query[1]<br /> plot(df$Date,df$Position,ylim = rev(range(x)))<br /> title(main=query)
We see the development of the ranking for a keyword on our plot, but it is not really “actionable” yet. Let’s take a look at the CTR next to it:
The higher the position, the higher the click-through rate. That’s obvious. But in these two plots you can see that first the click-through rate went down and then the position. Not that click-through rate is the only ranking factor, but a poor click-through rate on a result testifies to suboptimal perceived relevance, and no search engine wants the results to be perceived as less relevant. So a look at the title and description would be a good recommendation for action. But how do we actually know what a good CTR is for a position? For example, we can take a plot from our own rankings:
And we could compare this to the results of Advanced Web Ranking, which create plots from the Webmaster Console data of the large number of their customers. Every country and industry is different, and the CTR depends on the SERP, whether there are other elements that influence the CTR. But from the plot alone, you can see that certain CTRs are suboptimal for certain results. So here “only” a report would have to be created to determine which keyword-landing page combinations are below average.
If we take a closer look at the plot with the CTRs per position, we see a few unusual things. On the one hand, there are always a few results where I always have 100% CTR no matter what position I have. And then there is a lot of noise between positions 1, 2, 3 and so on. The latter is very easy to explain, because the API gives us average positions with decimal places, as described above. So we would only have to round to get actual positions. The 100% CTR mainly affects results with few impressions. For example, if all results that had less than 10 impressions per day are filtered out, then the picture looks different:
And lo and behold, I don’t have that many number 1 placements with more than one homeopathic dose of impressions. But if I squinted my eyes a little, I could see a line. And indeed, if we calculate the mean values (here with summary), then I see a non-linear descending curve in the mean:<br /> Position 1:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.2143 0.3971 0.4929 0.4828 0.5786 0.7059<br /> Position 2:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.2667 0.4118 0.3744 0.5000 0.7692<br /> Position 3:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1176 0.1818 0.2217 0.3205 0.5769<br /> Position 4:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.08333 0.18182 0.17266 0.26667 0.45454<br /> Position 5:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1240 0.1579 0.1584 0.2053 0.3200<br /> Position 6:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.06977 0.11765 0.12223 0.16667 0.30769<br /> Position 7:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.05043 0.09091 0.09246 0.13229 0.22222<br /> Position 8:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.03880 0.04594 0.08052 0.19048<br /> Position 9:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.00000 0.01412 0.01205 0.16514<br /> Position 10:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.000000 0.000000 0.000000 0.010284 0.004045 0.093023
In the next step, we “only” have to evaluate which result is below a certain limit of the CTR. Fortunately, there is a measure in statistics that helps us identify the variation around a mean, the standard deviation. This is not explained here now (but in the script for the data analysis course). But on the way to standard deviation, let’s look at an overview of the distribution of CTRs. As you can see, this looks like a normal distribution. Now we calculate the standard deviation per position with a clear conscience:
<br /> [1] 0.139202<br /> [2] 0.1691641<br /> [3] 0.1405702<br /> [4] 0.1116699<br /> [5] 0.07492808<br /> [6] 0.07420478<br /> [7] 0.05702427<br /> [8] 0.05028635<br /> [9] 0.03001044<br /> [10] 0.02134183
So, for position 1, we could calculate mean minus standard deviation equals result with click-through rate that deserves attention, here
0.4828 - 0.139202 = 0.3436019
All results with a rounded position 1 and a CTR of 0.43… deserve attention. In my case, these are the search queries for scalable capital experience and cookidoo. While there are also CTRs above average on some days, sometimes they fall below average. This is then repeated for each position.
Too much data?
We now have a problem. Because my blog is still a relatively small site. What do we do with a page with a lot of keywords and a lot of rankings? We’ll get to that next time by creating our own visibility index.