A custom visibility index with R and AWS


The third episode on search engine optimization with R and AWS is about creating your own visibility index to get an aggregated overview of the ranking of many keywords. In the first part, we looked at how to automatically pull data from the Webmaster Console with R and an AWS Free Tier EC2 instance, and the second part was about initial analyses based on click-through rates on positions.

What does a visibility index do?

To do this, let’s first look at what a visibility index is supposed to do. With Sistrix, the visibility index is independent of seasonal effects, which is quite charming, because you know in summer whether you can win a flower pot with winterreifen.de. This can be solved, for example, by using the average number of searches from the AdWords Keyword Planner. It’s just a shame that this tool only spits out reasonably useful values if you spend enough budget in Google AdWords. This approach sometimes falls flat, because we want to keep our monitoring as cheap as possible, at best free of charge.

Sistrix has the disadvantage that a) it is still too expensive for students who attend my SEO courses despite the low price and b) my little niche stories are not always available in the Sistrix keyword database. Sistrix and Co. are especially exciting if you want to compare a site with other sites (ideally from the same industry with the same portfolio). An abstract number like a visibility index of 2 is otherwise pretty meaningless. This number only makes sense if I can relate it to other websites and/or if I can use it to track the evolution of my own website rankings over time. The number itself is still not meaningful, because what kind of metric is being measured in? If I lose 2 kilos, then the metric is clear. But lose 0.002 SI? How many visitors are there?

We want to build an index that allows us to see whether our ranking changes over time across a large number of keywords. How our market competitors are developing can only be seen by scraping Google, and that is not allowed.

Visibility index based on the Webmaster Console

Obviously, it is better to rank 3rd with a search term that is searched 100 times a day than for a search term that is searched for only 2 times a day. So the number of searches should play a role in our visibility index. We exclude the search volume via the AdWords Keyword Planner for the reasons mentioned above, the only source we have left is the impressions from the Webmaster Console. Once we’re on the first page and far enough up (I’m not sure if it’s counted as an impression if you’re in 10th place and you’ve never been in viewable), we should be able to use the number of impressions from the Webmaster Console, even on a daily basis!

A small health check for the keyword “Scalable Capital Experiences” (AdWords / real data from the Webmaster Console):

  • 2400 / 1795 (September, but only half a month)
  • 5400 / 5438 (October)
  • 1000 / 1789 (November)

For September and October it looks good, only in November it is a bit strange that I had almost 80% more impressions than there were supposedly searches. Something must have happened in September/October that Scalable Capital suddenly had so many searches. In fact, this was also visible in the traffic on my site. We don’t get the first point clarified and accept that the AdWords numbers are not perfect either.

The following figures illustrate how different the visibility indices are depending on the weighting:

  • In the simplest model, only 11 minus position is calculated, everything above (greater than) position 10 then gets 1 point. The data for each result is added up for each day. This model has the disadvantage that I can climb up very quickly, even if I’m only in 1st place terms that are only searched for once a month.
  • In the second model, the same procedure is chosen, except that here the value is multiplied by the impressions
  • In the third model, the average CTR on the SERP from the second part of this series is multiplied by the impressions.

If you now look at the actual traffic, you can see that the 3rd model is already very close to the traffic. The spikes in real traffic aren’t quite as strong as in the index, and in the end, I don’t get as much traffic, but that may be because the actual CTR is below the expected CTR.

Alternative approach with the Webmaster Console

If you look at the plots, however, it becomes clear that this approach with impressions on a daily basis makes little sense. Because if the curve goes down, it doesn’t mean that I can do anything, because maybe there is just less searching for this topic and my rankings haven’t changed at all (usually you only write about what works, but I also find the failures exciting, because you can learn a lot from them :-)). This is exactly why Sistrix will probably also calculate out seasonal fluctuations.

Alternatively, you could simply average all impression data of a keyword landing page pair and use that average to calculate, again with the weighted CTR per position. The good thing about this approach is that seasonal or temporary fluctuations balance each other out. Plotted, it looks like this:

This plot looks very similar to the first plot, but that doesn’t mean that it always has to be that way. But when I look at the Sistrix values (even if I’m on a very low level), it looks very similar.

From data to action relevance

Now we have a plot that shows us the weighted development of our rankings, but what do we do with it? It’s not really “actionable”. It only gets exciting when we also look at which rankings change the most and have an influence on our visibility index. To do this, we first take the minimum ranking for each keyword landing page pair (minimal because low, and lower than 1st place is not possible) and then the current ranking. Finally, we calculate the delta and sort according to it:

The higher the delta, the greater the loss of ranking places, so to speak. And the greater the need for action, provided that the keyword is really interesting. In my example, for example, I wouldn’t think it would be bad to rank for “seo monitoring”, after all, the articles from this series are relevant for it. You could now weigh based on the impressions or the viewability index we chose earlier:

This looks more exciting: In fact, there are some search queries at the top (sorted by “Actionability”) that I find quite interesting. And now you could combine that with the data from the second part and build a dashboard… more on this in the fourth part

Actionable SEO Reporting with R and AWS


The first part was about how to create automated SEO monitoring with R and an initially free AWS instance. The Webmaster Console is queried every day via API, and the data is written to a database. Of course, collecting data alone doesn’t help, we should do something with it, and my mantra, which every student hears several times a day in my courses, is “Data – Information – Action”. In most treatises, knowledge takes the place of action, because the processing of information creates knowledge in the first place. In the field of data analysis or even data science, however, it is more often a matter of not only knowing, but also doing something with the knowledge, ideally knowing what to do. The reason why most reports are not read is usually that no action can be derived. 5 more likes this week. So what? What do I do differently tomorrow? So we don’t just want to build SEO reporting, we want to create SEO reporting that tells us where to do something and what to do. “Actionable” is the name of the game in modern German, and there is actually no really nice translation in German. “Action-relevant”? It’s somehow not the same. So let’s live with this term for the time being.

The value of the Webmaster Console API

Let’s first take a look at the data we got from the Webmaster Console. There is already a special feature to be seen here that we don’t get in the interface. There we get either the search queries or the pages that the users came to after clicking on a search result, but not both at the same time. This is already an added value, because we can form pairs of keyword and landing page, so to speak. A landing page can have several keywords, and vice versa, by the way, if several pages from one host rank for a keyword. We use these pairs for our evaluation to have a unique identifier, for example by putting both together and making an MD5 hash out of them (the close observer will notice that the same combination occurs twice in one day, but it only looks like that, because one version of the URL has a /amp/ after it, but this is not to be seen in the table).

If we haven’t already done this when writing to the table, we’ll do it now:

library(digest)<br /> i <- 1<br /> newResults <- results<br /> newResults["Hash"] <- NA<br /> for (i in i:nrow(newResults)) {<br /> newResults$Hash[i] <- digest(paste(newResults$query[i],newResults$page[i],sep=""))<br /> i <- i+1<br /> }

Of course, this can be done even more nicely with apply, but we are currently in hacker mode, not in nice programming mode

Another special feature is that we probably won’t bother to look at the data for each day individually in the interface. This is extremely helpful, though, because the default setting in the Webmaster Console is the last 28 days, and this is where an average is calculated for each value. Those who know me better will now look embarrassed at the ground, because I say the same thing here again and again: the average, and indeed the arithmetic mean, is the enemy of statistics. Because this type of average does not show us the swings. More about that another time when I put the script for the data analysis event online. The point here is that in the arithmetic mean of the interface, I can see that a keyword has been in a certain position in the last 28 days, but in fact, the range per day is much more interesting, because if I grab it per day, then I can map more accurate trends. Last but not least, the Webmaster Console also gives us impressions and keywords that cannot be found in the databases of the common tools. Sistrix, as much as I’ve grown fond of it, can’t find me for “Cookidoo” or “Scalable Capital”. Sure, I could provide Sistrix with my Webmaster Console data, but unfortunately I’m not allowed to do that for every project.

Since we’re querying the data every day, we can now walk through the spreadsheet and get the values for the created identifier so that we can get all the values for a keyword landing page combination and plot it. On the x-axis we have the time history, on the y-axis we have the position. By the way, here are two little R-tricks to see. R usually plots on the Y-axis from the low value upwards to a higher value. And then exactly the area that is interesting for us is caught, namely not all positions from 1 to 100, but only the area for which we were ranked. The plot requires only a few lines of code:

maxValue <- max(currentQuery$position)<br /> minValue <- min(currentQuery$position)<br /> x <- minValue:maxValue<br /> query <- currentQuery$query[1]<br /> plot(df$Date,df$Position,ylim = rev(range(x)))<br /> title(main=query)

We see the development of the ranking for a keyword on our plot, but it is not really “actionable” yet. Let’s take a look at the CTR next to it:

The higher the position, the higher the click-through rate. That’s obvious. But in these two plots you can see that first the click-through rate went down and then the position. Not that click-through rate is the only ranking factor, but a poor click-through rate on a result testifies to suboptimal perceived relevance, and no search engine wants the results to be perceived as less relevant. So a look at the title and description would be a good recommendation for action. But how do we actually know what a good CTR is for a position? For example, we can take a plot from our own rankings:

And we could compare this to the results of Advanced Web Ranking, which create plots from the Webmaster Console data of the large number of their customers. Every country and industry is different, and the CTR depends on the SERP, whether there are other elements that influence the CTR. But from the plot alone, you can see that certain CTRs are suboptimal for certain results. So here “only” a report would have to be created to determine which keyword-landing page combinations are below average.

If we take a closer look at the plot with the CTRs per position, we see a few unusual things. On the one hand, there are always a few results where I always have 100% CTR no matter what position I have. And then there is a lot of noise between positions 1, 2, 3 and so on. The latter is very easy to explain, because the API gives us average positions with decimal places, as described above. So we would only have to round to get actual positions. The 100% CTR mainly affects results with few impressions. For example, if all results that had less than 10 impressions per day are filtered out, then the picture looks different:

And lo and behold, I don’t have that many number 1 placements with more than one homeopathic dose of impressions. But if I squinted my eyes a little, I could see a line. And indeed, if we calculate the mean values (here with summary), then I see a non-linear descending curve in the mean:<br /> Position 1:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.2143 0.3971 0.4929 0.4828 0.5786 0.7059<br /> Position 2:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.2667 0.4118 0.3744 0.5000 0.7692<br /> Position 3:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1176 0.1818 0.2217 0.3205 0.5769<br /> Position 4:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.08333 0.18182 0.17266 0.26667 0.45454<br /> Position 5:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1240 0.1579 0.1584 0.2053 0.3200<br /> Position 6:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.06977 0.11765 0.12223 0.16667 0.30769<br /> Position 7:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.05043 0.09091 0.09246 0.13229 0.22222<br /> Position 8:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.03880 0.04594 0.08052 0.19048<br /> Position 9:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.00000 0.01412 0.01205 0.16514<br /> Position 10:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.000000 0.000000 0.000000 0.010284 0.004045 0.093023

In the next step, we “only” have to evaluate which result is below a certain limit of the CTR. Fortunately, there is a measure in statistics that helps us identify the variation around a mean, the standard deviation. This is not explained here now (but in the script for the data analysis course). But on the way to standard deviation, let’s look at an overview of the distribution of CTRs. As you can see, this looks like a normal distribution. Now we calculate the standard deviation per position with a clear conscience:

<br /> [1] 0.139202<br /> [2] 0.1691641<br /> [3] 0.1405702<br /> [4] 0.1116699<br /> [5] 0.07492808<br /> [6] 0.07420478<br /> [7] 0.05702427<br /> [8] 0.05028635<br /> [9] 0.03001044<br /> [10] 0.02134183

So, for position 1, we could calculate mean minus standard deviation equals result with click-through rate that deserves attention, here

0.4828 - 0.139202 = 0.3436019

All results with a rounded position 1 and a CTR of 0.43… deserve attention. In my case, these are the search queries for scalable capital experience and cookidoo. While there are also CTRs above average on some days, sometimes they fall below average. This is then repeated for each position.

Too much data?

We now have a problem. Because my blog is still a relatively small site. What do we do with a page with a lot of keywords and a lot of rankings? We’ll get to that next time by creating our own visibility index.

mv: argument list too long – processing millions of files


Due to my forgetfulness, a cron job had accumulated more than 3 million files in a directory, which I now wanted to process. So that my script doesn’t work on it for days (more 60 GB of data!), the files should be divided into smaller chunks. Unfortunately, mv couldn’t handle it, it complained with “argument list too long”. The remedy is a combination of command line commands:

find folder1/ -name ‘2017-07*’ -exec mv {} folder2 ;

Clustering with Google Analytics and R


Some questions are not so easy or even impossible to answer with the Google Analytics user interface (this also applies to Adobe Analytics, Piwik, etc.). While Google Analytics offers powerful and easy-to-use functionality to manually create and compare segments or personas based on devices, acquisition channels, or browsers, once it goes beyond these standard segments or to combinations of multiple dimensions, the effort becomes complex. Often enough, people simply “poke” at the data and hope that they will find something valuable. This is exactly where the advantages of combining Google Analytics and R come into play. One way to connect Google Analytics and R is the R package googleAnalyticsR by Mark Edmonson, which is used as an example in this article.

Segmentation, Clustering and Classification

Before we get into the practical, let’s briefly explain the difference between segmentation, clustering and classification. Segmentation attempts to divide customers or users into groups that differ based on characteristics, whether it’s an interest, an access channel, a marketing campaign through which a customer came, etc. Clustering attempts to automatically identify such groups and the dimensions or features that distinguish them, whereas classification attempts to predict which group a customer or user belongs to. Classification is a good example of supervised machine learning, clustering is a good example of unsupervised machine learning. Nothing meaningful always comes out of it, hence the frequent use of the word “tried”

This example is about clustering, i.e. identifying groups based on structures in the data using machine learning algorithms. Hierarchical clustering identifies groups based on their similarity, starting with a separate cluster for each data point and grouping together more clusters based on their similarity at each subsequent level, until all clusters are merged (see the dendrogram). A great algorithm that ideally needs numerical data as input, because similarity is calculated as distance here. Although DAISY could also be used with non-numerical data, this goes too far for the first step.

For our example, we’ll use a different approach, we just want to find out if clusters can be formed based on the content or products viewed on a page. So I suspect that the visitors of this website who look at roller derby photos are very likely not interested in the articles about fintechs. But maybe Google Analytics article readers are interested in fintechs, so they could be offered articles on the topic. We know this from Amazon (“Often bought together”), and most of the time the suggestions are useful. Of course, I can also look at the user flow report in Google Analytics (with the few pages on my homepage, this is more than enough), but as soon as there are more pages or products, we don’t get any further with this report. This is where Association Rules, also known as Market Basket Analysis, help us (I even worked with one of the creators, Tomas Imielinski, on Ask.com once). This machine learning-based approach tries to identify interesting relationships between variables in large data sets.

In general, the selection of an algorithm also and above all depends first and foremost on the business problem that is to be solved. This is an extremely important point: What question do we actually want to answer? In this case, I can only answer the question of which pages have ended up in a “shopping cart” based on the use by website visitors. Of course, I could also calculate a “similarity” automatically based on the content.

Keep your eyes open when interpreting the results

But first of all, a big warning: What users look at depends to a large extent on what the navigation or other elements on the page look like. If, for example, a recommendation system such as YARPP is already integrated into the site, then the probability is higher that some pages are more likely to be accessed together with other pages due to this plugin alone. But even small things on a page can lead to connections being seen that are not actually there, even if it is just a small icon that attracts the attention of the user and seduces them to click.

Then it should also be pointed out that the reputation of pages is not quite as strong a signal as buying a product. Just because someone calls up a page doesn’t mean that the person stayed on that page for a long time and read the text. You could incorporate this, for example, by measuring the time spent on the respective page and only allowing the combinations where “enough” time has been spent, but the time spent in the web analysis systems is unfortunately often unusable.

Preconditions

Clustering with Google Analytics and R requires user-level analytics data, which means that users of a standard installation of the free version of Google Analytics cannot easily perform clustering. The standard version has the new user explorer, which can be used to look at the behavior of individual anonymized users, but this data is not downloadable. Only the buyers of the Pro variant can access user-level data a priori, in the free version only aggregated data is offered. However, there is a small hack that can be used to change this, otherwise here is another argument for Piwik, at least as a second system. In addition to access to the raw data, the advantage of Piwik is that the data is immediately available. And there is also an R package for Piwik. But in this article, the solution with Google Analytics and R is to be shown.

Connect Google Analytics and R to googleAnalyticsR

Now it’s time to get down to business. I won’t explain how to install an R package, for what we are planning to do, the googleAnalyticsR package and the googleAuthR package are needed on the one hand, and the arules package on the other. The R-Hilfe is your friend.

We first load the two packages and then log in with a Google account (a browser window opens for this):

library("googleAuthR")<l g ibrary("googleAnalyticsR")<a_auth()

By the way, if you are already logged in and want to switch to another account, you enter

ga_auth(new_user = TRUE)

and log in with the other account. Next, let’s get the list of our accounts and properties that are connected to this Google account:

my_accounts <- ga_account_list()

In this data frame, we look for the viewView we want to use in the viewName column and look for the viewId in the row. This viewId is elementary because we will use it to retrieve the data of our data view. We store the viewID in a variable:

ga_id=XXXXXX

where XXXXXX stands for the viewId. What data are available to us now? Simple

meta <- google_analytics_meta()

and all dimensions and metrics are available in the Data Frame meta. We are particularly interested in Custom Dimension 1, where the User Level ID is stored, and ga:pagePath:

gadata <- google_analytics(id = ga_id,<br /> start="2017-04-20", end="2017-05-06",<br /> dimensions = c("ga:dimension1", "ga:pagePath"),<br /> max = 2000)

In my dataframe gadata, I now have the data as described above plus two more columns (sessions and bounceRate), because GA doesn’t output the dimensions to me without metrics. The data needs to be transformed so that it looks something like this:

<br /> user id 1, page 1, page 2<br /> user id 2, page 1<br /> user id 3, page 3

This is done with the code

i <- split(gadata$pagePath,gadata$dimension1)

We now have every “transaction” of a user on one line. This is the input needed for our algorithm.

Market Basket Analysis

Now is the time to load the R package arules and modify the data for the algorithm:

library(arules)<br /> txn <- as(i, "transactions")

When calling the algorithm, the parameters sup for support and conf for confidence are used, i.e. in the following example we tell it that we want to have rules that are applicable in at least 0.1% of the cases and for which there is a confidence of 0.1%. That sounds damn little at first, but let’s imagine that we are dealing with a “shopping cart” in which many different combinations can be “placed”, depending on the size of an online shop or a website. Of course, we want to cover as many operations as possible with the rules, but the more possibilities there are, the more likely it is that we will have to go in with a low support value.

basket_rules <- apriori(txn, parameter = list(sup = 0.001, conf = 0.001, target="rules"))

This can take up quite a lot of processor time and RAM, and depending on the configuration, R ends the process with an error message. One way to limit the algorithm is to increase the size of the sup and/or conf parameters. If everything went well, the identified rules can be examined:

inspect(head(sort(basket_rules, by="lift"),20))

This shows the top 20 rules, sorted by lift. The lift tells us how much higher the dependence of the items is compared to their independence, which you have to assume for a lift of 1. The output for me looks like this (not for this website):

Rule two, for example, states that a small proportion of transactions involve a user looking at an article about communication methods and, with a confidence rate of 30%, will also read the article about communication technology. This compound has an extremely high lift of 521, so we can assume that this compound is much more likely than their coincidental occurrence.

Summary Google Analytics and R

This small example shows the power of R and the packages. With just a few lines of code, data can be imported directly from Google Analytics into R and used for complex analyses. This analysis would not have been possible in the user interface.

However, we also see here how important it is to have data at the user level, although it must be said that we don’t even have user-level data here, but rather browser-level data due to the lack of cross-device tracking. A next step could therefore be to create the identified rules again on the basis of segmented data to identify differences.

Linear regression: How much should a used SLR camera cost?


Since the Canon 5d Mark IV has just been released, the 5d Mark III will also be affordable. I was advised to pay €1,500 for a maximum of 30,000 releases, but if you look at the cameras on offer on eBay and the relevant forums, the price seems to be much higher. But what is the fair price? With sufficient data, this can be determined by regression.

First of all, let’s assume that the more shutter releases a camera has, the cheaper it becomes. 150,000 triggers is the expected life of the shutter release on the 5d Mark III, a new shutter release including labor costs about 450€ (not verified). But first, let’s take a look at what the data looks like. I have picked out the prices and releases of almost 30 offers from the various platforms, so this does not take into account how good the camera is on the outside or what accessories are included. This data is read into R and examined for its characteristics:

We have an average price of 1,666€ and an average number of triggers of 85,891. That’s a far from the €1,500 with 30,000 triggers. The median does not look much better either. Now we start the regression and form a regression model with the function lm and look at the details with the command summary(MODELNAME):

From this we can already form a function:

Price = -0.001749x+1816 where x is the number of triggers. So a camera with 30,000 shutters should cost €1,763.53, a camera with 100,000 shutters still €1,614. Let’s take a look at the plot:

As you can see, we have some outliers (by the way, not due to great accessories or maintenance etc.) that somewhat “distort” the model. I had already removed a camera with nearly 400,000 releases and a price of still €1,000.

Unfortunately, the number of triggers cannot be automatically read from the portals because it is always manually entered by the user. Otherwise, you could build a nice tool for each camera model.