Data Science meets SEO, Part 2


After I explained in the first part what data science is and what already exists in this area on the subject of SEO, now the second part, where we take a closer look at what the linguistic processing of a document by a search engine has on SEO concepts such as keyword density, TF/IDF and WDF/IDF. Since I showed Campixx live code at SEO, I offer everything for download here, which makes following the examples even more eventful. By the way, this is also possible without the installation of R, here you can find the complete code with explanations and results.

For those who want to recreate it in R

(Please skip if you don’t want to recreate this yourself in R)

I recommend using RStudio in addition to R, because handling the data is a bit easier for newbies (and for professionals as well). R is available at the R ProjectRStudio on rstudio.com. First install R, then RStudio.

In the ZIP file there are two files, a notebook and a CSV file with a small text corpus. I refer to the notebook from time to time in this text, but the notebook can also be worked through in this way. Important: Please do not read the CSV file with the import button, because then a library will be loaded, which will nullify the functionality of another library.

The notebook has the great advantage that both my description, the program code and the result can be seen in the same document.

To execute the program code, simply click on the green arrow in the upper right corner, and it works

What is TF/IDF or WDF/IDF?

(Please skip if the concepts are clear!)

There are people who can explain TF/IDF (Term Frequency/Inverse Document Frequency) or WDF/IDF (Word Document Frequency/Inverse Document Frequency) better than me, at WDF/IDF Karl has distinguished himself with an incredibly good article (and by the way, in this article he has already said that “actually no small or medium-sized provider of analysis tools can offer such a calculation for a large number of users … ; -“).

A simplistic explanation is that the term frequency (TF) is the frequency of a term in a document, while the Inverse Document Frequency (IDF) measures the importance of a term in relation to all documents in a corpus in which the term occurs (corpus is the linguists’ term for a collection of documents; this is not the same as an index).

In TF, the number of occurrences of a term is counted, simplified but still far too complicated, and then normalized, usually by dividing that number by the number of all words in the document (this definition can be found in the book “Modern Information Retrieval” by Baeza-Yates et al, the bible of search engine builders). But there are other weightings of the TF, and WDF is actually nothing more than such a different weighting, because here the term frequency is provided by a logarithm to base 2.

In the IDF, the number of all documents in the corpus is divided by the number of documents in which the term appears, and the result is then provided with a logarithm to base 2. The Term Frequency or the Word Document Frequency is then multiplied by the Inverse Document Frequency, and we have TF/IDF (or, if we used WDF instead of TF, WDF/IDF). The question that arises to me as an ex-employee of several search engines, however, is what exactly is a term here, because behind the scenes the documents are eagerly tinkered with. More on this in the next section “Crash Course Stemming…”.

Crash Course Stemming, Composite and Proximity

Everyone has had the experience that you search for a term and then get a results page where the term can also be found in a modified form. This is because, for example, stemming is carried out, in which words are reduced to their word stem. Because not only in German is conjugated and declined, other languages also change words depending on the person, tense, etc. In order to increase the recall, not only the exact term is searched for, but also variants of this term (I did not use the term “recall” in the lecture, it describes in the Information Retrieval how many suitable documents are found for a search query). For example, for the search query “Data scientist with experience”, documents are also found that contain the terms “data scientist” and “experience” (see screenshot on the left).

In the lecture, I “lifted” live, with the SnowballC-Stemmer. This is not necessarily the best stemmer, but it gives an impression of how stemming works. In the R-Notebook for this article, the Stemmer is slightly expanded, because unfortunately the SnowBallC only ever lifts the last word, so that the Stemmer has been packed into a function that can then handle a whole set completely:

> stem_text(“This is a great post”) [1] “This is a great post” > stem_text(“Hope you’ve read this far.”) [1] “Hopefully you have read this far.”

But they are not just variants of a term. We Germans in particular are world champions in inventing new words by “composing” them by putting them together with other words. It is not for nothing that these neologisms are called compounds. An example of the processing of compounds can be found in the screenshot on the right, where the country triangle becomes the border triangle. First of all, it sounds quite simple, you simply separate terms that could stand as a single word and then index them. However, it is not quite that simple. Because not every compound may be separated, because it takes on a different meaning separately. Think, for example, of “pressure drop”, where the drop in pressure could also be interpreted as pressure and waste

The example of the “data scientist with experience” also illustrates another method of search engines: Terms do not have to be together. They can be further apart, which can sometimes be extremely annoying for the searcher if one of the terms in the search query is in a completely different context. The proximity, i.e. the proximity of the terms, can be a signal for the relevance of the terms present in the document for the search query. Google offers Proximity as a feature, but it is not clear how Proximity will be used as a ranking signal. And this only means the textual proximity, not yet the semantic proximity. Of course, there is much more processing in lexical analysis, apart from stop word removal etc. But for now, this is just a small insight.

So we see three points here that most SEO tools can’t do: stemming, proximity and decompounding. So when you talk about keyword density, TF/IDF or WDF/IDF in the tools, it’s usually based on exact matches and not the variants that a search engine processes. This is not clear with most tools; for example, the ever-popular Yoast SEO plugin for WordPress can only use the exact match and then calculates a keyword density (number of the term in relation to all words in the document). But ryte.com says, for example:

In addition, the formula WDF*IDF alone does not take into account that search terms can also occur more frequently in a paragraph, that stemming rules could apply or that a text works more with synonyms.

So as long as SEO tools can’t do that, we can’t assume that these tools give us “real” values, and that’s exactly what dear Karl Kratz has already said. These are values that were calculated on the basis of an exact match, whereas search engines use a different basis. Maybe it doesn’t matter at all, because everyone only uses the SEO tools and optimizes for them, we’ll take a look at that in the next part. But there are other reasons why the tools don’t give us the whole view, and we’ll take a closer look at them in the next section.

Can we measure TF/IDF or WDF/IDF properly at all?

Now, the definition of IDF already says why a tool that spits out TF/IDF has a small problem: It doesn’t know how many documents with the term are in the Google corpus, unless this number is also “scraped” from the results. And here we are dealing more with estimates. It is not for nothing that the search results always say “Approximately 92,800 results”. Instead, most tools use either the top 10, the top 100, or maybe even their own small index to calculate the IDF. In addition, we also need the number of ALL documents in the Google index (or all documents of a language area, which Karl Kratz drew my attention to again). According to Google, this is 130 trillion documents. So, in a very simplified way, we would have to calculate like this (I’ll take TF/IDF, so that the logarithm doesn’t scare everyone off, but the principle is the same):

TF/IDF = (Frequency of the term in the document/Number of words in the document)*log(130 trillion/”Approximately x results”),

where x is the number that Google displays per search result. So, and then we would have one number per document, but we don’t know what TF/IDF or WDF/IDF value the documents that are not among the top 10 or top 100 results examined. It could be that there is a document in 967th place that has better values. We only see our excerpt and assume that this excerpt explains the world to us.

And this is where chaos theory comes into play: Anyone who has seen Jurassic Park (or even read the book) may remember the chaos theorist, played by Jeff Goldblum in the film. Chaos theory plays a major role in Jurassic Park, because it is about the fact that complex systems can exhibit behavior that is difficult to predict. And so large parts of the park are monitored by video cameras, except for 3% of the area, and this is exactly where the female dinosaurs reproduce. Because they can change their sex (which frogs, for example, can also do). Transferred to TF/IDF and WDF/IDF, this means: we don’t see 97% of the area, but less than 1% (the top 10 or top 100 of the search results) and don’t know what’s lying dormant in the rest of our search results page. Nevertheless, we try to predict something on the basis of this small part.

Does this mean that TF/IDF or WDF/IDF are nonsense? No, so far I have only shown that these values do not necessarily have anything to do with what values a search engine has internally. And this is not even new information, but already documented by Karl and some tool providers. Therefore, in the next part, we will take a closer look at whether or not we can find a correlation between TF/IDF or WDF/IDF and position on the search results page.

Say it with data or it didn’t happen

In the enclosed R-notebook, I have chosen an example to illustrate this, which (hopefully) reminds us all of school, namely I have made a small corpus of Goethe poems (at least here I have no copyright problems, I was not quite so sure about the search results). A little more than 100 poems, one of which I still know by heart after 30 years. In this wonderful little corpus, I first normalize all the words by lowering them all, removing numbers, removing periods, commas, etc., and removing stop words.

Although there is a library tm in R with which you can calculate TF/IDF as well as TF (normalized)/IDF, but, scandal (!!), nothing about WDF/IDF. Maybe I’ll build a package for it myself. But for illustrative purposes, I just built all the variants myself and put them next to each other. So you can see for yourself what I did in my code. Let’s take a look at the data for the unstemmed version:

We can see here that the “ranking” would change if we were to sort by TF/IDF or TF (normalized)/IDF. So there is a difference between WDF/IDF and the “classical” methods. And now let’s take a look at the data for the managed version:

We see that two documents have cheated their way into the top 10, and also that we suddenly have 22 instead of 20 documents. This is logical, because a term with two or more different stems can now have become one. But we also see very quickly that all the numbers have changed. Because now we have a different basis. In other words, whatever SEOs read from WDF/IDF, the values are most likely not what actually happens at Google. And again: This is not news! Karl Kratz has already said this, and some tools also say this in their glossaries. After my lecture, however, it seemed as if I had said that God does not exist

And perhaps it is also the case that the rapprochement alone works. We’ll look at that in the next parts about data science and SEO.

Was that data science?

Phew. No. Not really. Just because you’re working with R doesn’t mean you’re doing data science. But at least we warmed up a bit by really understanding how our tool actually works.

Data Science meets SEO, Part 1


On 1.3.2018 I gave a lecture on the topic of data science and SEO at the SEO Campixx, and since there were some discussions afterwards :-), I will describe the contents in more detail here, in several parts. This part is first of all about what data science is and what already exists on the topic.

What exactly is data science?

The sexiest job of the 21st century” is rather dull on closer inspection, because most of the time is spent acquiring and cleaning data and using it to build models. It’s coding, it’s math, it’s statistics, and with larger amounts of data, it’s also a lot of knowledge about where to wire which instances with each other, such as on Amazon Web Services or Google Cloud Platform. A global galactic definition of data science does not exist to my knowledge, but I would consider data science to be the intersection of

  • Data Mining
  • Statistics and
  • Machine Learning

define. These are not new topics, but what is new is that we have much more data, much faster processors, cheap cloud processing and many development libraries. For the statistics language and development environment R used here, libraries exist for almost every purpose; somewhere there was someone who was faced with the same problem and then built a solution for it. What is also new is that more and more companies feel that you can do something with data, after all, Spotify uses data to know what music you might still like, and Google knows when you should set off if you want to get to work on time.

Unfortunately, the data hype (which will be followed by a healthy understanding of what is possible after a plateau of disappointment) is countered by relatively few people who feel at home in all three disciplines (plus cloud computing). Which in turn leads to the fact that these data scientist unicorns are sometimes offered unreasonable sums and 1000s of courses are offered on Udemy & Co that are supposed to provide you with the necessary knowledge.

A real problem with data science, however, is that not only knowledge in several areas is necessary, but also the understanding that data is used to solve a problem. I can deal with algorithms and data all day long, for me it’s like a kind of meditation and relaxation. In fact, sometimes I feel like playing with Lego, but at the end of the day, it’s all about solving problems. Not only collecting data, but also extracting the right information from it and then taking the right action (the holy trinity of data). And here is the challenge that often enough people just say, here is data, make something out of it. Therefore, it is an art for the data scientist to understand exactly what the problem actually is and to translate this into code.

In addition, many people have bad memories of math. Accordingly, the audience’s willingness to consume slides with lots of numbers and formulas tends to be on the lower end of the scale. That’s why I also worked with smaller examples in the lecture, which everyone should be able to understand well.

What kind of topics am I working on? Very different. Classification. Clustering. Personalization. Chatbots. But also analyses of slightly larger data volumes of 25 million rows of Analytics data and more, which have to be processed in a few minutes. All kinds of things.

What is already there?

On the search engine side, there is already a lot. When I was still at Ask, we had already worked with Support Vector Machines, for example, to create the ranking for the queries where the pages had almost no backlinks. Even then, there was a dynamic ranking. The topic recognition of most search engines is based on machine learning. RankBrain will be based on machine learning. So it’s not a new topic for search engines.

On the other hand, that of SEOs, the topic still seems to be relatively fresh. Search Engine Land says thatevery search marketer can think of themselves as a data scientist. I’m not sure I would subscribe to that, because most search marketers I know don’t build their own models. As a rule, they use tools that do this for them. On SEMRush you can find a collection of ideas, but more for SEA. Remi Bacha is still exciting, although I haven’t seen any data from him yet. Keyword Hero have come up with something pretty cool by using deep learning to identify the organic keywords that are no longer included since the switch to https. Otherwise, I haven’t seen much on the subject. So we see that we are at the very beginning.

What would we like to have?

Back to the question of what problem I actually want to solve with my work. In an ideal world, SEO naturally wants to be able to re-engineer the Google algorithm. However, this is unlikely, because of the more than 200 ranking signals, only a few are available to us. What we can do, however, is try to build models with the signals we have, and possibly create smaller tools. And that’s exactly what the next part is about

SEO monitoring with R, AWS, and Shiny


This is the final part of the series on SEO with R and AWS. In the first part, we prepared the AWS instance with RStudio, in the second part we carried out a small SEO analysis, and in the third part, we created a visibility index and an “actionable reporting”. This part is about how it is too exhausting for even the most hardened data scientist to run the individual scripts through RStudio every day. SEO monitoring should therefore run via an appealing interface.

GUI development with Shiny

With Shiny, the developers of RStudio have provided a framework that enables interaction with data with just a few lines of code. Fortunately, Shiny is included in the AMI that we installed in Part 1 of this series.

The code is divided into two parts, the server code and the UI code; both parts can be converted into a file, which can then be used as an app. R is placed in a special folder in RStudio. The server part uses the individual functions that we used for the first parts of this series (that’s why you can’t find all the code here).

`
server <- function(input, output, session) {

get data from db and create dropdown-items

sites <- unique(results$website)
output$moreControls <- renderUI({
selectInput(“moreControls”, “Select Website”, choices=sites)
output$sviPlot <- renderPlot({

Visibility Index Code

})
output$actions <- renderTable({

Actionable Data Code

})
}
`

In the UI part, the UI is simply screwed together:

<br /> ui <- fluidPage(<br /> title = "Tom's SEO Tool",<br /> uiOutput("moreControls"),<br /> hr(),<br /> plotOutput("sviPlot"),<br /> tableOutput("actions")<br /> )<br />

And finally, the app is started:

`

Run the application

shinyApp(ui=ui, server=server)
`

That’s it. Simple and simple:

Not necessarily nice for a customer, but perfectly sufficient to build a reporting system for yourself. In the upper left corner, I select the website for which I want to have the reporting, and after the scripts have been run, the visibility index plot and the table with the largest ranking losses appear. Creating this dashboard took just 20 minutes (ok, to be fair, I had built other dashboards with Shiny before, so I had some experience

Next steps

In my current setup, I just used the code I had used so far, but that also means that it takes a while until something can be seen. It would be better if the data were automatically evaluated every day and the dashboard then simply retrieved the aggregated data. It’s on my to-do list.

Then it would be nice if you could click on an entry in the table and then analyze it further. It’s also on my to-do list.

A connection to Google Analytics would be great.

Not every keyword is interesting and should be included in the report.

All good points that I could tackle in time, and above all good material for further posts And I’m always open to more ideas

Result

With a little brainpower, little code, and the help of Amazon Web Services and R, we built automated and free SEO reporting. Of course, Sistrix & Co offer even more functions. This approach is primarily about making better use of the data from the Webmaster Console, which is available free of charge. Because let’s be honest: We usually have enough data. We just don’t make much of it most of the time.

This approach has another advantage: The data from the Search Console is gone after 90 days. You can’t look any further back. Here, however, we are building up an archive and can also look at longer-term developments.

A custom visibility index with R and AWS


The third episode on search engine optimization with R and AWS is about creating your own visibility index to get an aggregated overview of the ranking of many keywords. In the first part, we looked at how to automatically pull data from the Webmaster Console with R and an AWS Free Tier EC2 instance, and the second part was about initial analyses based on click-through rates on positions.

What does a visibility index do?

To do this, let’s first look at what a visibility index is supposed to do. With Sistrix, the visibility index is independent of seasonal effects, which is quite charming, because you know in summer whether you can win a flower pot with winterreifen.de. This can be solved, for example, by using the average number of searches from the AdWords Keyword Planner. It’s just a shame that this tool only spits out reasonably useful values if you spend enough budget in Google AdWords. This approach sometimes falls flat, because we want to keep our monitoring as cheap as possible, at best free of charge.

Sistrix has the disadvantage that a) it is still too expensive for students who attend my SEO courses despite the low price and b) my little niche stories are not always available in the Sistrix keyword database. Sistrix and Co. are especially exciting if you want to compare a site with other sites (ideally from the same industry with the same portfolio). An abstract number like a visibility index of 2 is otherwise pretty meaningless. This number only makes sense if I can relate it to other websites and/or if I can use it to track the evolution of my own website rankings over time. The number itself is still not meaningful, because what kind of metric is being measured in? If I lose 2 kilos, then the metric is clear. But lose 0.002 SI? How many visitors are there?

We want to build an index that allows us to see whether our ranking changes over time across a large number of keywords. How our market competitors are developing can only be seen by scraping Google, and that is not allowed.

Visibility index based on the Webmaster Console

Obviously, it is better to rank 3rd with a search term that is searched 100 times a day than for a search term that is searched for only 2 times a day. So the number of searches should play a role in our visibility index. We exclude the search volume via the AdWords Keyword Planner for the reasons mentioned above, the only source we have left is the impressions from the Webmaster Console. Once we’re on the first page and far enough up (I’m not sure if it’s counted as an impression if you’re in 10th place and you’ve never been in viewable), we should be able to use the number of impressions from the Webmaster Console, even on a daily basis!

A small health check for the keyword “Scalable Capital Experiences” (AdWords / real data from the Webmaster Console):

  • 2400 / 1795 (September, but only half a month)
  • 5400 / 5438 (October)
  • 1000 / 1789 (November)

For September and October it looks good, only in November it is a bit strange that I had almost 80% more impressions than there were supposedly searches. Something must have happened in September/October that Scalable Capital suddenly had so many searches. In fact, this was also visible in the traffic on my site. We don’t get the first point clarified and accept that the AdWords numbers are not perfect either.

The following figures illustrate how different the visibility indices are depending on the weighting:

  • In the simplest model, only 11 minus position is calculated, everything above (greater than) position 10 then gets 1 point. The data for each result is added up for each day. This model has the disadvantage that I can climb up very quickly, even if I’m only in 1st place terms that are only searched for once a month.
  • In the second model, the same procedure is chosen, except that here the value is multiplied by the impressions
  • In the third model, the average CTR on the SERP from the second part of this series is multiplied by the impressions.

If you now look at the actual traffic, you can see that the 3rd model is already very close to the traffic. The spikes in real traffic aren’t quite as strong as in the index, and in the end, I don’t get as much traffic, but that may be because the actual CTR is below the expected CTR.

Alternative approach with the Webmaster Console

If you look at the plots, however, it becomes clear that this approach with impressions on a daily basis makes little sense. Because if the curve goes down, it doesn’t mean that I can do anything, because maybe there is just less searching for this topic and my rankings haven’t changed at all (usually you only write about what works, but I also find the failures exciting, because you can learn a lot from them :-)). This is exactly why Sistrix will probably also calculate out seasonal fluctuations.

Alternatively, you could simply average all impression data of a keyword landing page pair and use that average to calculate, again with the weighted CTR per position. The good thing about this approach is that seasonal or temporary fluctuations balance each other out. Plotted, it looks like this:

This plot looks very similar to the first plot, but that doesn’t mean that it always has to be that way. But when I look at the Sistrix values (even if I’m on a very low level), it looks very similar.

From data to action relevance

Now we have a plot that shows us the weighted development of our rankings, but what do we do with it? It’s not really “actionable”. It only gets exciting when we also look at which rankings change the most and have an influence on our visibility index. To do this, we first take the minimum ranking for each keyword landing page pair (minimal because low, and lower than 1st place is not possible) and then the current ranking. Finally, we calculate the delta and sort according to it:

The higher the delta, the greater the loss of ranking places, so to speak. And the greater the need for action, provided that the keyword is really interesting. In my example, for example, I wouldn’t think it would be bad to rank for “seo monitoring”, after all, the articles from this series are relevant for it. You could now weigh based on the impressions or the viewability index we chose earlier:

This looks more exciting: In fact, there are some search queries at the top (sorted by “Actionability”) that I find quite interesting. And now you could combine that with the data from the second part and build a dashboard… more on this in the fourth part

Actionable SEO Reporting with R and AWS


The first part was about how to create automated SEO monitoring with R and an initially free AWS instance. The Webmaster Console is queried every day via API, and the data is written to a database. Of course, collecting data alone doesn’t help, we should do something with it, and my mantra, which every student hears several times a day in my courses, is “Data – Information – Action”. In most treatises, knowledge takes the place of action, because the processing of information creates knowledge in the first place. In the field of data analysis or even data science, however, it is more often a matter of not only knowing, but also doing something with the knowledge, ideally knowing what to do. The reason why most reports are not read is usually that no action can be derived. 5 more likes this week. So what? What do I do differently tomorrow? So we don’t just want to build SEO reporting, we want to create SEO reporting that tells us where to do something and what to do. “Actionable” is the name of the game in modern German, and there is actually no really nice translation in German. “Action-relevant”? It’s somehow not the same. So let’s live with this term for the time being.

The value of the Webmaster Console API

Let’s first take a look at the data we got from the Webmaster Console. There is already a special feature to be seen here that we don’t get in the interface. There we get either the search queries or the pages that the users came to after clicking on a search result, but not both at the same time. This is already an added value, because we can form pairs of keyword and landing page, so to speak. A landing page can have several keywords, and vice versa, by the way, if several pages from one host rank for a keyword. We use these pairs for our evaluation to have a unique identifier, for example by putting both together and making an MD5 hash out of them (the close observer will notice that the same combination occurs twice in one day, but it only looks like that, because one version of the URL has a /amp/ after it, but this is not to be seen in the table).

If we haven’t already done this when writing to the table, we’ll do it now:

library(digest)<br /> i <- 1<br /> newResults <- results<br /> newResults["Hash"] <- NA<br /> for (i in i:nrow(newResults)) {<br /> newResults$Hash[i] <- digest(paste(newResults$query[i],newResults$page[i],sep=""))<br /> i <- i+1<br /> }

Of course, this can be done even more nicely with apply, but we are currently in hacker mode, not in nice programming mode

Another special feature is that we probably won’t bother to look at the data for each day individually in the interface. This is extremely helpful, though, because the default setting in the Webmaster Console is the last 28 days, and this is where an average is calculated for each value. Those who know me better will now look embarrassed at the ground, because I say the same thing here again and again: the average, and indeed the arithmetic mean, is the enemy of statistics. Because this type of average does not show us the swings. More about that another time when I put the script for the data analysis event online. The point here is that in the arithmetic mean of the interface, I can see that a keyword has been in a certain position in the last 28 days, but in fact, the range per day is much more interesting, because if I grab it per day, then I can map more accurate trends. Last but not least, the Webmaster Console also gives us impressions and keywords that cannot be found in the databases of the common tools. Sistrix, as much as I’ve grown fond of it, can’t find me for “Cookidoo” or “Scalable Capital”. Sure, I could provide Sistrix with my Webmaster Console data, but unfortunately I’m not allowed to do that for every project.

Since we’re querying the data every day, we can now walk through the spreadsheet and get the values for the created identifier so that we can get all the values for a keyword landing page combination and plot it. On the x-axis we have the time history, on the y-axis we have the position. By the way, here are two little R-tricks to see. R usually plots on the Y-axis from the low value upwards to a higher value. And then exactly the area that is interesting for us is caught, namely not all positions from 1 to 100, but only the area for which we were ranked. The plot requires only a few lines of code:

maxValue <- max(currentQuery$position)<br /> minValue <- min(currentQuery$position)<br /> x <- minValue:maxValue<br /> query <- currentQuery$query[1]<br /> plot(df$Date,df$Position,ylim = rev(range(x)))<br /> title(main=query)

We see the development of the ranking for a keyword on our plot, but it is not really “actionable” yet. Let’s take a look at the CTR next to it:

The higher the position, the higher the click-through rate. That’s obvious. But in these two plots you can see that first the click-through rate went down and then the position. Not that click-through rate is the only ranking factor, but a poor click-through rate on a result testifies to suboptimal perceived relevance, and no search engine wants the results to be perceived as less relevant. So a look at the title and description would be a good recommendation for action. But how do we actually know what a good CTR is for a position? For example, we can take a plot from our own rankings:

And we could compare this to the results of Advanced Web Ranking, which create plots from the Webmaster Console data of the large number of their customers. Every country and industry is different, and the CTR depends on the SERP, whether there are other elements that influence the CTR. But from the plot alone, you can see that certain CTRs are suboptimal for certain results. So here “only” a report would have to be created to determine which keyword-landing page combinations are below average.

If we take a closer look at the plot with the CTRs per position, we see a few unusual things. On the one hand, there are always a few results where I always have 100% CTR no matter what position I have. And then there is a lot of noise between positions 1, 2, 3 and so on. The latter is very easy to explain, because the API gives us average positions with decimal places, as described above. So we would only have to round to get actual positions. The 100% CTR mainly affects results with few impressions. For example, if all results that had less than 10 impressions per day are filtered out, then the picture looks different:

And lo and behold, I don’t have that many number 1 placements with more than one homeopathic dose of impressions. But if I squinted my eyes a little, I could see a line. And indeed, if we calculate the mean values (here with summary), then I see a non-linear descending curve in the mean:<br /> Position 1:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.2143 0.3971 0.4929 0.4828 0.5786 0.7059<br /> Position 2:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.2667 0.4118 0.3744 0.5000 0.7692<br /> Position 3:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1176 0.1818 0.2217 0.3205 0.5769<br /> Position 4:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.08333 0.18182 0.17266 0.26667 0.45454<br /> Position 5:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.0000 0.1240 0.1579 0.1584 0.2053 0.3200<br /> Position 6:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.06977 0.11765 0.12223 0.16667 0.30769<br /> Position 7:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.05043 0.09091 0.09246 0.13229 0.22222<br /> Position 8:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.03880 0.04594 0.08052 0.19048<br /> Position 9:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.00000 0.00000 0.00000 0.01412 0.01205 0.16514<br /> Position 10:<br /> Min. 1st Qu. Median Mean 3rd Qu. Max.<br /> 0.000000 0.000000 0.000000 0.010284 0.004045 0.093023

In the next step, we “only” have to evaluate which result is below a certain limit of the CTR. Fortunately, there is a measure in statistics that helps us identify the variation around a mean, the standard deviation. This is not explained here now (but in the script for the data analysis course). But on the way to standard deviation, let’s look at an overview of the distribution of CTRs. As you can see, this looks like a normal distribution. Now we calculate the standard deviation per position with a clear conscience:

<br /> [1] 0.139202<br /> [2] 0.1691641<br /> [3] 0.1405702<br /> [4] 0.1116699<br /> [5] 0.07492808<br /> [6] 0.07420478<br /> [7] 0.05702427<br /> [8] 0.05028635<br /> [9] 0.03001044<br /> [10] 0.02134183

So, for position 1, we could calculate mean minus standard deviation equals result with click-through rate that deserves attention, here

0.4828 - 0.139202 = 0.3436019

All results with a rounded position 1 and a CTR of 0.43… deserve attention. In my case, these are the search queries for scalable capital experience and cookidoo. While there are also CTRs above average on some days, sometimes they fall below average. This is then repeated for each position.

Too much data?

We now have a problem. Because my blog is still a relatively small site. What do we do with a page with a lot of keywords and a lot of rankings? We’ll get to that next time by creating our own visibility index.

Free and automated SEO monitoring with R and AWS


Slowly, R is making its way into the world of SEO, and while R may be a bit confusing at first (functional programming instead of procedural), you can build cool stuff with just a few lines of code. As an example, a free SEO monitoring should serve here, which of course can’t keep up with Sistrix and Co at all, but if you only want to track your own rankings, then this is a great and, above all, free solution.

Let’s start with the infrastructure, we only need three components:

  • One (free) EC2 instance
  • A Google Webmaster Account
  • A Google API service account

Amazon offers EC2 instances in the “free tier”, after 12 months a fee is due, but it is more in the homeopathic area. The t2.micro instance is relatively weak on the chest with its one vCPU, 1GB RAM and 30GB SSD, but for our purposes it is perfectly sufficient. R is of course not there from the outset, BUT Luis Aslett offers free AMIs (Amazon Machine Images) where RStudio Server is already pre-configured. Matt explains very well how to use these AMIs to install your own RStudio instance on AWS. All this takes a maximum of 15 minutes, and you have your own free RStudio Server computer in the AWS cloud. Large calculations are not possible with it, but once you get a taste for it, you quickly find yourself in the situation that an instance with a lot of memory is used for larger computing tasks. One click, a few euros a day, and you have a whole cluster with 16 processors and 128 GB of RAM for yourself. But I digress.

In the next step, we’ll take Mark Edmonson’s R package searchConsoleR. This is the elementary cornerstone of our SEO monitoring. In Mark’s example, he simply writes the data to disk, but we prefer to write the data to a database (how to install MySQL on our newly acquired EC2 instance is here. Please note that you only have one user “ubuntu”, i.e. you have to do everything with sudo; otherwise you can also book an RDS instance for a fee). In order to access the Webmaster Console data from a server, a service account is required. This is not quite so easy to set up, but that would go beyond the scope of this article. It is important that the email address of the service account is entered as a full user in the Webmaster Console. And here is the code:

library(jsonlite) library(googleAuthR) library(searchConsoleR) library(RMySQL) options(googleAuthR.scopes.selected=”https://www.googleapis.com/auth/webmasters”)’ gar_auth_service( json_file = “/home/rstudio/XXXXXXX.json”, scope = “https://www.googleapis.com/auth/webmasters” )

We’ll get the data from 5 days ago, then it’s definitely available in the Webmaster Console:

delay <- 5 start <- Sys.Date() – delay end <- Sys.Date() – delay

Here is the query:

website <;- “XXXXX” download_dimensions <- c(‘query’,’page’) type <- c(‘web’) sc_data <- search_analytics(siteURL = website, startDate = start, endDate = end, dimensions = download_dimensions, searchType = type)

We add a date and the website (if we query several websites)

sc_data7 <- Website sc_data8 <- Start colnames(sc_data)7 <- “Website” colnames(sc_data)8 <- “Date”

Now we write the dataframe to the database (we already added the DB and the table earlier):

mydb = dbConnect(MySQL(), user=’XXXXX’, password=’XXXXX’, host=’XXXXX’) dbSendQuery(mydb, “USE XXXXX”) dbWriteTable(mydb, value = sc_data, name = “rankings”, append = TRUE, row.names = FALSE) dbDisconnect(mydb)

Now all we need is a cron job that performs this query every day. To do this, I first use a small shell script, which then calls the R script:

cd /home/rstudio/SearchConsole/ Rscript /home/rstudio/SearchConsole/rankings. R

Then the cronjob is set up, my cronjob starts every day at 10:30, or 11:30, because the instance is in a different time zone:sudo crontab -e

30 10 * * * /home/rstudio/SearchConsole/rankings.sh 2>&1 | /usr/bin/logger -t rstudio

Important: Newline after the last line, otherwise cron will complain. The automatic and free SEO monitoring is ready! In one of the next posts, I will show how the data can then be evaluated.

mv: argument list too long – processing millions of files


Due to my forgetfulness, a cron job had accumulated more than 3 million files in a directory, which I now wanted to process. So that my script doesn’t work on it for days (more 60 GB of data!), the files should be divided into smaller chunks. Unfortunately, mv couldn’t handle it, it complained with “argument list too long”. The remedy is a combination of command line commands:

find folder1/ -name ‘2017-07*’ -exec mv {} folder2 ;

Clustering with Google Analytics and R


Some questions are not so easy or even impossible to answer with the Google Analytics user interface (this also applies to Adobe Analytics, Piwik, etc.). While Google Analytics offers powerful and easy-to-use functionality to manually create and compare segments or personas based on devices, acquisition channels, or browsers, once it goes beyond these standard segments or to combinations of multiple dimensions, the effort becomes complex. Often enough, people simply “poke” at the data and hope that they will find something valuable. This is exactly where the advantages of combining Google Analytics and R come into play. One way to connect Google Analytics and R is the R package googleAnalyticsR by Mark Edmonson, which is used as an example in this article.

Segmentation, Clustering and Classification

Before we get into the practical, let’s briefly explain the difference between segmentation, clustering and classification. Segmentation attempts to divide customers or users into groups that differ based on characteristics, whether it’s an interest, an access channel, a marketing campaign through which a customer came, etc. Clustering attempts to automatically identify such groups and the dimensions or features that distinguish them, whereas classification attempts to predict which group a customer or user belongs to. Classification is a good example of supervised machine learning, clustering is a good example of unsupervised machine learning. Nothing meaningful always comes out of it, hence the frequent use of the word “tried”

This example is about clustering, i.e. identifying groups based on structures in the data using machine learning algorithms. Hierarchical clustering identifies groups based on their similarity, starting with a separate cluster for each data point and grouping together more clusters based on their similarity at each subsequent level, until all clusters are merged (see the dendrogram). A great algorithm that ideally needs numerical data as input, because similarity is calculated as distance here. Although DAISY could also be used with non-numerical data, this goes too far for the first step.

For our example, we’ll use a different approach, we just want to find out if clusters can be formed based on the content or products viewed on a page. So I suspect that the visitors of this website who look at roller derby photos are very likely not interested in the articles about fintechs. But maybe Google Analytics article readers are interested in fintechs, so they could be offered articles on the topic. We know this from Amazon (“Often bought together”), and most of the time the suggestions are useful. Of course, I can also look at the user flow report in Google Analytics (with the few pages on my homepage, this is more than enough), but as soon as there are more pages or products, we don’t get any further with this report. This is where Association Rules, also known as Market Basket Analysis, help us (I even worked with one of the creators, Tomas Imielinski, on Ask.com once). This machine learning-based approach tries to identify interesting relationships between variables in large data sets.

In general, the selection of an algorithm also and above all depends first and foremost on the business problem that is to be solved. This is an extremely important point: What question do we actually want to answer? In this case, I can only answer the question of which pages have ended up in a “shopping cart” based on the use by website visitors. Of course, I could also calculate a “similarity” automatically based on the content.

Keep your eyes open when interpreting the results

But first of all, a big warning: What users look at depends to a large extent on what the navigation or other elements on the page look like. If, for example, a recommendation system such as YARPP is already integrated into the site, then the probability is higher that some pages are more likely to be accessed together with other pages due to this plugin alone. But even small things on a page can lead to connections being seen that are not actually there, even if it is just a small icon that attracts the attention of the user and seduces them to click.

Then it should also be pointed out that the reputation of pages is not quite as strong a signal as buying a product. Just because someone calls up a page doesn’t mean that the person stayed on that page for a long time and read the text. You could incorporate this, for example, by measuring the time spent on the respective page and only allowing the combinations where “enough” time has been spent, but the time spent in the web analysis systems is unfortunately often unusable.

Preconditions

Clustering with Google Analytics and R requires user-level analytics data, which means that users of a standard installation of the free version of Google Analytics cannot easily perform clustering. The standard version has the new user explorer, which can be used to look at the behavior of individual anonymized users, but this data is not downloadable. Only the buyers of the Pro variant can access user-level data a priori, in the free version only aggregated data is offered. However, there is a small hack that can be used to change this, otherwise here is another argument for Piwik, at least as a second system. In addition to access to the raw data, the advantage of Piwik is that the data is immediately available. And there is also an R package for Piwik. But in this article, the solution with Google Analytics and R is to be shown.

Connect Google Analytics and R to googleAnalyticsR

Now it’s time to get down to business. I won’t explain how to install an R package, for what we are planning to do, the googleAnalyticsR package and the googleAuthR package are needed on the one hand, and the arules package on the other. The R-Hilfe is your friend.

We first load the two packages and then log in with a Google account (a browser window opens for this):

library("googleAuthR")<l g ibrary("googleAnalyticsR")<a_auth()

By the way, if you are already logged in and want to switch to another account, you enter

ga_auth(new_user = TRUE)

and log in with the other account. Next, let’s get the list of our accounts and properties that are connected to this Google account:

my_accounts <- ga_account_list()

In this data frame, we look for the viewView we want to use in the viewName column and look for the viewId in the row. This viewId is elementary because we will use it to retrieve the data of our data view. We store the viewID in a variable:

ga_id=XXXXXX

where XXXXXX stands for the viewId. What data are available to us now? Simple

meta <- google_analytics_meta()

and all dimensions and metrics are available in the Data Frame meta. We are particularly interested in Custom Dimension 1, where the User Level ID is stored, and ga:pagePath:

gadata <- google_analytics(id = ga_id,<br /> start="2017-04-20", end="2017-05-06",<br /> dimensions = c("ga:dimension1", "ga:pagePath"),<br /> max = 2000)

In my dataframe gadata, I now have the data as described above plus two more columns (sessions and bounceRate), because GA doesn’t output the dimensions to me without metrics. The data needs to be transformed so that it looks something like this:

<br /> user id 1, page 1, page 2<br /> user id 2, page 1<br /> user id 3, page 3

This is done with the code

i <- split(gadata$pagePath,gadata$dimension1)

We now have every “transaction” of a user on one line. This is the input needed for our algorithm.

Market Basket Analysis

Now is the time to load the R package arules and modify the data for the algorithm:

library(arules)<br /> txn <- as(i, "transactions")

When calling the algorithm, the parameters sup for support and conf for confidence are used, i.e. in the following example we tell it that we want to have rules that are applicable in at least 0.1% of the cases and for which there is a confidence of 0.1%. That sounds damn little at first, but let’s imagine that we are dealing with a “shopping cart” in which many different combinations can be “placed”, depending on the size of an online shop or a website. Of course, we want to cover as many operations as possible with the rules, but the more possibilities there are, the more likely it is that we will have to go in with a low support value.

basket_rules <- apriori(txn, parameter = list(sup = 0.001, conf = 0.001, target="rules"))

This can take up quite a lot of processor time and RAM, and depending on the configuration, R ends the process with an error message. One way to limit the algorithm is to increase the size of the sup and/or conf parameters. If everything went well, the identified rules can be examined:

inspect(head(sort(basket_rules, by="lift"),20))

This shows the top 20 rules, sorted by lift. The lift tells us how much higher the dependence of the items is compared to their independence, which you have to assume for a lift of 1. The output for me looks like this (not for this website):

Rule two, for example, states that a small proportion of transactions involve a user looking at an article about communication methods and, with a confidence rate of 30%, will also read the article about communication technology. This compound has an extremely high lift of 521, so we can assume that this compound is much more likely than their coincidental occurrence.

Summary Google Analytics and R

This small example shows the power of R and the packages. With just a few lines of code, data can be imported directly from Google Analytics into R and used for complex analyses. This analysis would not have been possible in the user interface.

However, we also see here how important it is to have data at the user level, although it must be said that we don’t even have user-level data here, but rather browser-level data due to the lack of cross-device tracking. A next step could therefore be to create the identified rules again on the basis of segmented data to identify differences.

Linear regression: How much can a used SLR camera cost?


Since the Canon 5d Mark IV has just been released, the 5d Mark III will also be affordable. I was advised to pay €1,500 for a maximum of 30,000 releases, but if you look at the cameras on offer on eBay and the relevant forums, the price seems to be much higher. But what is the fair price? With sufficient data, this can be determined by regression.

First of all, let’s assume that the more shutter releases a camera has, the cheaper it becomes. 150,000 triggers is the expected life of the shutter release on the 5d Mark III, a new shutter release including labor costs about 450€ (not verified). But first, let’s take a look at what the data looks like. I have picked out the prices and releases of almost 30 offers from the various platforms, so this does not take into account how good the camera is on the outside or what accessories are included. This data is read into R and examined for its characteristics:

We have an average price of 1,666€ and an average number of triggers of 85,891. That’s a far from the €1,500 with 30,000 triggers. The median does not look much better either. Now we start the regression and form a regression model with the function lm and look at the details with the command summary(MODELNAME):

From this we can already form a function:

Price = -0.001749x+1816 where x is the number of triggers. So a camera with 30,000 shutters should cost €1,763.53, a camera with 100,000 shutters still €1,614. Let’s take a look at the plot:

As you can see, we have some outliers (by the way, not due to great accessories or maintenance etc.) that somewhat “distort” the model. I had already removed a camera with nearly 400,000 releases and a price of still €1,000.

Unfortunately, the number of triggers cannot be automatically read from the portals because it is always manually entered by the user. Otherwise, you could build a nice tool for each camera model.