Connect R through Service User with Google APIs


If you want to use R to access APIs automatically, then authorization via browser is not an option. The solution is called Service User: With a Service User and the associated JSON file, an R program can access the Google Analytics API, the Google Search Console API, but also all the other wonderful machine learning APIs. This short tutorial shows what needs to be done to connect to the Google Search Console.

First, you create a project if you don’t have a suitable one yet, and then the appropriate APIs have to be enabled. After a short search, you will find the Google Search DesktopAPI, which only needs to be activated.

Click on IAM & admin and then click on Service accounts (it’s kind of strangely highlighted in this screenshot:

Click Create Service Account:

Important: Give a meaningful name for the e-mail address, so that you can at least keep track of it a bit…

Browse Project is sufficient for this step:

Click on “Create Key” here:

Create a JSON key, download it and then put it in the desired RStudio directory.

Now the service user must be added as a user in the Search Console; the important thing is that he receives all rights.

What’s really great about the Google Search Console API is that you can see the query and landing page at the same time, unlike in the GUI. By the way, I get the data every day and write it to a database, so I have a nice history that goes beyond the few months of Search Console.

Last but not least, I also provide the R notebook, with which I query the data; it’s basically the code written by the author of the API, Mark Edmondson, but just for the sake of completeness, how the JSON file is included. There is a more elegant variant with R Environment variables, but I don’t know if it works on Windows.

Data-driven personas with association rules


I have already talked about personas elsewhere, this article is about the data-driven generation of personas. I stick to the definition of the persona inventor Cooper and see a persona as a prototype for a group of users. This can also be interesting for marketing, because after all, you can use it to create a needs- and experience-oriented communication, for example on a website. Personas are not target groups, but more on that elsewhere.

How do you create a data-driven persona?

I haven’t found the perfect universal way for data-driven personas either. External data is not available for all topics, the original approach of 10-12 interviews is difficult, and internal data has the disadvantage that it only contains the data of those you already know, not those you might still want to reach. The truth lies in merging different data sources.

Data-driven persona meets web analytics

Web analytics data offers a lot of usage behavior, and depending on how a page is structured (for example, whether it is already geared to the different needs of different personas), it is possible to understand the extent to which the different user groups actually behave as expected. Or you try to generate data-driven personas from the usage behavior on the website. All under the restriction that the users have to find the page first, so it is not certain that really all groups of people actually access this page and therefore important personas are overlooked. This article is about a special case of this automated persona generation from web analytics data, which is exciting from an algorithmic point of view and the associated visualization. As is well known, everyone likes to report on successes, here is a case where the failure shows in which direction further work could go.

The experiences from web mining are rarely associated with personas, although some research was done on it more than 10 years ago; for an overview, see, for example, Facca and Lanzi, Minining interesting knowledge from weblogs: a survey, from 2004 (published in 2005). Whereas in the past it was mainly weblogs (not web blogs!) that were used, i.e. log files written by the server, today we have the opportunity to use much “better” data through Google Analytics & Co.

Reintroducing: Association Rules

But which exactly is better? In GA & Co we can better distinguish people from bots (of which there are more than you think), returners are recognized more reliably, devices etc. The question is whether you absolutely have to use the additional data for basic data-driven personas. Because association rules, which I have already written about in a post about clustering with Google Analytics and R and which are also mentioned by Facca and Lanzi, can already identify basic groups of users (I had already mentioned in the other article that I had once worked for one of the creators of the algo, Tomasz Imilinski, but I still have to tell an anecdote with him: In a meeting, he once said to me that you often think something is a low hanging fruit, a quick success, but, “Tom, often enough, the low hanging fruits are rotten”. He has been right so many times.). The groups identify themselves through a common behavior, the co-occurrence of page views, for example. In R, this works wonderfully with the arules package and the algo apriori it contains.

Data-driven personas with Google Analytics & Co.

As already mentioned in the earlier article: A standard installation of Google Analytics is not sufficient (it never is anyway). Either you have the 360 variant or “hack” the free version (“hack” in terms of “tinkering”, not “being a criminal”) and pull the data via API. With Adobe Analytics, the data can be pulled from the data warehouse or via an API. Simply using Google Analytics and drawing personas from it is therefore not possible with this approach. You also have to think about which date from GA is best used next to the Client ID to represent transactions. This can vary greatly from website to website. And if you want to be very clever, then a PageView alone may not be signal enough.

However, this is first of all about visualization and what limitations the a priori approach has for the automated generation of data-driven personas. For the visualization, I work with the package arulesViz. The resulting graphics are not easy to interpret, as I have experienced at the HAW, but also with colleagues. Below we see the visualization of association rules obtained from the data of this page, with the GA date pagePathLevel1 (which is unfortunately also an article title for me). One thing stands out here: I can actually only identify two groups here, and that’s pretty poor.

What exactly do we see here? We see that users who are on the homepage also go to the Courses section and vice versa. The lift is high here, the support not so much. And then we see users moving between my four articles about Scalable Capital, with roughly the same low lift but different levels of support. Lift is the factor by which the co-occurrence of two items is higher than their probable occurrence if they were independent of each other. Support is the frequency. Support was defined at 0.01 when creating the association rules, and confidence was also defined at 0.01. For details, see my first article.

But why don’t I see any other pages here? My article about Google Trends is a very frequently read article, as is the one about the Thermomix or AirBnB. So it’s not because there aren’t more user groups. The disadvantage of this approach is simply that users have to have visited more than one page for a rule to arise here at all. And since some users come via a Google search and apparently have no interest in a second article, because their need for information may already be satisfied or because I don’t advertise it well enough, apparently only students and those interested in Scalable Capital can be identified here in these rules.

Ways out of the a priori dilemma?

So far, I’ve identified three ways to solve this dilemma, and all of them require extra work:

  • I test whether I can get users to view more than one page through a better relevant offer, for example with Google Optimize, and if successful, I get better data.
  • I use the a priori data only as a base and merge it with other data (also very nice, but I won’t cover it here)
  • I lower the support and confidence.

The most beautiful is the first approach, in my opinion, but it requires time and brains. And it is not said that something will come out. The last approach is unpleasant, because we are dealing with cases that occur less frequently and therefore not necessarily reliable. With a support of 0.005, the visualization looks different:

But again I have the problem that the individual pages do not appear. So it seems to be extremely rare that someone moves from the Google Trends article to another article, so lowering the support value didn’t help. From experience, I can say that this problem appears more or less strongly on most pages that I otherwise see, but it always appears somehow. The stupid thing is, if you can already read good personas, then you are more inclined not to look at the rest, even if it could be very large in scope.

We also see another problem in the graphic, because the users in the right strand do not have to be the same from arrow to arrow. In other words, it is not said that visitors who look at photography pages and courses will also look at the publications, even if it looks like that in the visualization. If A and B as well as B and C, then A and C do not apply here! To solve this, the association rules in the visualization would still have to have an exclusionary marking. It does not exist and would be a task for the future.

Result

The path via association rules is exciting for the creation of data-driven personas with Google Analytics or other web analysis tools. However, it will usually not be sufficient at the moment, because a) the problem of one-page visitors is not solved here, b) the rules do not provide sufficient information about different groups that only have overlaps and c) it can only say something about those groups that are already on the site anyway. I’m currently working on a) and b) on the side, I’m always happy about thoughts from outside

Update RStudio Server


Louis Aslett’s AMIs are useful for quickly starting an RStudio server on an AWS EC2 instance. However, these AMIs do not always contain the latest version of R or RStudio. These two commands help to update both R and RStudio:

sudo apt-get install gdebi-core<br /> wget https://download2.rstudio.org/rstudio-server-1.1.442-amd64.deb<br /> sudo gdebi rstudio-server-1.1.442-amd64.deb

echo "deb http://cran.stat.ucla.edu/bin/linux/ubuntu `lsb_release -sc`/" | sudo tee --append /etc/apt/sources.list.d/cran.list<br /> sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9<br /> sudo apt update<br /> apt list --upgradable

Sistrix traffic vs. Google AdWords keyword planner


If you read along here often, you know that Sistrix is one of my absolute favorite tools (I’ll brazenly link as the best SEO tool), if only because of the lean API, the absolutely lovable Johannes with his really clever blog posts and the calmness with which the toolbox convinces again and again. Of course, all other tools are great, but Sistrix is something like my first great tool love, which you can’t or don’t want to banish from your SEO memory. And even if the following data might scratch the paint, they didn’t cause a real dent in my Sistrix preference.

What problem am I trying to solve?

But enough of the adulation. What is it about? As already described in the post about keywordtools.io or the inaccuracies in the Google AdWords Keyword Planner data mentioned in the margin, it is a challenge to get reliable data about the search volume of keywords. And if you still believe that Google Trends provides absolute numbers, well… Sistrix offers a traffic index of 0-100 for this purpose, which is calculated on the basis of various data sources, which is supposed to result in higher accuracy. But how accurate are the numbers here? Along the way, I also want to show why box plots are a wonderful way to visualize data.

The database and first plots with data from Sistrix and Google

The database here is 4,491 search queries from a sample, where I have both the Sistrix and the Google AdWords Keyword Planner data. By the way, it’s not the first sample I’ve pulled, and the data looks about the same everywhere. So it’s not because of my sample. So let’s first look at the pure data:

As we can see, you could draw a curve into this plot, but the relation doesn’t seem to be linear. But maybe we only have a distorted picture here because of the outlier? Let’s take a look at the plot without the giant outlier:

Maybe we still have too many outliers here, let’s just take those under a search volume of 100,000 per month:

In fact, we see a tendency here to go up to the right, not a clear line (I didn’t do a regression analysis), but we also see that with a traffic value of 5, we have values that go beyond the index values of 10,15,20,25 and 30, even at 50 so we see the curve again:

The median ignores the outliers within the smaller values:

So if we look at the median data, we see a correct trend at least for the higher values, with the exception of the value for the Sistrix traffic value of 65 or 70. However, the variation around these values is very different when plotting the standard deviations for each Sistrix traffic value:

We don’t see a pattern in the spread. It is not the case that the dispersion increases with a higher index value (which would be expected), in fact it is already higher with the index value of 5 than with 10 etc. We see the highest dispersion at the value of 60.

All-in-one: box plots

Because boxplots are simply a wonderful thing, I’ll shoot it after that:

Here the data is reversed once (because it was not really easy to see with the Sistrix data on the X-axis). The box shows where 50% of the data is located, so with a search volume of 390, for example, 50% of the data is between the Sistrix value of 5 and 25, the median is indicated by the line in the box and is 15. The sizes of the boxes increase at the beginning, then they are different sizes again, which indicates a lower dispersion. At some data points, we see small circles that R has calculated as outliers. So we see outliers, especially in the low search volumes. Almost everything we plotted above we get visualized here in a plot. Boxplots are simply wonderful.

What do I do with this data now?

Does this mean that the traffic data in Sistrix is unusable? No, it doesn’t mean that. As described in the introduction, the Keyword Planner data is not always correct. So nothing is known for sure. If you see the Keyword Planner data as the ultimate, you won’t be satisfied with the Sistrix data. It would be helpful if there was more transparency about where exactly the data comes from. Obviously, tethered GSC data would be very helpful as it shows real impressions. My recommendation for action is to look at several data sources and to look at the overlaps and the deviations separately. This is unsatisfactory, as it is not automatic. But “a fool with a tool is still a fool”.

Comments (since February 2020 the comment function has been removed from my blog):

Hanns says

  1. May 2018 at 21:18 Hello, thank you very much for the interesting analysis. Have you ever tried the new traffic numbers in the SISTRIX Toolbox? This also gives you absolute numbers and not index values. To do this, simply activate the new SERP view in the SISTRIX Labs. Information can be found here (https://www.sistrix.de/news/nur-6-prozent-aller-google-klicks-gehen-auf-adwords-anzeigen/) and here (https://www.sistrix.de/changelog/listen-funktion-jetzt-mit-traffic-und-organischen-klick-daten/)

Tom Alby says

  1. May 2018 at 10:58 I hadn’t actually seen that before. Thanks for the hint. But these are the ranges here, not the really absolute numbers. But still very cool.

Martin Says

  1. April 2019 at 13:33 Moin, I read your post and tried to understand. But I can’t figure it out. Sistrix is cool yes, but unfortunately I don’t think how reliable the data is.

I actually don’t understand how this is supposed to work technically. How is Sistrix supposed to get the search queries that run through Google for each keyword? It’s not as if Google informs Sistrix briefly with every request.

The only thing I can think of is that they pull the data for each keyword from AdsPlanner. But… to present this as “own search volume” without any indication of where the data comes from, I would find grossly negligent.

Where could they still get data from?

Tom says

  1. April 2019 at 20:39 Hallo Martin,

the answer is not 1 or 0, that also comes out in the article. You also can’t rely on AdPlanner data. Sistrix also gets data from customers who have linked the Search Console data there, since you can see your page’s impressions for a keyword. But of course, all this is not for every keyword. And that’s why inaccuracies come about.

BG

Tom

Data Science meets SEO, Part 5


The last part of this series on search engine optimization/SEO and data science based on my presentation at SEO Campixx. I converted the data and the code into an HTML document via Knit, which makes my notebook including data comprehensible. There are also a few more examinations in the notebook, but I have documented everything in English, as this is not only interesting for Germans. So if you want to read all the results in one document (without the TF/IDF, WDF/IDF or stemming examples), please take a look at the Data Science & SEO Notebook.

Let’s go back to the age of domains and their importance for SEO

First of all, an addendum: We lacked the age for some domains, and I have now received this data from another source. In our sample, most of the domains were older, and my concern was that the missing domains might be much younger and therefore the average age would be wrongly pulled down. Almost 20% of the domains were missing.

In fact, the missing domains are younger. While the median for our holey data set was 2001, it is 2011 for the missing domains. If you merge the data, however, it is still back to 2001, only the mean has changed from 2002 to 2003. Thus, the number of missing data was not so high that this opposing trend would have had a major impact. Of course, one could now argue that this other source simply has completely different numbers, but this could not be verified in a sample of domains for which an age was already available. And if you look at the plot for the relationship between position on the search results page and the age of a domain, we haven’t found anything new:

Box plots are a wonderful thing, because they show an incredible amount about the data at a glance. The box shows where 50% of the data is, the thick line in the middle the median, and the width of the box the root from the sample set. Even after several beers, there is no pattern to be seen here, except that the boxes are all at about the same height. Google had already said that the age of a domain does not matter.

Longer text = better position?

Another myth, and the great thing about this myth is that we can clear it up relatively easily, because we can crawl the data ourselves. By the way, R is not so great for crawling; there is the package rvest, but if you really only want the content, then nothing comes close to Python’s Beautiful Soup. Nicely, you can also run Python in the RStudio notebooks, so only the actual text is taken as text here, not the source code. However, navigation elements and footers are included, although we can assume that the actual content can be extracted with Google. The following plot shows the relationship between content length and position:

As we can see, we don’t see anything, except for an interesting outlier with more than 140,000 words in a document (http://katalog.premio-tuning.de/), which ranked 3rd for the keyword “tuning catalogs”. Otherwise, no correlation can be observed. A general statement such as “more text = better position” cannot therefore be derived. The median word count is 273, and the mean is 690. Just a reminder, we are in the top 10 here. I would actually be very interested in how the colleagues from Backlinko came up with 1,890 words for the average 1st place document. They have looked at far more search results (what does “1 million search results” mean? Exactly that, i.e. about 100,000 search results pages, i.e. the results for about 100,000 search queries?), but they do not reveal which average they used. As you can see in my numbers, there is a big difference between the median and the mean, which is the arithmetic mean that most people call the average. It’s not for nothing that I always say that the average is the enemy of statistics, but maybe texts are longer in the USA? But since the numbers are not made available to us… and also not the methods… well, at some point I learned that you have to add both the data and the software for evaluation to your results so that everything is really comprehensible.

Is there really nothing to see at all?

In this final part, I have added more signals, including TF/IDF and WDF/IDF. And as you can see in the correlation matrix, we don’t have a correlation anywhere. In the last part, however, we had already seen that this does not apply to all keywords. In the histogram of the correlation coefficients, we saw both positive and negative correlations, but no p-value. If you only look at the correlation coefficients where p is < 0.05, the picture looks different again:

So we have keywords where the backlinks matter, and we also have keywords where the other signals matter. If we can draw one conclusion from the keyword set, it’s that there’s no one-size-fits-all rule. As already stated in the last part, we need the above correlation matrix for each keyword. And that’s exactly what is exciting, because we can look at how the ranking signals behave for each keyword individually or perhaps a topic.

And so you can see for the keyword “player update” (as hash 002849692a74103fa4f867b43ac3b088 in the data in the notebook) that some signals are more prominent, see the figure on the left. Can you now be sure that you now know exactly how the ranking works for this keyword? No, you can’t (especially since we haven’t calculated the p-values here yet). But if we look at several keywords from the same “region” (i.e. similar values in this signal), then there could actually be something in them.

What about WDF/IDF now?

Unfortunately, nothing either. And that was probably the biggest point of contention at SEO Campixx. In this example, I’m only using the exact match for now, so I find exactly the keyword I entered in the text. Of course, we could now go further and have the keywords picked apart matched, but to reduce the complexity, let’s just look at the exact match.

There is no clear pattern here, and there are no correlations. Very few observations even manage a p-value below 0.05 and a correlation coefficient of more than 0.1. In this keyword set, it cannot be understood that WDF/IDF brings anything, at least not for Exact Match. Neither does TF/IDF. I didn’t even look up keyword density.

Reporting Data

The last part of my presentation from the SEO Campixx was a short summary of my series of articles about SEO reporting with R and AWS (especially the part about action-relevant analyses and reporting).

Result

Once again the most important points:

  • My sample is quite small, so it may not be representative of the total population of all searches. So the statement is not that what I have written here really applies to everything. The goal is to show how to approach the topic from a data science perspective. For some topics, however, it cannot be denied that the data situation is sufficient, such as the correlation between document length and position.
  • My statements apply to German search results. The average document length may be different in DE, but I doubt it.
  • The data used for the calculation is not necessarily reliable. The backlink data is most likely not complete, and what Google & Co make of text is not completely transparent either. However, most tools out there don’t even use standard procedures like stemming, so it should at least be proven that it’s certainly exciting to work with such WDF-IDF tools, but it’s not necessarily what actually changes everything.
  • The typical SEO statements cannot be proven for all keywords with the help of this sample, but this should not come as a surprise, because the ranking algorithm is dynamic. This means:
    • Speed is not a ranking factor, at most as a hygiene factor, and even that we can’t prove here
    • HTTPS is not yet a ranking factor.
    • Surprisingly, backlinks do not always correlate, but this may be due to the data basis
    • We have to look at what the ranking signals look like for each keyword.

The emotional reactions of some colleagues are not incomprehensible, because after all, some tools are paid dearly (there were also tool operators in my presentation, one of whom let himself be carried away by the statement that you can tell that I haven’t worked as SEO for a long time). It’s a bit like going to a Christian and saying that his Jesus, unfortunately, never existed. I didn’t say that. I have only said that I cannot understand the effect of common practices on the basis of my data set. But many SEOs, whom I appreciate very much, have told me that e.g. WDF/IDF works for them. In medicine it is said “He who heals is right”, and at the end of the day it is the result that counts, even if it has been proven that homeopathy does not help.nAnd perhaps the good results of these SEOs only come about because they also do many other things right, but then blame it on WDF/IDF.

But what interests me as a data person is reproducibility. In which cases does WDF/IDF work and when does it not? I would like to add that I have no commercial interest in calling any way good or bad, because I don’t sell a tool (let’s see, maybe I’ll build one someday) and I don’t earn my money with SEO. In other words: I pretty much don’t give a s*** what comes out of here. The probability that I will succumb to confirmation bias because I am only looking for facts that support my opinion is extremely low. I’m only interested in the truth in a post-factual world. And unlike the investigation of Backlinko, for example, I provide my data and code so that everyone can understand it. This is complexity, and many try to avoid complexity and are looking for simple answers. But there are no easy answers to difficult questions, even if that is much more attractive to people. My recommendation: Don’t believe any statistics that don’t make the data and methods comprehensible. I hope that all critics will also disclose the data and their software. This is not about vanity.

The Donohue–Levitt hypothesis is a good example for me: For example, the New York Police Department’s Zero Tolerance approach in the 1990s was praised for significantly reducing crime as a result. This is still a widespread opinion today. Donohue and Levitt had examined the figures, but came to a different conclusion, namely that this was a spurious correlation. In reality, the spread of the baby pill was responsible for the fact that the young offenders were not born in the first place, which then became noticeable in the 90s. Of course, this was attacked again, then confirmed again, and then someone also found out that the disappearance of the lead content from gasoline was responsible for the reduction of juvenile delinquency (lead crime hypothesis). However, these are more complex models. More police truncheons equals less crime is easier to understand and is therefore still defended (and maybe there is a bit of truth to it?). But here, too, those who find a model more sympathetic will mainly look at the data that confirm this opinion.

I could have investigated a lot more. But as I said, I do it on the side. I’m already in the mood for more data on the topic. But for now, there are other mountains of data here again: And then the next step would be a larger data set and machine learning to identify patterns more precisely.

Data Science meets SEO, Part 4


Now it’s been another month, and I still haven’t written everything down. However, this is also due to the fact that I have acquired even more data in the last few weeks so that I have a data set that I can share and that is not customer-specific.

Why does data validation and cleansing take so long?

80% of the time is spent on validating and cleaning the data, according to the rule of thumb, and I would add one more point, namely transforming the data. Data is rarely available in such a way that it can be used immediately.

But one thing at a time. For this part, I wanted to add backlink data as well as crawl data, but Google’s backlink data is only available for your own domain, and if you use tools like Sistrix, then the API queries cost money or credits. Using the example of Sistrix, you pay 25 credits for querying for backlinks (links.overview), so with 50,000 credits per week you can query the links for 2,000 URLs. However, I can only use the credits that were not spent on other tools at the end of the week, so I would need more than 7 weeks for 14,099 unique hosts that I generated in the last part with the 5,000 searches. Until then, I have 1,000 other projects and I forget what code I wrote here, so I took a sample based on 500 searches that I randomly pulled from my 5,000 searches. Unfortunately, the ratio unique hosts/all URLs here was not as nice as with the overall set, 2,597 unique hosts had to be queried.

Unfortunately, the Sistrix API had also thrown a spanner in the works here, because for over 250 URLs I got answers that my script had not properly intercepted, e.g.

‘{“method”:[[“links.overview”]], “answer”:[{“total”:[{“num”:903634.75}], “hosts”:[{“num”:21491.504628108}], “domains”:[{“num”:16439.602383232}], “networks”:[{“num”:5979.5586911669}], “class_c”:[{“num”:9905.3625179945}]}], “credits”:[{“used”:25}]}

My script had expected integer values (fractions of a backlink don’t exist in my opinion) and then simply wrote nothing at all to the dataframe when a number came from Sistrix that was not an integer. But even if it had caught that, the number I see here has nothing to do with the number I see in the web interface, although there are also strange numbers here from time to time (see screenshot). Is that 197,520 backlinks or 19,752,000? Please don’t get me wrong, Sistrix is one of my favorite tools, but such things drive me crazy, and R is not easy there either. It didn’t help, I had to look through the data first and add some of it manually (!!). And how difficult R can sometimes be becomes apparent when you want to add existing data, but don’t want to set the existing data in a column to NA. My inelegant solution to the transformation (which took me 2 hours):

test <- merge(sample_2018_04_02,backlinks_2, by = “host”, all.x = TRUE) test <- cbind(test, “backlinks_raw”=with(test, ifelse(is.na(total.y), total.x, total.y))) ‘

We had already seen last time that we were missing data for the age of a domain, but here there is also the official statement from Google that the age of a domain has no influence on the ranking. However, the old domains were in the majority, so one could possibly assume that newer domains have less chance of getting into the index or into the top 10, but it doesn’t matter for the respective position in the top 10. However, we did not make this statement explicitly, because it could be that the missing age values in my data set are exactly the younger domains. So that’s yet to be found out, but no longer as part of this series. Then I might as well examine the top 100

In summary: It always looks very simple, but collecting, transforming and cleaning the data simply costs a lot of time and energy.

So what can we do with this data?

First of all, let’s see if we can see something just by plotting the individual variables in relation to each other. The variables we have here now are:

  • Position
  • https yes/no
  • year (age of the domain)
  • Speed
  • Ips
  • SearchVolume (spelling error, “SwarchVolume”
  • Number of backlinks for the host
  • Number of backlinks for the host, logarithmized
  • Number of results for a search query

There are still a few variables missing, but let’s start with this. As we can see, we see next to nothing So let’s take a look at the bare numbers again:

Here we see a little more. For example, a (sometimes very weak) correlation between http and backlinks_log, year and backlinks_log, speed and year, backlinks_raw and ip, etc. But why the logarithmization of the backlinks at all? This is illustrated by the following example:

If we look at the distribution of the frequencies of the backlinks in the histogram, we see a high bar on the far left and not much else. No wonder, because in the search results we have hosts like Youtube that have a nine-digit number of backlinks, but most hosts have much, far fewer backlinks. If we use a logarithm instead, i.e. if we “compress” it a bit, then the histogram looks quite different:

We see here that many hosts are somewhere in the middle, some stand out with few links (that’s the bar at 0) and few hosts have a lot of backlinks. The question of whether the number of backlinks of the individual search hits is comparable for each search query is also exciting. The answer here is no, as the following histogram shows (also logarithmized):

I calculated the average number of backlinks for each keyword and then put it logarithmized on the histogram (without a logarithm, the histogram would look like the unlogarithmized one before it). And as we can see, we also have areas where the search results come from hosts that have few backlinks, most of the search results are in the middle, and with very few search results we have an enormous number of backlinks. It must always be said that the average is not really a great story. But at least we see that we have different “regions”.

Now that we’ve clarified why certain data is logarithmized, let’s take a closer look at what the correlations look like, starting with age and backlinks:

If we squint a little, we see an impression of a line, it almost looks as if there is a correlation between the age of a domain and its backlinks. Once tested:

Pearson’s product-moment correlation
data: dataset$year and dataset$backlinks_log
t = -24.146, df = 4286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3721183 -0.3194161
sample estimates:
cor-0.3460401

That looks good. And it’s not completely surprising. After all, the longer a domain exists, the more time it has had to collect links. While we don’t know which way a correlation is headed, it’s unlikely that a domain will get older the more backlinks it gets.

Let’s take a look at that again for the combination of age and speed:

Pearson’s product-moment correlation
data: dataset$year and dataset$SPEED
t = 13.129, df = 4356, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1663720 0.2234958
sample estimates:
cor
0.1950994

It is interesting here that the correlation coefficient is positive, i.e. the older a domain is, the slower it is.

Good question. Because as already discussed in the last part, this does not apply to every keyword. Let’s take a look at the correlation between backlinks and position per keyword and then throw the correlation coefficients output onto a histogram:

Clearly, we have some keywords here whose ranking hosts have at least a weak if not moderate correlation with the number of backlinks. This means that we would have to look at the composition of the ranking for each keyword individually. And since we already know that the ranking is dynamic, we see it a little more clearly here.

Unfortunately, it is not the case that there is a correlation between the average number of backlinks of the ranked pages and the correlation between the backlinks of the locations and position. We can see in the screenshot on the left that this is a very colorful mix.

What can be the reason for this? For example, because I only have the data for the backlinks for the hosts, not for the respective landing page. Of course, that would be even nicer, and the most ideal thing would be if I could then look at what the composition of the individual factors of the backlinks would look like. In view of my credit poverty, however, this is not possible at the moment. And here we have a typical problem in the field of data science: We know that the data is out there, but we can’t get to it. Nevertheless, this approach already offers enormous advantages: I can now look at the composition of the current ranking for each keyword individually and act accordingly. In the example on the left, I see that I need a lot of backlinks for the host for “privacy”, but in my dataset (not on the screenshot), my host needs few backlinks for searches like “poems early retirement”. So we need exactly this correlation matrix for each keyword instead of an overall view as above.

In the [next part][10] we will get more data (we started with TF/IDF and WDF/IDF).

[1]: Comments (since February 2020 the comment function has been removed from my blog):

Steffen Blankenbach says

  1. April 2018 at 15:34 Hello Tom, very good and exciting article. Once again, it becomes clear how much effort you have to put into getting insights, only to find out that there are many other influencing factors that could not be taken into account in the calculation. This will probably always remain a problem of data evaluation. Personally, I also assume that thanks to RankBrain, ranking factors even differ at the keyword level and therefore such a mass evaluation is not expedient for companies. A case for a specific industry would be exciting. I look forward to more articles from you.

http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-02-um-12.09.07.png [2]: http://tom.alby.de/wp-content/uploads/2018/04/matrix.png [3]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-01.14.04.png [4]: http://tom.alby.de/wp-content/uploads/2018/04/hist.png [5]: http://tom.alby.de/wp-content/uploads/2018/04/hist_log.png [6]: http://tom.alby.de/wp-content/uploads/2018/04/00000b.png [7]: http://tom.alby.de/wp-content/uploads/2018/04/data-science-seo-age-backlinks.png [8]: http://tom.alby.de/wp-content/uploads/2018/04/plot_zoom_png.png [9]: http://tom.alby.de/wp-content/uploads/2018/04/Bildschirmfoto-2018-04-03-um-02.17.36.png [10]: http://tom.alby.de/data-science-meets-seo-teil-5/

R: dplyr/sparklyr vs data.table Performance


In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.

My work includes dealing with larger files almost every day, and for this test, I have used a 16 GB CSV file with 88.956.866 rows and 7 columns. After reading the file, I will do a few simple operations with that data and then write the result back to disk. The test is performed on an AWS EC2 m4.2xlarge instance with 32 GB of RAM and 8 vCPUs of which we will use 4. Let’s start with data.table:

`

library(data.table)
overallStart_time <- Sys.time()
start_time <- Sys.time()
userDataSetDT <- fread(“/home/tom/huge-file.csv”)
Read 88956065 rows and 7 (of 7) columns from 15.969 GB file in 00:02:52
end_time <- Sys.time()
end_time – start_time
Time difference of 6.507585 mins
`

I have no idea why fread says it only needed 2:52 minutes; there were no other CPU-hungry processes running or processes that had a huge impact on IO.

`> start_time <- Sys.time()

userDataSetDT <- userDataSetDT[!is.na(Timestamp)]
end_time <- Sys.time()
end_time – start_time
Time difference of 39.44712 secs

start_time <- Sys.time()
userDataSetDT <- userDataSetDT[!is.na(URL)]
end_time <- Sys.time()
end_time – start_time
Time difference of 38.62926 secs

start_time <- Sys.time()
configs <- userDataSetDT[configSection == “Select Engine”]
end_time <- Sys.time()
end_time – start_time
Time difference of 2.412425 secs

start_time <- Sys.time()
fwrite(configs,file=“configsDT.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.07708573 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.341271 mins`

data.table uses only one vCPU or one core respectively by default but consumes more virtual memory (43GB instead of 13GB being used by the R/sparklyr combination). We could use packages such as the parallel one but in fact, data.table is a bit more complex with respect to parallelization.

Now, let’s look at sparklyr:

“> overallStart_time <- Sys.time()

conf <- spark_config()
conf$sparklyr.defaultPackages <- “org.apache.hadoop:hadoop-aws:2.7.3”
conf$sparklyr.cores.local <- 4
conf$sparklyr.shell.driver-memory <- “8G”
conf$sparklyr.shell.executor-memory <- “16G”
conf$spark.yarn.executor.memoryOverhead <- “4g”
conf$spark.memory.fraction <- 0.9
conf$spark.driver.maxResultSize <- “8G”
sc <- spark_connect(master = “local”,

  • version = “2.2.1”,
  • config = conf)

start_time <- Sys.time()
userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 2.432772 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.01466608 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001867533 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001763344 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 1.333298 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01505065 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 3.878917 mins

We have saved more than 50% here! However, looking at the details, we see that collecting the data has cost us a lot of time. Having said that, doing the selects is faster on sparklyr compared to data.table. We have used 4 vCPUs for this, so there seems to be an advantage in parallelizing computing the data, there is almost no difference in writing the data, also given that data.table’s fread has been highly optimized. Edit: As one commenter said below, you would probably not collect the whole dataset and rather let Spark write the CSV but I have not done to make the approach more comparable.

If we used only one core for sparklyr (which doesn’t make any sense because even every Macbook today has 4 cores), how long would it take then?

`> start_time <- Sys.time()

userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.651707 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.003172636 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002816916 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002915621 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.487081 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01447606 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.345677 mins`

The selects are a bit slower albeit not noticable. sparklyr is much slower though when it comes to reading large files and collectiong data with only one core. Having said that, as mentioned above, there is no reason to use only one core.

However, there is still a good reason to use data.table: As you can see in the config file of the sparklyr code, a huge chunk of memory had to be assigned to the collector and the driver, simply because the computation or the collection will throw errors if there is not enough memory available. Finding out how much memory should be allocated to what component is difficult, and not allocating the right amount of memory will result in restarting R and running the code again and again, making data.table the better choice since no configuration is required whatsoever. In addition, it is amazing how fast data.table still is using one core only compared to sparklyr using 4 cores. On the contrary, running the same code on my Macbook Air with 8 GB RAM and 4 cores, data.table had not managed to read the file in 30 minutes whilst sparkly (using 3 of the 4 cores) managed to get everything processed in less than 8 minutes.

While I personally find dplyr a bit more easy to learn, data.table has caught me, too.

R is like smoking


“Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.”

François Pinard

Data Science meets SEO, Part 3


The first two parts were about what data science is and why WDF/IDF values are very likely to have little to do with what happens under the hood at Google. In this part, we go one step further, because we look at whether there are correlations between ranking signals and the position. In the lecture, I had shown this using the example of a search query and, in view of the time available, I had dealt with it rather briefly. Here I can go into depth. However, we will first only look at each individual ranking signal in relation to the position, not the possible effect of the ranking signals on each other.

Risks and Side Effects of Data Science

Since my presentation caused some colleagues to gasp and some “interesting” statements, one point had probably been lost. Because I had expressly said, and I repeat this here, that I am not making the statement that one can assume from this data that the ranking works that way. Anyone who has ever had to deal with lawyers or statisticians knows that they are reluctant to be pinned down to reliable statements. After all, we usually do not know the total population and therefore have to draw conclusions about the population from a small sample; Who is so crazy and lets himself be pinned down to it? Hence all the complexity with confidence level, confidence intervals etc…

The following statements refer to a sample of 5,000 search queries. That sounds like a lot, but we don’t know if these searches correspond to the total population of all searches. So the results are for the sample, and I’m always willing to repeat that for other searches if those searches are made available to me.

Other problems with this approach: We have access to a few ranking signals, but not all of them, and the few signals we do have are also partly inaccurate. Of the more than 200 ranking signals, we have:

  • Domain age: inaccurate, as the individual sources contradict each other and there is no pattern to be seen as to whether one source generally outputs younger values than the other
  • Backlinks: here we have the values that are output from tools, and they usually don’t have a complete overview of all backlinks
  • Anchor Texts: Since we can’t assume that the backlinks are complete, we can’t expect to have all the anchor texts
  • Text Matches: We can identify exact matches, but we saw in the previous part that search engines don’t
  • Text length
  • Speed
  • HTTPS versus HTTP
  • Readability

So we lack signals such as

  • User Signals
  • Quality Score
  • Domain History
  • RankBrain
  • and much more

In summary, we only have a fraction of the data, and some of it is not even accurate. And my calculations are also based on search queries that we don’t know if they are representative.

A word about correlations

My favorite example of the fatal belief in correlations is the statistical correlation between Microsoft Internet Explorer’s market share and the murder rate in the U.S. between 2006 and 2011. While it may be funny to claim that there is a connection here (and this regularly leads to laughter in lectures), the fact is that a statistical relationship, which we call a correlation, does not have to be a real connection here. Correlation does not mean cause and effect. Worse still, in statistics we don’t even know in which direction the statistical connection runs. In this example, whether Internet Explorer’s market shares led to more murders, or whether the murders led to Internet Explorer being used afterwards to cover their tracks.

Of course, the connections are clear in some situations: If I spend more money on AdWords, then I may get more conversions. And if we examine the statistical relationship between ranking signals, then it is likely that more backlinks lead to a better position, even though of course a better position can ensure that more website owners find interesting content and link to it… But we don’t know whether, for example, the individual signals can influence each other, and we usually only look at the top 10. As described in the previous part, this is a kind of Jurassic Park, where we don’t have the whole picture.

A little descriptive statistics

Every analysis begins with a description of the data. For the 5,000 search queries, we got more than 50,000 search results, but we take out the results that point to Google services for the time being, because we don’t know whether they are ranked according to the normal ranking factors. There are 48,837 URLs left, spread across 14,099 hosts.

Lately I’ve been working more and more with dplyr, which allows piping like under Linux/UNIX; this functionality was borrowed from the magrittr package, and it makes the code incredibly clear. In the example in the figure, after the commented out line, I google_serps throw my dataframe to group_by(), which is grouped by host, and I throw the result of this step to summarise(), which then calculates the frequencies per host, and finally I throw it to arrange(), which sorts the result in descending order. I write the result to hostFrequency, and because I want to see the result immediately in my R notebook, I put the whole expression in parentheses so that the result is not only written to the hostFrequency dataframe, but also output immediately. Every time I do something like this with dplyr, I’m a little happy. And if you have really large data sets, then you do the same with sparklyr :).

But back to the topic: So we see here that few hosts rank very frequently, and that means conversely that many hosts rank only once. No surprise here.

To warm up: Speed as a ranking factor

The speed data for each host is very easy to get, because Google offers the PageSpeed Insights API for this, and kindly there is also an R package for this. With more than 10,000 hosts, the query takes a while, and you are not allowed to make more than 25,000 requests per day, as well as no more than 100 (?) Requests per 100 seconds. I just let it run, and after 1 day my R crashed and lost all data. Not pretty, but a workaround: Write the dataframe to the hard disk after each request.

But let’s take a closer look at the data now. Here’s a histogram of the distribution of the speed values of 14,008 hosts (so I got a PageSpeed value for 99.4% of the hosts):

We see that most hosts make it over the 50 points, and summary gives us the following numbers:

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 55.00 70.00 66.69 82.00 100.00

It’s nice to see how misleading the average can be And now let’s plot positions versus speed:

As we see, we don’t see anything. Let’s check this again in more detail:

cor.test(google_serps$position,google_serps$SPEED) Pearson’s product-moment correlation data: google_serps$position and google_serps$SPEED t = -5.6294, df = 48675, p-value = 1.818e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03438350 -0.01662789 sample estimates: cor -0.02550771

Looks like there is no correlation between PageSpeed and position in the top 10 (we don’t know if there might be one in the top 100, it could be that places 11 to 100 have worse PageSpeed values). But it is also not unlikely that we are dealing with a hygiene factor: If you rank well, then you have no advantage, if you rank badly, then you will be punished. It’s a bit like taking a shower, then no one notices, but if you haven’t showered, then it’s noticeable. Or, in a nutshell (that’s how I should have said it at the SEO Campixx), a slow website is like not having showered. However, we also see hosts that rank despite a creepy speed. But if you search for “ganter shoes”, then http://ganter-shoes.com/de-at/ is probably the best result, even if the page takes 30 seconds to load for me.

We should also keep in mind that the PageSpeed API performs a real-time measurement… maybe we just caught a bad moment? You would actually have to measure the PageSpeed several times and create an average from it.

SSL as a ranking factor

What we also get very easily in terms of data is the distinction between whether a host uses https or not. While some see the use of secure protocols as a weighty ranking factor, more moderate voices like Sistrix see the use of SSL as a weak ranking factor. In this dataset, 70% of all URLs have an https. But does that also mean that these pages rank better?

We are dealing here with a special form of calculation, because we are trying to determine the relationship between a continuous variable (the position) and a dichotomous variable (SSL yes/no). We convert the two variants of the protocol into numbers, http becomes a 0, https becomes a 1 (see screenshot below in the secure and secure2 columns).

The determination of the point-biserial coefficient is a special case of the Pearson correlation coefficient; normally in R we would just type cor.test(x,y), but for this an additional package is loaded, which supports this special form. In fact, the values of the two tests hardly differ, and cor.test also provides me with the p-value.

As we can see, we see nothing or almost nothing: With a correlation coefficient of -0.045067, we can rule out the possibility that https has had an impact on the ranking in the top 10. Does this mean that we should all do without https again? No, because it’s better for the user. And as soon as browsers show even more clearly that a connection is not secure, users are more likely to say goodbye to a site more quickly. Not to mention that we only looked at the top 10 here. It could be that the places 11 to 1000 were mainly occupied by sites without SSL. And then the results could look different again.

Perhaps SSL as a ranking factor is also a reverse hygiene factor. Google would like to have SSL, but since some important pages may not yet have SSL, they do not use it. Just as you might want your dream partner to have a shower, but if he or she suddenly stands in front of you, then it doesn’t matter because you’re so in love. In the long run, of course, this does not go well. And that’s how it will be with SSL

Age of a domain as a ranking factor

Let’s move on to the next ranking signal, the age of a domain, even if it says that the age of a domain doesn’t matter. Here we are faced with a first challenge: How do I find out the age of a domain as automatically and, above all, reliably as possible? Sistrix offers the age of a domain, but not that of a host (the difference is explained in the script for my SEO seminar at HAW). Nevertheless, Sistrix has the advantage that the API is very lean and fast. However, Sistrix does not find an age for 2,912 of the domains (not hosts), with 13,226 unique domains that is 22% without domain age. Jurassic Park sends its regards (if you don’t understand this allusion, please read the second part about Data Science and SEO). Nevertheless, let’s take a look at the distribution:

We see a slightly skewed distribution to the right, i.e. that we see a higher frequency on the left of the older domains (no, I don’t see mirror-inverted, that’s what they call it). Younger domains seem to have fewer chances here. However, it is also possible that Sistrix does not have younger domains in its database and that the almost 3,000 missing domains would be more likely to be located on the right.

Can we at least see a correlation here? Let’s plot the data first:

Again, we don’t see anything, and if we let the correlation be calculated, then this is confirmed:

cor.test(google_serps$position,as.numeric(google_serps$year)) Pearson’s product-moment correlation data: google_serps$position and as.numeric(google_serps$year) t = 1.1235, df = 44386, p-value = 0.2612 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.003970486 0.014634746 sample estimates: cor 0.005332591

Not only do I have no correlation, but I am also outside the confidence level. However, the result should be taken with a grain of salt, because as mentioned above, it may be that our selection is simply not representative, because probably not every domain will have had the same chance of getting into Sistrix’s domain age database. So we would have to supplement this data and possibly also check the data from Sistrix (I haven’t been able to find out where Sistrix got the data from); unfortunately, I have not been able to identify a pattern either, because sometimes one source shows older data, sometimes another source. In principle, you would have to take all data sources and then always take the oldest date. Unfortunately, most sources are not so easy to scrape. And so we not only have missing, but also partly incorrect data. And this is not an atypical problem in the field of data science.

Intermediate thoughts

Since I’ll soon have the 2,000 words together (and I know the correlation words/”read to the end of the article” for my blog), I’ll look at the next ranking factors in the next blog post. Important: We can see here that we find no evidence for supposedly very important ranking factors that they actually have an influence. But that doesn’t mean that’s really true:

  • We have some missing and incorrect data
  • We may not have a valid sample
  • And Google has dynamic ranking, which means that for some queries, some signals correlate with position, while others don’t. The sledgehammer method we used here. is certainly not expedient.

Example:

cor.test(google_serps$position[google_serps$keyword==”acne vitamin A”],google_serps$secure2[google_serps$keyword==”acne vitamin a”]) Pearon’s product-moment correlation data: google_serps$position[google_serps$keyword == “acne vitamin A”] and google_serps$secure2[google_serps$keyword == “acne vitamin A”] t = -4.6188, df = 8, p-value = 0.001713 Alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9645284 -0.4819678 sample estimates: COR -0.8528029

Even though the confidence interval here is very large, it is still in the range of a medium to strong correlation for this query “acne vitamin a” (it correlates negatively because the positions go up from 1 to 10 or further). It is therefore also important to identify the segments or “regions” where certain signals have an effect. More on this in the next part about data science and SEO.