Clustering with Google Analytics and R


Some questions are not so easy or even impossible to answer with the Google Analytics user interface (this also applies to Adobe Analytics, Piwik, etc.). While Google Analytics offers powerful and easy-to-use functionality to manually create and compare segments or personas based on devices, acquisition channels, or browsers, once it goes beyond these standard segments or to combinations of multiple dimensions, the effort becomes complex. Often enough, people simply “poke” at the data and hope that they will find something valuable. This is exactly where the advantages of combining Google Analytics and R come into play. One way to connect Google Analytics and R is the R package googleAnalyticsR by Mark Edmonson, which is used as an example in this article.

Segmentation, Clustering and Classification

Before we get into the practical, let’s briefly explain the difference between segmentation, clustering and classification. Segmentation attempts to divide customers or users into groups that differ based on characteristics, whether it’s an interest, an access channel, a marketing campaign through which a customer came, etc. Clustering attempts to automatically identify such groups and the dimensions or features that distinguish them, whereas classification attempts to predict which group a customer or user belongs to. Classification is a good example of supervised machine learning, clustering is a good example of unsupervised machine learning. Nothing meaningful always comes out of it, hence the frequent use of the word “tried”

This example is about clustering, i.e. identifying groups based on structures in the data using machine learning algorithms. Hierarchical clustering identifies groups based on their similarity, starting with a separate cluster for each data point and grouping together more clusters based on their similarity at each subsequent level, until all clusters are merged (see the dendrogram). A great algorithm that ideally needs numerical data as input, because similarity is calculated as distance here. Although DAISY could also be used with non-numerical data, this goes too far for the first step.

For our example, we’ll use a different approach, we just want to find out if clusters can be formed based on the content or products viewed on a page. So I suspect that the visitors of this website who look at roller derby photos are very likely not interested in the articles about fintechs. But maybe Google Analytics article readers are interested in fintechs, so they could be offered articles on the topic. We know this from Amazon (“Often bought together”), and most of the time the suggestions are useful. Of course, I can also look at the user flow report in Google Analytics (with the few pages on my homepage, this is more than enough), but as soon as there are more pages or products, we don’t get any further with this report. This is where Association Rules, also known as Market Basket Analysis, help us (I even worked with one of the creators, Tomas Imielinski, on Ask.com once). This machine learning-based approach tries to identify interesting relationships between variables in large data sets.

In general, the selection of an algorithm also and above all depends first and foremost on the business problem that is to be solved. This is an extremely important point: What question do we actually want to answer? In this case, I can only answer the question of which pages have ended up in a “shopping cart” based on the use by website visitors. Of course, I could also calculate a “similarity” automatically based on the content.

Keep your eyes open when interpreting the results

But first of all, a big warning: What users look at depends to a large extent on what the navigation or other elements on the page look like. If, for example, a recommendation system such as YARPP is already integrated into the site, then the probability is higher that some pages are more likely to be accessed together with other pages due to this plugin alone. But even small things on a page can lead to connections being seen that are not actually there, even if it is just a small icon that attracts the attention of the user and seduces them to click.

Then it should also be pointed out that the reputation of pages is not quite as strong a signal as buying a product. Just because someone calls up a page doesn’t mean that the person stayed on that page for a long time and read the text. You could incorporate this, for example, by measuring the time spent on the respective page and only allowing the combinations where “enough” time has been spent, but the time spent in the web analysis systems is unfortunately often unusable.

Preconditions

Clustering with Google Analytics and R requires user-level analytics data, which means that users of a standard installation of the free version of Google Analytics cannot easily perform clustering. The standard version has the new user explorer, which can be used to look at the behavior of individual anonymized users, but this data is not downloadable. Only the buyers of the Pro variant can access user-level data a priori, in the free version only aggregated data is offered. However, there is a small hack that can be used to change this, otherwise here is another argument for Piwik, at least as a second system. In addition to access to the raw data, the advantage of Piwik is that the data is immediately available. And there is also an R package for Piwik. But in this article, the solution with Google Analytics and R is to be shown.

Connect Google Analytics and R to googleAnalyticsR

Now it’s time to get down to business. I won’t explain how to install an R package, for what we are planning to do, the googleAnalyticsR package and the googleAuthR package are needed on the one hand, and the arules package on the other. The R-Hilfe is your friend.

We first load the two packages and then log in with a Google account (a browser window opens for this):

library("googleAuthR")<l g ibrary("googleAnalyticsR")<a_auth()

By the way, if you are already logged in and want to switch to another account, you enter

ga_auth(new_user = TRUE)

and log in with the other account. Next, let’s get the list of our accounts and properties that are connected to this Google account:

my_accounts <- ga_account_list()

In this data frame, we look for the viewView we want to use in the viewName column and look for the viewId in the row. This viewId is elementary because we will use it to retrieve the data of our data view. We store the viewID in a variable:

ga_id=XXXXXX

where XXXXXX stands for the viewId. What data are available to us now? Simple

meta <- google_analytics_meta()

and all dimensions and metrics are available in the Data Frame meta. We are particularly interested in Custom Dimension 1, where the User Level ID is stored, and ga:pagePath:

gadata <- google_analytics(id = ga_id,<br /> start="2017-04-20", end="2017-05-06",<br /> dimensions = c("ga:dimension1", "ga:pagePath"),<br /> max = 2000)

In my dataframe gadata, I now have the data as described above plus two more columns (sessions and bounceRate), because GA doesn’t output the dimensions to me without metrics. The data needs to be transformed so that it looks something like this:

<br /> user id 1, page 1, page 2<br /> user id 2, page 1<br /> user id 3, page 3

This is done with the code

i <- split(gadata$pagePath,gadata$dimension1)

We now have every “transaction” of a user on one line. This is the input needed for our algorithm.

Market Basket Analysis

Now is the time to load the R package arules and modify the data for the algorithm:

library(arules)<br /> txn <- as(i, "transactions")

When calling the algorithm, the parameters sup for support and conf for confidence are used, i.e. in the following example we tell it that we want to have rules that are applicable in at least 0.1% of the cases and for which there is a confidence of 0.1%. That sounds damn little at first, but let’s imagine that we are dealing with a “shopping cart” in which many different combinations can be “placed”, depending on the size of an online shop or a website. Of course, we want to cover as many operations as possible with the rules, but the more possibilities there are, the more likely it is that we will have to go in with a low support value.

basket_rules <- apriori(txn, parameter = list(sup = 0.001, conf = 0.001, target="rules"))

This can take up quite a lot of processor time and RAM, and depending on the configuration, R ends the process with an error message. One way to limit the algorithm is to increase the size of the sup and/or conf parameters. If everything went well, the identified rules can be examined:

inspect(head(sort(basket_rules, by="lift"),20))

This shows the top 20 rules, sorted by lift. The lift tells us how much higher the dependence of the items is compared to their independence, which you have to assume for a lift of 1. The output for me looks like this (not for this website):

Rule two, for example, states that a small proportion of transactions involve a user looking at an article about communication methods and, with a confidence rate of 30%, will also read the article about communication technology. This compound has an extremely high lift of 521, so we can assume that this compound is much more likely than their coincidental occurrence.

Summary Google Analytics and R

This small example shows the power of R and the packages. With just a few lines of code, data can be imported directly from Google Analytics into R and used for complex analyses. This analysis would not have been possible in the user interface.

However, we also see here how important it is to have data at the user level, although it must be said that we don’t even have user-level data here, but rather browser-level data due to the lack of cross-device tracking. A next step could therefore be to create the identified rules again on the basis of segmented data to identify differences.

The optimal tracking concept or The sailing trip without a destination


How often have I heard the sentence “Let’s just track everything, we can think about what we actually need later. But of course the tracking concept can already be written!”

Let’s imagine we want to go on a trip with a sailboat and we said “I don’t know where we want to go, let’s just take everything we could need for all eventualities”. Our boat would sink before the trip has begun. We would not know whether we would have to take water and canned food with us for a day or several weeks, whether we would need winter clothes or summer clothes and so on. But to be on the safe side, we just buy the whole sailing supply store empty, we will need some of it. And now we have more than the ship can bear in terms of load.

Likewise, you can’t track everything that may be needed. Or maybe it is, but that would not only be very expensive. It would also make the website virtually unusable for users. More on that later. The bad news for all those who are looking for a simple solution to a difficult question: A tracking concept requires a lot of brainpower. If you don’t, you collect useless data in most cases and burn time and money. Just as we have to think about what we want to take with us on the sailing trip, depending on the destination.

No tracking concept without clear goals

First of all, there is no way around defining goals, SMART goals, i.e. what by when, etc. For example, 100,000 new customers in a quarter or €500,000 in sales in a quarter. That is our destinationKPIstell us where we are on the way to this goal. Similar to a nautical chart, on which we determine our position through navigation instruments and adjust the route if we have strayed from the destination.

If I realize that I probably won’t reach my goal of 100,000 new customers, then I want to know what screws I need to turn so that I can take corrective action. But at least I would like to understand why this is so. Maybe I have to look for another goal because my actual goal doesn’t make sense at the moment. Because if I see that there is a storm in front of my destination port, then there may be another port. Through this we may then be able to reach our actual destination later. If I don’t reach the sales target because the return rate is higher than expected, I want to understand the cause. I won’t identify them with a standard implementation of Google Analytics.

All data and the information to be derived from it have only one meaning. We want to understand what action we can derive from the data. If a piece of information is only interesting, but does not offer any relevance to action, then the data has very likely been collected unnecessarily. At sea, I’m not interested in the weather forecast from two days ago. Nevertheless, such data is written in reports, after all, you have them, they will be good for something, we will notice that later. In the same way, we sail across the sea with our overloaded boat rather badly than right and tell ourselves that we will need the stuff at some point, we just have to get into the situation first.

On the impossibility of being prepared for everything

Space is limited on a boat, and all material has to find its place. This also applies to a tracking tool. For a shop, a connection to a CRM would certainly be interesting, so that the customer lifetime value etc. can be determined. Most likely, you will also want to work with custom dimensions in Google Analytics, so that data from the CRM can be used in Analytics for segmentation.

But how am I supposed to know which custom dimensions need to be defined if I don’t even know if and which ones I will need later? Especially if the number of custom dimensions is also limited? Custom dimensions are a fundamental decision, similar to a change to the boat that cannot be undone. Because a custom dimension can no longer be deleted.

Every event is a small program that creates load

Each piece of material has weight and changes the sailing characteristics of a boat, to the point of overloading. And of course, you can also use a tracking tool to trigger an event in the browser every second to see how long a user has been doing what on a page. But running events is running small programs in the browser, and a lot of load is not good, neither for the browser nor for the user. One of them will give up, the only question is who comes first.

So a tracking concept can really only be written when the goals and KPIs are clear. Unfortunately, the definition of it is an exhausting story. The good thing is that once this task has been completed, an actionable reporting dashboard can also be built. Numbers are no longer reported just because they can be reported, but because they provide added value. However, most dashboards are far from that. And so most sailboats are driven more at will, feeling and visibility. Except that we don’t put our lives at risk in online marketing.

Of course, you can make a stopover later on the route in a harbor and adjust the provisions, equipment and boat, because you realize that it doesn’t work that way. But then I lost not only time, but also a lot of money. The same applies to the tracking concept. If I don’t think about it upfront, then I’ve invested a lot of time and money in an enormously complex implementation without being able to use any of it the way I actually need it.

What is the standard for tracking?

“And if we just do what you do? There will be some standards.” The comparison with the sailing trip also fits here: What is the average sailing trip like? I have hardly seen a tracking concept that is the same as the other, even in the same industry. And so no two sailing trips are the same, because every boat is a little different, the crew is different, etc.

If you want to avoid the definition of the destination, you just want to set off to signal movement, but will notice at sea at the latest that you will not be able to sail through. Or he hopes that no one notices. At some point, however, someone will notice that no one is really interested in the numbers because they are completely irrelevant.

If you don’t know the port you want to sail to, no wind is the right one. (Seneca)