Data-driven personas with association rules


I have already talked about personas elsewhere, this article is about the data-driven generation of personas. I stick to the definition of the persona inventor Cooper and see a persona as a prototype for a group of users. This can also be interesting for marketing, because after all, you can use it to create a needs- and experience-oriented communication, for example on a website. Personas are not target groups, but more on that elsewhere.

How do you create a data-driven persona?

I haven’t found the perfect universal way for data-driven personas either. External data is not available for all topics, the original approach of 10-12 interviews is difficult, and internal data has the disadvantage that it only contains the data of those you already know, not those you might still want to reach. The truth lies in merging different data sources.

Data-driven persona meets web analytics

Web analytics data offers a lot of usage behavior, and depending on how a page is structured (for example, whether it is already geared to the different needs of different personas), it is possible to understand the extent to which the different user groups actually behave as expected. Or you try to generate data-driven personas from the usage behavior on the website. All under the restriction that the users have to find the page first, so it is not certain that really all groups of people actually access this page and therefore important personas are overlooked. This article is about a special case of this automated persona generation from web analytics data, which is exciting from an algorithmic point of view and the associated visualization. As is well known, everyone likes to report on successes, here is a case where the failure shows in which direction further work could go.

The experiences from web mining are rarely associated with personas, although some research was done on it more than 10 years ago; for an overview, see, for example, Facca and Lanzi, Minining interesting knowledge from weblogs: a survey, from 2004 (published in 2005). Whereas in the past it was mainly weblogs (not web blogs!) that were used, i.e. log files written by the server, today we have the opportunity to use much “better” data through Google Analytics & Co.

Reintroducing: Association Rules

But which exactly is better? In GA & Co we can better distinguish people from bots (of which there are more than you think), returners are recognized more reliably, devices etc. The question is whether you absolutely have to use the additional data for basic data-driven personas. Because association rules, which I have already written about in a post about clustering with Google Analytics and R and which are also mentioned by Facca and Lanzi, can already identify basic groups of users (I had already mentioned in the other article that I had once worked for one of the creators of the algo, Tomasz Imilinski, but I still have to tell an anecdote with him: In a meeting, he once said to me that you often think something is a low hanging fruit, a quick success, but, “Tom, often enough, the low hanging fruits are rotten”. He has been right so many times.). The groups identify themselves through a common behavior, the co-occurrence of page views, for example. In R, this works wonderfully with the arules package and the algo apriori it contains.

Data-driven personas with Google Analytics & Co.

As already mentioned in the earlier article: A standard installation of Google Analytics is not sufficient (it never is anyway). Either you have the 360 variant or “hack” the free version (“hack” in terms of “tinkering”, not “being a criminal”) and pull the data via API. With Adobe Analytics, the data can be pulled from the data warehouse or via an API. Simply using Google Analytics and drawing personas from it is therefore not possible with this approach. You also have to think about which date from GA is best used next to the Client ID to represent transactions. This can vary greatly from website to website. And if you want to be very clever, then a PageView alone may not be signal enough.

However, this is first of all about visualization and what limitations the a priori approach has for the automated generation of data-driven personas. For the visualization, I work with the package arulesViz. The resulting graphics are not easy to interpret, as I have experienced at the HAW, but also with colleagues. Below we see the visualization of association rules obtained from the data of this page, with the GA date pagePathLevel1 (which is unfortunately also an article title for me). One thing stands out here: I can actually only identify two groups here, and that’s pretty poor.

What exactly do we see here? We see that users who are on the homepage also go to the Courses section and vice versa. The lift is high here, the support not so much. And then we see users moving between my four articles about Scalable Capital, with roughly the same low lift but different levels of support. Lift is the factor by which the co-occurrence of two items is higher than their probable occurrence if they were independent of each other. Support is the frequency. Support was defined at 0.01 when creating the association rules, and confidence was also defined at 0.01. For details, see my first article.

But why don’t I see any other pages here? My article about Google Trends is a very frequently read article, as is the one about the Thermomix or AirBnB. So it’s not because there aren’t more user groups. The disadvantage of this approach is simply that users have to have visited more than one page for a rule to arise here at all. And since some users come via a Google search and apparently have no interest in a second article, because their need for information may already be satisfied or because I don’t advertise it well enough, apparently only students and those interested in Scalable Capital can be identified here in these rules.

Ways out of the a priori dilemma?

So far, I’ve identified three ways to solve this dilemma, and all of them require extra work:

  • I test whether I can get users to view more than one page through a better relevant offer, for example with Google Optimize, and if successful, I get better data.
  • I use the a priori data only as a base and merge it with other data (also very nice, but I won’t cover it here)
  • I lower the support and confidence.

The most beautiful is the first approach, in my opinion, but it requires time and brains. And it is not said that something will come out. The last approach is unpleasant, because we are dealing with cases that occur less frequently and therefore not necessarily reliable. With a support of 0.005, the visualization looks different:

But again I have the problem that the individual pages do not appear. So it seems to be extremely rare that someone moves from the Google Trends article to another article, so lowering the support value didn’t help. From experience, I can say that this problem appears more or less strongly on most pages that I otherwise see, but it always appears somehow. The stupid thing is, if you can already read good personas, then you are more inclined not to look at the rest, even if it could be very large in scope.

We also see another problem in the graphic, because the users in the right strand do not have to be the same from arrow to arrow. In other words, it is not said that visitors who look at photography pages and courses will also look at the publications, even if it looks like that in the visualization. If A and B as well as B and C, then A and C do not apply here! To solve this, the association rules in the visualization would still have to have an exclusionary marking. It does not exist and would be a task for the future.

Result

The path via association rules is exciting for the creation of data-driven personas with Google Analytics or other web analysis tools. However, it will usually not be sufficient at the moment, because a) the problem of one-page visitors is not solved here, b) the rules do not provide sufficient information about different groups that only have overlaps and c) it can only say something about those groups that are already on the site anyway. I’m currently working on a) and b) on the side, I’m always happy about thoughts from outside

Clustering with Google Analytics and R


Some questions are not so easy or even impossible to answer with the Google Analytics user interface (this also applies to Adobe Analytics, Piwik, etc.). While Google Analytics offers powerful and easy-to-use functionality to manually create and compare segments or personas based on devices, acquisition channels, or browsers, once it goes beyond these standard segments or to combinations of multiple dimensions, the effort becomes complex. Often enough, people simply “poke” at the data and hope that they will find something valuable. This is exactly where the advantages of combining Google Analytics and R come into play. One way to connect Google Analytics and R is the R package googleAnalyticsR by Mark Edmonson, which is used as an example in this article.

Segmentation, Clustering and Classification

Before we get into the practical, let’s briefly explain the difference between segmentation, clustering and classification. Segmentation attempts to divide customers or users into groups that differ based on characteristics, whether it’s an interest, an access channel, a marketing campaign through which a customer came, etc. Clustering attempts to automatically identify such groups and the dimensions or features that distinguish them, whereas classification attempts to predict which group a customer or user belongs to. Classification is a good example of supervised machine learning, clustering is a good example of unsupervised machine learning. Nothing meaningful always comes out of it, hence the frequent use of the word “tried”

This example is about clustering, i.e. identifying groups based on structures in the data using machine learning algorithms. Hierarchical clustering identifies groups based on their similarity, starting with a separate cluster for each data point and grouping together more clusters based on their similarity at each subsequent level, until all clusters are merged (see the dendrogram). A great algorithm that ideally needs numerical data as input, because similarity is calculated as distance here. Although DAISY could also be used with non-numerical data, this goes too far for the first step.

For our example, we’ll use a different approach, we just want to find out if clusters can be formed based on the content or products viewed on a page. So I suspect that the visitors of this website who look at roller derby photos are very likely not interested in the articles about fintechs. But maybe Google Analytics article readers are interested in fintechs, so they could be offered articles on the topic. We know this from Amazon (“Often bought together”), and most of the time the suggestions are useful. Of course, I can also look at the user flow report in Google Analytics (with the few pages on my homepage, this is more than enough), but as soon as there are more pages or products, we don’t get any further with this report. This is where Association Rules, also known as Market Basket Analysis, help us (I even worked with one of the creators, Tomas Imielinski, on Ask.com once). This machine learning-based approach tries to identify interesting relationships between variables in large data sets.

In general, the selection of an algorithm also and above all depends first and foremost on the business problem that is to be solved. This is an extremely important point: What question do we actually want to answer? In this case, I can only answer the question of which pages have ended up in a “shopping cart” based on the use by website visitors. Of course, I could also calculate a “similarity” automatically based on the content.

Keep your eyes open when interpreting the results

But first of all, a big warning: What users look at depends to a large extent on what the navigation or other elements on the page look like. If, for example, a recommendation system such as YARPP is already integrated into the site, then the probability is higher that some pages are more likely to be accessed together with other pages due to this plugin alone. But even small things on a page can lead to connections being seen that are not actually there, even if it is just a small icon that attracts the attention of the user and seduces them to click.

Then it should also be pointed out that the reputation of pages is not quite as strong a signal as buying a product. Just because someone calls up a page doesn’t mean that the person stayed on that page for a long time and read the text. You could incorporate this, for example, by measuring the time spent on the respective page and only allowing the combinations where “enough” time has been spent, but the time spent in the web analysis systems is unfortunately often unusable.

Preconditions

Clustering with Google Analytics and R requires user-level analytics data, which means that users of a standard installation of the free version of Google Analytics cannot easily perform clustering. The standard version has the new user explorer, which can be used to look at the behavior of individual anonymized users, but this data is not downloadable. Only the buyers of the Pro variant can access user-level data a priori, in the free version only aggregated data is offered. However, there is a small hack that can be used to change this, otherwise here is another argument for Piwik, at least as a second system. In addition to access to the raw data, the advantage of Piwik is that the data is immediately available. And there is also an R package for Piwik. But in this article, the solution with Google Analytics and R is to be shown.

Connect Google Analytics and R to googleAnalyticsR

Now it’s time to get down to business. I won’t explain how to install an R package, for what we are planning to do, the googleAnalyticsR package and the googleAuthR package are needed on the one hand, and the arules package on the other. The R-Hilfe is your friend.

We first load the two packages and then log in with a Google account (a browser window opens for this):

library("googleAuthR")<l g ibrary("googleAnalyticsR")<a_auth()

By the way, if you are already logged in and want to switch to another account, you enter

ga_auth(new_user = TRUE)

and log in with the other account. Next, let’s get the list of our accounts and properties that are connected to this Google account:

my_accounts <- ga_account_list()

In this data frame, we look for the viewView we want to use in the viewName column and look for the viewId in the row. This viewId is elementary because we will use it to retrieve the data of our data view. We store the viewID in a variable:

ga_id=XXXXXX

where XXXXXX stands for the viewId. What data are available to us now? Simple

meta <- google_analytics_meta()

and all dimensions and metrics are available in the Data Frame meta. We are particularly interested in Custom Dimension 1, where the User Level ID is stored, and ga:pagePath:

gadata <- google_analytics(id = ga_id,<br /> start="2017-04-20", end="2017-05-06",<br /> dimensions = c("ga:dimension1", "ga:pagePath"),<br /> max = 2000)

In my dataframe gadata, I now have the data as described above plus two more columns (sessions and bounceRate), because GA doesn’t output the dimensions to me without metrics. The data needs to be transformed so that it looks something like this:

<br /> user id 1, page 1, page 2<br /> user id 2, page 1<br /> user id 3, page 3

This is done with the code

i <- split(gadata$pagePath,gadata$dimension1)

We now have every “transaction” of a user on one line. This is the input needed for our algorithm.

Market Basket Analysis

Now is the time to load the R package arules and modify the data for the algorithm:

library(arules)<br /> txn <- as(i, "transactions")

When calling the algorithm, the parameters sup for support and conf for confidence are used, i.e. in the following example we tell it that we want to have rules that are applicable in at least 0.1% of the cases and for which there is a confidence of 0.1%. That sounds damn little at first, but let’s imagine that we are dealing with a “shopping cart” in which many different combinations can be “placed”, depending on the size of an online shop or a website. Of course, we want to cover as many operations as possible with the rules, but the more possibilities there are, the more likely it is that we will have to go in with a low support value.

basket_rules <- apriori(txn, parameter = list(sup = 0.001, conf = 0.001, target="rules"))

This can take up quite a lot of processor time and RAM, and depending on the configuration, R ends the process with an error message. One way to limit the algorithm is to increase the size of the sup and/or conf parameters. If everything went well, the identified rules can be examined:

inspect(head(sort(basket_rules, by="lift"),20))

This shows the top 20 rules, sorted by lift. The lift tells us how much higher the dependence of the items is compared to their independence, which you have to assume for a lift of 1. The output for me looks like this (not for this website):

Rule two, for example, states that a small proportion of transactions involve a user looking at an article about communication methods and, with a confidence rate of 30%, will also read the article about communication technology. This compound has an extremely high lift of 521, so we can assume that this compound is much more likely than their coincidental occurrence.

Summary Google Analytics and R

This small example shows the power of R and the packages. With just a few lines of code, data can be imported directly from Google Analytics into R and used for complex analyses. This analysis would not have been possible in the user interface.

However, we also see here how important it is to have data at the user level, although it must be said that we don’t even have user-level data here, but rather browser-level data due to the lack of cross-device tracking. A next step could therefore be to create the identified rules again on the basis of segmented data to identify differences.