Data-driven personas with association rules


I have already talked about personas elsewhere, this article is about the data-driven generation of personas. I stick to the definition of the persona inventor Cooper and see a persona as a prototype for a group of users. This can also be interesting for marketing, because after all, you can use it to create a needs- and experience-oriented communication, for example on a website. Personas are not target groups, but more on that elsewhere.

How do you create a data-driven persona?

I haven’t found the perfect universal way for data-driven personas either. External data is not available for all topics, the original approach of 10-12 interviews is difficult, and internal data has the disadvantage that it only contains the data of those you already know, not those you might still want to reach. The truth lies in merging different data sources.

Data-driven persona meets web analytics

Web analytics data offers a lot of usage behavior, and depending on how a page is structured (for example, whether it is already geared to the different needs of different personas), it is possible to understand the extent to which the different user groups actually behave as expected. Or you try to generate data-driven personas from the usage behavior on the website. All under the restriction that the users have to find the page first, so it is not certain that really all groups of people actually access this page and therefore important personas are overlooked. This article is about a special case of this automated persona generation from web analytics data, which is exciting from an algorithmic point of view and the associated visualization. As is well known, everyone likes to report on successes, here is a case where the failure shows in which direction further work could go.

The experiences from web mining are rarely associated with personas, although some research was done on it more than 10 years ago; for an overview, see, for example, Facca and Lanzi, Minining interesting knowledge from weblogs: a survey, from 2004 (published in 2005). Whereas in the past it was mainly weblogs (not web blogs!) that were used, i.e. log files written by the server, today we have the opportunity to use much “better” data through Google Analytics & Co.

Reintroducing: Association Rules

But which exactly is better? In GA & Co we can better distinguish people from bots (of which there are more than you think), returners are recognized more reliably, devices etc. The question is whether you absolutely have to use the additional data for basic data-driven personas. Because association rules, which I have already written about in a post about clustering with Google Analytics and R and which are also mentioned by Facca and Lanzi, can already identify basic groups of users (I had already mentioned in the other article that I had once worked for one of the creators of the algo, Tomasz Imilinski, but I still have to tell an anecdote with him: In a meeting, he once said to me that you often think something is a low hanging fruit, a quick success, but, “Tom, often enough, the low hanging fruits are rotten”. He has been right so many times.). The groups identify themselves through a common behavior, the co-occurrence of page views, for example. In R, this works wonderfully with the arules package and the algo apriori it contains.

Data-driven personas with Google Analytics & Co.

As already mentioned in the earlier article: A standard installation of Google Analytics is not sufficient (it never is anyway). Either you have the 360 variant or “hack” the free version (“hack” in terms of “tinkering”, not “being a criminal”) and pull the data via API. With Adobe Analytics, the data can be pulled from the data warehouse or via an API. Simply using Google Analytics and drawing personas from it is therefore not possible with this approach. You also have to think about which date from GA is best used next to the Client ID to represent transactions. This can vary greatly from website to website. And if you want to be very clever, then a PageView alone may not be signal enough.

However, this is first of all about visualization and what limitations the a priori approach has for the automated generation of data-driven personas. For the visualization, I work with the package arulesViz. The resulting graphics are not easy to interpret, as I have experienced at the HAW, but also with colleagues. Below we see the visualization of association rules obtained from the data of this page, with the GA date pagePathLevel1 (which is unfortunately also an article title for me). One thing stands out here: I can actually only identify two groups here, and that’s pretty poor.

What exactly do we see here? We see that users who are on the homepage also go to the Courses section and vice versa. The lift is high here, the support not so much. And then we see users moving between my four articles about Scalable Capital, with roughly the same low lift but different levels of support. Lift is the factor by which the co-occurrence of two items is higher than their probable occurrence if they were independent of each other. Support is the frequency. Support was defined at 0.01 when creating the association rules, and confidence was also defined at 0.01. For details, see my first article.

But why don’t I see any other pages here? My article about Google Trends is a very frequently read article, as is the one about the Thermomix or AirBnB. So it’s not because there aren’t more user groups. The disadvantage of this approach is simply that users have to have visited more than one page for a rule to arise here at all. And since some users come via a Google search and apparently have no interest in a second article, because their need for information may already be satisfied or because I don’t advertise it well enough, apparently only students and those interested in Scalable Capital can be identified here in these rules.

Ways out of the a priori dilemma?

So far, I’ve identified three ways to solve this dilemma, and all of them require extra work:

  • I test whether I can get users to view more than one page through a better relevant offer, for example with Google Optimize, and if successful, I get better data.
  • I use the a priori data only as a base and merge it with other data (also very nice, but I won’t cover it here)
  • I lower the support and confidence.

The most beautiful is the first approach, in my opinion, but it requires time and brains. And it is not said that something will come out. The last approach is unpleasant, because we are dealing with cases that occur less frequently and therefore not necessarily reliable. With a support of 0.005, the visualization looks different:

But again I have the problem that the individual pages do not appear. So it seems to be extremely rare that someone moves from the Google Trends article to another article, so lowering the support value didn’t help. From experience, I can say that this problem appears more or less strongly on most pages that I otherwise see, but it always appears somehow. The stupid thing is, if you can already read good personas, then you are more inclined not to look at the rest, even if it could be very large in scope.

We also see another problem in the graphic, because the users in the right strand do not have to be the same from arrow to arrow. In other words, it is not said that visitors who look at photography pages and courses will also look at the publications, even if it looks like that in the visualization. If A and B as well as B and C, then A and C do not apply here! To solve this, the association rules in the visualization would still have to have an exclusionary marking. It does not exist and would be a task for the future.

Result

The path via association rules is exciting for the creation of data-driven personas with Google Analytics or other web analysis tools. However, it will usually not be sufficient at the moment, because a) the problem of one-page visitors is not solved here, b) the rules do not provide sufficient information about different groups that only have overlaps and c) it can only say something about those groups that are already on the site anyway. I’m currently working on a) and b) on the side, I’m always happy about thoughts from outside