Row Names in R

Some datasets in R use row names, such as the built-in mtcars dataset. While convenient, row names can be suboptimal when sorting data, for instance, by car brands:

To convert row names into a column using the Tidyverse, the rownames_to_column() function is used:

library(tidyverse)

# mtcars laden und die Reihenamen in eine Spalte verschieben
mtcars_tidy <- mtcars %>%
  rownames_to_column(var = "car_name")

And this is the result:

Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

ggplot2 and the New Pipe

Why doesn’t this code work?

mtcars |> ggplot(., aes(x = mpg, y = hp)) + geom_point()

The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:

library(ggplot2)
library(dplyr)

mtcars %>%
ggplot(aes(x = mpg, y = hp)) +
geom_point()

Alternatively, the data must be explicitly passed to ggplot, as shown here:

library(ggplot2)

mtcars |>
ggplot(data = ., aes(x = mpg, y = hp)) +
geom_point()

Here, the dot (.) represents the data being piped from mtcars into ggplot, and you need to specify it as the data argument in the ggplot function.

Apple MacBook Pro M1 Max – Is it worth it for Machine Learning?


Another new MacBook? Didn’t I just buy the Air? Yes, it still has warranty, so it makes even more sense to sell it. I’m a big fan of the Air form factor, and I’ve never quite warmed up to the Pro models. However, the limitation of 16GB of RAM in the MacBook Air was hard to accept at the time, but there were no other alternatives. So, on the evening when the new MacBook Pros with M1 Pro and M1 Max were announced, I immediately ordered one – a 14″ MacBook Pro M1 Max with 10 cores, 24 GPU cores, a 16-core Neural Engine, 64 GB of RAM (!!!), and a 2TB drive. My MacBook Air has 16 GB of RAM and the first M1 chip with 8 cores.

Why 64 GB of RAM?

I regularly work with large datasets, ranging from 10 to 50 GB. But even a 2 GB file can cause issues, depending on what kind of data transformations and computations you perform. Over time, using a computer with little RAM becomes frustrating. While a local installation of Apache Spark helps me utilize multiple cores simultaneously, the lack of RAM is always a limiting factor. For the less technically inclined among my readers: Data is loaded from the hard drive into the RAM, and the speed of the hard drive determines how fast this happens because even an SSD is slower than RAM.

However, if there isn’t enough RAM, for example, if I try to load a 20 GB file into 16 GB of RAM, the operating system starts swapping objects from the RAM to the hard drive. This means data is moved back and forth between the RAM and the hard drive, but the hard drive now serves as slower “RAM.” Writing and reading data from the hard drive simultaneously doesn’t speed up the process either. Plus, there’s the overhead, because the program that needs the RAM doesn’t move objects itself—the operating system does. And the operating system also needs RAM. So, if the operating system is constantly moving objects around, it also consumes CPU time. In short, too little RAM means everything slows down.

At one point, I considered building a cluster myself. There are some good guides online about how to do this with inexpensive Raspberry Pis. It can look cool, too. But I have little time. I might still do this at some point, if only to try it out. Just for the math: 8 Raspberry Pis with 8 GB of RAM plus accessories would probably cost me close to €1,000. Plus, I’d have to learn a lot of new things. So, putting it off isn’t the same as giving up.

How did I test it?

To clarify, I primarily program in R, a statistical programming language. Here, I have two scenarios:

  • An R script running on a single core (not parallelized).
  • An R script that’s parallelized and can thus run on a cluster.

For the cluster, I use Apache Spark, which works excellently locally. For those less familiar with the tech: With Spark, I can create a cluster where computational tasks are divided and sent to individual Nodes for processing. This allows for parallel processing. I can either build a cluster with multiple computers (which requires sending the data over the network), or I can install the cluster locally and use the cores of my CPU as the nodes. A local installation has the huge advantage of no network latency.

For those who want to learn more about R and Spark, here is the link to my book on R and Data Science!

For the first test, a script without parallelization, I use a famous dataset from the history of search engines, the AOL data. It contains 36,389,575 rows, just under 2 GB. Many generations of my students have worked with this dataset. In this script, the search queries are broken down, the number of terms per query is calculated, and correlations are computed. Of course, this could all be parallelized, but here, we’re just using one core.

For the second test, I use a nearly 20 GB dataset from Common Crawl (150 million rows and 4 columns) and compare it with data from Wikipedia, just under 2 GB. Here, I use the previously mentioned Apache Spark. My M1 Max has 10 cores, and even though I could use all of them, I’ll leave one core for the operating system, so we’ll only use 9 cores. To compare with the M1 in my MacBook Air, we’ll also run a test where the M1 Max uses the same number of cores as the Air.

How do I measure? There are several ways to measure, but I choose the simplest one: I look at what time my script starts and when it ends, then calculate the difference. It’s not precise, but we’ll see later that the measurement errors don’t really matter.

Results: Is it worth it?

It depends. The first test is somewhat disappointing. The larger RAM doesn’t seem to make much of a difference here, even though mutations of the AOL dataset are created and loaded into memory. The old M1 completes the script in 57.8 minutes, while the M1 Max takes 42.5 minutes. The data are probably loaded into RAM a bit faster thanks to the faster SSDs, but the difference is only a few seconds. The rest seems to come from the CPU. But for this price, the M1 Max doesn’t justify itself (it’s twice as expensive as the MacBook Air).

Things get more interesting when I use the same number of cores on both sides for a cluster and then use Spark. The differences are drastic: 52 minutes for the old M1 with 16 GB of RAM, 5.4 minutes for the new M1 Max with 64 GB of RAM. The “old” M1, with its limited RAM, takes many minutes just to load the large dataset, while the new M1 Max with 64 GB handles it in under 1 minute. By the way, I’m not loading a simple CSV file here but rather a folder full of small partitions, so the nodes can read the data independently. It’s not the case that the nodes are getting in each other’s way when loading the large file.

Materials for Web Analytics Wednesday, April 8, 2020


It’s great that you were part of the first virtual Web Analytics Wednesday. Here are the promised links:

All links marked with + are affiliate links

Connect R through Service User with Google APIs


If you want to use R to access APIs automatically, then authorization via browser is not an option. The solution is called Service User: With a Service User and the associated JSON file, an R program can access the Google Analytics API, the Google Search Console API, but also all the other wonderful machine learning APIs. This short tutorial shows what needs to be done to connect to the Google Search Console.

First, you create a project if you don’t have a suitable one yet, and then the appropriate APIs have to be enabled. After a short search, you will find the Google Search DesktopAPI, which only needs to be activated.

Click on IAM & admin and then click on Service accounts (it’s kind of strangely highlighted in this screenshot:

Click Create Service Account:

Important: Give a meaningful name for the e-mail address, so that you can at least keep track of it a bit…

Browse Project is sufficient for this step:

Click on “Create Key” here:

Create a JSON key, download it and then put it in the desired RStudio directory.

Now the service user must be added as a user in the Search Console; the important thing is that he receives all rights.

What’s really great about the Google Search Console API is that you can see the query and landing page at the same time, unlike in the GUI. By the way, I get the data every day and write it to a database, so I have a nice history that goes beyond the few months of Search Console.

Last but not least, I also provide the R notebook, with which I query the data; it’s basically the code written by the author of the API, Mark Edmondson, but just for the sake of completeness, how the JSON file is included. There is a more elegant variant with R Environment variables, but I don’t know if it works on Windows.

Data-driven personas with association rules


I have already talked about personas elsewhere, this article is about the data-driven generation of personas. I stick to the definition of the persona inventor Cooper and see a persona as a prototype for a group of users. This can also be interesting for marketing, because after all, you can use it to create a needs- and experience-oriented communication, for example on a website. Personas are not target groups, but more on that elsewhere.

How do you create a data-driven persona?

I haven’t found the perfect universal way for data-driven personas either. External data is not available for all topics, the original approach of 10-12 interviews is difficult, and internal data has the disadvantage that it only contains the data of those you already know, not those you might still want to reach. The truth lies in merging different data sources.

Data-driven persona meets web analytics

Web analytics data offers a lot of usage behavior, and depending on how a page is structured (for example, whether it is already geared to the different needs of different personas), it is possible to understand the extent to which the different user groups actually behave as expected. Or you try to generate data-driven personas from the usage behavior on the website. All under the restriction that the users have to find the page first, so it is not certain that really all groups of people actually access this page and therefore important personas are overlooked. This article is about a special case of this automated persona generation from web analytics data, which is exciting from an algorithmic point of view and the associated visualization. As is well known, everyone likes to report on successes, here is a case where the failure shows in which direction further work could go.

The experiences from web mining are rarely associated with personas, although some research was done on it more than 10 years ago; for an overview, see, for example, Facca and Lanzi, Minining interesting knowledge from weblogs: a survey, from 2004 (published in 2005). Whereas in the past it was mainly weblogs (not web blogs!) that were used, i.e. log files written by the server, today we have the opportunity to use much “better” data through Google Analytics & Co.

Reintroducing: Association Rules

But which exactly is better? In GA & Co we can better distinguish people from bots (of which there are more than you think), returners are recognized more reliably, devices etc. The question is whether you absolutely have to use the additional data for basic data-driven personas. Because association rules, which I have already written about in a post about clustering with Google Analytics and R and which are also mentioned by Facca and Lanzi, can already identify basic groups of users (I had already mentioned in the other article that I had once worked for one of the creators of the algo, Tomasz Imilinski, but I still have to tell an anecdote with him: In a meeting, he once said to me that you often think something is a low hanging fruit, a quick success, but, “Tom, often enough, the low hanging fruits are rotten”. He has been right so many times.). The groups identify themselves through a common behavior, the co-occurrence of page views, for example. In R, this works wonderfully with the arules package and the algo apriori it contains.

Data-driven personas with Google Analytics & Co.

As already mentioned in the earlier article: A standard installation of Google Analytics is not sufficient (it never is anyway). Either you have the 360 variant or “hack” the free version (“hack” in terms of “tinkering”, not “being a criminal”) and pull the data via API. With Adobe Analytics, the data can be pulled from the data warehouse or via an API. Simply using Google Analytics and drawing personas from it is therefore not possible with this approach. You also have to think about which date from GA is best used next to the Client ID to represent transactions. This can vary greatly from website to website. And if you want to be very clever, then a PageView alone may not be signal enough.

However, this is first of all about visualization and what limitations the a priori approach has for the automated generation of data-driven personas. For the visualization, I work with the package arulesViz. The resulting graphics are not easy to interpret, as I have experienced at the HAW, but also with colleagues. Below we see the visualization of association rules obtained from the data of this page, with the GA date pagePathLevel1 (which is unfortunately also an article title for me). One thing stands out here: I can actually only identify two groups here, and that’s pretty poor.

What exactly do we see here? We see that users who are on the homepage also go to the Courses section and vice versa. The lift is high here, the support not so much. And then we see users moving between my four articles about Scalable Capital, with roughly the same low lift but different levels of support. Lift is the factor by which the co-occurrence of two items is higher than their probable occurrence if they were independent of each other. Support is the frequency. Support was defined at 0.01 when creating the association rules, and confidence was also defined at 0.01. For details, see my first article.

But why don’t I see any other pages here? My article about Google Trends is a very frequently read article, as is the one about the Thermomix or AirBnB. So it’s not because there aren’t more user groups. The disadvantage of this approach is simply that users have to have visited more than one page for a rule to arise here at all. And since some users come via a Google search and apparently have no interest in a second article, because their need for information may already be satisfied or because I don’t advertise it well enough, apparently only students and those interested in Scalable Capital can be identified here in these rules.

Ways out of the a priori dilemma?

So far, I’ve identified three ways to solve this dilemma, and all of them require extra work:

  • I test whether I can get users to view more than one page through a better relevant offer, for example with Google Optimize, and if successful, I get better data.
  • I use the a priori data only as a base and merge it with other data (also very nice, but I won’t cover it here)
  • I lower the support and confidence.

The most beautiful is the first approach, in my opinion, but it requires time and brains. And it is not said that something will come out. The last approach is unpleasant, because we are dealing with cases that occur less frequently and therefore not necessarily reliable. With a support of 0.005, the visualization looks different:

But again I have the problem that the individual pages do not appear. So it seems to be extremely rare that someone moves from the Google Trends article to another article, so lowering the support value didn’t help. From experience, I can say that this problem appears more or less strongly on most pages that I otherwise see, but it always appears somehow. The stupid thing is, if you can already read good personas, then you are more inclined not to look at the rest, even if it could be very large in scope.

We also see another problem in the graphic, because the users in the right strand do not have to be the same from arrow to arrow. In other words, it is not said that visitors who look at photography pages and courses will also look at the publications, even if it looks like that in the visualization. If A and B as well as B and C, then A and C do not apply here! To solve this, the association rules in the visualization would still have to have an exclusionary marking. It does not exist and would be a task for the future.

Result

The path via association rules is exciting for the creation of data-driven personas with Google Analytics or other web analysis tools. However, it will usually not be sufficient at the moment, because a) the problem of one-page visitors is not solved here, b) the rules do not provide sufficient information about different groups that only have overlaps and c) it can only say something about those groups that are already on the site anyway. I’m currently working on a) and b) on the side, I’m always happy about thoughts from outside