Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

New tool for a dividend strategy.

Some tools online offer the ability to see how many dividends are likely to come your way. For example, extraETF provides a tool where you can see what the dividends might look like based on an assumed growth rate (CAGR), a certain number of years, and asset gains.

What I haven’t seen so far is a tool that, starting from a portfolio, calculates the dividend growth based on an assumed CAGR and dividend yield, while also factoring in taxes. That’s exactly the kind of tool I’ve created.

ggplot2 and the New Pipe

Why doesn’t this code work?

mtcars |> ggplot(., aes(x = mpg, y = hp)) + geom_point()

The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:

library(ggplot2)
library(dplyr)

mtcars %>%
ggplot(aes(x = mpg, y = hp)) +
geom_point()

Alternatively, the data must be explicitly passed to ggplot, as shown here:

library(ggplot2)

mtcars |>
ggplot(data = ., aes(x = mpg, y = hp)) +
geom_point()

Here, the dot (.) represents the data being piped from mtcars into ggplot, and you need to specify it as the data argument in the ggplot function.

From WordPress to Hugo and back again

Three years ago, for the 15th anniversary of this blog, I moved from WordPress to Hugo. Super-fast pages, everything in R, actually a cool thing. But in reality, it wasn’t that great. I always needed an R environment, which I didn’t always have. Git drove me crazy at times. And some problems were just impossible to troubleshoot. So now I’ve moved back again. Maybe the rankings will return as well, which I lost after the move.

Apple MacBook Pro M1 Max – Is it worth it for Machine Learning?


Another new MacBook? Didn’t I just buy the Air? Yes, it still has warranty, so it makes even more sense to sell it. I’m a big fan of the Air form factor, and I’ve never quite warmed up to the Pro models. However, the limitation of 16GB of RAM in the MacBook Air was hard to accept at the time, but there were no other alternatives. So, on the evening when the new MacBook Pros with M1 Pro and M1 Max were announced, I immediately ordered one – a 14″ MacBook Pro M1 Max with 10 cores, 24 GPU cores, a 16-core Neural Engine, 64 GB of RAM (!!!), and a 2TB drive. My MacBook Air has 16 GB of RAM and the first M1 chip with 8 cores.

Why 64 GB of RAM?

I regularly work with large datasets, ranging from 10 to 50 GB. But even a 2 GB file can cause issues, depending on what kind of data transformations and computations you perform. Over time, using a computer with little RAM becomes frustrating. While a local installation of Apache Spark helps me utilize multiple cores simultaneously, the lack of RAM is always a limiting factor. For the less technically inclined among my readers: Data is loaded from the hard drive into the RAM, and the speed of the hard drive determines how fast this happens because even an SSD is slower than RAM.

However, if there isn’t enough RAM, for example, if I try to load a 20 GB file into 16 GB of RAM, the operating system starts swapping objects from the RAM to the hard drive. This means data is moved back and forth between the RAM and the hard drive, but the hard drive now serves as slower “RAM.” Writing and reading data from the hard drive simultaneously doesn’t speed up the process either. Plus, there’s the overhead, because the program that needs the RAM doesn’t move objects itself—the operating system does. And the operating system also needs RAM. So, if the operating system is constantly moving objects around, it also consumes CPU time. In short, too little RAM means everything slows down.

At one point, I considered building a cluster myself. There are some good guides online about how to do this with inexpensive Raspberry Pis. It can look cool, too. But I have little time. I might still do this at some point, if only to try it out. Just for the math: 8 Raspberry Pis with 8 GB of RAM plus accessories would probably cost me close to €1,000. Plus, I’d have to learn a lot of new things. So, putting it off isn’t the same as giving up.

How did I test it?

To clarify, I primarily program in R, a statistical programming language. Here, I have two scenarios:

  • An R script running on a single core (not parallelized).
  • An R script that’s parallelized and can thus run on a cluster.

For the cluster, I use Apache Spark, which works excellently locally. For those less familiar with the tech: With Spark, I can create a cluster where computational tasks are divided and sent to individual Nodes for processing. This allows for parallel processing. I can either build a cluster with multiple computers (which requires sending the data over the network), or I can install the cluster locally and use the cores of my CPU as the nodes. A local installation has the huge advantage of no network latency.

For those who want to learn more about R and Spark, here is the link to my book on R and Data Science!

For the first test, a script without parallelization, I use a famous dataset from the history of search engines, the AOL data. It contains 36,389,575 rows, just under 2 GB. Many generations of my students have worked with this dataset. In this script, the search queries are broken down, the number of terms per query is calculated, and correlations are computed. Of course, this could all be parallelized, but here, we’re just using one core.

For the second test, I use a nearly 20 GB dataset from Common Crawl (150 million rows and 4 columns) and compare it with data from Wikipedia, just under 2 GB. Here, I use the previously mentioned Apache Spark. My M1 Max has 10 cores, and even though I could use all of them, I’ll leave one core for the operating system, so we’ll only use 9 cores. To compare with the M1 in my MacBook Air, we’ll also run a test where the M1 Max uses the same number of cores as the Air.

How do I measure? There are several ways to measure, but I choose the simplest one: I look at what time my script starts and when it ends, then calculate the difference. It’s not precise, but we’ll see later that the measurement errors don’t really matter.

Results: Is it worth it?

It depends. The first test is somewhat disappointing. The larger RAM doesn’t seem to make much of a difference here, even though mutations of the AOL dataset are created and loaded into memory. The old M1 completes the script in 57.8 minutes, while the M1 Max takes 42.5 minutes. The data are probably loaded into RAM a bit faster thanks to the faster SSDs, but the difference is only a few seconds. The rest seems to come from the CPU. But for this price, the M1 Max doesn’t justify itself (it’s twice as expensive as the MacBook Air).

Things get more interesting when I use the same number of cores on both sides for a cluster and then use Spark. The differences are drastic: 52 minutes for the old M1 with 16 GB of RAM, 5.4 minutes for the new M1 Max with 64 GB of RAM. The “old” M1, with its limited RAM, takes many minutes just to load the large dataset, while the new M1 Max with 64 GB handles it in under 1 minute. By the way, I’m not loading a simple CSV file here but rather a folder full of small partitions, so the nodes can read the data independently. It’s not the case that the nodes are getting in each other’s way when loading the large file.