Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

ggplot2 and the New Pipe

Why doesn’t this code work?

mtcars |> ggplot(., aes(x = mpg, y = hp)) + geom_point()

The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:

library(ggplot2)
library(dplyr)

mtcars %>%
ggplot(aes(x = mpg, y = hp)) +
geom_point()

Alternatively, the data must be explicitly passed to ggplot, as shown here:

library(ggplot2)

mtcars |>
ggplot(data = ., aes(x = mpg, y = hp)) +
geom_point()

Here, the dot (.) represents the data being piped from mtcars into ggplot, and you need to specify it as the data argument in the ggplot function.

The Digital Analytics Association is history – and no one cares.

It was a bit surprising. I had recently emailed with Jim Sterne when it came to the German branch. The DAA had also contributed a foreword to my web analytics book. It’s a bit of a shame.

For those who don’t know: The DAA was previously the WAA, the Web Analytics Association, and it created the most widely used definition of web analytics. Although that definition has long been missing from the website, most researchers who copy quotes from other papers didn’t seem to care.

But how is it possible that such an organization, despite the importance of data, is shutting down? It could be, for example, because many have installed Google Analytics & Co., but the data is not actually being used. In my last paper, which unfortunately isn’t public yet, it was found that most users don’t even realize that embedding the GA code alone isn’t enough to work data-driven. And maybe it’s also a bit due to the DAA itself, that it didn’t manage to make its relevance clear.

I had only been a member out of nostalgia in recent years. I had used my student status to lower the membership fees a bit.

The website is already no longer accessible.

Eternal November: Will Mastodon Suffer the Same Fate as Usenet?


Mastodon and the Fediverse had maintained a niche existence for many years until they were thrust into the spotlight by Musk’s acquisition of Twitter and the ensuing turbulence. Since then, the Mastodon community has not been growing like a hockey stick, as it’s called in investor jargon, but like a rocket. This is a big win for those who champion open-source principles. However, this rapid growth might also become a curse, and for several reasons.

Continue reading

1 Year of Not Buying: September Report


September was essentially a good month. The only new purchase I made was a pair of fingerless gloves, as it sometimes gets a bit chilly in the office. However, I didn’t want to turn on the heating just yet.

Then there was the Braun Atelier investment, which I had already written about and am still very pleased with.

However, there’s also an order I placed in September, which won’t arrive until December—the Kindle Scribe, which I might exchange for my Remarkable 2. Is the purchase necessary? Certainly not. I could print any article I want or need to read, and use a paper notebook. Can I work better and faster with paper tablets than with paper? Definitely. What I hope to achieve with the Scribe, I have already described in the article. If the Scribe doesn’t meet my expectations, it will go back. My reMarkable has very low usage costs since I use it multiple times a day. In the end, it’s about considering beforehand whether a technology actually improves something, or if it just serves blind consumption.

How to change the language in R


R often comes in a wonderful mix of languages, as seen in the screenshot:

The simplest way to change that:

Sys.setenv(LANG = “de”)

Then the R session will be in German. But only for this one session! If you want to change the language permanently, you can specify a preference in the .Renviron file (the file must be created if it doesn’t already exist). To do this, open the terminal and type:

vi .Renviron

Type, press “i” to enter Insert mode, and then

LANG=“de”

enter the following. Then press ESC and type “:wq!” (without quotes). Restart R, and it will remain in German.

Apple MacBook Pro M1 Max – Is it worth it for Machine Learning?


Another new MacBook? Didn’t I just buy the Air? Yes, it still has warranty, so it makes even more sense to sell it. I’m a big fan of the Air form factor, and I’ve never quite warmed up to the Pro models. However, the limitation of 16GB of RAM in the MacBook Air was hard to accept at the time, but there were no other alternatives. So, on the evening when the new MacBook Pros with M1 Pro and M1 Max were announced, I immediately ordered one – a 14″ MacBook Pro M1 Max with 10 cores, 24 GPU cores, a 16-core Neural Engine, 64 GB of RAM (!!!), and a 2TB drive. My MacBook Air has 16 GB of RAM and the first M1 chip with 8 cores.

Why 64 GB of RAM?

I regularly work with large datasets, ranging from 10 to 50 GB. But even a 2 GB file can cause issues, depending on what kind of data transformations and computations you perform. Over time, using a computer with little RAM becomes frustrating. While a local installation of Apache Spark helps me utilize multiple cores simultaneously, the lack of RAM is always a limiting factor. For the less technically inclined among my readers: Data is loaded from the hard drive into the RAM, and the speed of the hard drive determines how fast this happens because even an SSD is slower than RAM.

However, if there isn’t enough RAM, for example, if I try to load a 20 GB file into 16 GB of RAM, the operating system starts swapping objects from the RAM to the hard drive. This means data is moved back and forth between the RAM and the hard drive, but the hard drive now serves as slower “RAM.” Writing and reading data from the hard drive simultaneously doesn’t speed up the process either. Plus, there’s the overhead, because the program that needs the RAM doesn’t move objects itself—the operating system does. And the operating system also needs RAM. So, if the operating system is constantly moving objects around, it also consumes CPU time. In short, too little RAM means everything slows down.

At one point, I considered building a cluster myself. There are some good guides online about how to do this with inexpensive Raspberry Pis. It can look cool, too. But I have little time. I might still do this at some point, if only to try it out. Just for the math: 8 Raspberry Pis with 8 GB of RAM plus accessories would probably cost me close to €1,000. Plus, I’d have to learn a lot of new things. So, putting it off isn’t the same as giving up.

How did I test it?

To clarify, I primarily program in R, a statistical programming language. Here, I have two scenarios:

  • An R script running on a single core (not parallelized).
  • An R script that’s parallelized and can thus run on a cluster.

For the cluster, I use Apache Spark, which works excellently locally. For those less familiar with the tech: With Spark, I can create a cluster where computational tasks are divided and sent to individual Nodes for processing. This allows for parallel processing. I can either build a cluster with multiple computers (which requires sending the data over the network), or I can install the cluster locally and use the cores of my CPU as the nodes. A local installation has the huge advantage of no network latency.

For those who want to learn more about R and Spark, here is the link to my book on R and Data Science!

For the first test, a script without parallelization, I use a famous dataset from the history of search engines, the AOL data. It contains 36,389,575 rows, just under 2 GB. Many generations of my students have worked with this dataset. In this script, the search queries are broken down, the number of terms per query is calculated, and correlations are computed. Of course, this could all be parallelized, but here, we’re just using one core.

For the second test, I use a nearly 20 GB dataset from Common Crawl (150 million rows and 4 columns) and compare it with data from Wikipedia, just under 2 GB. Here, I use the previously mentioned Apache Spark. My M1 Max has 10 cores, and even though I could use all of them, I’ll leave one core for the operating system, so we’ll only use 9 cores. To compare with the M1 in my MacBook Air, we’ll also run a test where the M1 Max uses the same number of cores as the Air.

How do I measure? There are several ways to measure, but I choose the simplest one: I look at what time my script starts and when it ends, then calculate the difference. It’s not precise, but we’ll see later that the measurement errors don’t really matter.

Results: Is it worth it?

It depends. The first test is somewhat disappointing. The larger RAM doesn’t seem to make much of a difference here, even though mutations of the AOL dataset are created and loaded into memory. The old M1 completes the script in 57.8 minutes, while the M1 Max takes 42.5 minutes. The data are probably loaded into RAM a bit faster thanks to the faster SSDs, but the difference is only a few seconds. The rest seems to come from the CPU. But for this price, the M1 Max doesn’t justify itself (it’s twice as expensive as the MacBook Air).

Things get more interesting when I use the same number of cores on both sides for a cluster and then use Spark. The differences are drastic: 52 minutes for the old M1 with 16 GB of RAM, 5.4 minutes for the new M1 Max with 64 GB of RAM. The “old” M1, with its limited RAM, takes many minutes just to load the large dataset, while the new M1 Max with 64 GB handles it in under 1 minute. By the way, I’m not loading a simple CSV file here but rather a folder full of small partitions, so the nodes can read the data independently. It’s not the case that the nodes are getting in each other’s way when loading the large file.

When “Free” eventually turns into subscriptions: tado and reMarkable


On October 13, 2021, reMarkable announced that the previously free cloud service would now be limited, and the truly exciting features would become paid for new users. I had suspected this earlier, just as I had with tado. tado had announced a subscription in August 2018, but they backtracked for the first customers. While I had to purchase the new app for about 20 euros to use the new features, at least I don’t have to pay any subscription fees.

With both companies, I wasn’t sure why they didn’t include a subscription model from the start. Because in both cases, it was clear that costs would increase as more users accessed the servers. For reMarkable, the costs would be even higher since they offer 8 GB of cloud storage. It should have been obvious from the beginning that at some point, a subscription would have to be introduced to offset the growing costs associated with the increasing number of users. Did both companies avoid the subscription model because they thought it might deter buyers? Aren’t the first customers usually early adopters who are less price-sensitive?

I sold my reMarkable a few months ago, not because of the impending subscription model, but because I simply want fewer gadgets, and it didn’t fit into my workflow. At the end of the day, reMarkable is a niche product, because the desire for focus in a time when distraction is either sought or found by distraction is only present in a small number of users. Even though I think it’s a great product, I don’t believe it will ever be widely adopted by the masses.

Digital Minimalism


I’ve been blogging about minimalism for 15 years. Digital minimalism is another form of conscious consumption. And apparently, the topic strikes a chord, because otherwise, Der Spiegel wouldn’t have locked an interview about it behind their paywall (“How to Break Free from Your Smartphone”):

Cal Newport’s Digital Minimalism contrasts with his other bestseller Deep Work in that it doesn’t focus on work and productivity, but rather on our entire lives and the impact technology has on them.

Daniel Levitin’s The Organized Mind showed how easily our brains can be distracted, while Newport presents the opposite side: the attention economy, which primarily benefits those who can successfully market our time to advertisers. Those who think this is a modern phenomenon are mistaken—this started with the introduction of penny newspapers in 1830, where the readers were no longer the customers, but the advertisers in a newspaper.

According to Newport, the unconscious use of social media leads to exhaustion, anxiety, depression, and, above all, a waste of life’s time. Arguments such as social media helping us stay in touch with friends and family are countered by the point that this is not high-quality interaction—more “connection” than “conversation,” as Sherry Turkle distinguishes. When relationships are less digital (or when digital communication is used only to facilitate traditional communication), they are actually strengthened. The time we spend on Facebook and similar platforms is not only spent on lower-quality conversations but also on mindless scrolling through updates that give us the illusion of connection while leaving us feeling lonely.

However, Newport doesn’t advocate for completely abandoning technology. Instead, he encourages us to adopt a different attitude toward it, and he even paraphrases Dieter Rams’ phrase “Less, but better.” This leads to what he calls a digital declutter. Ironically, it was Steve Jobs, who was focused on mindfulness, who ensured that we carry around with us the symbol of constant connectivity in the form of the iPhone. His original goal was simply to have one device for both phone and iPod. The fact that the iPhone could also access the internet wasn’t even mentioned until late in the original keynote. We weren’t prepared for it. And suddenly, there was an app for everything:

 

We didn’t have time to think about what we truly wanted to get out of these new technologies (and even if we did think about it, like I did back then when I didn’t want a Blackberry, we later found too many reasons why a Blackberry might actually be a good idea). And since then, the door has been wide open for those who want to shape new habits in us:

Nir Eyal’s bestseller Hooked describes this mechanism in great detail. Fairly, Eyal also wrote the antidote to it. However, Eyal’s approach is not as elegant as Newport’s; it deals more with the symptoms, even though it sometimes touches on the causes. Where Eyal says that you can take back your time, Newport suggests you first consider what you want to fill it with. The mechanisms both describe, however, are the same.

Every post for which we might get a like or retweet is, for us, the same as using a slot machine—it triggers a dopamine release. The goal is to regain our autonomy and join the attention resistance, as Newport puts it. Digital minimalism, for him, is:

A philosophy of technology use where you focus your online time on a small number of carefully selected and optimized activities that strongly support what matters to you, while ignoring everything else. (my translation)

To this, he quotes Thoreau in Walden:

The cost of a thing is the amount of what I will call life which is required to be exchanged for it, immediately or in the long run.

This, of course, applies not only to social media & co. When someone buys a new sports car, they also need to consider how much life energy goes into working for that car and whether it’s worth it to drive a sports car in exchange for that time. The profit gained from something must be weighed against the cost of the life energy needed to obtain it. Conversely, technology should be considered optional, as long as its temporary absence doesn’t cause the collapse of one’s (work) life. Can I no longer live a meaningful life without an app, or does the app simply provide some added value that I could get elsewhere? This is in stark contrast to FOMO. To truly understand which technologies are genuinely valuable, Newport suggests a 30-day break.

As I write this, I am on the 8th day of my digital break. Newport really makes this break seem appealing, and I even started it before I had finished reading the book. During the reading, I paused my Facebook profile and deactivated my Twitter account (dangerous, because after 30 days, it is permanently lost). Instagram is deleted, as are Telegram and WhatsApp. These many ways people could contact me had already been annoying. Then, I even uninstalled email. These exact steps were already suggested in Make Time, but for Newport, it’s not about dogmatically banning all these apps, but rather about understanding what you truly miss.

In fact, I only bought my last phone because I wanted the best camera and didn’t want to carry around another camera. I enjoy listening to music. And occasionally, I like to make calls. So, here’s what my home screen looks like now:

Of the slot machines, only Signal, Apple Messages, and Safari as a browser are installed. Everything else is either essential (for example, banking—no longer possible without a phone) or helpful (e.g., the Corona app). Additionally, I’ve now set the Do Not Disturb mode as the default. Only my favorites can still call me, but their messages won’t come through either. However, this isn’t just about distraction.

Solitude (better translated here as seclusion and not as loneliness) requires the ability to be undisturbed and not have to react to everything, or the freedom from input from others. Seclusion demands that we come to terms with ourselves when we are alone, allowing for deep thinking. Our desire for social interaction must therefore be complemented by periods of solitude. How much this already occupied me 14 years ago is shown in this blog post from 2007.

But Newport goes even further. It’s not just about stopping certain activities; as mentioned earlier, you must also consider what to do with the time and attention you’ve gained, so you don’t fall into a void. The smartphone allows us to quickly escape such moments. Most systems already track how often that happens:

(On that day, I made several changes to the configuration, which is why the value was so high.)

Newport suggests writing letters to oneself, engaging in real conversations (instead of chatting, liking, or commenting), joining an offline group, or doing something non-digital with your hands. I’m out when it comes to crafting, but at least I can play instruments. The point is to deliver your best, quoting Rogowiski:

Leave good evidence of yourself. Do good work.

Not using Facebook and the like should therefore not be seen as a sign that you’ve become some kind of eccentric. It should be viewed as a bold act of resistance against the attention economy. This has become increasingly difficult, as we now carry a full-fledged computer with us at all times and actively have to seek ways to limit its possibilities. Still, my reMarkable is one of my favorite work tools. I can’t check emails on it or quickly look something up. And more and more, I’m only using that device when I sit on the couch in the evening.

Furthermore, Newport’s approach means that you need to carefully examine where you still want to gather information. I reactivated Twitter after a week, but unfollowed almost everyone because I only want to follow those who truly provide valuable content. And that’s a very small number. A kind of information diet, so to speak.

I’m not at the point yet where I want to have a Light Phone or leave my smartphone at home most of the time (the Light Phone isn’t available in Germany yet). For now, the camera in my phone is still too important to me (I used to always carry a Fujifilm X100, whose battery was always dead at the critical moments). But after just a few days of digital break, I can already hardly imagine going back to Facebook or spending the first minutes of my day reading through feeds.