Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).
The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:
The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:
Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.
The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.
It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.
Some tools online offer the ability to see how many dividends are likely to come your way. For example, extraETF provides a tool where you can see what the dividends might look like based on an assumed growth rate (CAGR), a certain number of years, and asset gains.
What I haven’t seen so far is a tool that, starting from a portfolio, calculates the dividend growth based on an assumed CAGR and dividend yield, while also factoring in taxes. That’s exactly the kind of tool I’ve created.
The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:
library(ggplot2) library(dplyr)
mtcars %>% ggplot(aes(x = mpg, y = hp)) + geom_point()
Alternatively, the data must be explicitly passed to ggplot, as shown here:
Three years ago, for the 15th anniversary of this blog, I moved from WordPress to Hugo. Super-fast pages, everything in R, actually a cool thing. But in reality, it wasn’t that great. I always needed an R environment, which I didn’t always have. Git drove me crazy at times. And some problems were just impossible to troubleshoot. So now I’ve moved back again. Maybe the rankings will return as well, which I lost after the move.
Another new MacBook? Didn’t I just buy the Air? Yes, it still has warranty, so it makes even more sense to sell it. I’m a big fan of the Air form factor, and I’ve never quite warmed up to the Pro models. However, the limitation of 16GB of RAM in the MacBook Air was hard to accept at the time, but there were no other alternatives. So, on the evening when the new MacBook Pros with M1 Pro and M1 Max were announced, I immediately ordered one – a 14″ MacBook Pro M1 Max with 10 cores, 24 GPU cores, a 16-core Neural Engine, 64 GB of RAM (!!!), and a 2TB drive. My MacBook Air has 16 GB of RAM and the first M1 chip with 8 cores.
Why 64 GB of RAM?
I regularly work with large datasets, ranging from 10 to 50 GB. But even a 2 GB file can cause issues, depending on what kind of data transformations and computations you perform. Over time, using a computer with little RAM becomes frustrating. While a local installation of Apache Spark helps me utilize multiple cores simultaneously, the lack of RAM is always a limiting factor. For the less technically inclined among my readers: Data is loaded from the hard drive into the RAM, and the speed of the hard drive determines how fast this happens because even an SSD is slower than RAM.
However, if there isn’t enough RAM, for example, if I try to load a 20 GB file into 16 GB of RAM, the operating system starts swapping objects from the RAM to the hard drive. This means data is moved back and forth between the RAM and the hard drive, but the hard drive now serves as slower “RAM.” Writing and reading data from the hard drive simultaneously doesn’t speed up the process either. Plus, there’s the overhead, because the program that needs the RAM doesn’t move objects itself—the operating system does. And the operating system also needs RAM. So, if the operating system is constantly moving objects around, it also consumes CPU time. In short, too little RAM means everything slows down.
At one point, I considered building a cluster myself. There are some good guides online about how to do this with inexpensive Raspberry Pis. It can look cool, too. But I have little time. I might still do this at some point, if only to try it out. Just for the math: 8 Raspberry Pis with 8 GB of RAM plus accessories would probably cost me close to €1,000. Plus, I’d have to learn a lot of new things. So, putting it off isn’t the same as giving up.
An R script running on a single core (not parallelized).
An R script that’s parallelized and can thus run on a cluster.
For the cluster, I use Apache Spark, which works excellently locally. For those less familiar with the tech: With Spark, I can create a cluster where computational tasks are divided and sent to individual Nodes for processing. This allows for parallel processing. I can either build a cluster with multiple computers (which requires sending the data over the network), or I can install the cluster locally and use the cores of my CPU as the nodes. A local installation has the huge advantage of no network latency.
For the first test, a script without parallelization, I use a famous dataset from the history of search engines, the AOL data. It contains 36,389,575 rows, just under 2 GB. Many generations of my students have worked with this dataset. In this script, the search queries are broken down, the number of terms per query is calculated, and correlations are computed. Of course, this could all be parallelized, but here, we’re just using one core.
For the second test, I use a nearly 20 GB dataset from Common Crawl (150 million rows and 4 columns) and compare it with data from Wikipedia, just under 2 GB. Here, I use the previously mentioned Apache Spark. My M1 Max has 10 cores, and even though I could use all of them, I’ll leave one core for the operating system, so we’ll only use 9 cores. To compare with the M1 in my MacBook Air, we’ll also run a test where the M1 Max uses the same number of cores as the Air.
How do I measure? There are several ways to measure, but I choose the simplest one: I look at what time my script starts and when it ends, then calculate the difference. It’s not precise, but we’ll see later that the measurement errors don’t really matter.
Results: Is it worth it?
It depends. The first test is somewhat disappointing. The larger RAM doesn’t seem to make much of a difference here, even though mutations of the AOL dataset are created and loaded into memory. The old M1 completes the script in 57.8 minutes, while the M1 Max takes 42.5 minutes. The data are probably loaded into RAM a bit faster thanks to the faster SSDs, but the difference is only a few seconds. The rest seems to come from the CPU. But for this price, the M1 Max doesn’t justify itself (it’s twice as expensive as the MacBook Air).
Things get more interesting when I use the same number of cores on both sides for a cluster and then use Spark. The differences are drastic: 52 minutes for the old M1 with 16 GB of RAM, 5.4 minutes for the new M1 Max with 64 GB of RAM. The “old” M1, with its limited RAM, takes many minutes just to load the large dataset, while the new M1 Max with 64 GB handles it in under 1 minute. By the way, I’m not loading a simple CSV file here but rather a folder full of small partitions, so the nodes can read the data independently. It’s not the case that the nodes are getting in each other’s way when loading the large file.
A general book on the basics of data analysis with R+; a big chunk, intended more as a general introduction. But I would always recommend Hadley’s book
As a general introduction to statistics, I recommend The Art of Statistics+ or Naked Statistics+, both very good and entertaining books (not only for statisticians)
For the 15th anniversary of this blog, there is not only a redesign, but also new technology under the hood:
Blogdown makes it possible to create a website based on Hugo with R. So I can design my blog with my favorite language and don’t have to think about how to get my R code or graphics in WordPress every time.
The output is pure HTML, which on the one hand abolishes the dependency on databases, WordPress plugins etc and on the other hand, and this is extremely important for me, enables super-fast delivery of content (see screenshot of the Google PageSpeed Insights below during the tests with my preview server). I don’t get any more mails because someone tried to break into my server, or because the database server requires too much memory.
The workflow is almost fully automatic: I write my texts in RStudio, commit a stand on GitHub, and from there it is automatically deployed to my web server. If I fail at something, I just go back to an earlier commit. Continuous deployment is probably what this is called in modern German.
In 2005 I had realized the first version of the blog with Movable Type. At that time, HTML pages were already being produced, which then no longer had to be generated in real time when a page was accessed. I don’t remember when and why I switched to WordPress. Maybe because there were more expansions there. For 10 years I was at least on WordPress and more than once enormously annoyed.
The switch to Hugo was not fully automated. The WordPress to Hugo Exporter was the only one of the popular options that worked halfway for me. It is necessary to have the space that a blog takes up today on the disk and in the DB free again on the disk, because everything is replicated on it as flat files. The error messages that the script spits out are not helpful in identifying this expected error. At the same time, not all pages were converted correctly, so I had to touch almost all pages again.
On the one hand, I noticed what has changed since 2005:
People and blogs that I miss, it feels like there are hardly any real blogs left,
Topics that either still interest me today or that I ask myself how they could have interested me at all,
and lots of links to external sites that are simply dead, even though the sites still exist.
The web is just as little static as we are, and the 15 years are a nice documentation of my different stages for me.
If you want to use R to access APIs automatically, then authorization via browser is not an option. The solution is called Service User: With a Service User and the associated JSON file, an R program can access the Google Analytics API, the Google Search Console API, but also all the other wonderful machine learning APIs. This short tutorial shows what needs to be done to connect to the Google Search Console.
First, you create a project if you don’t have a suitable one yet, and then the appropriate APIs have to be enabled. After a short search, you will find the Google Search DesktopAPI, which only needs to be activated.
Click on IAM & admin and then click on Service accounts (it’s kind of strangely highlighted in this screenshot:
Click Create Service Account:
Important: Give a meaningful name for the e-mail address, so that you can at least keep track of it a bit…
Browse Project is sufficient for this step:
Click on “Create Key” here:
Create a JSON key, download it and then put it in the desired RStudio directory.
Now the service user must be added as a user in the Search Console; the important thing is that he receives all rights.
What’s really great about the Google Search Console API is that you can see the query and landing page at the same time, unlike in the GUI. By the way, I get the data every day and write it to a database, so I have a nice history that goes beyond the few months of Search Console.
Last but not least, I also provide the R notebook, with which I query the data; it’s basically the code written by the author of the API, Mark Edmondson, but just for the sake of completeness, how the JSON file is included. There is a more elegant variant with R Environment variables, but I don’t know if it works on Windows.