Which Visualization for Which Data?

Communicating data, information, and the insights derived from them is a key skill. Data visualizations should help the audience grasp concepts more quickly, making it essential to choose the type of visualization that conveys the message most effectively. Even though Microsoft Excel might suggest a pie chart, it’s often not the best option, as you can see on the left!

At university and in my job, I constantly work with data visualizations. To save everyone’s nerves, I’ve created a handy overview, inspired by the work of A. Abela:

The overview is continuously updated by me. If you’re interested, feel free to sign up for my newsletter and get instant access to the overview (plus a monthly update).

Row Names in R

Some datasets in R use row names, such as the built-in mtcars dataset. While convenient, row names can be suboptimal when sorting data, for instance, by car brands:

To convert row names into a column using the Tidyverse, the rownames_to_column() function is used:

library(tidyverse)

# mtcars laden und die Reihenamen in eine Spalte verschieben
mtcars_tidy <- mtcars %>%
  rownames_to_column(var = "car_name")

And this is the result:

Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

New tool for a dividend strategy.

Some tools online offer the ability to see how many dividends are likely to come your way. For example, extraETF provides a tool where you can see what the dividends might look like based on an assumed growth rate (CAGR), a certain number of years, and asset gains.

What I haven’t seen so far is a tool that, starting from a portfolio, calculates the dividend growth based on an assumed CAGR and dividend yield, while also factoring in taxes. That’s exactly the kind of tool I’ve created.

ggplot2 and the New Pipe

Why doesn’t this code work?

mtcars |> ggplot(., aes(x = mpg, y = hp)) + geom_point()

The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:

library(ggplot2)
library(dplyr)

mtcars %>%
ggplot(aes(x = mpg, y = hp)) +
geom_point()

Alternatively, the data must be explicitly passed to ggplot, as shown here:

library(ggplot2)

mtcars |>
ggplot(data = ., aes(x = mpg, y = hp)) +
geom_point()

Here, the dot (.) represents the data being piped from mtcars into ggplot, and you need to specify it as the data argument in the ggplot function.

The Digital Analytics Association is history – and no one cares.

It was a bit surprising. I had recently emailed with Jim Sterne when it came to the German branch. The DAA had also contributed a foreword to my web analytics book. It’s a bit of a shame.

For those who don’t know: The DAA was previously the WAA, the Web Analytics Association, and it created the most widely used definition of web analytics. Although that definition has long been missing from the website, most researchers who copy quotes from other papers didn’t seem to care.

But how is it possible that such an organization, despite the importance of data, is shutting down? It could be, for example, because many have installed Google Analytics & Co., but the data is not actually being used. In my last paper, which unfortunately isn’t public yet, it was found that most users don’t even realize that embedding the GA code alone isn’t enough to work data-driven. And maybe it’s also a bit due to the DAA itself, that it didn’t manage to make its relevance clear.

I had only been a member out of nostalgia in recent years. I had used my student status to lower the membership fees a bit.

The website is already no longer accessible.

From WordPress to Hugo and back again

Three years ago, for the 15th anniversary of this blog, I moved from WordPress to Hugo. Super-fast pages, everything in R, actually a cool thing. But in reality, it wasn’t that great. I always needed an R environment, which I didn’t always have. Git drove me crazy at times. And some problems were just impossible to troubleshoot. So now I’ve moved back again. Maybe the rankings will return as well, which I lost after the move.

Artificial Intelligence (AI), Large Language Models (LLMs), Data Science, Machine Learning, Data Mining, and Statistics: What’s the difference?

The terms Artificial Intelligence (AI), Machine Learning, Data Science, Data Mining, Statistics, and Large Language Models (LLMs) are often used interchangeably or misunderstood. Clearly differentiating between these concepts helps you navigate discussions and make informed decisions in data-driven contexts.

Artificial Intelligence (AI)

AI encompasses techniques and algorithms that enable computers to perform tasks traditionally requiring human intelligence, such as reasoning, decision-making, and pattern recognition.

Machine Learning (ML)

ML is a subset of AI where systems learn from data to improve decision-making or predictions without explicit programming. Applications include recommendation engines, fraud detection, and image recognition.

Data Science

Data Science is an interdisciplinary field combining scientific methods, processes, and systems to extract actionable insights from data. It integrates domain expertise, statistical techniques, and data analysis skills to make informed business decisions.

Data Mining

Data Mining involves exploring large datasets to discover meaningful patterns, correlations, or trends. Common applications include customer segmentation, market basket analysis, and anomaly detection.

Statistics

Statistics forms the mathematical basis for Data Science and Machine Learning. It includes methods for collecting, analyzing, interpreting, and presenting data, ensuring rigorous analysis and reliable results.

Large Language Models (LLMs)

Large Language Models are a specialized, advanced type of Machine Learning model that process and generate natural language text. They excel at tasks such as content summarization, text generation, language translation, and interactive dialogue (e.g., ChatGPT).

The Connection Between These Terms:

  • Artificial Intelligence is the overarching goal of creating systems that simulate human intelligence.
  • Machine Learning is a key approach to achieving AI through data-driven learning.
  • Data Science covers the broader methodology of turning data into actionable insights.
  • Data Mining focuses specifically on finding meaningful patterns in large datasets.
  • Statistics underpins these fields, providing the mathematical rigor needed for trustworthy analysis.
  • Large Language Models are an advanced application of Machine Learning, focusing on language understanding and generation.

Why clarity matters

While “Data Science” has dominated conversations in recent years, many discussions have now shifted towards AI and especially Large Language Models. However, even with the buzz around AI, it’s important to remember that successful projects often rely heavily on foundational Data Science and robust statistical methods. Clearly distinguishing these concepts allows you to harness the full potential of data-driven solutions and avoid common misconceptions.