Can you combine column charts and cumulative lines?

It’s not something I would have thought of myself, but a German magazine recently tried it – and ever since, I’ve been wondering whether this is a clever idea or actually a very bad one. ChatGPT kindly translated the chart for me; the translation isn’t entirely accurate, but that’s not the point. What matters here is the visualization approach. In short, the data is about partnerships between the public sector and private investors.

A quick reminder: the purpose of data visualization is to make a subject easier to understand. Ideally, it also carries an intention – something that sharpens our perspective, changes our opinion, or even nudges us toward action. So, what does this visualization achieve?

Continue reading

Which Visualization for Which Data?

Communicating data, information, and the insights derived from them is a key skill. Data visualizations should help the audience grasp concepts more quickly, making it essential to choose the type of visualization that conveys the message most effectively. Even though Microsoft Excel might suggest a pie chart, it’s often not the best option, as you can see on the left!

At university and in my job, I constantly work with data visualizations. To save everyone’s nerves, I’ve created a handy overview, inspired by the work of A. Abela:

The overview is continuously updated by me. If you’re interested, feel free to sign up for my newsletter and get instant access to the overview (plus a monthly update).

Row Names in R

Some datasets in R use row names, such as the built-in mtcars dataset. While convenient, row names can be suboptimal when sorting data, for instance, by car brands:

To convert row names into a column using the Tidyverse, the rownames_to_column() function is used:

library(tidyverse)

# mtcars laden und die Reihenamen in eine Spalte verschieben
mtcars_tidy <- mtcars %>%
  rownames_to_column(var = "car_name")

And this is the result:

Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

Export from ING depot: CSV is not the same as CSV

Depot student Dominik has already provided a good overview of how to export data from the ING depot via the ExtraETF workaround. However, not every tool can handle the CSV export properly. For example, DivvyDiary immediately recognized the relevant columns, but the balances didn’t match. The reason for this is that CSV files can vary significantly, as can the data within them. Sometimes, columns aren’t separated by a comma but by a semicolon. And while the difference between 1,000.00 and 1.000,00 might seem minor to us, for DivvyDiary, a 1000 turned into a 1 because the thousands separator was treated as a decimal point.

The solution: As much as I dislike working with Excel, if you open the CSV file in Excel and then save it again as a CSV, even DivvyDiary (and many other tools) can handle it.

ggplot2 and the New Pipe

Why doesn’t this code work?

mtcars |> ggplot(., aes(x = mpg, y = hp)) + geom_point()

The problem with the code above lies in the use of the pipe operator (|>), right before ggplot. ggplot2 is not natively supported with the R-specific pipe (|>), as used here. However, ggplot2 works seamlessly with the Magrittr pipe (%>%) from the dplyr package. Here is the correct usage:

library(ggplot2)
library(dplyr)

mtcars %>%
ggplot(aes(x = mpg, y = hp)) +
geom_point()

Alternatively, the data must be explicitly passed to ggplot, as shown here:

library(ggplot2)

mtcars |>
ggplot(data = ., aes(x = mpg, y = hp)) +
geom_point()

Here, the dot (.) represents the data being piped from mtcars into ggplot, and you need to specify it as the data argument in the ggplot function.

The Digital Analytics Association is history – and no one cares.

It was a bit surprising. I had recently emailed with Jim Sterne when it came to the German branch. The DAA had also contributed a foreword to my web analytics book. It’s a bit of a shame.

For those who don’t know: The DAA was previously the WAA, the Web Analytics Association, and it created the most widely used definition of web analytics. Although that definition has long been missing from the website, most researchers who copy quotes from other papers didn’t seem to care.

But how is it possible that such an organization, despite the importance of data, is shutting down? It could be, for example, because many have installed Google Analytics & Co., but the data is not actually being used. In my last paper, which unfortunately isn’t public yet, it was found that most users don’t even realize that embedding the GA code alone isn’t enough to work data-driven. And maybe it’s also a bit due to the DAA itself, that it didn’t manage to make its relevance clear.

I had only been a member out of nostalgia in recent years. I had used my student status to lower the membership fees a bit.

The website is already no longer accessible.

Estateguru: First project partially collected + Exploratory Data Analysis


In fact, the first project has now been partially collected, with €262.72 of the €500 returned.

Otherwise, not much has changed since my last update. However, I decided to invest again. I withdrew 2/5 of my investment amount, and while the remaining amount is still significant, it no longer makes up too large a portion of my portfolio. I also put a stop to the reinvestment, at least as best as I could. Unfortunately, at Estateguru, you can’t specify that you want to invest only a certain amount, so that the interest is always available. Instead, you can only specify that you want to keep a certain amount invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 each month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though that’s not what I wanted. The support wasn’t particularly helpful in this case.

Estateguru does offer the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That’s quite alarming. It looks even worse when you look at the actual amounts.

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the “defaulted” status, meaning did I take on riskier loans in my “greed” that were characterized by higher interest rates? Let’s first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also seems that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, it doesn’t quite fit, because in some projects, you could invest over multiple stages:

data %>%
  mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
  group_by(`Loan Code`) %>%
  mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
  select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
  arrange(`Loan Code`) %>%
  unique() %>%
  ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
  geom_point() +
  facet_grid(rows = vars(Country)) +
  theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the limit of €500 per project that I had set for myself. Calculating a statistically significant difference would be the task. But let’s start in a different way first:

data %>%
  filter(Status == "Repaid" | Status == "In Default") %>%
  group_by(Status) %>%
  summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))
## # A tibble: 2 × 3
##   Status     mean_interest median_interest
##   <chr>              <dbl>           <dbl>
## 1 In Default          10.7            10.5
## 2 Repaid              10.0            10

The Shapiro-Wilk test helps us check the normality of the data.

repaid <- filter(data, Status == "Repaid")
In fact, the first project has now been partially collected, with €262.72 of the €500 returned.

Estateguru: Project partially recovered.
Otherwise, not much has changed since my last update. However, I’ve decided to invest again. I withdrew 2/5 of my investment amount, and while the remaining amount is still substantial, it no longer makes up too large a portion of my portfolio. Additionally, I’ve put a stop to the reinvestment, at least as best as I could. Unfortunately, with Estateguru, you can’t specify that you only want to invest a certain amount so that the interest is always available. Instead, you can specify that you want to keep a certain amount invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 every month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though I didn’t want that. The support wasn’t very helpful in this regard.

Estateguru offers the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can clearly see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That's quite alarming. It looks even worse when you look at the actual amounts.


data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the "defaulted" status, meaning did I take on riskier loans in my "greed" that were characterized by higher interest rates? Let's first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also appears that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, this doesn't quite fit, because in some projects, you could invest over multiple stages:

data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
arrange(`Loan Code`) %>%
unique() %>%
ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the €500 limit I had set for myself per project. Calculating a statistically significant difference would be the task. But let's start in a different way first:

data %>%
filter(Status == "Repaid" | Status == "In Default") %>%
group_by(Status) %>%
summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))

## # A tibble: 2 × 3
## Status mean_interest median_interest
## <chr> <dbl> <dbl>
## 1 In Default 10.7 10.5
## 2 Repaid 10.0 10

The Shapiro-Wilk test helps us check the normality of the data.

Interest Rate`in_default <- filter(data, Status == "In Default")

In fact, the first project has now been partially recovered, with €262.72 of the €500 returned.

Estateguru: Project partially recovered.

Otherwise, not much has changed since my last update. However, I’ve decided to invest again. I withdrew 2/5 of my investment, and while the remaining amount is still significant, it no longer makes up too large a portion of my portfolio. Additionally, I’ve put a stop to the reinvestment, at least as best as I could. Unfortunately, with Estateguru, you can’t specify that you only want to invest a certain amount, so that the interest is always available. Instead, you can only set a certain amount to be always invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 every month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though that’s not what I wanted. The support wasn’t very helpful in this case.

Estateguru offers the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can clearly see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That’s quite alarming. It looks even worse when you look at the actual amounts.

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the "defaulted" status, meaning did I take on riskier loans in my "greed" that were characterized by higher interest rates? Let's first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also seems that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, this doesn’t quite fit, because in some projects, you could invest over multiple stages:

data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
arrange(`Loan Code`) %>%
unique() %>%
ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the €500 limit I had set for myself per project. Calculating a statistically significant difference would be the task. But let's start in a different way first:

data %>%
filter(Status == "Repaid" | Status == "In Default") %>%
group_by(Status) %>%
summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))

## # A tibble: 2 × 3
## Status mean_interest median_interest
## <chr> <dbl> <dbl>
## 1 In Default 10.7 10.5
## 2 Repaid 10.0 10

The Shapiro-Wilk test helps us check the normality of the data.


Interest Rate`shapiro_test_repaid <- shapiro.test(repaid)

shapiro_test_in_default <- shapiro.test(in_default)

cat("P-value for Repaid group:", shapiro_test_repaid$p.value, "\n")


## P-value for Repaid group: 1.143358e-08

cat("P-value for In Default group:", shapiro_test_in_default$p.value, "\n")

## P-value for In Default group: 6.078673e-05

The p-values are significant (below 0.05), indicating that the data is not normally distributed. Therefore, the Mann-Whitney U test is used, a non-parametric test, to compare the interest rates of the two groups.
wilcox_test <- wilcox.test(repaid, in_default, alternative = "two.sided")
cat("P-value for Mann-Whitney U test:", wilcox_test$p.value, "\n")

## P-value for Mann-Whitney U test: 6.66547e-08

The p-value is significant, meaning it's below 0.05, indicating that there is a significant difference in interest rates between repaid and defaulted loans. This analysis was done across the entire portfolio. Now, how does this look by country?

countries <- unique(data$Country)

# Function to analyze each country
analyze_country <- function(country) {
cat("Analyse für", country, ":\n")

# Filter data by country and status
data_df <- data %>% filter(Country == country) %>% filter(Status %in% c("Repaid", "In Default"))

# Check if there is enough data for both categories
if (nrow(data_df) > 0 & length(unique(data_df$Status)) > 1) {

repaid <- data_df %>% filter(Status == "Repaid") %>% select(`Interest Rate`) %>% unlist()
in_default <- data_df %>% filter(Status == "In Default") %>% select(`Interest Rate`) %>% unlist()
test <- wilcox.test(repaid, in_default, exact = FALSE)

cat("Mann-Whitney U Test-Ergebnis: W =", test$statistic, ", p-value =", test$p.value, "\n\n")
} else {
cat("Nicht genug Daten für die Analyse.\n\n")
}
}

# Analyze each country
for (country in countries) {
analyze_country(country)
}

## Analyse für Estonia :
## Mann-Whitney U Test-Ergebnis: W = 77 , p-value = 0.02871484
##
## Analyse für Germany :
## Mann-Whitney U Test-Ergebnis: W = 101 , p-value = 0.5058534
##
## Analyse für Lithuania :
## Mann-Whitney U Test-Ergebnis: W = 224.5 , p-value = 3.126943e-06
##
## Analyse für Finland :
## Mann-Whitney U Test-Ergebnis: W = 54 , p-value = 0.8649381
##
## Analyse für Spain :
## Nicht genug Daten für die Analyse.
##
## Analyse für Portugal :
## Nicht genug Daten für die Analyse.
##
## Analyse für Latvia :
## Mann-Whitney U Test-Ergebnis: W = 12 , p-value = 0.04383209

In fact, the difference in Germany is not significant. So, it turns out I wasn't as greedy as I thought after all 🙂

Now, what if I had only invested €50 in each loan instead of sometimes investing much more? How would I be doing today?
data$fantasy = 50
data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete, fantasy) %>%
arrange(`Loan Code`) %>%
unique() %>%
group_by(Status) %>%
summarize(fifty_only = sum(fantasy), real_numbers = sum(principal_complete)) %>%
mutate(percentages_fifty = 100 * (fifty_only / sum(fifty_only)), percentages_real = 100 * (real_numbers / sum(real_numbers))) %>%
select(-fifty_only, -real_numbers)

## # A tibble: 6 × 3
## Status percentages_fifty percentages_real
## <chr> <dbl> <dbl>
## 1 Fully Recovered 2.33 1.25
## 2 Funded 34.6 30.7
## 3 In Default 16.3 27.1
## 4 Late 1.66 3.91
## 5 Partially Recovered 0.332 0.899
## 6 Repaid 44.9 36.1
It's very clear here that my strategy of spending more on certain projects didn't work out well. It would have been better to invest more evenly and diversify my investments. That's exactly what I'm doing differently now.

The Zettelkasten as a Knowledge Management System and Beyond


There are a few topics that have been on my mind for decades, one of which is how to store and retrieve information and knowledge in a meaningful way—essentially, a Memex. In the 90s, I was a fan of Apple’s HyperCard, and since then, I’ve tried many things, but nothing has proven reliable. Data formats have disappeared (like Apple’s HyperCard), and even though the technical possibilities today are much broader than back then, I don’t have the impression that they have solved the fundamental problem. There is no quick fix that reads articles and books for you and also makes you understand them. I don’t trust summary services like Blinkist, not to mention that they don’t help me with scientific articles either.

In recent months, I’ve turned again to a system that I had used in the 90s but abandoned in favor of HyperCard: the Zettelkasten. Not just any box with index cards, but a Zettelkasten following the Luhmannian principle. Niklas Luhmann was one of the most significant sociologists of the 20th century, with an incredible number of academic publications to his name. His Zettelkasten, which actually consists of several boxes, contains about 90,000 slips and is being digitized at Bielefeld University. Luhmann attributed his productivity to this Zettelkasten system, and after several months using it, I can understand why. So how does the Zettelkasten work?

First of all, there are a few fundamental principles:

  • The quality of a paper depends on what has already been written. In an ideal world, the thoughts in the Zettelkasten are already formulated well enough to be directly incorporated. The goal of the Zettelkasten, therefore, is to generate insights that are worth publishing.
  • Everything you read is processed and placed into the Zettelkasten. Rather than following a strict plan (as is often recommended when writing academic papers), everything goes into the Zettelkasten and is, if possible, linked together right away. Luhmann himself always wrote his slips with the thought in mind of how they would fit into the existing notes. But new ideas can also emerge from the Zettelkasten at any time.
  • Writing about everything instead of starting directly with a hypothesis also has the advantage of reducing exposure to confirmation bias, which causes one to ignore anything that might challenge the hypothesis.
  • Highlighting in a text is useless, the learning effect is zero. According to Luhmann, thinking cannot happen without writing. Handwritten notes are preferable to those typed on a computer, as they tend to capture the essence better, thus facilitating understanding.
  • The Zettelkasten is not an archive, nor is it an idea graveyard like Moleskine notebooks.
  • There is no information hierarchy; thoughts can simply be inserted into Luhmann’s system wherever they fit. For example, if you have a note numbered “1,2,1” and another “1,2,2,” you can just insert a “1,2,1,a” if a thought is missing here. There’s no need to pre-think endlessly about what the best structure is.
  • Self-discipline is more important than IQ. A smart working environment ensures that you don’t face resistance from the start.
  • Daily work with the Zettelkasten leads to a new KPI for knowledge workers: How many notes are created each day!

How exactly does the Zettelkasten work?

  • When reading a text, “Literature Notes” are created, which contain only your own thoughts about the text, written in your own words.
  • These then lead to “Permanent Notes,” which are added to the Zettelkasten. The Literature Notes are discarded (this is also why I sent back my Scribe, since the Literature Notes could not be created with it).
  • Additionally, there are “Fleeting Notes,” which contain all the ideas you have.

The Zettelkasten system is highly minimalist, with no fancy notebooks or software tools. The reduction, or even restriction, to what is essential also stimulates creativity and thinking (see Stokes 2001 and Rheinberger 1997). Of course, there are software solutions like The Archive, and they would have advantages for me: I don’t always have my Zettelkasten with me. On the other hand, I’ve learned that when I’m sitting in front of a computer or iPad, I tend to get distracted. Therefore, I now carry index cards with me.

Here is an interesting video about Luhmann’s Zettelkasten, where he explains it himself starting at minute 37:26:

And here is the researcher who is now delving into the Zettelkasten: