Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

Dividend Strategies: Missed Opportunities?

Disclaimer: This is not financial advice or a recommendation!

The article When Chasing More Dividends Leaves You With Less from the Wall Street Journal by Jason Zweig (who, by the way, wrote the commentary for The Intelligent Investor) sheds light on the appeal and associated risks of dividend strategies. Investors who focus on high dividend yields often hope for a steady income stream, especially in times of low interest rates. However, as the article points out, chasing high dividends can ultimately reduce long-term returns. The problem arises when investors blindly flock to funds that offer exceptionally high dividend yields.

Continue reading “Dividend Strategies: Missed Opportunities?”

Export from ING depot: CSV is not the same as CSV

Depot student Dominik has already provided a good overview of how to export data from the ING depot via the ExtraETF workaround. However, not every tool can handle the CSV export properly. For example, DivvyDiary immediately recognized the relevant columns, but the balances didn’t match. The reason for this is that CSV files can vary significantly, as can the data within them. Sometimes, columns aren’t separated by a comma but by a semicolon. And while the difference between 1,000.00 and 1.000,00 might seem minor to us, for DivvyDiary, a 1000 turned into a 1 because the thousands separator was treated as a decimal point.

The solution: As much as I dislike working with Excel, if you open the CSV file in Excel and then save it again as a CSV, even DivvyDiary (and many other tools) can handle it.

The advantage of ETFs with domicile in Ireland… sometimes.

Disclaimer: This is not financial advice! No warranty.

When selecting ETFs, various factors come into play, including tax considerations. In the last article, we discussed what partial exemption means. However, the tax differences between ETFs with different domiciles and their holdings in US stocks are also interesting. This article focuses on two specific and popular ETFs, and even though both contain US stocks, that doesn’t necessarily mean that the ETF domiciled in Ireland will deliver higher returns.

Continue reading “The advantage of ETFs with domicile in Ireland… sometimes.”

New tool for a dividend strategy.

Some tools online offer the ability to see how many dividends are likely to come your way. For example, extraETF provides a tool where you can see what the dividends might look like based on an assumed growth rate (CAGR), a certain number of years, and asset gains.

What I haven’t seen so far is a tool that, starting from a portfolio, calculates the dividend growth based on an assumed CAGR and dividend yield, while also factoring in taxes. That’s exactly the kind of tool I’ve created.

Are neobanks really such a good idea?

I’m a big fan of sub-accounts to keep budgets for different categories well separated. To do this, I’ve tried a few different banks. bunq didn’t work reliably and had very unfriendly support. I used to really like N26, but what I didn’t find very funny was that when downgrading from a more expensive to a cheaper plan, you can’t keep the IBANs for the sub-accounts. Then I found vivid. I hated the app’s color scheme from the start, but the features were okay, especially since I could link different virtual credit cards to different accounts. Unfortunately, the support wasn’t particularly good here either. And now, existing accounts are being closed because vivid is parting ways with Solaris Bank. New IBANs again. So, I’m moving on, or rather, going back to ING. They don’t have sub-accounts quite the way I need them, but I’ll figure out a different way to manage my budgets.

The fact is that with every neobank, you have to pay for the really interesting features, and in return, you don’t always get great support. The ING account might not be as sleek as those of the neobanks, but it’s free, and the support is usually good. I’ll keep my vivid account after upgrading it, but I won’t pay for it. Sometimes, the boring and old-fashioned offerings turn out to be not such a bad idea in the medium to long term.

Cool new features at Scalable Capital


I have been a customer of Scalable Capital for seven years, initially with the rather disappointing RoboAdvisor, but for the past two years, I’ve been a satisfied customer of the Prime Broker.

This week, I discovered new features, initially part of them in the desktop app, but everything new that is offered can already be seen in the web app. The new feature is called Insights and is based on data from BlackRock. For this to work, portfolio data is anonymized and sent to BlackRock. First of all, there is a Portfolio Check, where the diversification of the portfolio is assessed.

However, it doesn’t assess the weighting. After seeing this graphic, I invested a bit in a bond, and just like that, the ring was full. Next, there’s a crash simulation, where you can choose from several scenarios. In this simulation, for example, global stock markets drop by 10%, although I previously thought a crash would only be considered if the drop is 30% or more.

In such a scenario, my portfolio would suffer more than the benchmark.

The next new feature is the analysis of the portfolio allocation in terms of sectors, regions, etc.

For me, it looks like I have generally overweighted financial services, but in reality, this is only the case at first glance. Due to the dividend drivers Hercules and Ares Capital, it appears larger than it actually is. Then, there is also this wonderful analysis of the types of companies:

I’ve mostly seen something like this on ExtraETF. And then, for me, the most exciting part – the dividends, as I follow a dividend strategy:

I would like to see a projection into the future here, like on ExtraETF, although on that platform, you can only see the next few months within the same calendar year. But this way, I can better assess whether I’m on the right track or not. What I don’t like about ExtraETF is the shaky connection to Scalable. Sometimes it works, sometimes it doesn’t. That’s why I’m no longer a subscriber there.

Scalable takes away the market potential of tools like ExtraETF, getQuin, and DivvyDiary when they continue adding such features. For me, however, this is even more of a reason to stay with Scalable. These features could certainly become even smarter and support investors like me, who don’t want to use an expensive robo-advisor, even more when selecting investments.

Estateguru: First project partially collected + Exploratory Data Analysis


In fact, the first project has now been partially collected, with €262.72 of the €500 returned.

Otherwise, not much has changed since my last update. However, I decided to invest again. I withdrew 2/5 of my investment amount, and while the remaining amount is still significant, it no longer makes up too large a portion of my portfolio. I also put a stop to the reinvestment, at least as best as I could. Unfortunately, at Estateguru, you can’t specify that you want to invest only a certain amount, so that the interest is always available. Instead, you can only specify that you want to keep a certain amount invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 each month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though that’s not what I wanted. The support wasn’t particularly helpful in this case.

Estateguru does offer the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That’s quite alarming. It looks even worse when you look at the actual amounts.

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the “defaulted” status, meaning did I take on riskier loans in my “greed” that were characterized by higher interest rates? Let’s first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also seems that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, it doesn’t quite fit, because in some projects, you could invest over multiple stages:

data %>%
  mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
  group_by(`Loan Code`) %>%
  mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
  select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
  arrange(`Loan Code`) %>%
  unique() %>%
  ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
  geom_point() +
  facet_grid(rows = vars(Country)) +
  theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the limit of €500 per project that I had set for myself. Calculating a statistically significant difference would be the task. But let’s start in a different way first:

data %>%
  filter(Status == "Repaid" | Status == "In Default") %>%
  group_by(Status) %>%
  summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))
## # A tibble: 2 × 3
##   Status     mean_interest median_interest
##   <chr>              <dbl>           <dbl>
## 1 In Default          10.7            10.5
## 2 Repaid              10.0            10

The Shapiro-Wilk test helps us check the normality of the data.

repaid <- filter(data, Status == "Repaid")
In fact, the first project has now been partially collected, with €262.72 of the €500 returned.

Estateguru: Project partially recovered.
Otherwise, not much has changed since my last update. However, I’ve decided to invest again. I withdrew 2/5 of my investment amount, and while the remaining amount is still substantial, it no longer makes up too large a portion of my portfolio. Additionally, I’ve put a stop to the reinvestment, at least as best as I could. Unfortunately, with Estateguru, you can’t specify that you only want to invest a certain amount so that the interest is always available. Instead, you can specify that you want to keep a certain amount invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 every month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though I didn’t want that. The support wasn’t very helpful in this regard.

Estateguru offers the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can clearly see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That's quite alarming. It looks even worse when you look at the actual amounts.


data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the "defaulted" status, meaning did I take on riskier loans in my "greed" that were characterized by higher interest rates? Let's first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also appears that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, this doesn't quite fit, because in some projects, you could invest over multiple stages:

data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
arrange(`Loan Code`) %>%
unique() %>%
ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the €500 limit I had set for myself per project. Calculating a statistically significant difference would be the task. But let's start in a different way first:

data %>%
filter(Status == "Repaid" | Status == "In Default") %>%
group_by(Status) %>%
summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))

## # A tibble: 2 × 3
## Status mean_interest median_interest
## <chr> <dbl> <dbl>
## 1 In Default 10.7 10.5
## 2 Repaid 10.0 10

The Shapiro-Wilk test helps us check the normality of the data.

Interest Rate`in_default <- filter(data, Status == "In Default")

In fact, the first project has now been partially recovered, with €262.72 of the €500 returned.

Estateguru: Project partially recovered.

Otherwise, not much has changed since my last update. However, I’ve decided to invest again. I withdrew 2/5 of my investment, and while the remaining amount is still significant, it no longer makes up too large a portion of my portfolio. Additionally, I’ve put a stop to the reinvestment, at least as best as I could. Unfortunately, with Estateguru, you can’t specify that you only want to invest a certain amount, so that the interest is always available. Instead, you can only set a certain amount to be always invested, which isn’t ideal. For example, if I had invested €10,000 and wanted to reserve €100 every month, but then received a payment of €300, I would quickly end up investing more than €10,000, even though that’s not what I wanted. The support wasn’t very helpful in this case.

Estateguru offers the option to download your portfolio data, which allows me to take a closer look at what actually went wrong:

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, x = as.factor(Country))) +
geom_bar() +
theme_minimal() +
xlab("Land")

As you can clearly see here, I had fewer projects in Germany than in Estonia, for example, but most of the projects in Germany have defaulted. That’s quite alarming. It looks even worse when you look at the actual amounts.

data %>%
group_by(Country, Status) %>%
ggplot(., aes(fill = Status, y = `Initial Principal`, x = as.factor(Country))) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Land") +
ylab("Darlehensbetrag in Euro")

Is there a correlation between the interest rate and the "defaulted" status, meaning did I take on riskier loans in my "greed" that were characterized by higher interest rates? Let's first visualize several variables:

data %>%
ggplot(., aes(x = `Initial Principal`, y= `Interest Rate`, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

We definitely see the one outlier where I invested €2,500 at around 11%. It also seems that defaults are primarily associated with higher interest rates, except for the German projects, where I have defaults across the board. However, this doesn’t quite fit, because in some projects, you could invest over multiple stages:

data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete) %>%
arrange(`Loan Code`) %>%
unique() %>%
ggplot(., aes(x = principal_complete, y= median_interest, color = factor(Status))) +
geom_point() +
facet_grid(rows = vars(Country)) +
theme_minimal()

Apparently, I was particularly bold in Germany, thinking that loans there were safer, and as a result, I repeatedly exceeded the €500 limit I had set for myself per project. Calculating a statistically significant difference would be the task. But let's start in a different way first:

data %>%
filter(Status == "Repaid" | Status == "In Default") %>%
group_by(Status) %>%
summarize(mean_interest = mean(`Interest Rate`), median_interest = median(`Interest Rate`))

## # A tibble: 2 × 3
## Status mean_interest median_interest
## <chr> <dbl> <dbl>
## 1 In Default 10.7 10.5
## 2 Repaid 10.0 10

The Shapiro-Wilk test helps us check the normality of the data.


Interest Rate`shapiro_test_repaid <- shapiro.test(repaid)

shapiro_test_in_default <- shapiro.test(in_default)

cat("P-value for Repaid group:", shapiro_test_repaid$p.value, "\n")


## P-value for Repaid group: 1.143358e-08

cat("P-value for In Default group:", shapiro_test_in_default$p.value, "\n")

## P-value for In Default group: 6.078673e-05

The p-values are significant (below 0.05), indicating that the data is not normally distributed. Therefore, the Mann-Whitney U test is used, a non-parametric test, to compare the interest rates of the two groups.
wilcox_test <- wilcox.test(repaid, in_default, alternative = "two.sided")
cat("P-value for Mann-Whitney U test:", wilcox_test$p.value, "\n")

## P-value for Mann-Whitney U test: 6.66547e-08

The p-value is significant, meaning it's below 0.05, indicating that there is a significant difference in interest rates between repaid and defaulted loans. This analysis was done across the entire portfolio. Now, how does this look by country?

countries <- unique(data$Country)

# Function to analyze each country
analyze_country <- function(country) {
cat("Analyse für", country, ":\n")

# Filter data by country and status
data_df <- data %>% filter(Country == country) %>% filter(Status %in% c("Repaid", "In Default"))

# Check if there is enough data for both categories
if (nrow(data_df) > 0 & length(unique(data_df$Status)) > 1) {

repaid <- data_df %>% filter(Status == "Repaid") %>% select(`Interest Rate`) %>% unlist()
in_default <- data_df %>% filter(Status == "In Default") %>% select(`Interest Rate`) %>% unlist()
test <- wilcox.test(repaid, in_default, exact = FALSE)

cat("Mann-Whitney U Test-Ergebnis: W =", test$statistic, ", p-value =", test$p.value, "\n\n")
} else {
cat("Nicht genug Daten für die Analyse.\n\n")
}
}

# Analyze each country
for (country in countries) {
analyze_country(country)
}

## Analyse für Estonia :
## Mann-Whitney U Test-Ergebnis: W = 77 , p-value = 0.02871484
##
## Analyse für Germany :
## Mann-Whitney U Test-Ergebnis: W = 101 , p-value = 0.5058534
##
## Analyse für Lithuania :
## Mann-Whitney U Test-Ergebnis: W = 224.5 , p-value = 3.126943e-06
##
## Analyse für Finland :
## Mann-Whitney U Test-Ergebnis: W = 54 , p-value = 0.8649381
##
## Analyse für Spain :
## Nicht genug Daten für die Analyse.
##
## Analyse für Portugal :
## Nicht genug Daten für die Analyse.
##
## Analyse für Latvia :
## Mann-Whitney U Test-Ergebnis: W = 12 , p-value = 0.04383209

In fact, the difference in Germany is not significant. So, it turns out I wasn't as greedy as I thought after all 🙂

Now, what if I had only invested €50 in each loan instead of sometimes investing much more? How would I be doing today?
data$fantasy = 50
data %>%
mutate(`Loan Code` = str_remove(`Loan Code`, "-.*")) %>%
group_by(`Loan Code`) %>%
mutate(principal_complete = sum(`Initial Principal`), median_interest = median(`Interest Rate`)) %>%
select(`Loan Code`, Status, Country, median_interest, principal_complete, fantasy) %>%
arrange(`Loan Code`) %>%
unique() %>%
group_by(Status) %>%
summarize(fifty_only = sum(fantasy), real_numbers = sum(principal_complete)) %>%
mutate(percentages_fifty = 100 * (fifty_only / sum(fifty_only)), percentages_real = 100 * (real_numbers / sum(real_numbers))) %>%
select(-fifty_only, -real_numbers)

## # A tibble: 6 × 3
## Status percentages_fifty percentages_real
## <chr> <dbl> <dbl>
## 1 Fully Recovered 2.33 1.25
## 2 Funded 34.6 30.7
## 3 In Default 16.3 27.1
## 4 Late 1.66 3.91
## 5 Partially Recovered 0.332 0.899
## 6 Repaid 44.9 36.1
It's very clear here that my strategy of spending more on certain projects didn't work out well. It would have been better to invest more evenly and diversify my investments. That's exactly what I'm doing differently now.