The BMI Is Not a Health Measure – It’s a Statistical Relic

I recently read Jordan Ellenberg’s How Not to Be Wrong: The Power of Mathematical Thinking. In it, he describes the pitfalls of linear regression, including the example of the paper “Will all Americans become overweight or obese?” According to this paper, by 2048 all Americans will be overweight or obese. This sounds dramatic – but it is statistically nonsensical. The underlying regression ignores the fact that as the number of overweight people increases, fewer and fewer slim people remain who could “convert.” Just a few years after the paper’s publication, it became clear that the increase in obesity does not follow a linear path but a logistic one – it flattens out, because populations are not infinite processes.

What Ellenberg does not mention: even the underlying metric, the Body Mass Index (BMI), is problematic. It appears in health records, insurance forms, and medical consultations – as if it were a precise health indicator. Yet it is not.

BMI as a proxy variable

To be fair: the BMI was never intended as a diagnostic tool. Adolphe Quetelet developed it in the 1830s as a statistical descriptor for populations – not for individual assessment. For decades, it was mainly used in epidemiology. It only entered clinical practice in the 1970s and 1980s, when simplicity became more important than precision.

Used correctly, the BMI is a proxy variable: not exact, but stable enough to identify patterns at the population level. And indeed, large datasets show that the risk of cardiovascular disease increases with rising BMI – but not linearly. One of the largest meta-analyses, Flegal et al. (2013, JAMA), evaluated 97 studies involving 2.88 million people. The result: overweight (BMI 25–29.9) was associated with lower mortality (Hazard Ratio 0.94). Only from BMI ≥30 did risk increase significantly. In other words: the oft-cited formula “the higher the BMI, the sicker you are” is simply wrong.

What BMI does not measure

Statistically speaking, the BMI is a noisy signal with high variance and poor construct validity. It measures mass, not health. It does not distinguish fat from muscle, nor visceral from subcutaneous fat, nor does it account for age or ethnicity. For example, Asian populations have a higher body fat percentage at the same BMI than European populations, while Black populations tend to have more muscle mass at the same BMI. A one-size-fits-all threshold makes no epidemiological sense.

Fitness matters more than weight

Barry et al. (2014) found that fit obese individuals live longer on average than unfit normal-weight individuals. Tarp et al. (2021) confirmed: among men with high fitness, obesity was not a significant mortality factor. This means: fitness has a protective effect – but not an unlimited one.

At the same time, other studies show that high fitness does not fully neutralize the effects of severe obesity – some studies find a 117% elevated mortality risk. Healthy habits improve outcomes, but they do not cancel the effects of extreme adiposity.

Confounders

A significant portion of the correlation between high BMI and health risks arises because people with obesity tend to eat less healthily, exercise less, and more often face socioeconomic burdens. Diet: people with obesity consume more highly processed foods, sugar, and saturated fats – a pattern that independently leads to metabolic disorders. Physical inactivity: independently associated with cardiovascular risk. Socioeconomic factors: poverty increases both obesity risk and limited access to healthcare.

This does not mean that adiposity has no independent effect – above a certain threshold, biological mechanisms such as chronic inflammation, insulin resistance, and hormonal disruption clearly play a role. The global data consistently show a connection (Heath 2022; Global BMI Mortality Collaboration 2023). Nevertheless: adiposity is not inherently dangerous because one is fat, but because it is usually the expression of an unhealthy lifestyle and disturbed metabolism. Beyond a certain point, however, it also becomes a biological cause in its own right.

A classic statistical error

The most common mistake in dealing with BMI is confusing population and individual. What holds true on average for a group does not necessarily apply to an individual – this is the ecological fallacy. The BMI works for groups, not for individuals: it provides useful trends, but no diagnoses.

The BMI is a prime example of a variable that can be statistically significant but conceptually weak. Or to put it differently: the BMI endures because it is convenient – not because it is good.

Stacked Area Chart: Visualizing Investments and Credit Defaults

This was more of a small programming exercise – partly because I wanted to see how things had actually developed as the number of projects going into collection increased. The data can be downloaded relatively easily from the website, but it needs to be transformed. The challenge is that the data is available per project, whereas for the stacked area chart we need to transform it so that it is available per month.

library(tidyverse)
library(lubridate)

today <- Sys.Date()

# Prepare time window
base <- portfolio %>%
  transmute(
    Loan Code
    , start_date = as.Date( Date Funded )
    , end_date = if_else(is.na( Repaid Date ), today, as.Date( Repaid Date ))
    , maturity_date = as.Date( Maturity Date )
    , invested =  Invested amount
    , status_raw = Status,
    days_late =  Days late
  )
# Calculate start of default
base2 <- base %>%
  mutate(
    late_since_raw = if_else(!is.na(days_late), today - days(days_late), as.Date(NA)),
    late_since = if_else(late_since_raw < maturity_date, maturity_date, late_since_raw),
    late_since = if_else(is.na(late_since) & status_raw == "Late", maturity_date, late_since),
    is_late = !is.na(late_since)
  )
# Create monthly sequence
months_seq <- seq(
  floor_date(min(base2$start_date), "month"),
  floor_date(today, "month"),
  by = "month"
)

# Expand: one row per loan per month
expanded <- base2 %>%
  rowwise() %>%
  mutate(month = list(months_seq[months_seq >= floor_date(start_date, "month") &
                                   months_seq <= floor_date(end_date, "month")])) %>%
  unnest(month) %>%
  mutate(
    status = case_when(
      is_late & month >= floor_date(late_since, "month") ~ "late",
      TRUE ~ "active"
    )
  )
# Aggregate per month and status
monthly <- expanded %>%
  group_by(month, status) %>%
  summarise(volume = sum(invested), .groups = "drop")

# Plot
ggplot(monthly, aes(x = month, y = volume, fill = status)) +
  geom_area(position = "stack") +
  scale_fill_manual(values = c("active" = "#4CAF50", "late" = "#F44336")) +
  scale_y_continuous(labels = scales::dollar_format(prefix = "€", big.mark = ".")) +
  labs(
    title = "Development of estateguru Portfolio",
    x = NULL, y = "Invested Volume (€)", fill = "Status"
  ) +
  theme_minimal()

As the plot clearly shows, the projects started going off the rails at the end of 2022/beginning of 2023; the €2,500 project mentioned in the previous article was overdue in January 2023. Since then, the total amount in collection has continued to grow. At the same time, you can see that I kept withdrawing money from estateguru whenever possible (each withdrawal incurs a fee, so I wait until a certain amount has accumulated).

Website Analyzer

Claude Code allows me to build applications very quickly – including ones I had always wanted to create but had kept on my to-do list for years given the time they would require. In this case, things went a step further: I first asked Google Gemini to create a SWOT analysis of Screaming Frog, and then to write a prompt that would allow me to build a better app using Claude Code. Not all features are implemented yet. But the result is a native macOS app for crawling, analyzing, and monitoring websites: the Website Analyzer. Download available on request 🙂

Here is what the app does:

Crawling

  • Recursive crawling of a website starting from a start URL
  • Configurable parallelism (1–20 workers), depth (1–50), rate limit and timeout
  • Selectable user agent (Safari, Chrome, Googlebot, custom)
  • robots.txt compliance (optional)
  • Automatic HTTPS upgrading
  • HEAD requests for resources (images, CSS, JS, fonts, media) instead of full download
  • Detection of lazy-loading images (data-src, data-lazy-src, picture, srcset)

Link Checking

  • All internal and external links are checked (HTTP status code)
  • Status classification: OK, Redirect, Dead, Timeout, Error
  • Real redirects vs. trivial ones (http to https, www, trailing slash) are distinguished
  • Embedded resources (images, CSS, JS, fonts, iFrames) from CDNs are treated as internal
  • Grouped display: same target URL from multiple source pages

Results

  • Tabular overview of all crawled pages
  • Columns: URL, Content-Type, HTTP status, size, response time, depth, indexable
  • Search, sorting, filter by Content-Type
  • Color-coded status codes and response times

Insights & Analytics

  • Site Health Score (0–100) as a visual gauge
  • HTTP status code distribution (bar chart)
  • Link status distribution (OK / Dead / Redirect / Timeout / Error)
  • Page structure by depth
  • Content-Type distribution
  • Page Speed: average, median, P90, fastest/slowest pages (top 8)
  • Dead link hotspots: pages with the most dead links
  • Link graph: interactive force-directed visualization of page connections (pan, zoom, hover tooltips)

Export

  • CSV export for pages and links (with file dialog)

Technology

  • SwiftUI + macOS native
  • GRDB.swift (SQLite) for persistent storage
  • Fuzi (libxml2) for HTML parsing
  • Swift Concurrency (async/await, Actors) for thread-safe crawling
  • SwiftUI Canvas (Metal-backed) for link graph rendering

Using Large Language Models Locally with R

When it comes to Large Language Models (LLMs), there are moments when I ask myself: Do I really need to send my data halfway across the globe to OpenAI just to get a summary? With Ollama, there is now a standard for running models like Llama 3 or Mistral locally. And the best part: These models can be controlled directly from R. This not only saves money but also solves the data privacy problem quite elegantly. This article is about how to best implement this in R. Spoiler: There isn’t just one way, but (at least) two very good packages with completely different philosophies.

Continue reading

Can you combine column charts and cumulative lines?

It’s not something I would have thought of myself, but a German magazine recently tried it – and ever since, I’ve been wondering whether this is a clever idea or actually a very bad one. ChatGPT kindly translated the chart for me; the translation isn’t entirely accurate, but that’s not the point. What matters here is the visualization approach. In short, the data is about partnerships between the public sector and private investors.

A quick reminder: the purpose of data visualization is to make a subject easier to understand. Ideally, it also carries an intention – something that sharpens our perspective, changes our opinion, or even nudges us toward action. So, what does this visualization achieve?

Continue reading

Which Visualization for Which Data?

Communicating data, information, and the insights derived from them is a key skill. Data visualizations should help the audience grasp concepts more quickly, making it essential to choose the type of visualization that conveys the message most effectively. Even though Microsoft Excel might suggest a pie chart, it’s often not the best option, as you can see on the left!

At university and in my job, I constantly work with data visualizations. To save everyone’s nerves, I’ve created a handy overview, inspired by the work of A. Abela:

The overview is continuously updated by me. If you’re interested, feel free to sign up for my newsletter and get instant access to the overview (plus a monthly update).

Row Names in R

Some datasets in R use row names, such as the built-in mtcars dataset. While convenient, row names can be suboptimal when sorting data, for instance, by car brands:

To convert row names into a column using the Tidyverse, the rownames_to_column() function is used:

library(tidyverse)

# mtcars laden und die Reihenamen in eine Spalte verschieben
mtcars_tidy <- mtcars %>%
  rownames_to_column(var = "car_name")

And this is the result:

Visualizing overlaps of ETFs in an UpSet diagram

Today, two topics I find particularly exciting come together: data analysis and visualization, and finance. Choosing the right ETFs is a topic that fills countless web pages and financial magazine articles. However, it’s equally fascinating to explore the overlaps between ETFs. Previously, I compared the Vanguard FTSE All-World High Dividend Yield UCITS ETF USD Distributing (ISIN: IE00B8GKDB10) and the iShares STOXX Global Select Dividend 100 UCITS (ISIN: DE000A0F5UH1). I also analyzed the performance of these two alongside the VanEck Morningstar Developed Markets Dividend Leaders ETF (NL0011683594) and an MSCI World ETF (IE00B4L5Y983).

The holdings included in an ETF can be downloaded from the respective provider’s website; I performed this download on October 5. The data requires significant transformation before it can be compared. My R-based notebook detailing this process can be found [here]. For the visualization, I chose an UpSet diagram, a relatively new type of visualization that I’ve used in a paper and another project. While Venn diagrams are commonly used for visualizing overlaps between datasets, they become unwieldy with more than 3 or 4 datasets. This challenge is clearly illustrated in examples like this:

The size of the circles, for example, does not necessarily reflect the size of the datasets. An UpSet diagram is entirely different:

Yes, it takes a bit of effort, but it shows much more clearly how the datasets relate to one another. On the far left, we see the size of the datasets, with the Vanguard FTSE All-World High Dividend Yield having the most holdings—over 2,000. On the right-hand side, we see the overlaps. The point at the very bottom beneath the tallest vertical bar indicates that the Vanguard FTSE […] has 1,376 stocks that no other ETF includes. Similarly, the iShares Core MSCI World has 757 titles that no other ETF contains. In the third column, we see that these two ETFs share 486 titles that the other two ETFs do not include. I find that quite fascinating. For example, I wouldn’t have thought that the Vanguard contains so many stocks that the MSCI World does not.

The VanEck allegedly has one stock that no other ETF contains, but that’s not accurate; that entry was just cash. Otherwise, 81 of its 100 titles are also included in the MSCI World. All of its titles are included in the Vanguard.

It would now be interesting to see how the weightings align. However, that’s an additional dimension that would likely be difficult to represent in an UpSet diagram. Still, it’s necessary to take a closer look at this because the overlaps might result in unintended overweighting of certain stocks. That would be a topic for the next blog post.

Export from ING depot: CSV is not the same as CSV

Depot student Dominik has already provided a good overview of how to export data from the ING depot via the ExtraETF workaround. However, not every tool can handle the CSV export properly. For example, DivvyDiary immediately recognized the relevant columns, but the balances didn’t match. The reason for this is that CSV files can vary significantly, as can the data within them. Sometimes, columns aren’t separated by a comma but by a semicolon. And while the difference between 1,000.00 and 1.000,00 might seem minor to us, for DivvyDiary, a 1000 turned into a 1 because the thousands separator was treated as a decimal point.

The solution: As much as I dislike working with Excel, if you open the CSV file in Excel and then save it again as a CSV, even DivvyDiary (and many other tools) can handle it.