data > opinion

Tom Alby

Using R to analyze and plot Corona Data

2020-03-18


Sie sind hier: start / using r to plot corona data /

I am not updating the numbers here anymore. Please look at my dashboard written in Shiny if you want to look at current data!

Last update:

## [1] "2020-11-02 09:59:13 UTC"

Why?

This R notebook is inspired by a blog article by Jim Frost, published on his highly recommended statistics blog. Jim had those countries in his analysis that he was interested in, so he did not include Germany, my home country, at least not in the first version of his article. After reading his article, I asked myself, what does his analysis look like for Germany? And how do you create such an analysis yourself?

A few important remarks:

We will first look at the main code and then at the results with a bit more code. Feel free to skip the code if you are not interested in R but you will defintely miss something :)

Where to get the tools

This is what you need:

install.packages(“tidyverse”)

to install the tidyverse. * I am also using R Notebooks, an excellent way to walk you through my thoughts, my code and the results.

By the way, this whole website is created using R notebooks in conjunction with Hugo.

How to code it

As said before, I am using the tidyverse library as it enables a very easy way to transform data with a few verbs.

library(tidyverse)

The data can be found on GitHub, a Data Repository provided by the Johns Hopkins CSSE; you need to click on the RAW button in GitHub to get the correct URL of the csv data:

corona_data <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

We are only interested in a few countries, and we need to do some extra transformation as there are several rows for China and the US that need to be summed up (Update: the format changed end of March but the code still works):

data <- corona_data %>%
  filter(`Country/Region` == "Germany" | `Country/Region` == "Italy" |  `Country/Region` == "China"  |  `Country/Region` == "US" | `Country/Region` == "France"  | `Country/Region` == "New Zealand") %>%
  select(-`Province/State`) %>%
  select(-Lat,-Long) %>%
  group_by(`Country/Region`) %>%
  summarise_each(list(sum))
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

In order to plot the data, we need to swap columns and rows; unfortunately, this removes our headers, and we will use the first row as a header:

n <- data$`Country/Region` # save the names in the first column
data <- as.data.frame(t(data[,-1])) # use t() to transpose and remove the 1st row
colnames(data) <- n # add tshe names as colum names
data <- tibble::rownames_to_column(data, "Day") # convert the rownames to a column
str(data) # Check the data types
## 'data.frame':    285 obs. of  7 variables:
##  $ Day        : chr  "1/22/20" "1/23/20" "1/24/20" "1/25/20" ...
##  $ China      : num  548 643 920 1406 2075 ...
##  $ France     : num  0 0 2 3 3 3 4 5 5 5 ...
##  $ Germany    : num  0 0 0 0 0 1 4 4 4 5 ...
##  $ Italy      : num  0 0 0 0 0 0 0 0 0 2 ...
##  $ New Zealand: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ US         : num  1 1 2 2 5 5 5 6 6 8 ...

As we can see, the date needs to be transformed to a date data type:

data <- data %>%
  mutate(Day = as.Date(Day,"%m/%d/%y"))

We use reshape to “melt” the data. ggplot2 needs the data in a very specific format, each data point in one row.

library(reshape2)
d <- melt(data, id.vars="Day")

The Results I - Reported Infections

Infection data has to be taken with a grain of salt:

ggplot(d, aes(Day,value, col=variable)) + 
  geom_line() +
  ggtitle("Infections beginning with China's reports") + 
  xlab("Date") + 
  ylab("Infections")

This is not what Jim has plotted. Basically, he used the day when reported infections reached 20 people as the first day. This was a bit tricky in R first, and I had to “stackoverflow” a solution here:

d2 <- d %>%
  filter(value > 19) %>%
  group_by(variable) %>%
  mutate(id = row_number())

ggplot(d2, aes(id,value, col=variable)) + 
  geom_line() +
  ggtitle("Infections counted from 20th Case") + 
  xlab("Date") + 
  ylab("Infections")

I cannot reproduce the US data that Jim had, though. Having asked him on his blog what possible reasons could be, he replied that he had not included those people that were on the cruise ship. Unfortunately, since March 23rd, the data does not include the data for the US passengers on the cruise ships anymore. However, we still did not get the graph that Jim had. Jim has not disclosed his code, so we don’t know how he got to his graph. That’s why these Notebooks using Explanatory Data Anaylsis are so useful, you can see my thoughts, my code and my results, everything is transparent to you.

We could also challenge why Jim had chosen 20 as a threshold. Obviously, there is more urgency in the plot when using 20 as a threshold. But there is no reason why it is 20, it could also be 17 oder 21. What would the data look like with the 1st case as a threshold? This would make a lot of sense as it shows the development beginning with the 1st case:

d3 <- d %>%
  filter(value > 0) %>%
  group_by(variable) %>%
  mutate(id = row_number())

# plot
ggplot(d3, aes(id,value, col=variable)) + 
  geom_line() +
  ggtitle("Infections counted from first case") + 
  xlab("Date") + 
  ylab("Infections")

Some people asked me to look also at the data on a logarithmic scale. This means that the y-axis will not use the same distance from 1.000 to 2.000 to 3.000, but that the numbers on the axis will be logarithmed (is that a word?). For example,

\(log_2 16 = 4\)

is 4 because

\(2^4\)

is 16.

ggplot(d3, aes(id,value, col=variable)) + 
  geom_line() +
  scale_y_continuous(trans = 'log2') +
  ggtitle("Infections counted from first reported case in each country, log2 scale on y axis") + 
  xlab("Date") + 
  ylab("Infections")

This shows a slightly different picture.

Some commentors wanted to see how many tests were performed, and this is a tricky one as data provided by ourworldindata.org differs in the way how tests are performed and counted:

In Germany, for example, the high number of tests could be a result of one person being tested several times. Let’s look at the data anyway. While it is possible to download it automatically, I have downloaded a copy. The current data is from April 6th, 2020.

library(readxl)
tests <- read_excel("covid-testing-06-Apr.xlsx")
(my_tests <- tests %>%
  filter(Entity == "France - units unclear" | Entity == "Germany - samples tested" | Entity == "Italy - units unclear" | Entity == "South Korea - cases tested" | Entity == "United States - specimens tested (CDC)" | Entity == "United States - inconsistent units (COVID Tracking Project)") %>%
   select(Entity,Date, `Cumulative total`,`Cumulative total per million`))
## # A tibble: 6 x 4
##   Entity               Date                `Cumulative tota… `Cumulative total …
##   <chr>                <dttm>                          <dbl>               <dbl>
## 1 France - units uncl… 2020-03-31 00:00:00            224254               3412.
## 2 Germany - samples t… 2020-03-29 00:00:00            918460              11127.
## 3 Italy - units uncle… 2020-04-06 00:00:00            721732              12205.
## 4 South Korea - cases… 2020-04-06 00:00:00            466804               9063.
## 5 United States - inc… 2020-04-05 00:00:00           1762032               5316.
## 6 United States - spe… 2020-03-28 00:00:00            163660                494.

Data is not available for every day, instead, the data source includes data for the most recent date when test data was available. As a consequence, data differs from the data above.

library(ggrepel)
ggplot(my_tests, aes(x = Date, y = `Cumulative total`)) +
    geom_point(aes(color = Entity,  size = `Cumulative total per million`)) +
    geom_text_repel(aes(label=Entity)) + # Repel takes care of putting the labels close to the data points but without leaving the grid
  #scale_y_continuous(trans = 'log2') +
  labs(title="Performed Tests on Day of Year, Size reflects tests performed per million", x ="Date", y = "Tests performed") +
  theme(legend.position = "none")

To do: Some of the labels overlap with the data points, this needs to be fixed in one of the next releases.

A few more thoughts: Results II - Deaths compared to Population

As discussed above, infections might not be a good indicator of the current spread as countries have different approaches to when people are tested. We can get the number of deaths from the same data source, the John Hopkins CSSE data repository:

# Get the data
corona_deaths <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")

Let’s use the same approach as above but beginning with cases reported in China:

# Transform the data and remove the Diamond Princess
data_deaths <- corona_deaths %>%
  filter(`Country/Region` == "Germany" | `Country/Region` == "Italy" |  `Country/Region` == "China"  |  `Country/Region` == "US" | `Country/Region` == "France" | `Country/Region` == "New Zealand") %>%
  filter(`Province/State` !="Diamond Princess" | is.na(`Province/State`)) %>%
   select(-`Province/State`) %>%
  select(-Lat,-Long) %>%
  group_by(`Country/Region`) %>%
  summarise_each(list(sum))

# Swap rows and columns and convert date
n <- data_deaths$`Country/Region`
data_deaths <- as.data.frame(t(data_deaths[,-1]))
colnames(data_deaths) <- n
data_deaths <- tibble::rownames_to_column(data_deaths, "Day")
data_deaths <- data_deaths %>%
  mutate(Day = as.Date(Day,"%m/%d/%y"))

# Melt the data
d_deaths <- melt(data_deaths, id.vars="Day")

# plot
ggplot(d_deaths, aes(Day,value, col=variable)) + 
  geom_line() +
  ggtitle("Deaths per Country") + 
  xlab("Date") + 
  ylab("Deaths")

Italy surpassed China in death numbers whilst they had less infections reported until late March. As data for Germany, France, and the US is difficult to differentiate, let’s logscale the plot.

ggplot(d_deaths, aes(Day,value, col=variable)) + 
  geom_line() +
  scale_y_continuous(trans = 'log2') +
  ggtitle("Deaths per Country, log2 Scale on y axis") + 
  xlab("Date") + 
  ylab("Deaths")
## Warning: Transformation introduced infinite values in continuous y-axis

Another discussion that came up on Facebook is how similar growth curves are, e.g. how many days Germany is behind Italy. Let’s have a look:

deaths_from_one <- d_deaths %>%
  filter(value > 0) %>%
  group_by(variable) %>%
  mutate(id = row_number())

Now plot the data:

ggplot(deaths_from_one, aes(id,value, col=variable)) + 
  geom_line() +
  ggtitle("Deaths per Country from the 1st day of a Death") + 
  xlab("Days since 1st death") + 
  ylab("Deaths")

In one of the earlier versions of this notebook, I had used the ratio of deaths by infected people. Many commentors however said that this is unfair due to different approaches to testing. The Washington Post states that the death rate might be inflated “because a considerable number of people likely have or had the virus but were not diagnosed because their symptoms were too mild to see a doctor.” Another way to analyze it is to compare deaths to population data. This dataset has data until 2020:

population <- read_delim("population.csv",  "\t", escape_double = FALSE, trim_ws = TRUE)

Again, we need to filter the data as the data set has data for every country and every year since 1960:

(population_subset <- population %>%
  filter(`Country` == "China" | `Country` == "France" | `Country` == "Germany" | `Country` == "Italy" | `Country` == "United States" | `Country` == "New Zealand" | `Country` == "Israel") %>%
  select(`Country`,Population) %>%
  mutate(`Country` = replace(`Country`, `Country` == "United States","US")) %>%
  rename(variable = `Country`) %>%
  rename(population = Population)
 )
## # A tibble: 7 x 2
##   variable    population
##   <chr>            <dbl>
## 1 China       1439323776
## 2 US           331002651
## 3 Germany       83783942
## 4 France        65273511
## 5 Italy         60461826
## 6 Israel         8655535
## 7 New Zealand    4822233

We compute deaths by 100.000 people per population:

deaths_vs_population <- d_deaths %>%
  left_join(population_subset) %>%
  mutate(value = (value/population)*100000)

Let’s plot:

  # plot
ggplot(deaths_vs_population, aes(Day,value, col=variable)) + 
  geom_line() +
  ggtitle("Corona-related Deaths per 100.000 people per Country") + 
  xlab("Date") + 
  ylab("Number of Deaths per 100.000")

Germany is very close to the US curve, that’s why it is not visible.

  # plot
ggplot(deaths_vs_population, aes(Day,value, col=variable)) + 
  #scale_x_continuous(trans = 'log2') +
  scale_y_continuous(trans = 'log2') +
  geom_line() +
  ggtitle("Corona-related Deaths per 100.000 people per Country, log2 y-Axis") + 
  xlab("Date") + 
  ylab("Number of Deaths per 100.000")
## Warning: Transformation introduced infinite values in continuous y-axis

The disadvantage of this approach is that even deaths are measured differently. Not every dead person is examined if the death came due to the Corona virus, ar least not in every country. People might now show any symptoms but die due to Corona but will not be reported as such.

Comparing Corona to Influenza: Results III

Data was taken manually (!) from RKI’s weekly reports. Data is delayed by at least one week, so please take it with a grain of salt. Influenza data has been updated on March 26th. One thing to keep in mind is that influenza is actually a collection of different viruses, e.g. Influenza A(H1N1)pdm09, Influenza A(H3N2) and Influenza B.

grippe <- read_delim("grippe.csv", 
    ";", escape_double = FALSE, col_types = cols(date = col_date(format = "%d.%m.%y"), 
        infected = col_integer()), trim_ws = TRUE)

Let’s look at infections first:

flu_vs_influenca_inf <- data %>%
  select(Day,Germany) %>%
  full_join(grippe, c("Day" = "date")) %>%
  filter(Day > "2018-12-31") %>%
  select(-deaths) %>%
  rename(Corona = Germany) %>%
  rename(Influenza = infected) 
# melt the data
comparison_inf <- melt(flu_vs_influenca_inf, id.vars="Day")

# remove NAs
comparison_inf <- comparison_inf %>%
  filter(!is.na(value))

# plot
ggplot(comparison_inf, aes(Day,value, col=variable)) + 
  geom_line()  +
  ggtitle("Influenza versus Corona Infections 2019/2020") + 
  xlab("Date") + 
  ylab("Accumulated Number of Infections")

What if we looked at a full cycle of Influenza, e.g. 2017/2018, a really tough cycle? Now, it becomes tricky: we have two different timeframes but we don’t want to plot them as we want to compare them. Unfortunately, we also have another issue as Influenza data is provided on a weekly basis but Corona data on a daily basis. We also want to prevent that the beginning of a Influenza season, typically at the end of a year will be displayed at the end of our current graph but at the beginning. As a consequence, we do a little hack and convert days into week numbers and give a week before New Year a negative week number. Also, we need to take the last value of Corona data of a week.

data %>%
  select(Day,Germany) %>%
  full_join(grippe, c("Day" = "date")) %>%
  select(-deaths) %>%
  rename(Corona = Germany) %>%
  rename(Influenza = infected) %>%
  mutate(Influenza17_18 = case_when(Day < "2018-12-31" ~ Influenza)) %>%
  mutate(Influenza19_20 = case_when(Day > "2018-12-31" ~ Influenza)) %>%
  select(-Influenza) %>%
  melt(., id.vars="Day") %>%
  filter(!is.na(value)) %>%
  mutate(week = lubridate::week((Day))) %>%
  mutate(week = if_else(week > 38,week-53,week)) %>%
  group_by(week, variable) %>%
  summarize(value = max(value)) %>%
  ggplot(aes(week,value, col=variable)) + 
  geom_line()  +
  ggtitle("Two Influenza Seasons compared to Corona") + 
  xlab("Week") + 
  ylab("Accumulated Number of reported Infections")
## `summarise()` regrouping output by 'week' (override with `.groups` argument)

Obviously, this is not the best way to display the comparison. Depending on the day of the week when this script runs, the Corona curve may seem to flatten but only because it is not a full week. Also, as said above, infections may not be the best indicator.

Now, let’s look at deaths.

data_deaths %>%
  select(Day,Germany) %>%
  full_join(grippe, c("Day" = "date")) %>%
  select(-infected) %>%
  rename(Corona = Germany) %>%
  rename(Influenza = deaths) %>%
  mutate(Influenza17_18 = case_when(Day < "2018-12-31" ~ Influenza)) %>%
  mutate(Influenza19_20 = case_when(Day > "2018-12-31" ~ Influenza)) %>%
  select(-Influenza) %>%
  melt(., id.vars="Day") %>%
  filter(!is.na(value)) %>%
  filter(value != 0) %>%
  mutate(week = lubridate::week((Day))) %>%
  mutate(week = if_else(week > 38,week-53,week)) %>%
  group_by(week, variable) %>%
  summarize(value = max(value)) %>%
  ggplot(aes(week,value, col=variable)) + 
  geom_line()  +
  ggtitle("Deaths: Two Influenza Seasons compared to Corona in Germany") + 
  xlab("Week") + 
  ylab("Accumulated Number of reported Deaths")
## `summarise()` regrouping output by 'week' (override with `.groups` argument)

Comparing infections to deaths, the “death curve” seems to grow faster. However, even here, we need to be careful as actual influenza deaths might be just higher. However, this is measured by using excess mortality, i.e. measureing how many more people have died compared to a timeframe when there was a low number of influenza cases. These numbers will only be available in a few months.

Other comments referred to the fact that there is some immunity against influenza in the population whereas there is none against Corona. As a consequence, there is a higher death rate for Corona. But that’s exactly the point: There is no immunity, as a consequence, there will be more deaths. Daily or weekly data from the Spanish flu is not available, unfortunately. But looking at a tougher influenza season might be a good solution because these seasons are tough due to a virus having mutated in a way that some vaccines don’t cover it.

Another point to keep in mind is that the restrictions imposed on the population also have an impact on the current influenza season, as the RKI emphasized.

Finally, a log2 plot:

data_deaths %>%
  select(Day,Germany) %>%
  full_join(grippe, c("Day" = "date")) %>%
  select(-infected) %>%
  rename(Corona = Germany) %>%
  rename(Influenza = deaths) %>%
  mutate(Influenza17_18 = case_when(Day < "2018-12-31" ~ Influenza)) %>%
  mutate(Influenza19_20 = case_when(Day > "2018-12-31" ~ Influenza)) %>%
  select(-Influenza) %>%
  melt(., id.vars="Day") %>%
  filter(!is.na(value)) %>%
  filter(value != 0) %>%
  mutate(week = lubridate::week((Day))) %>%
  mutate(week = if_else(week > 38,week-53,week)) %>%
  group_by(week, variable) %>%
  summarize(value = max(value)) %>%
  ggplot(aes(week,value, col=variable)) + 
  geom_line()  +
  scale_y_continuous(trans = 'log2') +
  ggtitle("Deaths: Two Influenza Seasons compared to Corona in Germany, y axis log2") + 
  xlab("Week") + 
  ylab("Accumulated Number of reported Deaths")
## `summarise()` regrouping output by 'week' (override with `.groups` argument)

Obviously, the Corona line is increasing at a higher rate (the last week may not be complete).

Even more thoughts: Results IV - Pure Numbers

What are the countries with most infections? First, we need to rename the last column because data is added with each new day and we want the script to work even when the names change:

new_data <- corona_data %>%
  rename(last_day = rev(names(corona_data))[1])

Now, it is a simple data transformation:

new_data %>%
  group_by(`Country/Region`) %>%
  select(-`Province/State`) %>%
  select(-Lat,-Long) %>%
  summarise_each(list(sum)) %>%
  select(`Country/Region`,last_day) %>%
  arrange(desc(last_day))
## # A tibble: 190 x 2
##    `Country/Region` last_day
##    <chr>               <dbl>
##  1 US                9206975
##  2 India             8229313
##  3 Brazil            5545705
##  4 Russia            1624648
##  5 France            1458999
##  6 Spain             1185678
##  7 Argentina         1173533
##  8 Colombia          1082767
##  9 United Kingdom    1038054
## 10 Mexico             929392
## # … with 180 more rows

Let’s do the same for deaths, change the last column’s name first:

new_death_data <- corona_deaths %>%
  rename(last_day = rev(names(corona_deaths))[1])
new_death_data %>%
  group_by(`Country/Region`) %>%
  select(-`Province/State`) %>%
  select(-Lat,-Long) %>%
  summarise_each(list(sum)) %>%
  select(`Country/Region`,last_day) %>%
  arrange(desc(last_day))
## # A tibble: 190 x 2
##    `Country/Region` last_day
##    <chr>               <dbl>
##  1 US                 230995
##  2 Brazil             160074
##  3 India              122607
##  4 Mexico              91895
##  5 United Kingdom      46807
##  6 Italy               38826
##  7 France              37057
##  8 Spain               35878
##  9 Iran                35298
## 10 Peru                34476
## # … with 180 more rows
Tags: