R: dplyr/sparklyr vs data.table Performance


In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.

My work includes dealing with larger files almost every day, and for this test, I have used a 16 GB CSV file with 88.956.866 rows and 7 columns. After reading the file, I will do a few simple operations with that data and then write the result back to disk. The test is performed on an AWS EC2 m4.2xlarge instance with 32 GB of RAM and 8 vCPUs of which we will use 4. Let’s start with data.table:

`

library(data.table)
overallStart_time <- Sys.time()
start_time <- Sys.time()
userDataSetDT <- fread(„/home/tom/huge-file.csv“)
Read 88956065 rows and 7 (of 7) columns from 15.969 GB file in 00:02:52
end_time <- Sys.time()
end_time – start_time
Time difference of 6.507585 mins
`

I have no idea why fread says it only needed 2:52 minutes; there were no other CPU-hungry processes running or processes that had a huge impact on IO.

`> start_time <- Sys.time()

userDataSetDT <- userDataSetDT[!is.na(Timestamp)]
end_time <- Sys.time()
end_time – start_time
Time difference of 39.44712 secs

start_time <- Sys.time()
userDataSetDT <- userDataSetDT[!is.na(URL)]
end_time <- Sys.time()
end_time – start_time
Time difference of 38.62926 secs

start_time <- Sys.time()
configs <- userDataSetDT[configSection == “Select Engine”]
end_time <- Sys.time()
end_time – start_time
Time difference of 2.412425 secs

start_time <- Sys.time()
fwrite(configs,file=“configsDT.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.07708573 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.341271 mins`

data.table uses only one vCPU or one core respectively by default but consumes more virtual memory (43GB instead of 13GB being used by the R/sparklyr combination). We could use packages such as the parallel one but in fact, data.table is a bit more complex with respect to parallelization.

Now, let’s look at sparklyr:

„> overallStart_time <- Sys.time()

conf <- spark_config()
conf$sparklyr.defaultPackages <- “org.apache.hadoop:hadoop-aws:2.7.3”
conf$sparklyr.cores.local <- 4
conf$sparklyr.shell.driver-memory <- “8G”
conf$sparklyr.shell.executor-memory <- “16G”
conf$spark.yarn.executor.memoryOverhead <- “4g”
conf$spark.memory.fraction <- 0.9
conf$spark.driver.maxResultSize <- “8G”
sc <- spark_connect(master = “local”,

  • version = “2.2.1”,
  • config = conf)

start_time <- Sys.time()
userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 2.432772 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.01466608 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001867533 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001763344 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 1.333298 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01505065 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 3.878917 mins

We have saved more than 50% here! However, looking at the details, we see that collecting the data has cost us a lot of time. Having said that, doing the selects is faster on sparklyr compared to data.table. We have used 4 vCPUs for this, so there seems to be an advantage in parallelizing computing the data, there is almost no difference in writing the data, also given that data.table’s fread has been highly optimized. Edit: As one commenter said below, you would probably not collect the whole dataset and rather let Spark write the CSV but I have not done to make the approach more comparable.

If we used only one core for sparklyr (which doesn’t make any sense because even every Macbook today has 4 cores), how long would it take then?

`> start_time <- Sys.time()

userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.651707 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.003172636 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002816916 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002915621 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.487081 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01447606 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.345677 mins`

The selects are a bit slower albeit not noticable. sparklyr is much slower though when it comes to reading large files and collectiong data with only one core. Having said that, as mentioned above, there is no reason to use only one core.

However, there is still a good reason to use data.table: As you can see in the config file of the sparklyr code, a huge chunk of memory had to be assigned to the collector and the driver, simply because the computation or the collection will throw errors if there is not enough memory available. Finding out how much memory should be allocated to what component is difficult, and not allocating the right amount of memory will result in restarting R and running the code again and again, making data.table the better choice since no configuration is required whatsoever. In addition, it is amazing how fast data.table still is using one core only compared to sparklyr using 4 cores. On the contrary, running the same code on my Macbook Air with 8 GB RAM and 4 cores, data.table had not managed to read the file in 30 minutes whilst sparkly (using 3 of the 4 cores) managed to get everything processed in less than 8 minutes.

While I personally find dplyr a bit more easy to learn, data.table has caught me, too.