R: dplyr/sparklyr vs data.table Performance


In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.

My work includes dealing with larger files almost every day, and for this test, I have used a 16 GB CSV file with 88.956.866 rows and 7 columns. After reading the file, I will do a few simple operations with that data and then write the result back to disk. The test is performed on an AWS EC2 m4.2xlarge instance with 32 GB of RAM and 8 vCPUs of which we will use 4. Let’s start with data.table:

`

library(data.table)
overallStart_time <- Sys.time()
start_time <- Sys.time()
userDataSetDT <- fread(“/home/tom/huge-file.csv”)
Read 88956065 rows and 7 (of 7) columns from 15.969 GB file in 00:02:52
end_time <- Sys.time()
end_time – start_time
Time difference of 6.507585 mins
`

I have no idea why fread says it only needed 2:52 minutes; there were no other CPU-hungry processes running or processes that had a huge impact on IO.

`> start_time <- Sys.time()

userDataSetDT <- userDataSetDT[!is.na(Timestamp)]
end_time <- Sys.time()
end_time – start_time
Time difference of 39.44712 secs

start_time <- Sys.time()
userDataSetDT <- userDataSetDT[!is.na(URL)]
end_time <- Sys.time()
end_time – start_time
Time difference of 38.62926 secs

start_time <- Sys.time()
configs <- userDataSetDT[configSection == “Select Engine”]
end_time <- Sys.time()
end_time – start_time
Time difference of 2.412425 secs

start_time <- Sys.time()
fwrite(configs,file=“configsDT.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.07708573 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.341271 mins`

data.table uses only one vCPU or one core respectively by default but consumes more virtual memory (43GB instead of 13GB being used by the R/sparklyr combination). We could use packages such as the parallel one but in fact, data.table is a bit more complex with respect to parallelization.

Now, let’s look at sparklyr:

“> overallStart_time <- Sys.time()

conf <- spark_config()
conf$sparklyr.defaultPackages <- “org.apache.hadoop:hadoop-aws:2.7.3”
conf$sparklyr.cores.local <- 4
conf$sparklyr.shell.driver-memory <- “8G”
conf$sparklyr.shell.executor-memory <- “16G”
conf$spark.yarn.executor.memoryOverhead <- “4g”
conf$spark.memory.fraction <- 0.9
conf$spark.driver.maxResultSize <- “8G”
sc <- spark_connect(master = “local”,

  • version = “2.2.1”,
  • config = conf)

start_time <- Sys.time()
userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 2.432772 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.01466608 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001867533 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.001763344 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 1.333298 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01505065 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 3.878917 mins

We have saved more than 50% here! However, looking at the details, we see that collecting the data has cost us a lot of time. Having said that, doing the selects is faster on sparklyr compared to data.table. We have used 4 vCPUs for this, so there seems to be an advantage in parallelizing computing the data, there is almost no difference in writing the data, also given that data.table’s fread has been highly optimized. Edit: As one commenter said below, you would probably not collect the whole dataset and rather let Spark write the CSV but I have not done to make the approach more comparable.

If we used only one core for sparklyr (which doesn’t make any sense because even every Macbook today has 4 cores), how long would it take then?

`> start_time <- Sys.time()

userDataSet <- spark_read_csv(sc, “country”, “/home/tom/huge-file.csv”, memory = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.651707 mins

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(Timestamp))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.003172636 secs

start_time <- Sys.time()
userDataSet <- userDataSet %>%

  • filter(!is.na(URL))

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002816916 secs

start_time <- Sys.time()
configs <- userDataSet %>%

  • filter(configSection == “Select Engine”)

end_time <- Sys.time()
end_time – start_time
Time difference of 0.002915621 secs

start_time <- Sys.time()
collected <- collect(configs)
end_time <- Sys.time()
end_time – start_time
Time difference of 4.487081 mins

start_time <- Sys.time()
write.csv(collected, file=“configs.csv”, row.names = FALSE)
end_time <- Sys.time()
end_time – start_time
Time difference of 0.01447606 secs

overallEnd_time <- Sys.time()
overallEnd_time – overallStart_time
Time difference of 8.345677 mins`

The selects are a bit slower albeit not noticable. sparklyr is much slower though when it comes to reading large files and collectiong data with only one core. Having said that, as mentioned above, there is no reason to use only one core.

However, there is still a good reason to use data.table: As you can see in the config file of the sparklyr code, a huge chunk of memory had to be assigned to the collector and the driver, simply because the computation or the collection will throw errors if there is not enough memory available. Finding out how much memory should be allocated to what component is difficult, and not allocating the right amount of memory will result in restarting R and running the code again and again, making data.table the better choice since no configuration is required whatsoever. In addition, it is amazing how fast data.table still is using one core only compared to sparklyr using 4 cores. On the contrary, running the same code on my Macbook Air with 8 GB RAM and 4 cores, data.table had not managed to read the file in 30 minutes whilst sparkly (using 3 of the 4 cores) managed to get everything processed in less than 8 minutes.

While I personally find dplyr a bit more easy to learn, data.table has caught me, too.

One Reply to “R: dplyr/sparklyr vs data.table Performance”

  1. This comparison is a bit flawed. It is not comparing data.table to dplyr, but rather native calculation in R to distributed calculation using SPARK. Moreso, it performs a very simple set of operations, that is reading, filtering and writing.

    If we look at dplyr and data.table by itself it does not make a lot of difference performancewise (although fread and fwrite are quite fast compared to base read.csv and readr::read_csv) here. Essentially this uses R-native implementation for is.na and then subsets. But even here you can get a decent improvement by instead using the data.table implementation:

    na.omit(data, by c(“Timestamp”, “URL”)

    But the true strength of data.table comes when you are using grouped operations, joins and the update by reference mechanics (which allows avoiding copying memory and do some things very efficiently).

    If we then look at Spark instead you need to understand that it is not “filtering is blazingly fast, but collecting takes time”. With Spark you do not really work with the data, but you define a transformation. Only when it is needed this transformation is then optimized by the Spark engine and executed. So if you filter for something this does not perform an operation, but it defines a (delayed) transformation. This might not even need to be executed if it is not actually needed. But this means that the actual work is done once you collect the data or once you tell Spark to write the data to the disk.

    What you need to keep in mind with Spark is that workers will work on portions of the data. Now, some operations are very straightforward in this setting, such as reading and writing data and filtering, which happens to be what you used. None of these require you to have a specific subset of the data.

    Other operations such as joins require the same join columns to be present on the same worker, which means that before a join data is redistributed across workers, making them very expensive.

    So you’ll note that one thing data.table does really well (that is, joins) is quite a pain in Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *