R: dplyr/sparklyr vs data.table Performance

In their 2017 book “R for Data Science“, Grolemund and Wickham state that data.table is recommended instead of dplyr when working with larger datasets (10 to 100 Gb) on a regular basis. Having started with Wickhams sparklyr (R’s interface to Spark using the dplyr dialect), I was wondering how much faster data.table actually is. This is not the most professional benchmark given that I just compare system time before and after the script ran but it gives an indication of the advantages and disadvantages of each approach.

My work includes dealing with larger files almost every day, and for this test, I have used a 16 GB CSV file with 88.956.866 rows and 7 columns. After reading the file, I will do a few simple operations with that data and then write the result back to disk. The test is performed on an AWS EC2 m4.2xlarge instance with 32 GB of RAM and 8 vCPUs of which we will use 4. Let’s start with data.table:

> library(data.table)
> overallStart_time <- Sys.time()
> start_time <- Sys.time()
> userDataSetDT <- fread("/home/tom/huge-file.csv")
Read 88956065 rows and 7 (of 7) columns from 15.969 GB file in 00:02:52
> end_time <- Sys.time()
> end_time - start_time
Time difference of 6.507585 mins

I have no idea why fread says it only needed 2:52 minutes; there were no other CPU-hungry processes running or processes that had a huge impact on IO.

> start_time <- Sys.time()
> userDataSetDT <- userDataSetDT[!is.na(Timestamp)]
> end_time <- Sys.time()
> end_time - start_time
Time difference of 39.44712 secs
>
> start_time <- Sys.time()
> userDataSetDT <- userDataSetDT[!is.na(URL)]
> end_time <- Sys.time()
> end_time - start_time
Time difference of 38.62926 secs
>
> start_time <- Sys.time()
> configs <- userDataSetDT[configSection == "Select Engine"]
> end_time <- Sys.time()
> end_time - start_time
Time difference of 2.412425 secs
>
> start_time <- Sys.time()
> fwrite(configs,file="configsDT.csv", row.names = FALSE)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.07708573 secs
>
> overallEnd_time <- Sys.time()
> overallEnd_time - overallStart_time
Time difference of 8.341271 mins

data.table uses only one vCPU or one core respectively by default but consumes more virtual memory (43GB instead of 13GB being used by the R/sparklyr combination). We could use packages such as the parallel one but in fact, data.table is a bit more complex with respect to parallelization.

Now, let’s look at sparklyr:

> overallStart_time <- Sys.time()
> conf <- spark_config()
> conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"
> conf$`sparklyr.cores.local` <- 4
> conf$`sparklyr.shell.driver-memory` <- "8G"
> conf$`sparklyr.shell.executor-memory` <- "16G"
> conf$spark.yarn.executor.memoryOverhead <- "4g"
> conf$spark.memory.fraction <- 0.9
> conf$spark.driver.maxResultSize <- "8G"
> sc <- spark_connect(master = "local",
+ version = "2.2.1",
+ config = conf)
>
> start_time <- Sys.time()
> userDataSet <- spark_read_csv(sc, "country", "/home/tom/huge-file.csv", memory = FALSE)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 2.432772 mins
>
> start_time <- Sys.time()
> userDataSet <- userDataSet %>%
+ filter(!is.na(Timestamp))
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.01466608 secs
>
> start_time <- Sys.time()
> userDataSet <- userDataSet %>%
+ filter(!is.na(URL))
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.001867533 secs
>
> start_time <- Sys.time()
> configs <- userDataSet %>%
+ filter(configSection == "Select Engine")
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.001763344 secs
>
> start_time <- Sys.time()
> collected <- collect(configs)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 1.333298 mins
>
> start_time <- Sys.time()
> write.csv(collected, file="configs.csv", row.names = FALSE)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.01505065 secs
>
> overallEnd_time <- Sys.time()
> overallEnd_time - overallStart_time
Time difference of 3.878917 mins

We have saved more than 50% here! However, looking at the details, we see that collecting the data has cost us a lot of time. Having said that, doing the selects is faster on sparklyr compared to data.table. We have used 4 vCPUs for this, so there seems to be an advantage in parallelizing computing the data, there is almost no difference in writing the data, also given that data.table’s fread has been highly optimized. Edit: As one commenter said below, you would probably not collect the whole dataset and rather let Spark write the CSV but I have not done to make the approach more comparable.

If we used only one core for sparklyr (which doesn’t make any sense because even every Macbook today has 4 cores), how long would it take then?

> start_time <- Sys.time()
> userDataSet <- spark_read_csv(sc, "country", "/home/tom/huge-file.csv", memory = FALSE)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 4.651707 mins
>
> start_time <- Sys.time()
> userDataSet <- userDataSet %>%
+ filter(!is.na(Timestamp))
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.003172636 secs
>
> start_time <- Sys.time()
> userDataSet <- userDataSet %>%
+ filter(!is.na(URL))
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.002816916 secs
>
> start_time <- Sys.time()
> configs <- userDataSet %>%
+ filter(configSection == "Select Engine")
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.002915621 secs
>
> start_time <- Sys.time()
> collected <- collect(configs)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 4.487081 mins
>
> start_time <- Sys.time()
> write.csv(collected, file="configs.csv", row.names = FALSE)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.01447606 secs
>
> overallEnd_time <- Sys.time()
> overallEnd_time - overallStart_time
Time difference of 8.345677 mins

The selects are a bit slower albeit not noticable. sparklyr is much slower though when it comes to reading large files and collectiong data with only one core. Having said that, as mentioned above, there is no reason to use only one core.

However, there is still a good reason to use data.table: As you can see in the config file of the sparklyr code, a huge chunk of memory had to be assigned to the collector and the driver, simply because the computation or the collection will throw errors if there is not enough memory available. Finding out how much memory should be allocated to what component is difficult, and not allocating the right amount of memory will result in restarting R and running the code again and again, making data.table the better choice since no configuration is required whatsoever. In addition, it is amazing how fast data.table still is using one core only compared to sparklyr using 4 cores. On the contrary, running the same code on my Macbook Air with 8 GB RAM and 4 cores, data.table had not managed to read the file in 30 minutes whilst sparkly (using 3 of the 4 cores) managed to get everything processed in less than 8 minutes.

While I personally find dplyr a bit more easy to learn, data.table has caught me, too.

2 Replies to “R: dplyr/sparklyr vs data.table Performance”

  1. Spark’s selects, filters, etc, are all _lazy_! The reason that the collect is slow here is probably because that is where it’s actually executing the calculations. However, there’s also a considerable cost of sending the entire selected set back to R.

    Use Spark to write the output csv, rather than transferring to R and writing via R — that should save you some time.

    I might also suggest a grouping and summary operation, as that is often what is really needed. Try with data.table, and with Spark, doing a collect() on the summary results?

  2. Thanks for the comment. Of course, it is a bit unfair to compare data.table against lazy-loading Spark. What I am basically interested in is the question when is it worth to wait for data.table results because the overhead of setting up Spark and Sparklyr and collecting data is expensive as well. I agree that collecting is not necessary when I write csv to file directly from Spark but I would still have to repartition data so that spark_write_csv writes one csv file and not several. Also, often you just need the results of an operation and not the whole data set; I guess that’s where you would collect rather than let Spark write a CSV that is than picked up by Base R. Again, this is just a simplified comparison, and I never work like that. It has just been done for comparison.

Leave a Reply

Your email address will not be published. Required fields are marked *