2 Setting up Spark on your laptop

While the great benefits of Spark come from accessing it through a cluster of computers, there are two key reasons to install it on your laptop:

⊕One benefit of Spark is that it allows you to read, join, subset and process those files to create smaller files that fit in memory and are ready for analysis.

to test your code on a small data set, for the purpose of learning and experimenting. Spark does not care how you connect to it: so if a syntax works on your laptop it will work on a cluster.
to perform analyses in R that you cannot perform on your laptop because of memory limitations. It is often the case that you are given large data sets that simply do not fit in R memory, especially if you are loading multiple files and need to link them.

⊕Another benefit of Spark is its machine learning library, that allows you to perform some analyses directly on large data sets.

Installing Spark is incredibly easy, and is very well detailed in (Luraschi, Kuo, and Ruiz 2019). It does require Java, but most laptops come with Java already so we skip this step. First you need to download and install the sparklyr package:

install.packages("sparklyr")
library(sparklyr)

and then install Spark as follows:

spark_install("2.3")

Here “2.3” refers to the particular version of Spark to install. I put 2.3 because this is the one used in (Luraschi, Kuo, and Ruiz 2019), but there are some newer ones available. You can check which versions are available as follows:

spark_available_versions()

##   spark
## 1   1.6
## 2   2.0
## 3   2.1
## 4   2.2
## 5   2.3
## 6   2.4

Note

When I installed Spark I got an error message that istructed me to download a file from the web and place in the Spark directory. After that I had not problems.

Note

In my case I found better to run Rstudio as an administrator, since some processes running behind Spark seem to need some administrative rights (sometimes).

Page built: 2020-01-21