4 Working with data in Spark

Once you have uploaded a table into Spark there are different ways of working with the R object:

you can use operations from the dplyr package, fully supported by the magrittr pipes. This is very easy if you are familiar with these packages and with the syntax of pipes. Otherwise there is a bit of learning to do, but there is so much material out there about these packages that it really does not take much to get going.
you can use SQL operations, supported by the package DBI and by sparklyr itself. This is an excellent alternative for SQL fans and it appears to be well supported.
you can use operations provided by sparklyr. The package comes with a number of operations that allow to define and extract features, about 40 of them. Some of them are very low level, such as data scaling, and some are more sophisticated, as a full Word2Vec encoder.
you can bring the Spark table into R memory and operate on it as a usual tibble or data frame. This is often what happens at the end, especially if you are using Spark to extract descriptives out of some very large data set or to clean/subset and a large data set and produce a much smaller one.

We will explore these options in the following sections.

Page built: 2020-01-21