7 User defined functions

When using dplyr in R we are free to use, in the body of the calls to mutate, select, summarise etc …, any function we want, including functions we defined ourselves. This is not the case in sparklyr, where we are limited to a set of predefined “registered” functions. The registered functions include some basic R functions (the usual arithmetic/logical operators, as.character/integer/numeric, substr, …) and the HIVE functions listed on this page: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF.

It may happen that the function one needs is not a registered fuction, and in this case a workaround is necessary. The simplest workaround uses the function spark_apply(), that we now explain.

For this example we consider the (not very interesting) problem of selecting the ids of one individual at random from each state. If dat where a standard tibble we would get the ids of the sample as follows:

s = dat %>% group_by(state) %>% summarise(id=sample(id,1))

and produce a tibble that to each state associates an individual id. If we do this in Spark, however, we get a Java error saying: Undefined function: ‘sample’. This function is neither a registered temporary function nor a permanent function registered in the database ‘default’. All this is saying is that we cannot use “sample” inside summarise() bacuse is not “registered”. We would get a similar message if we used a function that we defined ourself.

A solution to this problem is to use the function spark_apply() as follows:

f = function(e){
  set.seed(1)
  sample(e$id,1)
}
state = dat %>% spark_apply(f,columns=c(int="id"),group_by = "state") %>% compute("tmp")
state

The syntax is pretty self-explanatory: spark_apply() takes the user defined function f in the first argument, applies it to each of the groups defined by group_by and stores the summary function in a table with column of the given type. Declaring the type of the columns is supposed to make the processing more efficient. Notice that I called compute() to force an evaluation once for all. Few observations:

  • If you do not pass the argument columns you still get the same result but the column storing the id gets automatically labeled “result”

  • Looking at the actual code it seems that the argument columns can be replaced by names. Indeed if you google spark_apply() many examples have names instead of columns.

  • Looking at the documentation it seems that spark_apply() can be used as we are using it here, but it is really intended to be used to distribute R code across different partitions, which may explain some of its quirks.

  • The sparklyr documentation for spark_apply() is quite vague. Some more information can be found at the following link: https://spark.rstudio.com/guides/distributed-r/

Issue: spark_apply is extremely slow. This is a known issue (see this link for example)[https://jozef.io/r201-spark-r-1/] , which suggests that it is always better to use registered functions unless one is desperate and has a lot of time (or computing power) to spare.

Issue: So far I have not found a list of the R functions that are registered.

Page built: 2020-01-21