In this example we estimate a simple linear regression of cost against individual characteristics. Before proceeding it is good to do some house cleaning and get rid of stuff we do not need anymore:
### we want to use the merged file we produced above
dat = merged
### also we can do some house cleaning and remove from Spark tables we do not need anymore:
src_tbls(sc)
db_drop_table(sc, "state")
db_drop_table(sc, "submerge_table")
rm(submerge, merged)
Notice that we did not delete merged_table from Spark: since we defined dat as merged and merged is linked to merged_table we need it in order to do anything with dat.
Since we do not want to use age as a continuous variable it is good to define some age groups, that we can do using the sparklyr function ft_bucketizer(), which is similar to the R cut() function. However ft_bucketizer(), unlike cut(), produces an integer representing the index of a bucket. Fortunately Spark “remembers” how the bucket variable was created and provides the function ft_index_to_string() to convert that index variable to proper descriptive strings. Therefore we run the following to create the categorical variable agecat: When you use ft_bucketizer() followed by ft_index_to_string() remember to delete the intermediate index you created, if you are not going to need it anymore!
dat = dat %>%
ft_bucketizer(input_col="age", output_col="agecatnum",
splits=c(0,18,35,45,55,65,75,85,100)) %>%
ft_index_to_string(input_col="agecatnum", output_col="agecat") %>%
select(-agecatnum)
Sparklyr is pretty consistent in its naming conventions: all the functions that are used to perform manipulation of variables and create new variables begin with ft_, standing for feature transformation.
The sparklyr package has a command for linear regression that allows to perform both standard (weighted) linear regression and penalized linear regression, using both a quadratic and an Huber penalty function. Here we show the simplest option, traditional unweighted OLS:
Results of the linear regression can be accessed in different ways. Spark provides functions such as tidy() and glance() that summarize some of the results:
lin %>% tidy() %>%
kable(digits=c(0,0,0,1,3),format.args = list(big.mark=","),
caption="**Linear regression coefficients**.Notice that Spark picked the last age category as the reference one, which explains the negative signs. It does not appear that there is an obvious way to pick a different reference category other than an appropriate renaming of the variables.")
Table 1: Linear regression coefficients.Notice that Spark picked the last age category as the reference one, which explains the negative signs. It does not appear that there is an obvious way to pick a different reference category other than an appropriate renaming of the variables.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2,271 | 7 | 321.9 | 0.000 |
agecat_18.0, 35.0 | -1,378 | 3 | -451.8 | 0.000 |
agecat_35.0, 45.0 | -1,270 | 3 | -404.4 | 0.000 |
agecat_45.0, 55.0 | -984 | 3 | -312.3 | 0.000 |
agecat_55.0, 65.0 | -466 | 3 | -146.8 | 0.000 |
agecat_65.0, 75.0 | -14 | 3 | -4.3 | 0.000 |
agecat_75.0, 85.0 | 261 | 3 | 74.9 | 0.000 |
gender_F | -119 | 1 | -122.7 | 0.000 |
state_Aboda | -4 | 5 | -0.9 | 0.390 |
state_Sintbu | -4 | 5 | -0.8 | 0.403 |
state_Isnor | -2 | 5 | -0.5 | 0.626 |
state_Itsware | -4 | 5 | -0.8 | 0.442 |
state_Haivismal | -7 | 5 | -1.4 | 0.166 |
state_Blitzbar | -10 | 6 | -1.6 | 0.101 |
state_Morgenor | -7 | 6 | -1.2 | 0.249 |
smoking_Never smoked | -962 | 1 | -696.7 | 0.000 |
smoking_Ex-smoker | -416 | 2 | -274.7 | 0.000 |
bmi_Overweight | 358 | 4 | 87.2 | 0.000 |
bmi_Normal | -31 | 4 | -7.6 | 0.000 |
bmi_Obese | 1,173 | 4 | 284.3 | 0.000 |
lin %>% glance() %>%
kable(digits=c(0,0,0,2,0), format.args = list(big.mark=","),
caption="**Linear regression statistics**")
Table 2: Linear regression statistics
explained.variance | mean.absolute.error | mean.squared.error | r.squared | root.mean.squared.error |
---|---|---|---|---|
822,071 | 774 | 2,211,964 | 0.27 | 1,487 |
Additional quantities can be retrived directly from lin directly or with the summary function ml_summary():
## [1] "pipeline_model" "formula" "dataset" "pipeline" "model" "label_col" "features_col" "feature_names" "response" "coefficients" "summary"
## LinearRegressionTrainingSummary
## Access the following via `$` or `ml_summary()`.
## - coefficient_standard_errors()
## - deviance_residuals()
## - explained_variance
## - features_col
## - label_col
## - mean_absolute_error
## - mean_squared_error
## - num_instances()
## - p_values()
## - prediction_col
## - predictions
## - r2
## - residuals()
## - root_mean_squared_error
## - t_values()
## - objective_history
## - total_iterations
## - degrees_of_freedom
## - r2adj
Page built: 2020-01-21