6 Linear regression

In this example we estimate a simple linear regression of cost against individual characteristics. Before proceeding it is good to do some house cleaning and get rid of stuff we do not need anymore:

### we want to use the merged file we produced above
dat = merged

### also we can do some house cleaning and remove from Spark tables we do not need anymore:
src_tbls(sc)
db_drop_table(sc, "state")
db_drop_table(sc, "submerge_table")
rm(submerge, merged)

⊕Notice that we did not delete merged_table from Spark: since we defined dat as merged and merged is linked to merged_table we need it in order to do anything with dat.

Since we do not want to use age as a continuous variable it is good to define some age groups, that we can do using the sparklyr function ft_bucketizer(), which is similar to the R cut() function. However ft_bucketizer(), unlike cut(), produces an integer representing the index of a bucket. Fortunately Spark “remembers” how the bucket variable was created and provides the function ft_index_to_string() to convert that index variable to proper descriptive strings. Therefore we run the following to create the categorical variable agecat: ⊕When you use ft_bucketizer() followed by ft_index_to_string() remember to delete the intermediate index you created, if you are not going to need it anymore!

dat = dat %>% 
  ft_bucketizer(input_col="age", output_col="agecatnum",
                splits=c(0,18,35,45,55,65,75,85,100)) %>% 
  ft_index_to_string(input_col="agecatnum", output_col="agecat") %>% 
  select(-agecatnum)

⊕Sparklyr is pretty consistent in its naming conventions: all the functions that are used to perform manipulation of variables and create new variables begin with ft_, standing for feature transformation.

The sparklyr package has a command for linear regression that allows to perform both standard (weighted) linear regression and penalized linear regression, using both a quadratic and an Huber penalty function. Here we show the simplest option, traditional unweighted OLS:

lin = ml_linear_regression(dat, 
                           cost ~ agecat + gender + state + smoking + bmi)

Results of the linear regression can be accessed in different ways. Spark provides functions such as tidy() and glance() that summarize some of the results:

lin %>% tidy() %>% 
  kable(digits=c(0,0,0,1,3),format.args = list(big.mark=","),
        caption="**Linear regression coefficients**.Notice that Spark picked the last age category as the reference one, which explains the negative signs. It does not appear that there is an obvious way to pick a different reference category other than an appropriate renaming of the variables.")

Table 1: Linear regression coefficients.Notice that Spark picked the last age category as the reference one, which explains the negative signs. It does not appear that there is an obvious way to pick a different reference category other than an appropriate renaming of the variables.

term	estimate	std.error	statistic	p.value
(Intercept)	2,271	7	321.9	0.000
agecat_18.0, 35.0	-1,378	3	-451.8	0.000
agecat_35.0, 45.0	-1,270	3	-404.4	0.000
agecat_45.0, 55.0	-984	3	-312.3	0.000
agecat_55.0, 65.0	-466	3	-146.8	0.000
agecat_65.0, 75.0	-14	3	-4.3	0.000
agecat_75.0, 85.0	261	3	74.9	0.000
gender_F	-119	1	-122.7	0.000
state_Aboda	-4	5	-0.9	0.390
state_Sintbu	-4	5	-0.8	0.403
state_Isnor	-2	5	-0.5	0.626
state_Itsware	-4	5	-0.8	0.442
state_Haivismal	-7	5	-1.4	0.166
state_Blitzbar	-10	6	-1.6	0.101
state_Morgenor	-7	6	-1.2	0.249
smoking_Never smoked	-962	1	-696.7	0.000
smoking_Ex-smoker	-416	2	-274.7	0.000
bmi_Overweight	358	4	87.2	0.000
bmi_Normal	-31	4	-7.6	0.000
bmi_Obese	1,173	4	284.3	0.000

lin %>% glance() %>% 
kable(digits=c(0,0,0,2,0), format.args = list(big.mark=","), 
      caption="**Linear regression statistics**")

Table 2: Linear regression statistics

explained.variance	mean.absolute.error	mean.squared.error	r.squared	root.mean.squared.error
822,071	774	2,211,964	0.27	1,487

Additional quantities can be retrived directly from lin directly or with the summary function ml_summary():

names(lin)

##  [1] "pipeline_model" "formula"        "dataset"        "pipeline"       "model"          "label_col"      "features_col"   "feature_names"  "response"       "coefficients"   "summary"

ml_summary(lin)

## LinearRegressionTrainingSummary 
##  Access the following via `$` or `ml_summary()`. 
##  - coefficient_standard_errors() 
##  - deviance_residuals() 
##  - explained_variance 
##  - features_col 
##  - label_col 
##  - mean_absolute_error 
##  - mean_squared_error 
##  - num_instances() 
##  - p_values() 
##  - prediction_col 
##  - predictions 
##  - r2 
##  - residuals() 
##  - root_mean_squared_error 
##  - t_values() 
##  - objective_history 
##  - total_iterations 
##  - degrees_of_freedom 
##  - r2adj

Page built: 2020-01-21