Data wrangling
- Load data, remove redundat variables
- Summary of data
Models
Conclusion

Hello. Purpose of this post is to share and learn basics of modeling, sort of a simple “cookbook” how to apply several type of models (glmnet and random forest in this case) on the same train/test splits using caret package, evaluate and compare its performance. Some sort of reference. We´re going to predict breast cancer diagnosis (malign “M” or benign "B) from Kaggle´s data set. Let´s get started:)

# load libraries
library(tidyverse)
library(glmnet)
library(ranger)
library(caret)
library(caTools)
library(ranger)
library(e1071)

Data wrangling

Load data, remove redundat variables

# load data
df <- read_csv("../../static/data/breast_data.csv")

# glimpse raw data
df %>% glimpse

## Observations: 569
## Variables: 33
## $ id                      <dbl> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ `concave points_mean`   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ `concave points_se`     <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ `concave points_worst`  <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X33                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

# remove redundat columns
df <- df %>% 
  select(-id, -X33)

# convert diagnosis as factor
df$diagnosis <- as.factor(df$diagnosis)

Summary of data

# propotion of benign/malignant
table(df$diagnosis)

## 
##   B   M 
## 357 212

prop.table(table(df$diagnosis))

## 
##         B         M 
## 0.6274165 0.3725835

# are there any missing values?
any(is.na(df))

## [1] FALSE

In total:

569 observations
- 357 benign (62%)
- 212 malignant (37%)
30 predictor variables
no missing values

Models

Create common validation indicies, `trainControl`

Start with dividing predictors (features) and response (class) into separated variables df_x and df_y.

# subset predictors variables
df_x <- df %>% select(-diagnosis)

# subset response variable
df_y <- df$diagnosis

In order to achieve fair comparison of models we have to train and test models on the same train/test splits. createFolds seems to be covinient way how to achieve such splits. The outcome (indices) are used later as an index parameter of common trainControl object for all tested models.

# create indices for each fold
my_folds <- createFolds(df_y, k = 10)

my_folds now contains indices of held-out (validation) samples of each fold. In addition distribution of class (benign / malignant) is preserved.

# structure of "my_fold"
my_folds %>% glimpse

## List of 10
##  $ Fold01: int [1:58] 4 26 28 37 40 59 65 71 80 81 ...
##  $ Fold02: int [1:57] 2 7 10 14 19 22 36 39 46 54 ...
##  $ Fold03: int [1:57] 35 43 51 53 55 70 79 96 99 108 ...
##  $ Fold04: int [1:58] 6 12 13 23 27 50 78 82 84 100 ...
##  $ Fold05: int [1:57] 8 24 33 41 77 85 104 105 110 116 ...
##  $ Fold06: int [1:56] 18 31 45 62 64 69 74 83 93 97 ...
##  $ Fold07: int [1:57] 15 17 20 29 38 42 47 56 63 66 ...
##  $ Fold08: int [1:56] 1 5 16 30 44 49 61 68 76 90 ...
##  $ Fold09: int [1:56] 11 32 48 58 73 75 102 103 107 117 ...
##  $ Fold10: int [1:57] 3 9 21 25 34 52 57 60 72 86 ...

# distribution of class is preserved in each fold
my_folds %>% 
  map_df(~prop.table(table(df_y[.])))

## # A tibble: 2 x 10
##   Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  0.621  0.632  0.632  0.621  0.632  0.625  0.632  0.625  0.625  0.632
## 2  0.379  0.368  0.368  0.379  0.368  0.375  0.368  0.375  0.375  0.368

# create trainControl object
my_trainControl <- trainControl(
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = FALSE,
  savePredictions = TRUE,
  index = my_folds)

glmnet model

# train glmnet model
model_glmnet <- train(x = df_x, y = df_y, 
                   method = "glmnet",
                   metric = "ROC",
                   tuneLength = 3,
                   trControl = my_trainControl)

# print model
model_glmnet

## glmnet 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 58, 57, 57, 58, 57, 56, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        ROC        Sens       Spec     
##   0.10   0.0007673665  0.9873814  0.9623324  0.9140590
##   0.10   0.0076736649  0.9897332  0.9694936  0.9156186
##   0.10   0.0767366489  0.9905157  0.9844382  0.8930752
##   0.55   0.0007673665  0.9868221  0.9617103  0.9135271
##   0.55   0.0076736649  0.9890875  0.9713599  0.9140397
##   0.55   0.0767366489  0.9874516  0.9850612  0.8600744
##   1.00   0.0007673665  0.9848363  0.9585892  0.9172031
##   1.00   0.0076736649  0.9867256  0.9654389  0.9077432
##   1.00   0.0767366489  0.9836090  0.9844372  0.8417470
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.07673665.

Some remarks:

ROC = area under curve
Sens = sensitivity = true positive rate = proportion of positives that are correctly identified as such
Spec = specificity = true negative rate = proportion of negatives that are correctly identified as such

Another approach to find the model with the highest ROC:

model_glmnet$results %>% 
  as.tibble %>%  
  arrange(desc(ROC))

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

## # A tibble: 9 x 8
##   alpha   lambda   ROC  Sens  Spec   ROCSD SensSD SpecSD
##   <dbl>    <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1  0.1  0.0767   0.991 0.984 0.893 0.00364 0.0136 0.0495
## 2  0.1  0.00767  0.990 0.969 0.916 0.00311 0.0273 0.0439
## 3  0.55 0.00767  0.989 0.971 0.914 0.00287 0.0272 0.0475
## 4  0.55 0.0767   0.987 0.985 0.860 0.00556 0.0117 0.0711
## 5  0.1  0.000767 0.987 0.962 0.914 0.00548 0.0365 0.0462
## 6  0.55 0.000767 0.987 0.962 0.914 0.00570 0.0356 0.0491
## 7  1    0.00767  0.987 0.965 0.908 0.00559 0.0300 0.0589
## 8  1    0.000767 0.985 0.959 0.917 0.00682 0.0389 0.0441
## 9  1    0.0767   0.984 0.984 0.842 0.00818 0.0162 0.0800

# plot ROCs
plot(model_glmnet)

By default, the tuning parameters alpha and lamda have three values (tuneLength = 3 in train function). Let´s try 20.

# train glmnet model with tuneLength = 10
model_glmnet <- train(x = df_x, y = df_y, 
                   method = "glmnet",
                   metric = "ROC",
                   tuneLength = 20,
                   trControl = my_trainControl)

model_glmnet$results %>% 
  as.tibble %>%  
  arrange(desc(ROC))

## # A tibble: 400 x 8
##    alpha lambda   ROC  Sens  Spec   ROCSD SensSD SpecSD
##    <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
##  1 0.1   0.0356 0.991 0.979 0.911 0.00277 0.0186 0.0461
##  2 0.1   0.0230 0.991 0.976 0.915 0.00251 0.0232 0.0450
##  3 0.1   0.0552 0.991 0.983 0.900 0.00331 0.0143 0.0485
##  4 0.147 0.0356 0.991 0.979 0.909 0.00283 0.0182 0.0497
##  5 0.147 0.0230 0.991 0.975 0.912 0.00257 0.0227 0.0497
##  6 0.195 0.0356 0.990 0.979 0.907 0.00285 0.0183 0.0500
##  7 0.195 0.0230 0.990 0.976 0.911 0.00263 0.0208 0.0483
##  8 0.147 0.0552 0.990 0.982 0.897 0.00338 0.0146 0.0504
##  9 0.1   0.0856 0.990 0.985 0.889 0.00378 0.0130 0.0514
## 10 0.1   0.0148 0.990 0.972 0.914 0.00252 0.0260 0.0456
## # ... with 390 more rows

Tha value ~0.990 seems to be the peak we can get without additional tuning.

# plot ROCs
plot(model_glmnet)

randomForest model

# train randomForest model
model_randomForest <- train(x = df_x, y = df_y, 
                            method = "ranger",
                            metric = "ROC",
                            importance = "permutation",
                            tuneLength = 10,
                            trControl = my_trainControl)

# print model
model_randomForest

## Random Forest 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Bootstrapped (10 reps) 
## Summary of sample sizes: 58, 57, 57, 58, 57, 56, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   ROC        Sens       Spec     
##    2    gini        0.9860817  0.9673265  0.8810444
##    2    extratrees  0.9889330  0.9807057  0.8826068
##    5    gini        0.9837646  0.9567394  0.8857729
##    5    extratrees  0.9882140  0.9726128  0.8841885
##    8    gini        0.9818464  0.9511329  0.8863048
##    8    extratrees  0.9878671  0.9688851  0.8873354
##   11    gini        0.9810034  0.9439755  0.8852439
##   11    extratrees  0.9875739  0.9654535  0.8873409
##   14    gini        0.9803360  0.9427323  0.8836649
##   14    extratrees  0.9880191  0.9663880  0.8899559
##   17    gini        0.9805221  0.9355634  0.8841967
##   17    extratrees  0.9870110  0.9617190  0.8941582
##   20    gini        0.9799474  0.9340077  0.8836677
##   20    extratrees  0.9867063  0.9635882  0.8910058
##   23    gini        0.9795562  0.9333827  0.8815789
##   23    extratrees  0.9868048  0.9614075  0.8910058
##   26    gini        0.9788892  0.9312049  0.8847148
##   26    extratrees  0.9857579  0.9598537  0.8909975
##   30    gini        0.9785235  0.9321366  0.8794820
##   30    extratrees  0.9861731  0.9604739  0.8873354
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule =
##  extratrees and min.node.size = 1.

# plot model
plot(model_randomForest)

model_randomForest$results %>% 
  as.tibble %>%  
  arrange(desc(ROC))

## # A tibble: 20 x 9
##     mtry min.node.size splitrule    ROC  Sens  Spec   ROCSD SensSD SpecSD
##    <dbl>         <dbl> <fct>      <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
##  1     2             1 extratrees 0.989 0.981 0.883 0.00290 0.0140 0.0540
##  2     5             1 extratrees 0.988 0.973 0.884 0.00271 0.0152 0.0631
##  3    14             1 extratrees 0.988 0.966 0.890 0.00254 0.0155 0.0652
##  4     8             1 extratrees 0.988 0.969 0.887 0.00293 0.0161 0.0675
##  5    11             1 extratrees 0.988 0.965 0.887 0.00326 0.0170 0.0647
##  6    17             1 extratrees 0.987 0.962 0.894 0.00320 0.0179 0.0684
##  7    23             1 extratrees 0.987 0.961 0.891 0.00332 0.0182 0.0652
##  8    20             1 extratrees 0.987 0.964 0.891 0.00330 0.0137 0.0652
##  9    30             1 extratrees 0.986 0.960 0.887 0.00321 0.0188 0.0670
## 10     2             1 gini       0.986 0.967 0.881 0.00317 0.0148 0.0626
## 11    26             1 extratrees 0.986 0.960 0.891 0.00388 0.0187 0.0703
## 12     5             1 gini       0.984 0.957 0.886 0.00387 0.0154 0.0704
## 13     8             1 gini       0.982 0.951 0.886 0.00463 0.0153 0.0672
## 14    11             1 gini       0.981 0.944 0.885 0.00513 0.0187 0.0690
## 15    17             1 gini       0.981 0.936 0.884 0.00524 0.0311 0.0715
## 16    14             1 gini       0.980 0.943 0.884 0.00535 0.0194 0.0730
## 17    20             1 gini       0.980 0.934 0.884 0.00583 0.0316 0.0724
## 18    23             1 gini       0.980 0.933 0.882 0.00612 0.0333 0.0712
## 19    26             1 gini       0.979 0.931 0.885 0.00558 0.0330 0.0721
## 20    30             1 gini       0.979 0.932 0.879 0.00589 0.0331 0.0703

Now we can compare top 10 importance variables:

plot(varImp(model_glmnet, scale = FALSE), top = 10, main = "glmnet")

plot(varImp(model_randomForest, scale = FALSE), top = 10, main = "randomForest")

Suprisingly variables quite differs for both models. Any hint why is welcomed:)

Compare glmnet and randomForest performance

model_list <- list(glmnet = model_glmnet, randomForest = model_randomForest)
resamples <- resamples(model_list)

# plot the comparison
bwplot(resamples, metric = "ROC")

# plot the comparison per each fold
xyplot(resamples, metric = "ROC")

Conclusion

the glmnet seems to fit the data better than randomForest
even without advanced tuning pretty decent level of accuracy can be achieved
open points: difference between variable importance of both models

Thanks for reading. Sure, there is a lot of space for improvements, tweaks. Any kind of feedback is appreciated.

How to model using caret package - cookbook