Hello. Purpose of this post is to share and learn basics of modeling, sort of a simple “cookbook” how to apply several type of models (glmnet and random forest in this case) on the same train/test splits using caret package, evaluate and compare its performance. Some sort of reference. We´re going to predict breast cancer diagnosis (malign “M” or benign "B) from Kaggle´s data set. Let´s get started:)
# load libraries
library(tidyverse)
library(glmnet)
library(ranger)
library(caret)
library(caTools)
library(ranger)
library(e1071)
Data wrangling
Load data, remove redundat variables
# load data
df <- read_csv("../../static/data/breast_data.csv")
# glimpse raw data
df %>% glimpse
## Observations: 569
## Variables: 33
## $ id <dbl> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ `concave points_mean` <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ `concave points_se` <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ `concave points_worst` <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X33 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
# remove redundat columns
df <- df %>%
select(-id, -X33)
# convert diagnosis as factor
df$diagnosis <- as.factor(df$diagnosis)
Summary of data
# propotion of benign/malignant
table(df$diagnosis)
##
## B M
## 357 212
prop.table(table(df$diagnosis))
##
## B M
## 0.6274165 0.3725835
# are there any missing values?
any(is.na(df))
## [1] FALSE
In total:
- 569 observations
- 357 benign (62%)
- 212 malignant (37%)
- 30 predictor variables
- no missing values
Models
Create common validation indicies, trainControl
Start with dividing predictors (features) and response (class) into separated variables df_x
and df_y
.
# subset predictors variables
df_x <- df %>% select(-diagnosis)
# subset response variable
df_y <- df$diagnosis
In order to achieve fair comparison of models we have to train and test models on the same train/test splits. createFolds
seems to be covinient way how to achieve such splits. The outcome (indices) are used later as an index
parameter of common trainControl
object for all tested models.
# create indices for each fold
my_folds <- createFolds(df_y, k = 10)
my_folds
now contains indices of held-out (validation) samples of each fold. In addition distribution of class (benign / malignant) is preserved.
# structure of "my_fold"
my_folds %>% glimpse
## List of 10
## $ Fold01: int [1:58] 4 26 28 37 40 59 65 71 80 81 ...
## $ Fold02: int [1:57] 2 7 10 14 19 22 36 39 46 54 ...
## $ Fold03: int [1:57] 35 43 51 53 55 70 79 96 99 108 ...
## $ Fold04: int [1:58] 6 12 13 23 27 50 78 82 84 100 ...
## $ Fold05: int [1:57] 8 24 33 41 77 85 104 105 110 116 ...
## $ Fold06: int [1:56] 18 31 45 62 64 69 74 83 93 97 ...
## $ Fold07: int [1:57] 15 17 20 29 38 42 47 56 63 66 ...
## $ Fold08: int [1:56] 1 5 16 30 44 49 61 68 76 90 ...
## $ Fold09: int [1:56] 11 32 48 58 73 75 102 103 107 117 ...
## $ Fold10: int [1:57] 3 9 21 25 34 52 57 60 72 86 ...
# distribution of class is preserved in each fold
my_folds %>%
map_df(~prop.table(table(df_y[.])))
## # A tibble: 2 x 10
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.621 0.632 0.632 0.621 0.632 0.625 0.632 0.625 0.625 0.632
## 2 0.379 0.368 0.368 0.379 0.368 0.375 0.368 0.375 0.375 0.368
# create trainControl object
my_trainControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE,
savePredictions = TRUE,
index = my_folds)
glmnet model
# train glmnet model
model_glmnet <- train(x = df_x, y = df_y,
method = "glmnet",
metric = "ROC",
tuneLength = 3,
trControl = my_trainControl)
# print model
model_glmnet
## glmnet
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 58, 57, 57, 58, 57, 56, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0.10 0.0007673665 0.9873814 0.9623324 0.9140590
## 0.10 0.0076736649 0.9897332 0.9694936 0.9156186
## 0.10 0.0767366489 0.9905157 0.9844382 0.8930752
## 0.55 0.0007673665 0.9868221 0.9617103 0.9135271
## 0.55 0.0076736649 0.9890875 0.9713599 0.9140397
## 0.55 0.0767366489 0.9874516 0.9850612 0.8600744
## 1.00 0.0007673665 0.9848363 0.9585892 0.9172031
## 1.00 0.0076736649 0.9867256 0.9654389 0.9077432
## 1.00 0.0767366489 0.9836090 0.9844372 0.8417470
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda
## = 0.07673665.
Some remarks:
- ROC = area under curve
- Sens = sensitivity = true positive rate = proportion of positives that are correctly identified as such
- Spec = specificity = true negative rate = proportion of negatives that are correctly identified as such
Another approach to find the model with the highest ROC:
model_glmnet$results %>%
as.tibble %>%
arrange(desc(ROC))
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 9 x 8
## alpha lambda ROC Sens Spec ROCSD SensSD SpecSD
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.1 0.0767 0.991 0.984 0.893 0.00364 0.0136 0.0495
## 2 0.1 0.00767 0.990 0.969 0.916 0.00311 0.0273 0.0439
## 3 0.55 0.00767 0.989 0.971 0.914 0.00287 0.0272 0.0475
## 4 0.55 0.0767 0.987 0.985 0.860 0.00556 0.0117 0.0711
## 5 0.1 0.000767 0.987 0.962 0.914 0.00548 0.0365 0.0462
## 6 0.55 0.000767 0.987 0.962 0.914 0.00570 0.0356 0.0491
## 7 1 0.00767 0.987 0.965 0.908 0.00559 0.0300 0.0589
## 8 1 0.000767 0.985 0.959 0.917 0.00682 0.0389 0.0441
## 9 1 0.0767 0.984 0.984 0.842 0.00818 0.0162 0.0800
# plot ROCs
plot(model_glmnet)
By default, the tuning parameters alpha
and lamda
have three values (tuneLength = 3
in train
function). Let´s try 20.
# train glmnet model with tuneLength = 10
model_glmnet <- train(x = df_x, y = df_y,
method = "glmnet",
metric = "ROC",
tuneLength = 20,
trControl = my_trainControl)
model_glmnet$results %>%
as.tibble %>%
arrange(desc(ROC))
## # A tibble: 400 x 8
## alpha lambda ROC Sens Spec ROCSD SensSD SpecSD
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.1 0.0356 0.991 0.979 0.911 0.00277 0.0186 0.0461
## 2 0.1 0.0230 0.991 0.976 0.915 0.00251 0.0232 0.0450
## 3 0.1 0.0552 0.991 0.983 0.900 0.00331 0.0143 0.0485
## 4 0.147 0.0356 0.991 0.979 0.909 0.00283 0.0182 0.0497
## 5 0.147 0.0230 0.991 0.975 0.912 0.00257 0.0227 0.0497
## 6 0.195 0.0356 0.990 0.979 0.907 0.00285 0.0183 0.0500
## 7 0.195 0.0230 0.990 0.976 0.911 0.00263 0.0208 0.0483
## 8 0.147 0.0552 0.990 0.982 0.897 0.00338 0.0146 0.0504
## 9 0.1 0.0856 0.990 0.985 0.889 0.00378 0.0130 0.0514
## 10 0.1 0.0148 0.990 0.972 0.914 0.00252 0.0260 0.0456
## # ... with 390 more rows
Tha value ~0.990 seems to be the peak we can get without additional tuning.
# plot ROCs
plot(model_glmnet)
randomForest model
# train randomForest model
model_randomForest <- train(x = df_x, y = df_y,
method = "ranger",
metric = "ROC",
importance = "permutation",
tuneLength = 10,
trControl = my_trainControl)
# print model
model_randomForest
## Random Forest
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Bootstrapped (10 reps)
## Summary of sample sizes: 58, 57, 57, 58, 57, 56, ...
## Resampling results across tuning parameters:
##
## mtry splitrule ROC Sens Spec
## 2 gini 0.9860817 0.9673265 0.8810444
## 2 extratrees 0.9889330 0.9807057 0.8826068
## 5 gini 0.9837646 0.9567394 0.8857729
## 5 extratrees 0.9882140 0.9726128 0.8841885
## 8 gini 0.9818464 0.9511329 0.8863048
## 8 extratrees 0.9878671 0.9688851 0.8873354
## 11 gini 0.9810034 0.9439755 0.8852439
## 11 extratrees 0.9875739 0.9654535 0.8873409
## 14 gini 0.9803360 0.9427323 0.8836649
## 14 extratrees 0.9880191 0.9663880 0.8899559
## 17 gini 0.9805221 0.9355634 0.8841967
## 17 extratrees 0.9870110 0.9617190 0.8941582
## 20 gini 0.9799474 0.9340077 0.8836677
## 20 extratrees 0.9867063 0.9635882 0.8910058
## 23 gini 0.9795562 0.9333827 0.8815789
## 23 extratrees 0.9868048 0.9614075 0.8910058
## 26 gini 0.9788892 0.9312049 0.8847148
## 26 extratrees 0.9857579 0.9598537 0.8909975
## 30 gini 0.9785235 0.9321366 0.8794820
## 30 extratrees 0.9861731 0.9604739 0.8873354
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule =
## extratrees and min.node.size = 1.
# plot model
plot(model_randomForest)
model_randomForest$results %>%
as.tibble %>%
arrange(desc(ROC))
## # A tibble: 20 x 9
## mtry min.node.size splitrule ROC Sens Spec ROCSD SensSD SpecSD
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 1 extratrees 0.989 0.981 0.883 0.00290 0.0140 0.0540
## 2 5 1 extratrees 0.988 0.973 0.884 0.00271 0.0152 0.0631
## 3 14 1 extratrees 0.988 0.966 0.890 0.00254 0.0155 0.0652
## 4 8 1 extratrees 0.988 0.969 0.887 0.00293 0.0161 0.0675
## 5 11 1 extratrees 0.988 0.965 0.887 0.00326 0.0170 0.0647
## 6 17 1 extratrees 0.987 0.962 0.894 0.00320 0.0179 0.0684
## 7 23 1 extratrees 0.987 0.961 0.891 0.00332 0.0182 0.0652
## 8 20 1 extratrees 0.987 0.964 0.891 0.00330 0.0137 0.0652
## 9 30 1 extratrees 0.986 0.960 0.887 0.00321 0.0188 0.0670
## 10 2 1 gini 0.986 0.967 0.881 0.00317 0.0148 0.0626
## 11 26 1 extratrees 0.986 0.960 0.891 0.00388 0.0187 0.0703
## 12 5 1 gini 0.984 0.957 0.886 0.00387 0.0154 0.0704
## 13 8 1 gini 0.982 0.951 0.886 0.00463 0.0153 0.0672
## 14 11 1 gini 0.981 0.944 0.885 0.00513 0.0187 0.0690
## 15 17 1 gini 0.981 0.936 0.884 0.00524 0.0311 0.0715
## 16 14 1 gini 0.980 0.943 0.884 0.00535 0.0194 0.0730
## 17 20 1 gini 0.980 0.934 0.884 0.00583 0.0316 0.0724
## 18 23 1 gini 0.980 0.933 0.882 0.00612 0.0333 0.0712
## 19 26 1 gini 0.979 0.931 0.885 0.00558 0.0330 0.0721
## 20 30 1 gini 0.979 0.932 0.879 0.00589 0.0331 0.0703
Now we can compare top 10 importance variables:
plot(varImp(model_glmnet, scale = FALSE), top = 10, main = "glmnet")
plot(varImp(model_randomForest, scale = FALSE), top = 10, main = "randomForest")
Suprisingly variables quite differs for both models. Any hint why is welcomed:)
Compare glmnet and randomForest performance
model_list <- list(glmnet = model_glmnet, randomForest = model_randomForest)
resamples <- resamples(model_list)
# plot the comparison
bwplot(resamples, metric = "ROC")
# plot the comparison per each fold
xyplot(resamples, metric = "ROC")
Conclusion
- the glmnet seems to fit the data better than randomForest
- even without advanced tuning pretty decent level of accuracy can be achieved
- open points: difference between variable importance of both models
Thanks for reading. Sure, there is a lot of space for improvements, tweaks. Any kind of feedback is appreciated.