Vignette written in 2017, small update in 2022.
This is an example of how we can use sparse Partial Least Squares for
modelling. The model is implemented in the mixOmics
package
(http://mixomics.org/).
The package has some cross-validation functionality but this is not
nearly as rich as what the caret
package offers. But
luckily, it can be added to the caret
ecosystem to take
advantage of all these nice features.
This small utility package does just that and can be installed from
github using
devtools::install_github("jonathanth/mixOmicsCaret")
Let’s look at how to use it.
First, we load some required packages.
library(mlbench)
data(Sonar)
library(doMC)
library(caret)
library(pROC)
library(mixOmics)
library(tidyverse)
library(mixOmicsCaret) # This package
options(width = 90)
Then we can register doMC for the built-in multithreading in
caret
. A common option is number of threads minus one. On a
quad-core with hyperthreading (Macbook Pro i7), that’s 7.
registerDoMC(cores = 7)
We start by examining the dummy data we’ll be using. It’s called
Sonar and comes from the mlbench
package.
head(Sonar)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609 0.1582
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918 0.6552
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333 0.7060
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881 0.1992
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152 0.3952
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988 0.4250
## V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24
## 1 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071 0.4328 0.5550
## 2 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052 0.3957 0.3914
## 3 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737 0.4293 0.3648
## 4 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690 0.5556 0.4846
## 5 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292 0.5730 0.5399
## 6 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985 0.5890 0.2872
## V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36
## 1 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507
## 2 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182
## 3 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043
## 4 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012
## 5 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181
## 6 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856
## V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
## 1 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343
## 2 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742
## 3 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744
## 4 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576
## 5 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085
## 6 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494
## V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60
## 1 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
## Class
## 1 R
## 2 R
## 3 R
## 4 R
## 5 R
## 6 R
$Class Sonar
## [1] R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
## [43] R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
## [85] R R R R R R R R R R R R R M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## [127] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## [169] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## Levels: M R
dim(Sonar)
## [1] 208 61
Alright, 60 predictors, 208 observations, 2 classes. Let’s partition the data into a training set and a test set.
set.seed(123)
<- createDataPartition(Sonar$Class, p = .75, list = FALSE)
inTraining <- Sonar[ inTraining,]
training <- Sonar[-inTraining,] testing
Now we will setup a cross validation to see how many predictors we
should include. First, let’s set up the options. It will run 5 repeats
of 10-fold CV. We will keep the resamples and the predictions, and allow
it to run in parallel. We will turn off verbose-mode for now. We save
this as a trainControl object, which is just a set of options for
caret
.
<- trainControl(method = "repeatedcv",
repCV10 number = 10,
repeats = 5,
returnResamp = "all",
savePredictions = "all",
allowParallel = T,
verboseIter = F)
Now for the CV loop itself. This will take some time. We have asked
it to model Class M as a numeric 0/1, which is fine for spls but
caret
throws a warning we can safely ignore. We will try 1
to 5 components, with a range of keepX values. This is the number of
predictors allowed per component. Note the fixX
argument,
which is left empty for now. This allows us to manually specify number
of predictors per component.
set.seed(123)
<- train(as.numeric(Class == "M") ~ ., data = training,
SonarPLS method = get_mixOmics_spls(),
preProc = c("center", "scale"),
tuneGrid = expand.grid(ncomp = 1:5,
keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60),
keepY = 1),
trControl = repCV10,
fixX = c())
## Warning in train.default(x, y, weights = w, ...): You are trying to do regression and your
## outcome only has two possible values Are you trying to do classification? If so, use a 2
## level factor as your outcome column.
Let’s inspect the results!
SonarPLS
## sparse PLS (mixOmics)
##
## 157 samples
## 60 predictor
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 141, 141, 141, 142, 141, 142, ...
## Resampling results across tuning parameters:
##
## ncomp keepX RMSE Rsquared MAE
## 1 1 0.4428918 0.2778487 0.4032228
## 1 2 0.4426901 0.2794088 0.3996691
## 1 4 0.4379582 0.2921246 0.3911516
## 1 8 0.4373480 0.2935882 0.3880929
## 1 16 0.4317514 0.3069779 0.3752652
## 1 25 0.4220693 0.3306699 0.3612679
## 1 40 0.4183901 0.3332180 0.3555149
## 1 50 0.4207146 0.3233099 0.3591693
## 1 60 0.4217330 0.3188414 0.3625658
## 2 1 0.4540927 0.2273804 0.4005623
## 2 2 0.4532759 0.2294312 0.3947365
## 2 4 0.4337976 0.2929344 0.3728697
## 2 8 0.4202289 0.3317215 0.3599165
## 2 16 0.4234048 0.3220425 0.3623934
## 2 25 0.4147862 0.3491279 0.3565259
## 2 40 0.4020496 0.3835529 0.3494071
## 2 50 0.4027940 0.3799771 0.3511427
## 2 60 0.4050114 0.3732625 0.3542073
## 3 1 0.4366237 0.2818716 0.3787097
## 3 2 0.4376513 0.2825221 0.3748253
## 3 4 0.4304967 0.3022681 0.3662344
## 3 8 0.4280860 0.3085806 0.3669898
## 3 16 0.4284122 0.3129407 0.3697654
## 3 25 0.4174701 0.3470650 0.3599210
## 3 40 0.4167302 0.3509187 0.3592221
## 3 50 0.4177837 0.3501879 0.3592736
## 3 60 0.4157594 0.3549846 0.3569877
## 4 1 0.4469014 0.2671418 0.3874149
## 4 2 0.4582922 0.2459357 0.3920854
## 4 4 0.4522034 0.2516641 0.3842528
## 4 8 0.4351378 0.3012492 0.3724988
## 4 16 0.4298474 0.3252323 0.3702491
## 4 25 0.4244027 0.3374143 0.3632728
## 4 40 0.4212382 0.3514348 0.3601897
## 4 50 0.4223672 0.3486500 0.3610030
## 4 60 0.4233832 0.3474482 0.3609421
## 5 1 0.4607586 0.2318728 0.3965671
## 5 2 0.4713468 0.2205850 0.4006259
## 5 4 0.4547183 0.2565166 0.3868815
## 5 8 0.4343501 0.3081907 0.3700465
## 5 16 0.4291110 0.3339493 0.3669521
## 5 25 0.4328510 0.3302838 0.3654943
## 5 40 0.4331672 0.3414393 0.3657326
## 5 50 0.4351893 0.3359962 0.3687523
## 5 60 0.4323212 0.3391029 0.3683673
##
## Tuning parameter 'keepY' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were ncomp = 2, keepX = 40 and keepY = 1.
ggplot(SonarPLS, metric = "RMSE")
ggplot(SonarPLS, metric = "Rsquared")
We can also manually extract the predictions, calculate AUC, and plot
it:
$pred %>%
SonarPLSseparate(Resample, c("Fold", "Rep")) %>%
group_by(ncomp, keepX, Rep) %>%
summarize(auc = pROC::auc(predictor = pred, obs, direction = "<") %>% as.numeric) %>%
ggplot(aes(x = factor(keepX), y = auc)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = factor(ncomp)), alpha = 0.2, show.legend = FALSE) +
geom_hline(yintercept = 0.5) +
facet_grid(. ~ ncomp) +
scale_color_brewer(palette = "Set1", name = NULL) +
theme_bw() + theme(strip.background = element_blank()) +
xlab("Number of variables")
Looks like the first component saturates at 25 predictors. Let’s use
the fixX
parameter to force comp 1 to have 25 predictors
and let the following components try out all combinations. This allows
us to flexibly set individual parameter numbers for each component. Lets
re-run the cross-validation with this option.
set.seed(123)
<- train(as.numeric(Class == "M") ~ ., data = training,
SonarPLS method = get_mixOmics_spls(),
preProc = c("center", "scale"),
tuneGrid = expand.grid(ncomp = 1:5,
keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60),
keepY = 1),
trControl = repCV10,
fixX = c(25))
ggplot(SonarPLS, metric = "Rsquared")
Alright, we can do a little bit better if we add another component with
60 variables. But after that, the model starts overfitting and the
CV-error increases / Rsquared decreases.
Looks like we have found our final model! Lets retrieve the predicted
values, that we can use for unbiased downstream analyses. We want an
average across the repeats to ensure we didn’t just get lucky with the
fold divisions. Since we have a test set, we can also look at the test
set performance of the final model, using the predict
function on the SonarPLS
object with the
testing
dataframe as new data.
# Retrieve the predictions from the optimal model
<- SonarPLS %>% get_best_predictions %>% group_by(Rep)
bestSet head(bestSet)
## # A tibble: 6 × 8
## # Groups: Rep [1]
## pred obs rowIndex ncomp keepX keepY Fold Rep
## <dbl> <dbl> <int> <int> <dbl> <dbl> <chr> <chr>
## 1 0.362 0 1 2 60 1 Fold02 Rep1
## 2 1.10 0 2 2 60 1 Fold08 Rep1
## 3 0.332 0 3 2 60 1 Fold05 Rep1
## 4 0.619 0 4 2 60 1 Fold01 Rep1
## 5 0.326 0 5 2 60 1 Fold04 Rep1
## 6 0.585 0 6 2 60 1 Fold09 Rep1
%>%
bestSet summarize(auc = pROC::auc(obs, pred, direction = "<") %>% as.numeric) %>%
arrange(auc)
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## # A tibble: 5 × 2
## Rep auc
## <chr> <dbl>
## 1 Rep3 0.859
## 2 Rep1 0.860
## 3 Rep4 0.862
## 4 Rep5 0.862
## 5 Rep2 0.868
<- SonarPLS %>% get_best_predictions %>% filter(Rep == "Rep3")
savedPreds head(savedPreds)
## pred obs rowIndex ncomp keepX keepY Fold Rep
## 1 0.2950255 0 1 2 60 1 Fold06 Rep3
## 2 1.2395565 0 2 2 60 1 Fold03 Rep3
## 3 0.2266207 0 3 2 60 1 Fold01 Rep3
## 4 0.6671375 0 4 2 60 1 Fold01 Rep3
## 5 0.3880590 0 5 2 60 1 Fold01 Rep3
## 6 0.5428043 0 6 2 60 1 Fold05 Rep3
# Best CV predictions
<- auc(obs ~ pred, data = savedPreds, direction = "<") cvauc
## Setting levels: control = 0, case = 1
qplot(ifelse(obs == 1, "M", "R"), pred, data = savedPreds, geom = c("boxplot", "jitter")) +
xlab("Training class") +
ylab("Training predictions") +
ggtitle(paste0("CV AUC = ", round(cvauc, 3)))
# Test set predictions
<- predict(SonarPLS, testing)
testPreds <- auc(predictor = testPreds, testing$Class, direction = ">") cvauc
## Setting levels: control = M, case = R
qplot(testing$Class, testPreds, geom = c("boxplot", "jitter")) +
xlab("Testing class") +
ylab("Testing predictions") +
ggtitle(paste0("Test AUC = ", round(cvauc, 3)))
We can also run e.g. a logistic regression with these predictions, or use them for other downstream stuff.
# Training (CV)
glm(obs ~ pred, data = savedPreds, family = binomial) %>% summary
##
## Call:
## glm(formula = obs ~ pred, family = binomial, data = savedPreds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9731 -0.7594 0.1373 0.7380 2.1659
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7578 0.5074 -5.435 5.47e-08 ***
## pred 5.7804 0.9754 5.926 3.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 216.88 on 156 degrees of freedom
## Residual deviance: 151.66 on 155 degrees of freedom
## AIC: 155.66
##
## Number of Fisher Scoring iterations: 5
# Test
glm(testing$Class == "M" ~ testPreds, family = binomial) %>% summary
##
## Call:
## glm(formula = testing$Class == "M" ~ testPreds, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8312 -0.9711 0.2897 0.8381 1.7565
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0826 0.7679 -2.712 0.00669 **
## testPreds 4.5750 1.4825 3.086 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 70.524 on 50 degrees of freedom
## Residual deviance: 56.768 on 49 degrees of freedom
## AIC: 60.768
##
## Number of Fisher Scoring iterations: 4
As you can see, the CV performance tends to be quite close to the test set performance in this setup.
Now we would like to get some information on the variables included in the model. Luckily we can easily extract the loadings from the model, either from the full training model or from our favourite CV run.
# Retrieve the predictions with the best
%>% get_loadings %>% head # Returns a data.frame for easy plotting SonarPLS
## var comp loading sd xy
## 1 V2 comp1 0.01778986 NA x
## 2 V4 comp1 0.10139233 NA x
## 3 V5 comp1 0.05297409 NA x
## 4 V9 comp1 0.22065636 NA x
## 5 V10 comp1 0.24919542 NA x
## 6 V11 comp1 0.49892869 NA x
%>% get_loadings("CV", rep = "Rep3") %>%
SonarPLS ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) +
facet_wrap(~ comp, scales = "free_x") +
geom_errorbar() +
geom_bar(stat = "identity") +
ggtitle("CV, median rep (Rep3)")
%>% get_loadings("finalModel") %>%
SonarPLS ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) +
facet_wrap(~ comp, scales = "free_x") +
geom_errorbar() +
geom_bar(stat = "identity") +
ggtitle("Full model")
Hope this proves useful for you! Read documentation on the methods
with ?get_mixOmics_spls
, ?get_loadings
and
?get_best_predictions
.