mixOmics with caret examples

This is an example of how we can use sparse Partial Least Squares for modelling. The model is implemented in the mixOmics package (http://mixomics.org/). The package has some cross-validation functionality but this is not nearly as rich as what the caret package offers. But luckily, it can be added to the caret ecosystem to take advantage of all these nice features.

This small utility package does just that and can be installed from github using devtools::install_github("jonathanth/mixOmicsCaret")

Setup

First, we load some required packages.

library(mlbench)
data(Sonar)
library(doMC)
library(caret)
library(pROC)
library(mixOmics)
library(tidyverse)
library(mixOmicsCaret)  # This package
options(width = 90)

Then we can register doMC for the built-in multithreading in caret. A common option is number of threads minus one. On a quad-core with hyperthreading (Macbook Pro i7), that’s 7.

registerDoMC(cores = 7)

Data

We start by examining the dummy data we’ll be using. It’s called Sonar and comes from the mlbench package.

head(Sonar)

##       V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11    V12
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609 0.1582
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918 0.6552
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333 0.7060
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881 0.1992
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152 0.3952
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988 0.4250
##      V13    V14    V15    V16    V17    V18    V19    V20    V21    V22    V23    V24
## 1 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071 0.4328 0.5550
## 2 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052 0.3957 0.3914
## 3 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737 0.4293 0.3648
## 4 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690 0.5556 0.4846
## 5 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292 0.5730 0.5399
## 6 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985 0.5890 0.2872
##      V25    V26    V27    V28    V29    V30    V31    V32    V33    V34    V35    V36
## 1 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121 0.7547 0.8537 0.8507
## 2 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984 0.2341 0.1306 0.4182
## 3 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862 0.2709 0.4232 0.3043
## 4 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120 0.3497 0.3953 0.3012
## 5 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881 0.0981 0.1951 0.4181
## 6 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299 0.2198 0.1407 0.2856
##      V37    V38    V39    V40    V41    V42    V43    V44    V45    V46    V47    V48
## 1 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256 0.2641 0.1386 0.1051 0.1343
## 2 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628 0.0621 0.0203 0.0530 0.0742
## 3 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222 0.2111 0.0176 0.1348 0.0744
## 4 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202 0.4295 0.3654 0.2655 0.1576
## 5 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841 0.0692 0.0528 0.0357 0.0085
## 6 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238 0.1192 0.1089 0.0623 0.0494
##      V49    V50    V51    V52    V53    V54    V55    V56    V57    V58    V59    V60
## 1 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072 0.0167 0.0180 0.0084 0.0090 0.0032
## 2 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094 0.0191 0.0140 0.0049 0.0052 0.0044
## 3 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180 0.0244 0.0316 0.0164 0.0095 0.0078
## 4 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085 0.0073 0.0050 0.0044 0.0040 0.0117
## 5 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110 0.0015 0.0072 0.0048 0.0107 0.0094
## 6 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013 0.0089 0.0057 0.0027 0.0051 0.0062
##   Class
## 1     R
## 2     R
## 3     R
## 4     R
## 5     R
## 6     R

Sonar$Class

##   [1] R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
##  [43] R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
##  [85] R R R R R R R R R R R R R M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## [127] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## [169] M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M
## Levels: M R

dim(Sonar)

## [1] 208  61

Alright, 60 predictors, 208 observations, 2 classes. Let’s partition the data into a training set and a test set.

set.seed(123)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing  <- Sonar[-inTraining,]

Model and cross-validation

Now we will setup a cross validation to see how many predictors we should include. First, let’s set up the options. It will run 5 repeats of 10-fold CV. We will keep the resamples and the predictions, and allow it to run in parallel. We will turn off verbose-mode for now. We save this as a trainControl object, which is just a set of options for caret.

repCV10 <- trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 5, 
                        returnResamp = "all", 
                        savePredictions = "all", 
                        allowParallel = T, 
                        verboseIter = F)

Now for the CV loop itself. This will take some time. We have asked it to model Class M as a numeric 0/1, which is fine for spls but caret throws a warning we can safely ignore. We will try 1 to 5 components, with a range of keepX values. This is the number of predictors allowed per component. Note the fixX argument, which is left empty for now. This allows us to manually specify number of predictors per component.

set.seed(123)
SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training,
                  method = get_mixOmics_spls(),
                  preProc = c("center", "scale"),
                  tuneGrid = expand.grid(ncomp = 1:5, 
                                         keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), 
                                         keepY = 1),
                  trControl = repCV10,
                  fixX = c())

## Warning in train.default(x, y, weights = w, ...): You are trying to do regression and your
## outcome only has two possible values Are you trying to do classification? If so, use a 2
## level factor as your outcome column.

Let’s inspect the results!

SonarPLS

## sparse PLS (mixOmics) 
## 
## 157 samples
##  60 predictor
## 
## Pre-processing: centered (60), scaled (60) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 141, 141, 141, 142, 141, 142, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  keepX  RMSE       Rsquared   MAE      
##   1       1     0.4428918  0.2778487  0.4032228
##   1       2     0.4426901  0.2794088  0.3996691
##   1       4     0.4379582  0.2921246  0.3911516
##   1       8     0.4373480  0.2935882  0.3880929
##   1      16     0.4317514  0.3069779  0.3752652
##   1      25     0.4220693  0.3306699  0.3612679
##   1      40     0.4183901  0.3332180  0.3555149
##   1      50     0.4207146  0.3233099  0.3591693
##   1      60     0.4217330  0.3188414  0.3625658
##   2       1     0.4540927  0.2273804  0.4005623
##   2       2     0.4532759  0.2294312  0.3947365
##   2       4     0.4337976  0.2929344  0.3728697
##   2       8     0.4202289  0.3317215  0.3599165
##   2      16     0.4234048  0.3220425  0.3623934
##   2      25     0.4147862  0.3491279  0.3565259
##   2      40     0.4020496  0.3835529  0.3494071
##   2      50     0.4027940  0.3799771  0.3511427
##   2      60     0.4050114  0.3732625  0.3542073
##   3       1     0.4366237  0.2818716  0.3787097
##   3       2     0.4376513  0.2825221  0.3748253
##   3       4     0.4304967  0.3022681  0.3662344
##   3       8     0.4280860  0.3085806  0.3669898
##   3      16     0.4284122  0.3129407  0.3697654
##   3      25     0.4174701  0.3470650  0.3599210
##   3      40     0.4167302  0.3509187  0.3592221
##   3      50     0.4177837  0.3501879  0.3592736
##   3      60     0.4157594  0.3549846  0.3569877
##   4       1     0.4469014  0.2671418  0.3874149
##   4       2     0.4582922  0.2459357  0.3920854
##   4       4     0.4522034  0.2516641  0.3842528
##   4       8     0.4351378  0.3012492  0.3724988
##   4      16     0.4298474  0.3252323  0.3702491
##   4      25     0.4244027  0.3374143  0.3632728
##   4      40     0.4212382  0.3514348  0.3601897
##   4      50     0.4223672  0.3486500  0.3610030
##   4      60     0.4233832  0.3474482  0.3609421
##   5       1     0.4607586  0.2318728  0.3965671
##   5       2     0.4713468  0.2205850  0.4006259
##   5       4     0.4547183  0.2565166  0.3868815
##   5       8     0.4343501  0.3081907  0.3700465
##   5      16     0.4291110  0.3339493  0.3669521
##   5      25     0.4328510  0.3302838  0.3654943
##   5      40     0.4331672  0.3414393  0.3657326
##   5      50     0.4351893  0.3359962  0.3687523
##   5      60     0.4323212  0.3391029  0.3683673
## 
## Tuning parameter 'keepY' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were ncomp = 2, keepX = 40 and keepY = 1.

ggplot(SonarPLS, metric = "RMSE")
ggplot(SonarPLS, metric = "Rsquared")

We can also manually extract the predictions, calculate AUC, and plot it:

SonarPLS$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  group_by(ncomp, keepX, Rep) %>%
  summarize(auc = pROC::auc(predictor = pred, obs, direction = "<") %>% as.numeric) %>%
  ggplot(aes(x = factor(keepX), y = auc)) +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(aes(color = factor(ncomp)), alpha = 0.2, show.legend = FALSE) +
    geom_hline(yintercept = 0.5) +
    facet_grid(. ~ ncomp) +
    scale_color_brewer(palette = "Set1", name = NULL) +
    theme_bw() + theme(strip.background = element_blank()) +
    xlab("Number of variables")

Looks like the first component saturates at 25 predictors. Let’s use the fixX parameter to force comp 1 to have 25 predictors and let the following components try out all combinations. This allows us to flexibly set individual parameter numbers for each component. Lets re-run the cross-validation with this option.

set.seed(123)
SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training,
                  method = get_mixOmics_spls(),
                  preProc = c("center", "scale"),
                  tuneGrid = expand.grid(ncomp = 1:5, 
                                         keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), 
                                         keepY = 1),
                  trControl = repCV10,
                  fixX = c(25))
ggplot(SonarPLS, metric = "Rsquared")

Alright, we can do a little bit better if we add another component with 60 variables. But after that, the model starts overfitting and the CV-error increases / Rsquared decreases.

Evaluate the final model

Looks like we have found our final model! Lets retrieve the predicted values, that we can use for unbiased downstream analyses. We want an average across the repeats to ensure we didn’t just get lucky with the fold divisions. Since we have a test set, we can also look at the test set performance of the final model, using the predict function on the SonarPLS object with the testing dataframe as new data.

# Retrieve the predictions from the optimal model
bestSet <- SonarPLS %>% get_best_predictions %>% group_by(Rep) 
head(bestSet)

## # A tibble: 6 × 8
## # Groups:   Rep [1]
##    pred   obs rowIndex ncomp keepX keepY Fold   Rep  
##   <dbl> <dbl>    <int> <int> <dbl> <dbl> <chr>  <chr>
## 1 0.362     0        1     2    60     1 Fold02 Rep1 
## 2 1.10      0        2     2    60     1 Fold08 Rep1 
## 3 0.332     0        3     2    60     1 Fold05 Rep1 
## 4 0.619     0        4     2    60     1 Fold01 Rep1 
## 5 0.326     0        5     2    60     1 Fold04 Rep1 
## 6 0.585     0        6     2    60     1 Fold09 Rep1

bestSet %>% 
  summarize(auc = pROC::auc(obs, pred, direction = "<") %>% as.numeric) %>% 
  arrange(auc)

## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1
## Setting levels: control = 0, case = 1

## # A tibble: 5 × 2
##   Rep     auc
##   <chr> <dbl>
## 1 Rep3  0.859
## 2 Rep1  0.860
## 3 Rep4  0.862
## 4 Rep5  0.862
## 5 Rep2  0.868

savedPreds <- SonarPLS %>% get_best_predictions %>% filter(Rep == "Rep3")
head(savedPreds)

##        pred obs rowIndex ncomp keepX keepY   Fold  Rep
## 1 0.2950255   0        1     2    60     1 Fold06 Rep3
## 2 1.2395565   0        2     2    60     1 Fold03 Rep3
## 3 0.2266207   0        3     2    60     1 Fold01 Rep3
## 4 0.6671375   0        4     2    60     1 Fold01 Rep3
## 5 0.3880590   0        5     2    60     1 Fold01 Rep3
## 6 0.5428043   0        6     2    60     1 Fold05 Rep3

# Best CV predictions
cvauc <- auc(obs ~ pred, data = savedPreds, direction = "<")

## Setting levels: control = 0, case = 1

qplot(ifelse(obs == 1, "M", "R"), pred, data = savedPreds, geom = c("boxplot", "jitter")) + 
  xlab("Training class") +
  ylab("Training predictions") +
  ggtitle(paste0("CV AUC = ", round(cvauc, 3)))

# Test set predictions
testPreds <- predict(SonarPLS, testing)
cvauc <- auc(predictor = testPreds, testing$Class, direction = ">")

## Setting levels: control = M, case = R

qplot(testing$Class, testPreds, geom = c("boxplot", "jitter")) + 
  xlab("Testing class") +
  ylab("Testing predictions") +
  ggtitle(paste0("Test AUC = ", round(cvauc, 3)))

We can also run e.g. a logistic regression with these predictions, or use them for other downstream stuff.

# Training (CV)
glm(obs ~ pred, data = savedPreds, family = binomial) %>% summary

## 
## Call:
## glm(formula = obs ~ pred, family = binomial, data = savedPreds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9731  -0.7594   0.1373   0.7380   2.1659  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7578     0.5074  -5.435 5.47e-08 ***
## pred          5.7804     0.9754   5.926 3.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 216.88  on 156  degrees of freedom
## Residual deviance: 151.66  on 155  degrees of freedom
## AIC: 155.66
## 
## Number of Fisher Scoring iterations: 5

# Test
glm(testing$Class == "M" ~ testPreds, family = binomial) %>% summary

## 
## Call:
## glm(formula = testing$Class == "M" ~ testPreds, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8312  -0.9711   0.2897   0.8381   1.7565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -2.0826     0.7679  -2.712  0.00669 **
## testPreds     4.5750     1.4825   3.086  0.00203 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 70.524  on 50  degrees of freedom
## Residual deviance: 56.768  on 49  degrees of freedom
## AIC: 60.768
## 
## Number of Fisher Scoring iterations: 4

As you can see, the CV performance tends to be quite close to the test set performance in this setup.

Interpretation (Loadings)

Now we would like to get some information on the variables included in the model. Luckily we can easily extract the loadings from the model, either from the full training model or from our favourite CV run.

# Retrieve the predictions with the best 
SonarPLS %>% get_loadings %>% head # Returns a data.frame for easy plotting

##   var  comp    loading sd xy
## 1  V2 comp1 0.01778986 NA  x
## 2  V4 comp1 0.10139233 NA  x
## 3  V5 comp1 0.05297409 NA  x
## 4  V9 comp1 0.22065636 NA  x
## 5 V10 comp1 0.24919542 NA  x
## 6 V11 comp1 0.49892869 NA  x

SonarPLS %>% get_loadings("CV", rep = "Rep3") %>% 
  ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + 
  facet_wrap(~ comp, scales = "free_x") + 
  geom_errorbar() + 
  geom_bar(stat = "identity") + 
  ggtitle("CV, median rep (Rep3)")

SonarPLS %>% get_loadings("finalModel") %>% 
  ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + 
  facet_wrap(~ comp, scales = "free_x") + 
  geom_errorbar() + 
  geom_bar(stat = "identity") + 
  ggtitle("Full model")

Hope this proves useful for you! Read documentation on the methods with ?get_mixOmics_spls, ?get_loadings and ?get_best_predictions.