variable selection | Tengku Hanis

A short note on variable selection

Sat, 08 Jan 2022 00:00:00 +0000

Variable selection

Variable or feature selection is one of the important step whether in machine learning or statistical analysis. This post is geared more to the machine learning side. Certain machine learning models such as Support vector machine (SVM) and neural network do not handle irrelevant predictors very well, whereas models such as linear and logistic regression do not handle correlated predictors very well. Thus, careful selection of the variables will help mitigate this issue and further improve the predictive performance.

There are three types of approaches in variable selection:

1. Intrinsic (or built-in feature selection)

An intrinsic feature selection is a feature selection embedded in the algorithm. Some examples include:

Tree-and-rule-based model - decision tree, random forest, etc
Multivariate adaptive regression spline (MARS)
Regularization method such as least absolute shrinkage and selection operator (LASSO or L1)

Advantages of this type of approach are they are fast and computationally efficient. However, the best variable selected in this approach is model dependent.

2. Filter

In filter approach we determine the variable importance, usually separately though not necessarily. An example of this approach is univariate filter. If the outcome is two categories, we can use t-test to assess the numerical predictors. Variables with a significant p-value or a large t-statistics will be chosen.

This approach is very simple and fast. However, the best subset of variables selected using some filtering criteria such as statistical significance may not reflect the best predictive performance of the model. Additionally, this approach is prone to over-selection of the predictors.

3. Wrapper

There two types of wrapper approaches:

Greedy wrapper

Greedy approach or algorithm direct a search path towards the best at times to achieve the best immediate benefit. Due to this reason this approach cannot escape local minima. We can assume in Figure 1 below local minima represents locally best predictors and global minima represents globally best predictors.

Figure 1: Local minima and global minima

An example of this approach is recursive feature elimination or backward selection. The main weakness of this greedy approach is the selected subset of features identified by this approach may not has the best predictive performance.

Non-greedy wrapper

The examples of this approach are simulated annealing and genetic algorithm. Both of these algorithm incorporate a randomness in their approach. Hence, it is classified as non-greedy wrapper. Due to this randomness, it can escape a local minima (see Figure 1 above).

The wrapper type has the best chance to find the globally best predictors. However, this approach is computationally expensive. Not to mention, this approach has a tendency to overfit (some packages like caret use resampling to avoid this issue).

Suggested approach

Kuhn & Johnson (2019) suggested this approach:

Start with an intrinsic approach
Then, do a wrapper approach:
- If a linear intrinsic approach has a better performance - proceed to wrapper method with a linear model
- If non-linear intrinsic approach has a better performance - proceed to wrapper method with a non-linear model
If several approach select a large number of predictors, it may not feasible to reduce the number of features

References:

Stepwise selection after multiple imputation

Tue, 04 Jan 2022 00:00:00 +0000

Some note

I have written two post previously about multiple imputation using mice package:

This post probably my last post about multiple imputation using mice package.

Stepwise selection

The general steps in mice package are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.

Brand (1999) proposed this solution:

Perform stepwise selection separately on each imputed dataset
Fit a preliminary model that contains all variables that present in at least half of the models in the step 1
Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p > 0.05)
Repeat step 3 until all variables have p values < 0.05

So, we going to do this solution and use multivariate Wald test (D1() in mice package) for model comparison instead of pooled likelihood ratio p value.

Example in R

Load the packages.

library(mice)
library(tidyverse)

Create a missing data. We going to use the famous mtcars dataset, which already available in R.

set.seed(123)
dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor)) %>% 
  select(-mpg) %>% 
  missForest::prodNA(0.1) %>% 
  bind_cols(mpg = mtcars$mpg)
summary(dat)

##       cyl             disp             hp             drat      
##  Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:4.000   1st Qu.:120.7   1st Qu.:103.0   1st Qu.:3.150  
##  Median :6.000   Median :225.0   Median :123.0   Median :3.715  
##  Mean   :6.148   Mean   :232.8   Mean   :147.4   Mean   :3.642  
##  3rd Qu.:8.000   3rd Qu.:334.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##  NA's   :5       NA's   :1       NA's   :4       NA's   :2      
##        wt             qsec          vs        am          gear     
##  Min.   :1.513   Min.   :14.50   0   :17   0   :18   Min.   :3.00  
##  1st Qu.:2.429   1st Qu.:16.88   1   :11   1   :10   1st Qu.:3.00  
##  Median :3.203   Median :17.51   NA's: 4   NA's: 4   Median :4.00  
##  Mean   :3.112   Mean   :17.75                       Mean   :3.71  
##  3rd Qu.:3.533   3rd Qu.:18.83                       3rd Qu.:4.00  
##  Max.   :5.424   Max.   :22.90                       Max.   :5.00  
##  NA's   :4       NA's   :2                           NA's   :1     
##       carb            mpg       
##  Min.   :1.000   Min.   :10.40  
##  1st Qu.:2.000   1st Qu.:15.43  
##  Median :2.000   Median :19.20  
##  Mean   :2.667   Mean   :20.09  
##  3rd Qu.:4.000   3rd Qu.:22.80  
##  Max.   :6.000   Max.   :33.90  
##  NA's   :5

Run mice() on missing data with 10 imputed datasets (m = 10).

datImp <- mice(dat, m = 10, printFlag = F, seed = 123)
datImp

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##      cyl     disp       hp     drat       wt     qsec       vs       am 
##    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"    "pmm" "logreg" "logreg" 
##     gear     carb      mpg 
##    "pmm"    "pmm"       "" 
## PredictorMatrix:
##      cyl disp hp drat wt qsec vs am gear carb mpg
## cyl    0    1  1    1  1    1  1  1    1    1   1
## disp   1    0  1    1  1    1  1  1    1    1   1
## hp     1    1  0    1  1    1  1  1    1    1   1
## drat   1    1  1    0  1    1  1  1    1    1   1
## wt     1    1  1    1  0    1  1  1    1    1   1
## qsec   1    1  1    1  1    0  1  1    1    1   1

Run stepwise selection on each imputed dataset.

sc <- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, 
           lower = ~ 1)
exp <- expression(f1 <- lm(mpg ~ 1),
                  f2 <- step(f1, scope = sc, trace = 0))
fit <- with(datImp, exp)

Next, we calculate how many times each variable selected in the each model by stepwise selection.

fit$analyses %>% 
  map(formula) %>% #get the formula
  map(terms) %>% #get the terms
  map(labels) %>% #get the name of variables
  unlist() %>% 
  table()

## .
##   am carb  cyl disp drat   hp qsec   vs   wt 
##    7    5    3    2    4    5    3    4    7

We going to select:

am
carb
hp
wt

These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.

fit_full1 <- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2         am1  3.06689135 1.94363342  1.577917 13.06329 1.384846e-01
## 3        carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4          hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5          wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02

We exclude carb variable in the next model as it has the largest non-significant p value.

fit_full2 <- with(datImp, lm(mpg ~ am + hp + wt))

Next, we compare using multivariate Wald test.

D1(fit_full1, fit_full2)

##    test statistic df1     df2 dfcom   p.value       riv
##  1 ~~ 2 0.9765411   1 9.21378    27 0.3482934 0.6935655

P > 0.05. So, we opt for the simpler model.

pool(fit_full2) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2         am1  2.50264907 1.79966590  1.390619 15.31418 1.842201e-01
## 3          hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4          wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02

We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.

fit_full3 <- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)

##    test statistic df1      df2 dfcom   p.value       riv
##  1 ~~ 2   1.93382   1 12.90982    28 0.1878483 0.4392918

Again, we opt for the simple model.

pool(fit_full3) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2          hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3          wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05

There is no non-significant variable in the model anymore. Thus, this is our final model.

gtsummary::tbl_regression(fit_full3)

Characteristic	Beta	95% CI¹	p-value
hp	-0.03	-0.05, -0.01	0.005
wt	-3.9	-5.5, -2.4	<0.001
¹ CI = Confidence Interval

Reference:

https://stefvanbuuren.name/fimd/sec-stepwise.html

Variable selection using genetic algorithm

Sun, 02 Jan 2022 00:00:00 +0000

Background

Genetic algorithm is inspired by a natural selection process by which the fittest individuals be selected to reproduce. This algorithm has been used in optimization and search problem, and also, can be used for variable selection.

Genetic algorithm - gene, chromosome, population, crossover (upper right), offspring (lower right)

First, let’s go into a few terms related to genetic algorithm theory.

Population - a set of chromosomes
Chromosome - a subset of variables (also known as individual by some reference)
Gene - a variable or feature
Fitness function - give fitness score to each chromosome and guide the selection
Selection - a process to select the two chromosome known as parents
Crossover - a process to generate offspring by parents (illustrate in the picture above, on the upper right side)
Mutation - the process by which the gene in the chromosome is randomly flipped into 1 or 0

Mutation

So, the basic flow of genetic algorithm:

Algorithm starts with an initial population, often randomly generated
Create a successive generation by selecting a portion of the initial population (the selection is guided by the fitness function) - this includes selection -> crossover -> mutation
The algorithm terminates if certain predetermined criteria are met such as:
- Solution satisfies the minimum criteria
- Fixed number of generation reached
- Successive iteration no longer produce a better result

Example in R

There is GA package in R, where we can implement the genetic algorithm a bit more manually where we can specify our own fitness function. However, I think it is easier to use a genetic algorithm implemented in caret package for variable selection.

Load the packages.

library(caret)
library(tidyverse)
library(rsample)
library(recipes)

The data.

dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor),
         am = fct_recode(am, auto = "0", man = "1"))
str(dat)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "auto","man": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

For this, we going to use random forest (rfGA). Other options are bagged tree (treebagGA) and caretGA. We are able to use other method in caret if we use caretGA.

# specify control
ga_ctrl <- gafsControl(functions = rfGA, method = "cv", number = 5)

# run random forest
set.seed(123)
rf_ga <- gafs(x = dat %>% select(-am), 
              y = dat$am,
              iters = 5,
              gafsControl = ga_ctrl)
rf_ga

## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 2 classes: 'auto', 'man' 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     qsec (60%), wt (60%), disp (40%), gear (40%), vs (40%)
##   * on average, 3.2 variables were selected (min = 1, max = 7)
## 
## In the final search using the entire training set:
##    * 7 features selected at iteration 3 including:
##      cyl, hp, drat, qsec, vs ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9429      0.8831

The optimal features/variables:

rf_ga$optVariables

## [1] "cyl"  "hp"   "drat" "qsec" "vs"   "gear" "carb"

This is the time taken for random forest approach.

rf_ga$times

## $everything
##    user  system elapsed 
##   51.22    1.25   52.92

By default the algorithm will find a solution or a set of variable that reduce RMSE for numerical outcome, and accuracy for categorical outcome. Also, genetic algorithm tend to overfit, that’s why for the implementation in caret we have internal and external performance. So, for the 10-fold cross-validation, 10 genetic algorithm will be run separately. All the first nine folds are used for the genetic algorithm, and the 10th for external performance evaluation.

Let’s try a variable selection using linear regression model.

# specify control
lm_ga_ctrl <- gafsControl(functions = caretGA, method = "cv", number = 5)

# run lm
set.seed(123)
lm_ga <- gafs(x = dat %>% select(-mpg), 
              y = dat$mpg,
              iters = 5,
              gafsControl = lm_ga_ctrl,
              # below is the option for `train`
              method = "lm",
              trControl = trainControl(method = "cv", allowParallel = F))
lm_ga

## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared, MAE
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (100%), hp (80%), carb (60%), cyl (60%), am (40%)
##   * on average, 4.4 variables were selected (min = 4, max = 5)
## 
## In the final search using the entire training set:
##    * 5 features selected at iteration 5 including:
##      cyl, disp, hp, wt, qsec  
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      3.3434      0.7624      2.6037

Now, let’s see how to integrate this in machine learning flow using recipes from rsample.

First, we split the data.

set.seed(123)
dat_split <-initial_split(dat)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)

We specify two recipes for numerical and categorical outcome.

# Numerical
rec_num <- 
  recipe(mpg ~., data = dat_train) %>% 
  step_center(all_numeric()) %>% 
  step_dummy(all_nominal_predictors())

# Categorical
rec_cat <- 
  recipe(am ~., data = dat_train) %>% 
  step_center(all_numeric()) %>% 
  step_dummy(all_nominal_predictors())

We run random forest for numerical outcome recipes.

# specify control
rf_ga_ctrl <- gafsControl(functions = rfGA, method = "cv", number = 5)

# run random forest
set.seed(123)
rf_ga2 <- 
  gafs(rec_num,
       data = dat_train,
       iters = 5, 
       gafsControl = rf_ga_ctrl) 
rf_ga2

## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     cyl (80%), disp (80%), hp (80%), wt (80%), carb (60%)
##   * on average, 4.8 variables were selected (min = 2, max = 9)
## 
## In the final search using the entire training set:
##    * 6 features selected at iteration 5 including:
##      cyl, disp, hp, wt, gear ... 
##    * external performance at this iteration is
## 
##       RMSE   Rsquared        MAE 
##      2.830      0.928      2.408

The optimal variables.

rf_ga2$optVariables

## [1] "cyl"   "disp"  "hp"    "wt"    "gear"  "vs_X1"

Let’s try run SVM for the numerical outcome recipes.

# specify control
svm_ga_ctrl <- gafsControl(functions = caretGA, method = "cv", number = 5)

# run SVM
set.seed(123)
svm_ga <- 
  gafs(rec_cat,
       data = dat_train,
       iters = 5, 
       gafsControl = svm_ga_ctrl,
       # below is the options to `train` for caretGA
       method = "svmRadial", #SVM with Radial Basis Function Kernel
       trControl = trainControl(method = "cv", allowParallel = T))
svm_ga

## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 2 classes: 'auto', 'man' 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (80%), qsec (60%), vs_X1 (60%), carb (40%), disp (40%)
##   * on average, 4 variables were selected (min = 3, max = 6)
## 
## In the final search using the entire training set:
##    * 9 features selected at iteration 2 including:
##      mpg, cyl, disp, hp, drat ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9200      0.8571

The optimal variables.

svm_ga$optVariables

## [1] "mpg"   "cyl"   "disp"  "hp"    "drat"  "wt"    "qsec"  "carb"  "vs_X1"

Conclusion

Although genetic algorithm seems quite good for variable selection, the main limitation I would say is the computational time. However, if we have a lot of variables or features to reduced, using the genetic algorithm despite the long computational time seems beneficial to me.

Reference: