tidymodels | Tengku Hanis

Using UMAP preprocessing for image classification

Wed, 16 Mar 2022 00:00:00 +0000

UMAP

Uniform manifold approximation and projection or in short UMAP is a type of dimension reduction techniques. So, basically UMAP will project a set of features into a smaller space. UMAP can be a supervised technique in which we give a label or an outcome or an unsupervised one. For those interested to know in detail how UMAP works can refer to this reference. For those prefer a much simpler or shorter version of it, I recommend a YouTube video by Joshua Starmer.

Example in R

We going to see how to apply a UMAP techniques for image preprocessing and further classify the images using kNN and naive bayes.

These are the packages that we need.

library(keras) #for data and reshape to tabular format
library(tidymodels)
library(embed) #for umap
library(discrim) #for naive bayes model

We going to use the famous MNIST dataset. This dataset contained a handwritten digit from 0 to 9. This dataset is available in keras package.

mnist_data <- dataset_mnist()

## Loaded Tensorflow version 2.2.0

image_data <- mnist_data$train$x
image_labels <- mnist_data$train$y
image_data %>% dim()

## [1] 60000    28    28

For example this is the image for the second row.

image_data[2, 1:28, 1:28] %>% 
  t() %>% 
  image(col = gray.colors(256))

Next, we going to change the image into a tabular data frame format. We going to limit the data to the first 1000 rows or images out of the total 6000 images.

# Reformat to tabular format
image_data <- array_reshape(image_data, dim = c(60000, 28*28))
image_data %>% dim()

## [1] 60000   784

image_data <- image_data[1:10000,]
image_labels <- image_labels[1:10000]

# Reformat to data frame
full_data <- 
  data.frame(image_data) %>% 
  bind_cols(label = image_labels) %>% 
  mutate(label = as.factor(label))

Then, we going to split the data and create a 3-folds cross-validation sets for the sake of simplicity.

# Split data
set.seed(123)
ind <- initial_split(full_data)
data_train <- training(ind)  
data_test <- testing(ind)

# 10-folds CV
set.seed(123)
data_cv <- vfold_cv(data_train, v = 3)

For recipe specification, we going to scale and center all the predictor after creating a new variable using step_umap(). Notice that in step_umap() we supply the outcome and we tune the number of components (num_comp).

rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = tune()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

We create a a base workflow.

wf <- 
  workflow() %>% 
  add_recipe(rec)

We going to use two models as classifier:

kNN
Naive bayes

For each classifier, we going to create a regular grid of parameters to be tuned and further run a regular grid search.

For kNN.

# knn model
knn_mod <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

# knn grid
knn_grid <- grid_regular(neighbors(), num_comp(range = c(2, 8)), levels = 3)

# Tune grid search
knn_tune <- 
  tune_grid(
  wf %>% add_model(knn_mod),
  resamples = data_cv,
  grid = knn_grid, 
  control = control_grid(verbose = F)
)

For naive bayes.

# nb model
nb_mod <- 
  naive_Bayes(smoothness = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("naivebayes")

# nb grid
nb_grid <- grid_regular(smoothness(), num_comp(range = c(2, 10)), levels = 3)

# Tune grid search
nb_tune <- 
  tune_grid(
    wf %>% add_model(nb_mod),
    resamples = data_cv,
    grid = nb_grid, 
    control = control_grid(verbose = F)
  )

Let’s see our tuning performance of our model.

# knn model
knn_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   neighbors num_comp .metric .estimator  mean     n  std_err .config            
##       <int>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>              
## 1        10        8 roc_auc hand_till  0.961     3 0.000268 Preprocessor3_Mode~
## 2        10        5 roc_auc hand_till  0.961     3 0.000421 Preprocessor2_Mode~
## 3         5        8 roc_auc hand_till  0.959     3 0.000757 Preprocessor3_Mode~
## 4        10        2 roc_auc hand_till  0.959     3 0.000737 Preprocessor1_Mode~
## 5         5        5 roc_auc hand_till  0.958     3 0.000740 Preprocessor2_Mode~

knn_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   neighbors num_comp .metric  .estimator  mean     n std_err .config            
##       <int>    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
## 1        10        8 accuracy multiclass 0.914     3 0.00104 Preprocessor3_Mode~
## 2         5        8 accuracy multiclass 0.913     3 0.00315 Preprocessor3_Mode~
## 3        10        5 accuracy multiclass 0.912     3 0.00114 Preprocessor2_Mode~
## 4         5        5 accuracy multiclass 0.91      3 0.00139 Preprocessor2_Mode~
## 5        10        2 accuracy multiclass 0.910     3 0.00175 Preprocessor1_Mode~

# nb model
nb_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   smoothness num_comp .metric .estimator  mean     n  std_err .config           
##        <dbl>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>             
## 1        1.5       10 roc_auc hand_till  0.971     3 0.000400 Preprocessor3_Mod~
## 2        1.5        6 roc_auc hand_till  0.971     3 0.000997 Preprocessor2_Mod~
## 3        1         10 roc_auc hand_till  0.971     3 0.000634 Preprocessor3_Mod~
## 4        1          6 roc_auc hand_till  0.970     3 0.00124  Preprocessor2_Mod~
## 5        0.5       10 roc_auc hand_till  0.969     3 0.000808 Preprocessor3_Mod~

nb_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   smoothness num_comp .metric  .estimator  mean     n  std_err .config          
##        <dbl>    <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>            
## 1        1         10 accuracy multiclass 0.913     3 0.000481 Preprocessor3_Mo~
## 2        1.5       10 accuracy multiclass 0.913     3 0.000267 Preprocessor3_Mo~
## 3        0.5       10 accuracy multiclass 0.912     3 0.000462 Preprocessor3_Mo~
## 4        1.5        6 accuracy multiclass 0.911     3 0.00135  Preprocessor2_Mo~
## 5        1          6 accuracy multiclass 0.910     3 0.00157  Preprocessor2_Mo~

Next, we going to select the best model from the tuned parameters and finalise our model using last_fit().

For knn model.

# Finalize
knn_best <- knn_tune %>% select_best("roc_auc")
knn_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = knn_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

knn_wf <- 
  workflow() %>% 
  add_recipe(knn_rec) %>% 
  add_model(knn_mod) %>% 
  finalize_workflow(knn_best) 

# Last fit
knn_lastfit <- 
  knn_wf %>% 
  last_fit(ind)

For naive bayes model.

# Finalize
nb_best <- nb_tune %>% select_best("roc_auc")
nb_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = nb_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

nb_wf <- 
  workflow() %>% 
  add_recipe(nb_rec) %>% 
  add_model(nb_mod) %>% 
  finalize_workflow(nb_best) 

# Last fit
nb_lastfit <- 
  nb_wf %>% 
  last_fit(ind)

Let’s see the model performance on the testing data.

knn_lastfit %>% 
  collect_metrics() %>% 
  mutate(model = "knn") %>% 
  dplyr::bind_rows(nb_lastfit %>% 
                     collect_metrics() %>% 
                     mutate(model = "nb")) %>% 
  select(-.config)

## # A tibble: 4 x 4
##   .metric  .estimator .estimate model
##   <chr>    <chr>          <dbl> <chr>
## 1 accuracy multiclass     0.938 knn  
## 2 roc_auc  hand_till      0.971 knn  
## 3 accuracy multiclass     0.936 nb   
## 4 roc_auc  hand_till      0.980 nb

These are the confusion matrices.

knn_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - kNN")

nb_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - naive bayes")

Lastly, we can compare the ROC plots for each class.

knn_lastfit %>% 
  collect_predictions() %>%
  mutate(id = "knn") %>% 
  bind_rows(
    nb_lastfit %>% 
      collect_predictions() %>% 
      mutate(id = "nb")
            ) %>% 
  group_by(id) %>% 
  roc_curve(label, .pred_0:.pred_9) %>% 
  autoplot()

Conclusion

I believe UMAP is quite good and can be used as one of preprocessing step in image classification. We are able to get a pretty good performance result in this post. I believe if the the parameter tuning approach is a bit more rigorous, the performance result will be a lot better.

Explore data using PCA

Wed, 09 Feb 2022 00:00:00 +0000

Principal component analysis (PCA)

PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.

PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.

For those who want to study the theoretical side of PCA can further read on this link. We going to focus more on the coding part in the machine learning framework (using tidymodels package) in this post.

Example in R

These are the packages that we going to use.

library(tidymodels)
library(tidyverse)
library(mlbench) #data

We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.

data("PimaIndiansDiabetes")
glimpse(PimaIndiansDiabetes)

## Rows: 768
## Columns: 9
## $ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~

We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.

set.seed(1)

ind <- initial_split(PimaIndiansDiabetes)
dat_train <- training(ind)

We create a recipe and apply normalization and PCA techniques. Then, we prep it.

# Recipe
pca_rec <- 
  recipe(diabetes ~ ., data = dat_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep <- prep(pca_rec)

So, we can extract the PCA data using tidy(). type = "coef" indicates that we want the loadings values. So, the values in the data are the loadings.

pca_tidied <- tidy(pca_prep, 2, type = "coef")
pca_tidied

## # A tibble: 64 x 4
##    terms     value component id       
##    <chr>     <dbl> <chr>     <chr>    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows

So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.

We can further visualise these loadings.

pca_tidied %>% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab("") +
  xlab("Loadings") + 
  theme_minimal()

Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.

pca_tidied2 <- tidy(pca_prep, 2, type = "variance")

pca_tidied2 %>% 
  pivot_wider(names_from = component, values_from = value, names_prefix = "PC") %>% 
  select(-id) %>% 
  mutate_if(is.numeric, round, digits = 1) %>% 
  kableExtra::kable("simple")

terms	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
variance	2.1	1.7	1.0	0.8	0.8	0.7	0.5	0.4
cumulative variance	2.1	3.8	4.9	5.7	6.5	7.2	7.6	8.0
percent variance	26.2	21.5	12.9	10.6	9.9	8.5	5.7	4.7
cumulative percent variance	26.2	47.7	60.7	71.2	81.1	89.6	95.3	100.0

Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.

pca_tidied3 <- 
  pca_tidied %>% 
  filter(component %in% c("PC1", "PC2")) %>% 
  select(-id) %>% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3

## # A tibble: 8 x 3
##   terms      PC1    PC2
##   <chr>    <dbl>  <dbl>
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638

Now, we can make a scatter plot using training set data (juice(pca_prep)) and the loadings data (pca_tidied3). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.

juice(pca_prep) %>% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, "picas")),
               color = "blue") +
  annotate("text", 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab("PC1 (26.2%)") +
  ylab("PC2 (21.5%)")

So, from this scatter plot we learn that:

(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other
As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase
As PC2 decreases, pregnant and age increase

References:

Hyperparameter tuning in tidymodels

Sun, 05 Sep 2021 00:00:00 +0000

This post will not go very detail in each of the approach of hyperparameter tuning. This post mainly aims to summarize a few things that I studied for the last couple of days. Generally, there are two approaches to hyperparameter tuning in tidymodels.

Grid search:
– Regular grid search
– Random grid search
Iterative search:
– Bayesian optimization
– Simulated annealing

Grid search

So, in grid search, we provide the combination of parameters and the algorithm will go through each combination of parameters. There are two types of grid search:

Regular grid search
– The algorithm will go through each combinations of parameters.

grid_regular(mtry(c(1, 13)), 
             trees(), 
             min_n(),
             levels = 3) # how many from each parameter

## # A tibble: 27 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1     1     2
##  2     7     1     2
##  3    13     1     2
##  4     1  1000     2
##  5     7  1000     2
##  6    13  1000     2
##  7     1  2000     2
##  8     7  2000     2
##  9    13  2000     2
## 10     1     1    21
## # ... with 17 more rows

Random grid search
– The algorithm will randomly select a number of combination of parameters instead of go through each of them.

grid_random(mtry(c(1, 13)),
            trees(), 
            min_n(), 
            size = 100) # size of parameters combination

## # A tibble: 100 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     5  1216    40
##  2     8  1374    13
##  3     9   859    39
##  4     6   282    12
##  5     2  1210     9
##  6     8  1828    39
##  7    11   550    14
##  8    13  1157    32
##  9     5   282     6
## 10    10  1018    28
## # ... with 90 more rows

By default, tidymodels uses space-filling-design to make sure the combination of parameters are on “equidistance” to each other.

Iterative search

In iterative search, we need to specify some initial parameters/values to start the search.

Bayesian optimization
– This algorithm/function will search the next best combination of parameters based on the previous combination of parameters (priori).
Simulated annealing
– Generally, this algorithm works relatively similar to bayesian optimization.
– However, as the figure below illustrates this algorithm is able to explore in the worst combination of parameters for a short term (barrier of local search), in order to find the best combination of parameters (global minima).

Futher details on iterative search or both methods above can be found here. So, as both iterative methods need a starting parameters, we can actually combine with any of the grid search methods.

Other methods

By default, if we do not supply any combination of parameters, tidymodels will randomly pick 10 combinations of parameters from the default range of values from the model. Additionally, we can set this values to other values as shown below:

tune_grid(
  resamples = dat_cv, # cross validation data set
  grid = 20,  # 20 combinations of parameters
  control = control, # some control parameters
  metrics = metrics # some metrics parameters (roc_auc, etc)
  )

There are another special cases of grid search; tune_race_anova() and tune_race_win_loss(). Both of these methods supposed to be more efficient way of grid search. In general, both methods evaluate the tuning parameters on a small initial set. The combination of parameters with a worst performance will be eliminated. Thus, makes them more efficient in grid search. The main difference between these two methods is how the worst combination of parameters are evaluated and eliminated.

R codes

Load the packages.

# Packages
library(tidyverse)
library(tidymodels)
library(finetune)

We will only use a small chunk of the data for ease of computation.

# Data
data(income, package = "kernlab")

# Make data smaller for computation
set.seed(2021)
income2 <- 
  income %>% 
  filter(INCOME == "[75.000-" | INCOME == "[50.000-75.000)") %>% 
  slice_sample(n = 600) %>% 
  mutate(INCOME = fct_drop(INCOME), 
         INCOME = fct_recode(INCOME, 
                             rich = "[75.000-",
                             less_rich = "[50.000-75.000)"), 
         INCOME = factor(INCOME, ordered = F)) %>% 
  mutate(across(-INCOME, fct_drop))

# Summary of data
glimpse(income2)

## Rows: 600
## Columns: 14
## $ INCOME         <fct> less_rich, rich, rich, rich, less_rich, rich, rich, les~
## $ SEX            <fct> F, M, F, M, F, F, F, M, F, M, M, M, F, F, F, F, M, M, M~
## $ MARITAL.STATUS <fct> Married, Married, Married, Single, Single, NA, Married,~
## $ AGE            <ord> 35-44, 25-34, 45-54, 18-24, 18-24, 14-17, 25-34, 25-34,~
## $ EDUCATION      <ord> 1 to 3 years of college, Grad Study, College graduate, ~
## $ OCCUPATION     <fct> "Professional/Managerial", "Professional/Managerial", "~
## $ AREA           <ord> 10+ years, 7-10 years, 10+ years, -1 year, 4-6 years, 7~
## $ DUAL.INCOMES   <fct> Yes, Yes, Yes, Not Married, Not Married, Not Married, N~
## $ HOUSEHOLD.SIZE <ord> Five, Two, Four, Two, Four, Two, Three, Two, Five, One,~
## $ UNDER18        <ord> Three, None, None, None, None, None, One, None, Three, ~
## $ HOUSEHOLDER    <fct> Own, Own, Own, Rent, Family, Own, Own, Rent, Own, Own, ~
## $ HOME.TYPE      <fct> House, House, House, House, House, Apartment, House, Ho~
## $ ETHNIC.CLASS   <fct> White, White, White, White, White, White, White, White,~
## $ LANGUAGE       <fct> English, English, English, English, English, NA, Englis~

# Outcome variable
table(income2$INCOME)

## 
## less_rich      rich 
##       362       238

# Missing data
DataExplorer::plot_missing(income)

Split the data and create a 10-fold cross validation.

set.seed(2021)
dat_index <- initial_split(income2, strata = INCOME)
dat_train <- training(dat_index)
dat_test <- testing(dat_index)

## CV
set.seed(2021)
dat_cv <- vfold_cv(dat_train, v = 10, repeats = 1, strata = INCOME)

We going to impute the NAs with mode value since all the variable are categorical.

# Recipe
dat_rec <- 
  recipe(INCOME ~ ., data = dat_train) %>% 
  step_impute_mode(all_predictors()) %>% 
  step_ordinalscore(AGE, EDUCATION, AREA, HOUSEHOLD.SIZE, UNDER18)

# Model
rf_mod <- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

# Workflow
rf_wf <- 
  workflow() %>% 
  add_recipe(dat_rec) %>% 
  add_model(rf_mod)

Parameters for grid search

# Regular grid
reg_grid <- grid_regular(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         levels = 3)

# Random grid
rand_grid <- grid_random(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         size = 100)

Tune models using regular grid search. We going to use doParallel library to do parallel processing.

ctrl <- control_grid(save_pred = T,
                        extract = extract_model)
measure <- metric_set(roc_auc)  

# Parallel for regular grid
library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_regular <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = reg_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for regular grid search:

autoplot(tune_regular)

show_best(tune_regular)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7  1000    21 roc_auc binary     0.690    10  0.0148 Preprocessor1_Model14
## 2     7  1000    40 roc_auc binary     0.689    10  0.0179 Preprocessor1_Model23
## 3     7  2000    40 roc_auc binary     0.689    10  0.0178 Preprocessor1_Model26
## 4     7  1000     2 roc_auc binary     0.688    10  0.0173 Preprocessor1_Model05
## 5     7  2000    21 roc_auc binary     0.688    10  0.0159 Preprocessor1_Model17

Tune models using random grid search.

# Parallel for random grid
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_random <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for random grid search:

autoplot(tune_random)

show_best(tune_random)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_Model0~
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_Model0~
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_Model0~
## 4     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_Model0~
## 5     6   160    31 roc_auc binary     0.692    10  0.0172 Preprocessor1_Model0~

Random grid search has slightly a better result. Let’s use this random search result as a base for iterative search. Firstly, we limit the parameters based on the plot from a random grid search.

rf_param <- 
  rf_wf %>% 
  parameters() %>% 
  update(mtry = mtry(c(5, 13)), 
         trees = trees(c(1, 500)), 
         min_n = min_n(c(5, 30)))

Now we do a bayesian optimization.

# Parallel for bayesian optimization
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
bayes_tune <-  
  rf_wf %>% 
  tune_bayes(    
    resamples = dat_cv,
    param_info = rf_param,
    iter = 60,
    initial = tune_random, # result from random grid search        
    control = control_bayes(no_improve = 30, verbose = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

Result for bayesian optimization.

autoplot(bayes_tune, "performance")

show_best(bayes_tune)

## # A tibble: 5 x 10
##    mtry trees min_n .metric .estimator  mean     n std_err .config         .iter
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           <int>
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_~     0
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_~     0
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_~     0
## 4     6   189    15 roc_auc binary     0.693    10  0.0153 Iter1               1
## 5     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_~     0

We get a slightly better result from bayesian optimization. I will not do a simulated annealing approach since I get an error, though I am not sure why.

Lastly, we do a race anova.

# Parallel for race anova
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_efficient <- 
  rf_wf %>% 
  tune_race_anova(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = control_race(verbose_elim = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

We get a relatively similar result to random grid search but with faster computation.

autoplot(tune_efficient)

show_best(tune_efficient)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5  1425     5 roc_auc binary     0.695    10  0.0161 Preprocessor1_Model0~
## 2    11   406     2 roc_auc binary     0.694    10  0.0183 Preprocessor1_Model0~
## 3     6   631     3 roc_auc binary     0.692    10  0.0171 Preprocessor1_Model0~
## 4     7  1264     4 roc_auc binary     0.692    10  0.0159 Preprocessor1_Model0~
## 5     9  1264     3 roc_auc binary     0.692    10  0.0188 Preprocessor1_Model0~

We can also compare ROCs of all approaches. All approaches looks more or less similar.

Show code

# regular grid
rf_reg <- 
  tune_regular %>% 
  select_best(metric = "roc_auc")

reg_auc <- 
  tune_regular %>% 
  collect_predictions(parameters = rf_reg) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "regular_grid")

# random grid
rf_rand <- 
  tune_random %>% 
  select_best(metric = "roc_auc")

rand_auc <- 
  tune_random %>% 
  collect_predictions(parameters = rf_rand) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "random_grid")

# bayes
rf_bayes <- 
  bayes_tune %>% 
  select_best(metric = "roc_auc")

bayes_auc <- 
  bayes_tune %>% 
  collect_predictions(parameters = rf_bayes) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "bayes")

# race_anova
rf_eff <- 
  tune_efficient %>% 
  select_best(metric = "roc_auc")

eff_auc <- 
  tune_efficient %>% 
  collect_predictions(parameters = rf_eff) %>%
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "race_anova")

# Compare ROC between all tuning approach
bind_rows(reg_auc, rand_auc, bayes_auc, eff_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6) +
  theme_bw()

Finally, we fit our best model (bayesian optimization) to the testing data.

# Finalize workflow
best_rf <-
  select_best(bayes_tune, "roc_auc")

final_wf <- 
  rf_wf %>% 
  finalize_workflow(best_rf)
final_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_impute_mode()
## * step_ordinalscore()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 4
##   trees = 1016
##   min_n = 4
## 
## Computational engine: ranger

# Last fit
test_fit <- 
  final_wf %>%
  last_fit(dat_index) 

# Evaluation metrics 
test_fit %>%
  collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.611 Preprocessor1_Model1

test_fit %>%
  collect_predictions() %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  autoplot()

Conclusion

The result is not that good. Our AUC is quite lower. However, we did use only about 8% from the overall data. Nonetheless, the aim of this post is to cover an overview of hyperparameter tuning in tidymodels.

Additionally, there are another two function to construct parameter grids that I did not cover in this post; grid_max_entropy() and grid_latin_hypercube(). Both of these functions do not have much resources explaining them (or at least I did not found it), however, for those interested, a good start will be the tidymodels website.

References:
https://www.tmwr.org/grid-search.html
https://www.tmwr.org/iterative-search.html
https://oliviergimenez.github.io/learning-machine-learning/#
https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7