<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Machine Learning | Tengku Hanis</title>
    <link>https://tengkuhanis.netlify.app/category/machine-learning/</link>
      <atom:link href="https://tengkuhanis.netlify.app/category/machine-learning/index.xml" rel="self" type="application/rss+xml" />
    <description>Machine Learning</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>©Tengku Hanis 2020-2025 Made with [blogdown](https://github.com/rstudio/blogdown)</copyright><lastBuildDate>Wed, 16 Mar 2022 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://tengkuhanis.netlify.app/images/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_2.png</url>
      <title>Machine Learning</title>
      <link>https://tengkuhanis.netlify.app/category/machine-learning/</link>
    </image>
    
    <item>
      <title>Using UMAP preprocessing for image classification</title>
      <link>https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/</link>
      <pubDate>Wed, 16 Mar 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;umap&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;UMAP&lt;/h2&gt;
&lt;p&gt;Uniform manifold approximation and projection or in short UMAP is a type of dimension reduction techniques. So, basically UMAP will project a set of features into a smaller space. UMAP can be a supervised technique in which we give a label or an outcome or an unsupervised one. For those interested to know in detail how UMAP works can refer to this &lt;a href=&#34;https://umap-learn.readthedocs.io/en/latest/how_umap_works.html&#34;&gt;reference&lt;/a&gt;. For those prefer a much simpler or shorter version of it, I recommend a &lt;a href=&#34;https://www.youtube.com/watch?v=eN0wFzBA4Sc&amp;amp;list=WL&amp;amp;index=2&#34;&gt;YouTube video by Joshua Starmer&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;We going to see how to apply a UMAP techniques for image preprocessing and further classify the images using kNN and naive bayes.&lt;/p&gt;
&lt;p&gt;These are the packages that we need.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(keras) #for data and reshape to tabular format
library(tidymodels)
library(embed) #for umap
library(discrim) #for naive bayes model&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use the famous MNIST dataset. This dataset contained a handwritten digit from 0 to 9. This dataset is available in &lt;code&gt;keras&lt;/code&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mnist_data &amp;lt;- dataset_mnist()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loaded Tensorflow version 2.2.0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data &amp;lt;- mnist_data$train$x
image_labels &amp;lt;- mnist_data$train$y
image_data %&amp;gt;% dim()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 60000    28    28&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For example this is the image for the second row.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data[2, 1:28, 1:28] %&amp;gt;% 
  t() %&amp;gt;% 
  image(col = gray.colors(256))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Next, we going to change the image into a tabular data frame format. We going to limit the data to the first 1000 rows or images out of the total 6000 images.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Reformat to tabular format
image_data &amp;lt;- array_reshape(image_data, dim = c(60000, 28*28))
image_data %&amp;gt;% dim()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 60000   784&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data &amp;lt;- image_data[1:10000,]
image_labels &amp;lt;- image_labels[1:10000]

# Reformat to data frame
full_data &amp;lt;- 
  data.frame(image_data) %&amp;gt;% 
  bind_cols(label = image_labels) %&amp;gt;% 
  mutate(label = as.factor(label))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, we going to split the data and create a 3-folds cross-validation sets for the sake of simplicity.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Split data
set.seed(123)
ind &amp;lt;- initial_split(full_data)
data_train &amp;lt;- training(ind)  
data_test &amp;lt;- testing(ind)

# 10-folds CV
set.seed(123)
data_cv &amp;lt;- vfold_cv(data_train, v = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For recipe specification, we going to scale and center all the predictor after creating a new variable using &lt;code&gt;step_umap()&lt;/code&gt;. Notice that in &lt;code&gt;step_umap()&lt;/code&gt; we supply the outcome and we tune the number of components (&lt;code&gt;num_comp&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = tune()) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We create a a base workflow.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(rec) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use two models as classifier:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;kNN&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Naive bayes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For each classifier, we going to create a regular grid of parameters to be tuned and further run a regular grid search.&lt;/p&gt;
&lt;p&gt;For kNN.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# knn model
knn_mod &amp;lt;- 
  nearest_neighbor(neighbors = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;kknn&amp;quot;)

# knn grid
knn_grid &amp;lt;- grid_regular(neighbors(), num_comp(range = c(2, 8)), levels = 3)

# Tune grid search
knn_tune &amp;lt;- 
  tune_grid(
  wf %&amp;gt;% add_model(knn_mod),
  resamples = data_cv,
  grid = knn_grid, 
  control = control_grid(verbose = F)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For naive bayes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# nb model
nb_mod &amp;lt;- 
  naive_Bayes(smoothness = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;naivebayes&amp;quot;)

# nb grid
nb_grid &amp;lt;- grid_regular(smoothness(), num_comp(range = c(2, 10)), levels = 3)

# Tune grid search
nb_tune &amp;lt;- 
  tune_grid(
    wf %&amp;gt;% add_model(nb_mod),
    resamples = data_cv,
    grid = nb_grid, 
    control = control_grid(verbose = F)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see our tuning performance of our model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# knn model
knn_tune %&amp;gt;% 
  show_best(&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   neighbors num_comp .metric .estimator  mean     n  std_err .config            
##       &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;              
## 1        10        8 roc_auc hand_till  0.961     3 0.000268 Preprocessor3_Mode~
## 2        10        5 roc_auc hand_till  0.961     3 0.000421 Preprocessor2_Mode~
## 3         5        8 roc_auc hand_till  0.959     3 0.000757 Preprocessor3_Mode~
## 4        10        2 roc_auc hand_till  0.959     3 0.000737 Preprocessor1_Mode~
## 5         5        5 roc_auc hand_till  0.958     3 0.000740 Preprocessor2_Mode~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_tune %&amp;gt;% 
  show_best(&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   neighbors num_comp .metric  .estimator  mean     n std_err .config            
##       &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;              
## 1        10        8 accuracy multiclass 0.914     3 0.00104 Preprocessor3_Mode~
## 2         5        8 accuracy multiclass 0.913     3 0.00315 Preprocessor3_Mode~
## 3        10        5 accuracy multiclass 0.912     3 0.00114 Preprocessor2_Mode~
## 4         5        5 accuracy multiclass 0.91      3 0.00139 Preprocessor2_Mode~
## 5        10        2 accuracy multiclass 0.910     3 0.00175 Preprocessor1_Mode~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# nb model
nb_tune %&amp;gt;% 
  show_best(&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   smoothness num_comp .metric .estimator  mean     n  std_err .config           
##        &amp;lt;dbl&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             
## 1        1.5       10 roc_auc hand_till  0.971     3 0.000400 Preprocessor3_Mod~
## 2        1.5        6 roc_auc hand_till  0.971     3 0.000997 Preprocessor2_Mod~
## 3        1         10 roc_auc hand_till  0.971     3 0.000634 Preprocessor3_Mod~
## 4        1          6 roc_auc hand_till  0.970     3 0.00124  Preprocessor2_Mod~
## 5        0.5       10 roc_auc hand_till  0.969     3 0.000808 Preprocessor3_Mod~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nb_tune %&amp;gt;% 
  show_best(&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   smoothness num_comp .metric  .estimator  mean     n  std_err .config          
##        &amp;lt;dbl&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;            
## 1        1         10 accuracy multiclass 0.913     3 0.000481 Preprocessor3_Mo~
## 2        1.5       10 accuracy multiclass 0.913     3 0.000267 Preprocessor3_Mo~
## 3        0.5       10 accuracy multiclass 0.912     3 0.000462 Preprocessor3_Mo~
## 4        1.5        6 accuracy multiclass 0.911     3 0.00135  Preprocessor2_Mo~
## 5        1          6 accuracy multiclass 0.910     3 0.00157  Preprocessor2_Mo~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we going to select the best model from the tuned parameters and finalise our model using &lt;code&gt;last_fit()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For knn model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize
knn_best &amp;lt;- knn_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
knn_rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = knn_best$num_comp) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())

knn_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(knn_rec) %&amp;gt;% 
  add_model(knn_mod) %&amp;gt;% 
  finalize_workflow(knn_best) 

# Last fit
knn_lastfit &amp;lt;- 
  knn_wf %&amp;gt;% 
  last_fit(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For naive bayes model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize
nb_best &amp;lt;- nb_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
nb_rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = nb_best$num_comp) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())

nb_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(nb_rec) %&amp;gt;% 
  add_model(nb_mod) %&amp;gt;% 
  finalize_workflow(nb_best) 

# Last fit
nb_lastfit &amp;lt;- 
  nb_wf %&amp;gt;% 
  last_fit(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see the model performance on the testing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_metrics() %&amp;gt;% 
  mutate(model = &amp;quot;knn&amp;quot;) %&amp;gt;% 
  dplyr::bind_rows(nb_lastfit %&amp;gt;% 
                     collect_metrics() %&amp;gt;% 
                     mutate(model = &amp;quot;nb&amp;quot;)) %&amp;gt;% 
  select(-.config)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 4 x 4
##   .metric  .estimator .estimate model
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;
## 1 accuracy multiclass     0.938 knn  
## 2 roc_auc  hand_till      0.971 knn  
## 3 accuracy multiclass     0.936 nb   
## 4 roc_auc  hand_till      0.980 nb&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are the confusion matrices.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  conf_mat(label, .pred_class) %&amp;gt;% 
  autoplot(type = &amp;quot;heatmap&amp;quot;) +
  labs(title = &amp;quot;Confusion matrix - kNN&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nb_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  conf_mat(label, .pred_class) %&amp;gt;% 
  autoplot(type = &amp;quot;heatmap&amp;quot;) +
  labs(title = &amp;quot;Confusion matrix - naive bayes&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-14-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, we can compare the ROC plots for each class.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  mutate(id = &amp;quot;knn&amp;quot;) %&amp;gt;% 
  bind_rows(
    nb_lastfit %&amp;gt;% 
      collect_predictions() %&amp;gt;% 
      mutate(id = &amp;quot;nb&amp;quot;)
            ) %&amp;gt;% 
  group_by(id) %&amp;gt;% 
  roc_curve(label, .pred_0:.pred_9) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I believe UMAP is quite good and can be used as one of preprocessing step in image classification. We are able to get a pretty good performance result in this post. I believe if the the parameter tuning approach is a bit more rigorous, the performance result will be a lot better.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Explore data using PCA</title>
      <link>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</link>
      <pubDate>Wed, 09 Feb 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;principal-component-analysis-pca&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Principal component analysis (PCA)&lt;/h2&gt;
&lt;p&gt;PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.&lt;/p&gt;
&lt;p&gt;PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.&lt;/p&gt;
&lt;p&gt;For those who want to study the theoretical side of PCA can further read on this &lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34;&gt;link&lt;/a&gt;. We going to focus more on the coding part in the machine learning framework (using &lt;code&gt;tidymodels&lt;/code&gt; package) in this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;These are the packages that we going to use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidymodels)
library(tidyverse)
library(mlbench) #data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(&amp;quot;PimaIndiansDiabetes&amp;quot;)
glimpse(PimaIndiansDiabetes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 768
## Columns: 9
## $ pregnant &amp;lt;dbl&amp;gt; 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  &amp;lt;dbl&amp;gt; 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure &amp;lt;dbl&amp;gt; 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  &amp;lt;dbl&amp;gt; 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  &amp;lt;dbl&amp;gt; 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     &amp;lt;dbl&amp;gt; 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree &amp;lt;dbl&amp;gt; 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      &amp;lt;dbl&amp;gt; 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes &amp;lt;fct&amp;gt; pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1)

ind &amp;lt;- initial_split(PimaIndiansDiabetes)
dat_train &amp;lt;- training(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We create a recipe and apply normalization and PCA techniques. Then, we prep it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Recipe
pca_rec &amp;lt;- 
  recipe(diabetes ~ ., data = dat_train) %&amp;gt;% 
  step_normalize(all_numeric_predictors()) %&amp;gt;% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep &amp;lt;- prep(pca_rec)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, we can extract the PCA data using &lt;code&gt;tidy()&lt;/code&gt;. &lt;code&gt;type = &#34;coef&#34;&lt;/code&gt; indicates that we want the loadings values. So, the values in the data are the loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;coef&amp;quot;)
pca_tidied&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 64 x 4
##    terms     value component id       
##    &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.&lt;/p&gt;
&lt;p&gt;We can further visualise these loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied %&amp;gt;% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab(&amp;quot;&amp;quot;) +
  xlab(&amp;quot;Loadings&amp;quot;) + 
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied2 &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;variance&amp;quot;)

pca_tidied2 %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value, names_prefix = &amp;quot;PC&amp;quot;) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  mutate_if(is.numeric, round, digits = 1) %&amp;gt;% 
  kableExtra::kable(&amp;quot;simple&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;terms&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC3&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC4&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC5&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC6&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC7&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;60.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;81.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;89.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;95.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;100.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied3 &amp;lt;- 
  pca_tidied %&amp;gt;% 
  filter(component %in% c(&amp;quot;PC1&amp;quot;, &amp;quot;PC2&amp;quot;)) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 8 x 3
##   terms      PC1    PC2
##   &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we can make a scatter plot using training set data (&lt;code&gt;juice(pca_prep)&lt;/code&gt;) and the loadings data (&lt;code&gt;pca_tidied3&lt;/code&gt;). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;juice(pca_prep) %&amp;gt;% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, &amp;quot;picas&amp;quot;)),
               color = &amp;quot;blue&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab(&amp;quot;PC1 (26.2%)&amp;quot;) +
  ylab(&amp;quot;PC2 (21.5%)&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, from this scatter plot we learn that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase&lt;/li&gt;
&lt;li&gt;As PC2 decreases, pregnant and age increase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34; class=&#34;uri&#34;&gt;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://juliasilge.com/blog/cocktail-recipes-umap/&#34; class=&#34;uri&#34;&gt;https://juliasilge.com/blog/cocktail-recipes-umap/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>A short note on variable selection</title>
      <link>https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/</link>
      <pubDate>Sat, 08 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;variable-selection&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Variable selection&lt;/h2&gt;
&lt;p&gt;Variable or feature selection is one of the important step whether in machine learning or statistical analysis. This post is geared more to the machine learning side. Certain machine learning models such as Support vector machine (SVM) and neural network do not handle irrelevant predictors very well, whereas models such as linear and logistic regression do not handle correlated predictors very well. Thus, careful selection of the variables will help mitigate this issue and further improve the predictive performance.&lt;/p&gt;
&lt;p&gt;There are three types of approaches in variable selection:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Intrinsic (or built-in feature selection)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An intrinsic feature selection is a feature selection embedded in the algorithm. Some examples include:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Tree-and-rule-based model - decision tree, random forest, etc&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Multivariate adaptive regression spline (MARS)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Regularization method such as least absolute shrinkage and selection operator (LASSO or L1)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Advantages of this type of approach are they are fast and computationally efficient. However, the best variable selected in this approach is model dependent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Filter&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In filter approach we determine the variable importance, usually separately though not necessarily. An example of this approach is univariate filter. If the outcome is two categories, we can use t-test to assess the numerical predictors. Variables with a significant p-value or a large t-statistics will be chosen.&lt;/p&gt;
&lt;p&gt;This approach is very simple and fast. However, the best subset of variables selected using some filtering criteria such as statistical significance may not reflect the best predictive performance of the model. Additionally, this approach is prone to over-selection of the predictors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Wrapper&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There two types of wrapper approaches:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Greedy wrapper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Greedy approach or algorithm direct a search path towards the best at times to achieve the best immediate benefit. Due to this reason this approach cannot escape local minima. We can assume in Figure 1 below local minima represents locally best predictors and global minima represents globally best predictors.&lt;/p&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;img.png&#34; alt=&#34;Local minima and global minima&#34; width=&#34;576&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Local minima and global minima
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;An example of this approach is recursive feature elimination or backward selection. The main weakness of this greedy approach is the selected subset of features identified by this approach may not has the best predictive performance.&lt;/p&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Non-greedy wrapper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The examples of this approach are simulated annealing and &lt;a href=&#34;https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/&#34;&gt;genetic algorithm&lt;/a&gt;. Both of these algorithm incorporate a randomness in their approach. Hence, it is classified as non-greedy wrapper. Due to this randomness, it can escape a local minima (see Figure 1 above).&lt;/p&gt;
&lt;p&gt;The wrapper type has the best chance to find the globally best predictors. However, this approach is computationally expensive. Not to mention, this approach has a tendency to overfit (some packages like &lt;code&gt;caret&lt;/code&gt; use resampling to avoid this issue).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;suggested-approach&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Suggested approach&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://bookdown.org/max/FES/&#34;&gt;Kuhn &amp;amp; Johnson (2019)&lt;/a&gt; suggested this approach:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Start with an intrinsic approach&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then, do a wrapper approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If a linear intrinsic approach has a better performance - proceed to wrapper method with a linear model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If non-linear intrinsic approach has a better performance - proceed to wrapper method with a non-linear model&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If several approach select a large number of predictors, it may not feasible to reduce the number of features&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/max/FES/classes-of-feature-selection-methodologies.html&#34; class=&#34;uri&#34;&gt;https://bookdown.org/max/FES/classes-of-feature-selection-methodologies.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://topepo.github.io/caret/feature-selection-overview.html&#34; class=&#34;uri&#34;&gt;http://topepo.github.io/caret/feature-selection-overview.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Variable selection using genetic algorithm</title>
      <link>https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/</link>
      <pubDate>Sun, 02 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;background&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Genetic algorithm is inspired by a natural selection process by which the fittest individuals be selected to reproduce. This algorithm has been used in optimization and search problem, and also, can be used for variable selection.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;images/ga_fig.png&#34; alt=&#34;&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Genetic algorithm - gene, chromosome, population, crossover (upper right), offspring (lower right)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;First, let’s go into a few terms related to genetic algorithm theory.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Population - a set of chromosomes&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Chromosome - a subset of variables (also known as individual by some reference)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Gene - a variable or feature&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Fitness function - give fitness score to each chromosome and guide the selection&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Selection - a process to select the two chromosome known as parents&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Crossover - a process to generate offspring by parents (illustrate in the picture above, on the upper right side)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Mutation - the process by which the gene in the chromosome is randomly flipped into 1 or 0&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;images/mutation.png&#34; width=&#34;250&#34; alt=&#34;&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Mutation&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;So, the basic flow of genetic algorithm:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Algorithm starts with an initial population, often randomly generated&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create a successive generation by selecting a portion of the initial population (the selection is guided by the fitness function) - this includes selection -&amp;gt; crossover -&amp;gt; mutation&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The algorithm terminates if certain predetermined criteria are met such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Solution satisfies the minimum criteria&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Fixed number of generation reached&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Successive iteration no longer produce a better result&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;There is &lt;code&gt;GA&lt;/code&gt; package in R, where we can implement the genetic algorithm a bit more manually where we can specify our own fitness function. However, I think it is easier to use a genetic algorithm implemented in &lt;code&gt;caret&lt;/code&gt; package for variable selection.&lt;/p&gt;
&lt;p&gt;Load the packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(caret)
library(tidyverse)
library(rsample)
library(recipes)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat &amp;lt;- 
  mtcars %&amp;gt;% 
  mutate(across(c(vs, am), as.factor),
         am = fct_recode(am, auto = &amp;quot;0&amp;quot;, man = &amp;quot;1&amp;quot;))
str(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels &amp;quot;0&amp;quot;,&amp;quot;1&amp;quot;: 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels &amp;quot;auto&amp;quot;,&amp;quot;man&amp;quot;: 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For this, we going to use random forest (&lt;code&gt;rfGA&lt;/code&gt;). Other options are bagged tree (&lt;code&gt;treebagGA&lt;/code&gt;) and &lt;code&gt;caretGA&lt;/code&gt;. We are able to use other method in &lt;code&gt;caret&lt;/code&gt; if we use &lt;code&gt;caretGA&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
ga_ctrl &amp;lt;- gafsControl(functions = rfGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run random forest
set.seed(123)
rf_ga &amp;lt;- gafs(x = dat %&amp;gt;% select(-am), 
              y = dat$am,
              iters = 5,
              gafsControl = ga_ctrl)
rf_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 2 classes: &amp;#39;auto&amp;#39;, &amp;#39;man&amp;#39; 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     qsec (60%), wt (60%), disp (40%), gear (40%), vs (40%)
##   * on average, 3.2 variables were selected (min = 1, max = 7)
## 
## In the final search using the entire training set:
##    * 7 features selected at iteration 3 including:
##      cyl, hp, drat, qsec, vs ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9429      0.8831&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal features/variables:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;cyl&amp;quot;  &amp;quot;hp&amp;quot;   &amp;quot;drat&amp;quot; &amp;quot;qsec&amp;quot; &amp;quot;vs&amp;quot;   &amp;quot;gear&amp;quot; &amp;quot;carb&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the time taken for random forest approach.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga$times&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $everything
##    user  system elapsed 
##   51.22    1.25   52.92&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default the algorithm will find a solution or a set of variable that reduce RMSE for numerical outcome, and accuracy for categorical outcome. Also, genetic algorithm tend to overfit, that’s why for the implementation in &lt;code&gt;caret&lt;/code&gt; we have internal and external performance. So, for the 10-fold cross-validation, 10 genetic algorithm will be run separately. All the first nine folds are used for the genetic algorithm, and the 10th for external performance evaluation.&lt;/p&gt;
&lt;p&gt;Let’s try a variable selection using linear regression model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
lm_ga_ctrl &amp;lt;- gafsControl(functions = caretGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run lm
set.seed(123)
lm_ga &amp;lt;- gafs(x = dat %&amp;gt;% select(-mpg), 
              y = dat$mpg,
              iters = 5,
              gafsControl = lm_ga_ctrl,
              # below is the option for `train`
              method = &amp;quot;lm&amp;quot;,
              trControl = trainControl(method = &amp;quot;cv&amp;quot;, allowParallel = F))
lm_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared, MAE
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (100%), hp (80%), carb (60%), cyl (60%), am (40%)
##   * on average, 4.4 variables were selected (min = 4, max = 5)
## 
## In the final search using the entire training set:
##    * 5 features selected at iteration 5 including:
##      cyl, disp, hp, wt, qsec  
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      3.3434      0.7624      2.6037&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, let’s see how to integrate this in machine learning flow using recipes from &lt;code&gt;rsample&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;First, we split the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
dat_split &amp;lt;-initial_split(dat)
dat_train &amp;lt;- training(dat_split)
dat_test &amp;lt;- testing(dat_split)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We specify two recipes for numerical and categorical outcome.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Numerical
rec_num &amp;lt;- 
  recipe(mpg ~., data = dat_train) %&amp;gt;% 
  step_center(all_numeric()) %&amp;gt;% 
  step_dummy(all_nominal_predictors())

# Categorical
rec_cat &amp;lt;- 
  recipe(am ~., data = dat_train) %&amp;gt;% 
  step_center(all_numeric()) %&amp;gt;% 
  step_dummy(all_nominal_predictors())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We run random forest for numerical outcome recipes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
rf_ga_ctrl &amp;lt;- gafsControl(functions = rfGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run random forest
set.seed(123)
rf_ga2 &amp;lt;- 
  gafs(rec_num,
       data = dat_train,
       iters = 5, 
       gafsControl = rf_ga_ctrl) 
rf_ga2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     cyl (80%), disp (80%), hp (80%), wt (80%), carb (60%)
##   * on average, 4.8 variables were selected (min = 2, max = 9)
## 
## In the final search using the entire training set:
##    * 6 features selected at iteration 5 including:
##      cyl, disp, hp, wt, gear ... 
##    * external performance at this iteration is
## 
##       RMSE   Rsquared        MAE 
##      2.830      0.928      2.408&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga2$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;cyl&amp;quot;   &amp;quot;disp&amp;quot;  &amp;quot;hp&amp;quot;    &amp;quot;wt&amp;quot;    &amp;quot;gear&amp;quot;  &amp;quot;vs_X1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s try run SVM for the numerical outcome recipes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
svm_ga_ctrl &amp;lt;- gafsControl(functions = caretGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run SVM
set.seed(123)
svm_ga &amp;lt;- 
  gafs(rec_cat,
       data = dat_train,
       iters = 5, 
       gafsControl = svm_ga_ctrl,
       # below is the options to `train` for caretGA
       method = &amp;quot;svmRadial&amp;quot;, #SVM with Radial Basis Function Kernel
       trControl = trainControl(method = &amp;quot;cv&amp;quot;, allowParallel = T))
svm_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 2 classes: &amp;#39;auto&amp;#39;, &amp;#39;man&amp;#39; 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (80%), qsec (60%), vs_X1 (60%), carb (40%), disp (40%)
##   * on average, 4 variables were selected (min = 3, max = 6)
## 
## In the final search using the entire training set:
##    * 9 features selected at iteration 2 including:
##      mpg, cyl, disp, hp, drat ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9200      0.8571&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;svm_ga$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;mpg&amp;quot;   &amp;quot;cyl&amp;quot;   &amp;quot;disp&amp;quot;  &amp;quot;hp&amp;quot;    &amp;quot;drat&amp;quot;  &amp;quot;wt&amp;quot;    &amp;quot;qsec&amp;quot;  &amp;quot;carb&amp;quot;  &amp;quot;vs_X1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Although genetic algorithm seems quite good for variable selection, the main limitation I would say is the computational time. However, if we have a lot of variables or features to reduced, using the genetic algorithm despite the long computational time seems beneficial to me.&lt;/p&gt;
&lt;p&gt;Reference:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html#ga&#34; class=&#34;uri&#34;&gt;https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html#ga&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://towardsdatascience.com/feature-selection-using-genetic-algorithms-in-r-3d9252f1aa66&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/feature-selection-using-genetic-algorithms-in-r-3d9252f1aa66&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Hyperparameter tuning in tidymodels</title>
      <link>https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/</link>
      <pubDate>Sun, 05 Sep 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;This post will not go very detail in each of the approach of hyperparameter tuning. This post mainly aims to summarize a few things that I studied for the last couple of days.
Generally, there are two approaches to hyperparameter tuning in tidymodels.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Grid search:&lt;br /&gt;
– Regular grid search&lt;br /&gt;
– Random grid search&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Iterative search:&lt;br /&gt;
– Bayesian optimization&lt;br /&gt;
– Simulated annealing&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;grid-search&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Grid search&lt;/h2&gt;
&lt;p&gt;So, in grid search, we provide the combination of parameters and the algorithm will go through each combination of parameters. There are two types of grid search:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regular grid search&lt;br /&gt;
– The algorithm will go through each combinations of parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_regular(mtry(c(1, 13)), 
             trees(), 
             min_n(),
             levels = 3) # how many from each parameter&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 27 x 3
##     mtry trees min_n
##    &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
##  1     1     1     2
##  2     7     1     2
##  3    13     1     2
##  4     1  1000     2
##  5     7  1000     2
##  6    13  1000     2
##  7     1  2000     2
##  8     7  2000     2
##  9    13  2000     2
## 10     1     1    21
## # ... with 17 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Random grid search&lt;br /&gt;
– The algorithm will randomly select a number of combination of parameters instead of go through each of them.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_random(mtry(c(1, 13)),
            trees(), 
            min_n(), 
            size = 100) # size of parameters combination&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 100 x 3
##     mtry trees min_n
##    &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
##  1     5  1216    40
##  2     8  1374    13
##  3     9   859    39
##  4     6   282    12
##  5     2  1210     9
##  6     8  1828    39
##  7    11   550    14
##  8    13  1157    32
##  9     5   282     6
## 10    10  1018    28
## # ... with 90 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, tidymodels uses space-filling-design to make sure the combination of parameters are on “equidistance” to each other.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;iterative-search&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Iterative search&lt;/h2&gt;
&lt;p&gt;In iterative search, we need to specify some initial parameters/values to start the search.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bayesian optimization&lt;br /&gt;
– This algorithm/function will search the next best combination of parameters based on the previous combination of parameters (priori).&lt;/li&gt;
&lt;li&gt;Simulated annealing&lt;br /&gt;
– Generally, this algorithm works relatively similar to bayesian optimization.&lt;br /&gt;
– However, as the figure below illustrates this algorithm is able to explore in the worst combination of parameters for a short term (barrier of local search), in order to find the best combination of parameters (global minima).
&lt;img src=&#34;images/sim-anneal.png&#34; alt=&#34;Simulated annealing&#34; /&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Futher details on iterative search or both methods above can be found &lt;a href=&#34;https://www.tmwr.org/iterative-search.html#iterative-search&#34;&gt;here&lt;/a&gt;. So, as both iterative methods need a starting parameters, we can actually combine with any of the grid search methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-methods&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Other methods&lt;/h2&gt;
&lt;p&gt;By default, if we do not supply any combination of parameters, tidymodels will randomly pick 10 combinations of parameters from the default range of values from the model. Additionally, we can set this values to other values as shown below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_grid(
  resamples = dat_cv, # cross validation data set
  grid = 20,  # 20 combinations of parameters
  control = control, # some control parameters
  metrics = metrics # some metrics parameters (roc_auc, etc)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are another special cases of grid search; &lt;code&gt;tune_race_anova()&lt;/code&gt; and &lt;code&gt;tune_race_win_loss()&lt;/code&gt;. Both of these methods supposed to be more efficient way of grid search. In general, both methods evaluate the tuning parameters on a small initial set. The combination of parameters with a worst performance will be eliminated. Thus, makes them more efficient in grid search. The main difference between these two methods is how the worst combination of parameters are evaluated and eliminated.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-codes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R codes&lt;/h2&gt;
&lt;p&gt;Load the packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Packages
library(tidyverse)
library(tidymodels)
library(finetune)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will only use a small chunk of the data for ease of computation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Data
data(income, package = &amp;quot;kernlab&amp;quot;)

# Make data smaller for computation
set.seed(2021)
income2 &amp;lt;- 
  income %&amp;gt;% 
  filter(INCOME == &amp;quot;[75.000-&amp;quot; | INCOME == &amp;quot;[50.000-75.000)&amp;quot;) %&amp;gt;% 
  slice_sample(n = 600) %&amp;gt;% 
  mutate(INCOME = fct_drop(INCOME), 
         INCOME = fct_recode(INCOME, 
                             rich = &amp;quot;[75.000-&amp;quot;,
                             less_rich = &amp;quot;[50.000-75.000)&amp;quot;), 
         INCOME = factor(INCOME, ordered = F)) %&amp;gt;% 
  mutate(across(-INCOME, fct_drop))

# Summary of data
glimpse(income2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 600
## Columns: 14
## $ INCOME         &amp;lt;fct&amp;gt; less_rich, rich, rich, rich, less_rich, rich, rich, les~
## $ SEX            &amp;lt;fct&amp;gt; F, M, F, M, F, F, F, M, F, M, M, M, F, F, F, F, M, M, M~
## $ MARITAL.STATUS &amp;lt;fct&amp;gt; Married, Married, Married, Single, Single, NA, Married,~
## $ AGE            &amp;lt;ord&amp;gt; 35-44, 25-34, 45-54, 18-24, 18-24, 14-17, 25-34, 25-34,~
## $ EDUCATION      &amp;lt;ord&amp;gt; 1 to 3 years of college, Grad Study, College graduate, ~
## $ OCCUPATION     &amp;lt;fct&amp;gt; &amp;quot;Professional/Managerial&amp;quot;, &amp;quot;Professional/Managerial&amp;quot;, &amp;quot;~
## $ AREA           &amp;lt;ord&amp;gt; 10+ years, 7-10 years, 10+ years, -1 year, 4-6 years, 7~
## $ DUAL.INCOMES   &amp;lt;fct&amp;gt; Yes, Yes, Yes, Not Married, Not Married, Not Married, N~
## $ HOUSEHOLD.SIZE &amp;lt;ord&amp;gt; Five, Two, Four, Two, Four, Two, Three, Two, Five, One,~
## $ UNDER18        &amp;lt;ord&amp;gt; Three, None, None, None, None, None, One, None, Three, ~
## $ HOUSEHOLDER    &amp;lt;fct&amp;gt; Own, Own, Own, Rent, Family, Own, Own, Rent, Own, Own, ~
## $ HOME.TYPE      &amp;lt;fct&amp;gt; House, House, House, House, House, Apartment, House, Ho~
## $ ETHNIC.CLASS   &amp;lt;fct&amp;gt; White, White, White, White, White, White, White, White,~
## $ LANGUAGE       &amp;lt;fct&amp;gt; English, English, English, English, English, NA, Englis~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Outcome variable
table(income2$INCOME)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## less_rich      rich 
##       362       238&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Missing data
DataExplorer::plot_missing(income)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Split the data and create a 10-fold cross validation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(2021)
dat_index &amp;lt;- initial_split(income2, strata = INCOME)
dat_train &amp;lt;- training(dat_index)
dat_test &amp;lt;- testing(dat_index)

## CV
set.seed(2021)
dat_cv &amp;lt;- vfold_cv(dat_train, v = 10, repeats = 1, strata = INCOME)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to impute the NAs with mode value since all the variable are categorical.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Recipe
dat_rec &amp;lt;- 
  recipe(INCOME ~ ., data = dat_train) %&amp;gt;% 
  step_impute_mode(all_predictors()) %&amp;gt;% 
  step_ordinalscore(AGE, EDUCATION, AREA, HOUSEHOLD.SIZE, UNDER18)

# Model
rf_mod &amp;lt;- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;)

# Workflow
rf_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(dat_rec) %&amp;gt;% 
  add_model(rf_mod)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Parameters for grid search&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Regular grid
reg_grid &amp;lt;- grid_regular(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         levels = 3)

# Random grid
rand_grid &amp;lt;- grid_random(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         size = 100)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tune models using regular grid search. We going to use &lt;code&gt;doParallel&lt;/code&gt; library to do parallel processing.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ctrl &amp;lt;- control_grid(save_pred = T,
                        extract = extract_model)
measure &amp;lt;- metric_set(roc_auc)  

# Parallel for regular grid
library(doParallel)

# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_regular &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(
    resamples = dat_cv, 
    grid = reg_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for regular grid search:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_regular)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_regular)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     7  1000    21 roc_auc binary     0.690    10  0.0148 Preprocessor1_Model14
## 2     7  1000    40 roc_auc binary     0.689    10  0.0179 Preprocessor1_Model23
## 3     7  2000    40 roc_auc binary     0.689    10  0.0178 Preprocessor1_Model26
## 4     7  1000     2 roc_auc binary     0.688    10  0.0173 Preprocessor1_Model05
## 5     7  2000    21 roc_auc binary     0.688    10  0.0159 Preprocessor1_Model17&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tune models using random grid search.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for random grid
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_random &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for random grid search:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_random)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_random)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_Model0~
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_Model0~
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_Model0~
## 4     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_Model0~
## 5     6   160    31 roc_auc binary     0.692    10  0.0172 Preprocessor1_Model0~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Random grid search has slightly a better result. Let’s use this random search result as a base for iterative search. Firstly, we limit the parameters based on the plot from a random grid search.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_param &amp;lt;- 
  rf_wf %&amp;gt;% 
  parameters() %&amp;gt;% 
  update(mtry = mtry(c(5, 13)), 
         trees = trees(c(1, 500)), 
         min_n = min_n(c(5, 30)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we do a bayesian optimization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for bayesian optimization
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
bayes_tune &amp;lt;-  
  rf_wf %&amp;gt;% 
  tune_bayes(    
    resamples = dat_cv,
    param_info = rf_param,
    iter = 60,
    initial = tune_random, # result from random grid search        
    control = control_bayes(no_improve = 30, verbose = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for bayesian optimization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(bayes_tune, &amp;quot;performance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(bayes_tune)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 10
##    mtry trees min_n .metric .estimator  mean     n std_err .config         .iter
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;           &amp;lt;int&amp;gt;
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_~     0
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_~     0
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_~     0
## 4     6   189    15 roc_auc binary     0.693    10  0.0153 Iter1               1
## 5     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_~     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We get a slightly better result from bayesian optimization. I will not do a simulated annealing approach since I get an error, though I am not sure why.&lt;/p&gt;
&lt;p&gt;Lastly, we do a race anova.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for race anova
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_efficient &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_race_anova(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = control_race(verbose_elim = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We get a relatively similar result to random grid search but with faster computation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_efficient)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-19-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_efficient)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     5  1425     5 roc_auc binary     0.695    10  0.0161 Preprocessor1_Model0~
## 2    11   406     2 roc_auc binary     0.694    10  0.0183 Preprocessor1_Model0~
## 3     6   631     3 roc_auc binary     0.692    10  0.0171 Preprocessor1_Model0~
## 4     7  1264     4 roc_auc binary     0.692    10  0.0159 Preprocessor1_Model0~
## 5     9  1264     3 roc_auc binary     0.692    10  0.0188 Preprocessor1_Model0~&lt;/code&gt;&lt;/pre&gt;
We can also compare ROCs of all approaches. All approaches looks more or less similar.
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# regular grid
rf_reg &amp;lt;- 
  tune_regular %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

reg_auc &amp;lt;- 
  tune_regular %&amp;gt;% 
  collect_predictions(parameters = rf_reg) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;regular_grid&amp;quot;)

# random grid
rf_rand &amp;lt;- 
  tune_random %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

rand_auc &amp;lt;- 
  tune_random %&amp;gt;% 
  collect_predictions(parameters = rf_rand) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;random_grid&amp;quot;)

# bayes
rf_bayes &amp;lt;- 
  bayes_tune %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

bayes_auc &amp;lt;- 
  bayes_tune %&amp;gt;% 
  collect_predictions(parameters = rf_bayes) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;bayes&amp;quot;)

# race_anova
rf_eff &amp;lt;- 
  tune_efficient %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

eff_auc &amp;lt;- 
  tune_efficient %&amp;gt;% 
  collect_predictions(parameters = rf_eff) %&amp;gt;%
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;race_anova&amp;quot;)

# Compare ROC between all tuning approach
bind_rows(reg_auc, rand_auc, bayes_auc, eff_auc) %&amp;gt;% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = &amp;quot;plasma&amp;quot;, end = .6) +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-21-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Finally, we fit our best model (bayesian optimization) to the testing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize workflow
best_rf &amp;lt;-
  select_best(bayes_tune, &amp;quot;roc_auc&amp;quot;)

final_wf &amp;lt;- 
  rf_wf %&amp;gt;% 
  finalize_workflow(best_rf)
final_wf&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_impute_mode()
## * step_ordinalscore()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 4
##   trees = 1016
##   min_n = 4
## 
## Computational engine: ranger&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Last fit
test_fit &amp;lt;- 
  final_wf %&amp;gt;%
  last_fit(dat_index) 

# Evaluation metrics 
test_fit %&amp;gt;%
  collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.611 Preprocessor1_Model1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_fit %&amp;gt;%
  collect_predictions() %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-22-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The result is not that good. Our AUC is quite lower. However, we did use only about 8% from the overall data. Nonetheless, the aim of this post is to cover an overview of hyperparameter tuning in tidymodels.&lt;/p&gt;
&lt;p&gt;Additionally, there are another two function to construct parameter grids that I did not cover in this post; &lt;code&gt;grid_max_entropy()&lt;/code&gt; and &lt;code&gt;grid_latin_hypercube()&lt;/code&gt;. Both of these functions do not have much resources explaining them (or at least I did not found it), however, for those interested, a good start will be the tidymodels &lt;a href=&#34;https://dials.tidymodels.org/reference/grid_max_entropy.html&#34;&gt;website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
&lt;a href=&#34;https://www.tmwr.org/grid-search.html&#34; class=&#34;uri&#34;&gt;https://www.tmwr.org/grid-search.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://www.tmwr.org/iterative-search.html&#34; class=&#34;uri&#34;&gt;https://www.tmwr.org/iterative-search.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://oliviergimenez.github.io/learning-machine-learning/#&#34; class=&#34;uri&#34;&gt;https://oliviergimenez.github.io/learning-machine-learning/#&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Handling imbalanced data</title>
      <link>https://tengkuhanis.netlify.app/post/handling-imbalanced-data/</link>
      <pubDate>Fri, 14 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/handling-imbalanced-data/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;overview&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Overview&lt;/h2&gt;
&lt;p&gt;Imbalance data happens when there is unequal distribution of data within a categorical outcome variable. Imbalance data occurs due to several reasons such as biased sampling method and measurement errors. However, the imbalance may also be the inherent characteristic of the data. For example, a rare disease predictive model, in this case, the imbalance is expected.&lt;/p&gt;
&lt;p&gt;Generally, there are two types of imbalanced problem:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slight imbalance: the imbalance is small, like 4:6&lt;/li&gt;
&lt;li&gt;Severe imbalance: the imbalance is large, like 1:100 or more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In slight imbalanced cases, usually it is not a concern, while severe imbalanced cases require a more specialised method to to build a predictive model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-problem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The problem&lt;/h2&gt;
&lt;p&gt;What’s the problem with the imbalanced data?&lt;br /&gt;
Firstly, a predictive model of an imbalanced data is bias towards the majority class. The minority class becomes harder to predict as there are few data from this class. So, the detection rate for a minority class will be very low.
Secondly, accuracy is not a good measure in this case. We may get a good accuracy,but in reality the accuracy does not reflect the unequal distribution of the data. This is known as an &lt;a href=&#34;https://en.wikipedia.org/wiki/Accuracy_paradox&#34;&gt;accuracy paradox&lt;/a&gt;. Imagine we have 90% of data belong to the majority class, while the remaining 10% belong to the minority class. So, just by predicting all data as a majority class, the model can easily get 90% accuracy.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;handling-approach&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Handling approach&lt;/h2&gt;
&lt;p&gt;The easiest approach is to collect more data, though this may not be practical in all situation. Fortunately, there are a few machine learning techniques available to tackle this problem.&lt;/p&gt;
&lt;p&gt;Here is a summary of resampling techniques available in &lt;code&gt;themis&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;method-themis.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Over-sampling approach is preferred when the dataset is small. The under-sampling approach can be used when the dataset is large, though this approach may lead to loss of information. Additionally, ensemble technique such as random forest is said to be able to model the imbalanced data, though some references/blogs say otherwise.&lt;/p&gt;
&lt;p&gt;So, we are going to compare four of over-sampling techniques (upsample, SMOTE, ADASYN, and ROSE), and three of under-sampling techniques (downsample, nearmiss and tomek). The base model is a decision tree, which will be used for all the techniques. The decision trees are not going to be extensively hyperparameter tuned, for the sake of simplicity. Additionally, random forest is also going to be included in the comparison.&lt;/p&gt;
&lt;p&gt;The dataset is from &lt;a href=&#34;https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv&#34;&gt;here&lt;/a&gt;. This is a summary of the dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(df)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see from the summary, variable admit has a moderate imbalanced data about 1:3 ratio.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(df, aes(admit)) + 
  geom_bar() +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/figure-html/barplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Below is the code for each model.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Packages
library(tidyverse)
library(magrittr)
library(tidymodels)
library(themis)

# Data
df &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv&amp;quot;)

# Split data
set.seed(1234)
df_split &amp;lt;- initial_split(df)
df_train &amp;lt;- training(df_split)
df_test &amp;lt;- testing(df_split)

# 1) Decision tree ----

# Recipe
dt_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank)

df_train_rec &amp;lt;- 
  dt_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)
  
df_test_rec &amp;lt;- 
  dt_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv &amp;lt;- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
dt_mod &amp;lt;- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %&amp;gt;% 
  set_engine(&amp;quot;rpart&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)

## Specify workflow
dt_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune &amp;lt;- 
  dt_wf %&amp;gt;% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune &amp;lt;- dt_tune %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final &amp;lt;- 
  dt_wf %&amp;gt;% 
  finalize_workflow(best_tune)

# Fit on train data
dt_train &amp;lt;- 
  dt_wf_final %&amp;gt;% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train, new_data = df_test_rec)) %&amp;gt;% 
  rename(pred = .pred_class)

# 2) Oversampling ----
## step_upsample() ----

# Recipe
up_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_upsample(admit,
                seed = 1234)

df_train_up &amp;lt;- 
  up_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_up &amp;lt;- 
  up_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_up &amp;lt;- vfold_cv(df_train_up)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_up &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_up &amp;lt;- 
  dt_wf_up %&amp;gt;% 
  tune_grid(resamples = df_cv_up,
            metrics = metric_set(accuracy))

## Select best model
best_tune_up &amp;lt;- dt_tune_up %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_up &amp;lt;- 
  dt_wf_up %&amp;gt;% 
  finalize_workflow(best_tune_up)

# Fit on train data
dt_train_up &amp;lt;- 
  dt_wf_final_up %&amp;gt;% 
  fit(data = df_train_up)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_up, new_data = df_test_rec_up)) %&amp;gt;% 
  rename(pred_up = .pred_class)

## step_smote() ----

# Recipe
smote_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_smote(admit, 
             seed = 1234)

df_train_smote &amp;lt;- 
  smote_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_smote &amp;lt;- 
  smote_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_smote &amp;lt;- vfold_cv(df_train_smote)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_smote &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_smote &amp;lt;- 
  dt_wf_smote %&amp;gt;% 
  tune_grid(resamples = df_cv_smote,
            metrics = metric_set(accuracy))

## Select best model
best_tune_smote &amp;lt;- dt_tune_smote %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_smote &amp;lt;- 
  dt_wf_smote %&amp;gt;% 
  finalize_workflow(best_tune_smote)

# Fit on train data
dt_train_smote &amp;lt;- 
  dt_wf_final_smote %&amp;gt;% 
  fit(data = df_train_smote)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_smote, new_data = df_test_rec_smote)) %&amp;gt;% 
  rename(pred_smote = .pred_class)

## step_rose() ----

# Recipe
rose_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_rose(admit, 
             seed = 1234)

df_train_rose &amp;lt;- 
  rose_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_rose &amp;lt;- 
  rose_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_rose &amp;lt;- vfold_cv(df_train_rose)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_rose &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_rose &amp;lt;- 
  dt_wf_rose %&amp;gt;% 
  tune_grid(resamples = df_cv_rose,
            metrics = metric_set(accuracy))

## Select best model
best_tune_rose &amp;lt;- dt_tune_rose %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_rose &amp;lt;- 
  dt_wf_rose %&amp;gt;% 
  finalize_workflow(best_tune_rose)

# Fit on train data
dt_train_rose &amp;lt;- 
  dt_wf_final_rose %&amp;gt;% 
  fit(data = df_train_rose)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_rose, new_data = df_test_rec_rose)) %&amp;gt;% 
  rename(pred_rose = .pred_class)

## step_adasyn() ----

# Recipe
adasyn_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_adasyn(admit, 
            seed = 1234)

df_train_adasyn &amp;lt;- 
  adasyn_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_adasyn &amp;lt;- 
  adasyn_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_adasyn &amp;lt;- vfold_cv(df_train_adasyn)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_adasyn &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_adasyn &amp;lt;- 
  dt_wf_adasyn %&amp;gt;% 
  tune_grid(resamples = df_cv_adasyn,
            metrics = metric_set(accuracy))

## Select best model
best_tune_adasyn &amp;lt;- dt_tune_adasyn %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_adasyn &amp;lt;- 
  dt_wf_adasyn %&amp;gt;% 
  finalize_workflow(best_tune_adasyn)

# Fit on train data
dt_train_adasyn &amp;lt;- 
  dt_wf_final_adasyn %&amp;gt;% 
  fit(data = df_train_adasyn)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_adasyn, new_data = df_test_rec_adasyn)) %&amp;gt;% 
  rename(pred_adasyn = .pred_class)

# 3) Undersampling ----
## step_downsample() ----

# Recipe
down_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_downsample(admit,
                seed = 1234)

df_train_down &amp;lt;- 
  down_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_down &amp;lt;- 
  down_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_down &amp;lt;- vfold_cv(df_train_down)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_down &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_down &amp;lt;- 
  dt_wf_down %&amp;gt;% 
  tune_grid(resamples = df_cv_down,
            metrics = metric_set(accuracy))

## Select best model
best_tune_down &amp;lt;- dt_tune_down %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_down &amp;lt;- 
  dt_wf_down %&amp;gt;% 
  finalize_workflow(best_tune_down)

# Fit on train data
dt_train_down &amp;lt;- 
  dt_wf_final_down %&amp;gt;% 
  fit(data = df_train_down)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_down, new_data = df_test_rec_down)) %&amp;gt;% 
  rename(pred_down = .pred_class)

## step_nearmiss() ----

# Recipe
nearmiss_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_nearmiss(admit,
                  seed = 1234)

df_train_nearmiss &amp;lt;- 
  nearmiss_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_nearmiss &amp;lt;- 
  nearmiss_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_nearmiss &amp;lt;- vfold_cv(df_train_nearmiss)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_nearmiss &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_nearmiss &amp;lt;- 
  dt_wf_nearmiss %&amp;gt;% 
  tune_grid(resamples = df_cv_nearmiss,
            metrics = metric_set(accuracy))

## Select best model
best_tune_nearmiss &amp;lt;- dt_tune_nearmiss %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_nearmiss &amp;lt;- 
  dt_wf_nearmiss %&amp;gt;% 
  finalize_workflow(best_tune_nearmiss)

# Fit on train data
dt_train_nearmiss &amp;lt;- 
  dt_wf_final_nearmiss %&amp;gt;% 
  fit(data = df_train_nearmiss)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_nearmiss, new_data = df_test_rec_nearmiss)) %&amp;gt;% 
  rename(pred_nearmiss = .pred_class)

## step_tomek() ----

# Recipe
tomek_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_tomek(admit,
                  seed = 1234)

df_train_tomek &amp;lt;- 
  tomek_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_tomek &amp;lt;- 
  tomek_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_tomek &amp;lt;- vfold_cv(df_train_tomek)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_tomek &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_tomek &amp;lt;- 
  dt_wf_tomek %&amp;gt;% 
  tune_grid(resamples = df_cv_tomek,
            metrics = metric_set(accuracy))

## Select best model
best_tune_tomek &amp;lt;- dt_tune_tomek %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_tomek &amp;lt;- 
  dt_wf_tomek %&amp;gt;% 
  finalize_workflow(best_tune_tomek)

# Fit on train data
dt_train_tomek &amp;lt;- 
  dt_wf_final_tomek %&amp;gt;% 
  fit(data = df_train_tomek)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_tomek, new_data = df_test_rec_tomek)) %&amp;gt;% 
  rename(pred_tomek = .pred_class)

# 4) Ensemble approach: random forest ----

## 10-folds CV
set.seed(1234)
df_cv &amp;lt;- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
rf_mod &amp;lt;- rand_forest(
 mtry = tune(),
 trees = tune(),
 min_n = tune()
 ) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)

## Specify workflow
rf_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(rf_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
rf_tune &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune &amp;lt;- rf_tune %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
rf_wf_final &amp;lt;- 
  rf_wf %&amp;gt;% 
  finalize_workflow(best_tune)

# Fit on train data
rf_train &amp;lt;- 
  rf_wf_final %&amp;gt;% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(rf_train, new_data = df_test_rec)) %&amp;gt;% 
  rename(pred_rf = .pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;Now, let’s get the accuracy, sensitivity, specificity, and &lt;a href=&#34;https://en.wikipedia.org/wiki/Matthews_correlation_coefficient#Advantages_of_MCC_over_accuracy_and_F1_score&#34;&gt;Mathews Correlation Coefficient (MCC)&lt;/a&gt; for each model.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get all measurements
df_test$admit %&amp;lt;&amp;gt;% as_factor()
pred_col &amp;lt;- colnames(df_test)[5:13]
result &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
sensi &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
specif &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
mathew &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)

for (i in seq_along(pred_col)) {
  # accuracy
  result[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    accuracy(admit, df_test[,pred_col[i]])
  
  # sensitivity
  sensi[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    sensitivity(admit, df_test[,pred_col[i]])
  
  # specificity
  specif[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    specificity(admit, df_test[,pred_col[i]])
  
  # MCC
  mathew[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    mcc(admit, df_test[,pred_col[i]])
}

## Turn into dataframe
result  %&amp;lt;&amp;gt;%  
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;)) %&amp;gt;% 
  rename(model = name, 
         accuracy = .estimate) %&amp;gt;% 
  select(model, accuracy) %&amp;gt;% 
  mutate(model = factor(model,labels = 
                          c(
                            &amp;quot;1&amp;quot; = &amp;quot;base&amp;quot;,
                            &amp;quot;2&amp;quot; = &amp;quot;upsample&amp;quot;,
                            &amp;quot;3&amp;quot; = &amp;quot;smote&amp;quot;,
                            &amp;quot;4&amp;quot; = &amp;quot;rose&amp;quot;,
                            &amp;quot;5&amp;quot; = &amp;quot;adasyn&amp;quot;,
                            &amp;quot;6&amp;quot; = &amp;quot;downsample&amp;quot;,
                            &amp;quot;7&amp;quot; = &amp;quot;nearmiss&amp;quot;,
                            &amp;quot;8&amp;quot; = &amp;quot;tomek&amp;quot;,
                            &amp;quot;9&amp;quot; = &amp;quot;random_forest&amp;quot;
                            )
                        ))

sensi  %&amp;lt;&amp;gt;%  
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

specif %&amp;lt;&amp;gt;% 
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

mathew %&amp;lt;&amp;gt;% 
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

result %&amp;lt;&amp;gt;% 
  bind_cols(sensitive = sensi$.estimate, specific = specif$.estimate, mathew = mathew$.estimate)

# Plot the result
result %&amp;gt;% 
  pivot_longer(cols = 2:5, names_to = &amp;quot;measure&amp;quot;) %&amp;gt;% 
  ggplot(aes(x = model, y = value, fill = measure)) +
  geom_bar(position = &amp;quot;dodge&amp;quot;, stat = &amp;quot;identity&amp;quot;) +
  theme_bw() +
  coord_flip() +
  geom_text(aes(label = paste0(round(value*100, digits = 1), &amp;quot;%&amp;quot;)), 
            position = position_dodge(0.9), vjust = 0.3, size = 2.7, hjust = -0.1) +
  labs(title = &amp;quot;Comparison of unbalanced data techniques&amp;quot;, 
       x = &amp;quot;Techniques&amp;quot;, 
       y = &amp;quot;Performance&amp;quot;) +
  scale_fill_discrete(name = &amp;quot;Metrics:&amp;quot;,
                      labels = c(&amp;quot;Accuracy&amp;quot;, &amp;quot;MCC&amp;quot;, &amp;quot;Sensitivity&amp;quot;, &amp;quot;Specificity&amp;quot;)) +
  theme(legend.position = &amp;quot;bottom&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/figure-html/summary-measure2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see from the above plot, the base model (decision tree) clearly has a low detection rate for a minority class (specificity). All methods able to increase the specificity, while sacrificing the accuracy and sensitivity. As mentioned earlier, accuracy is not a good metrics for this kind of model (ie; accuracy paradox). MCC on the other hand, takes into account all values of confusion matrix; true positive, false positive, true negative, and false negative. Hence, MCC is more informative compared to accuracy (and F score, which has not been included in the plot, for the sake of simplicity).&lt;/p&gt;
&lt;p&gt;A more balanced model probably downsample approach based on MCC, specificity, and sensitivity. However, this does not mean that downsample technique is the best as I believes each technique behaves differently from one data to another.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://themis.tidymodels.org/reference/index.html&#34; class=&#34;uri&#34;&gt;https://themis.tidymodels.org/reference/index.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/&#34; class=&#34;uri&#34;&gt;https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7&#34; class=&#34;uri&#34;&gt;https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
