R | Tengku Hanis

Mapping the states in Malaysia

Wed, 22 Feb 2023 00:00:00 +0000

I have written two blog posts about making map in R:

This post is sort of a continuation to the first blog post. I have shown how to plot a coordinate to a map in that post specifically for Malaysia.

However, using the two approaches in the previous blog post, we cannot plot the coordinate to a certain states in Malaysia. At least I am not unable to find how to do that after googling around. But, we can plot the borneo or peninsular of Malaysia using the two approaches.

Plot the peninsular of Malaysia (not the best way)

Load the necessary packages.

library(rworldmap) 
library(tidyverse)

First, we get the data. The data is about desa clinic (klinik desa) in Malaysia.

clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")
head(clinicDesa)

##   id facilities_id                     name              address postcode
## 1  1    KD01010019  KLINIK DESA ASSAM BUBOK     Jalan Batu Pahat    86400
## 2  2    KD01010020   KLINIK DESA BATU PUTIH    Jalan Behor Temak    83000
## 3  3    KD01010021      KLINIK DESA BEROLEH    Jalan Parit Besar    83300
## 4  4    KD01010022        KLINIK DESA BINDU Jalan Tongkang Pecah    83010
## 5  5    KD01010023 KLINIK DESA KAMPUNG BARU   Jalan Parit Kemang    83710
## 6  6    KD01010024 KLINIK DESA KANGKAR BARU      Jalan Meng Seng    85400
##             city   district  state tel fax website email image latitude
## 1     Ayer Hitam Batu Pahat Johor       NA      NA    NA    NA 1.933330
## 2          Bagan Batu Pahat Johor       NA      NA    NA    NA 1.889100
## 3     Sri Gading Batu Pahat Johor       NA      NA    NA    NA 1.877890
## 4 Tongkang Pecah Batu Pahat Johor       NA      NA    NA    NA 1.901515
## 5    Parit Yaani Batu Pahat Johor       NA      NA    NA    NA 1.905120
## 6      Yong Peng Batu Pahat Johor       NA      NA    NA    NA 2.065310
##   longitude likes rating status
## 1  103.1167     0      0    NEW
## 2  102.8778     0      0    NEW
## 3  102.9858     0      0    NEW
## 4  102.9665     0      0    NEW
## 5  103.0372     0      0    NEW
## 6  103.1248     0      0    NEW

First we plot the data.

ggplot(clinicDesa, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()

Remove the two points.

clinicDesa2 <- clinicDesa %>% filter(longitude > 25)

Again, plot the updated data.

ggplot(clinicDesa2, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()

From the plot, we already know the left side consists of the coordinates in the peninsular of Malaysia. So, we can limit our plot by limit the longitude < 105 and longitude > 97.

# Get base map
global <- map_data("world") 

# Plot
ggplot() + 
  geom_polygon(data = global %>% filter(region == "Malaysia"), aes(x=long, y = lat, group = group), 
               fill = "gray85") + 
  coord_fixed(1.3) +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in the peninsular of Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan")))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia

I am not going to re-explain the above and below codes as I have explain it in the previous blog post.

This approach also works with rworldmap.

# Get base map
world <- getMap(resolution = "low")
msia <- world[world@data$ADMIN == "Malaysia", ]

# Plot
ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = "black") +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in the peninsular of Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan")))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia

As we can see using the two approaches, we can plot the borne and peninsular sides of Malaysia. But, at least to my knowledge we cannot apply this approach if we want to plot a coordinate to certain states in Malaysia.

Plot the states in Malaysia

Load the necessary package.

library(geodata)
library(tidyterra)

As we can see from the package, we going to use a geodata package. tidyterra is used to supplements the ggplot. First, let’s limit the data into desa clinics in Terengganu only.

clinic_trg <- 
  clinicDesa %>% 
  filter(state == "Terengganu") %>% 
  dplyr::select(latitude, longitude) 
head(clinic_trg)

##   latitude longitude
## 1  5.48533  102.4914
## 2  5.81578  102.5778
## 3  5.70886  102.4892
## 4  5.75722  102.5303
## 5  5.67444  102.6289
## 6  5.69875  102.5430

Now we get the map from the geodata package with the boundaries at the district level.

Malaysia <- gadm(country = "MYS", level = 2, path=tempdir())

We can use the below information to limit the map to Terengganu state only.

Malaysia$NAME_1

##   [1] "Johor"           "Johor"           "Johor"           "Johor"          
##   [5] "Johor"           "Johor"           "Johor"           "Johor"          
##   [9] "Johor"           "Johor"           "Kedah"           "Kedah"          
##  [13] "Kedah"           "Kedah"           "Kedah"           "Kedah"          
##  [17] "Kedah"           "Kedah"           "Kedah"           "Kedah"          
##  [21] "Kedah"           "Kedah"           "Kelantan"        "Kelantan"       
##  [25] "Kelantan"        "Kelantan"        "Kelantan"        "Kelantan"       
##  [29] "Kelantan"        "Kelantan"        "Kelantan"        "Kelantan"       
##  [33] "Kuala Lumpur"    "Labuan"          "Melaka"          "Melaka"         
##  [37] "Melaka"          "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan"
##  [41] "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan"
##  [45] "Pahang"          "Pahang"          "Pahang"          "Pahang"         
##  [49] "Pahang"          "Pahang"          "Pahang"          "Pahang"         
##  [53] "Pahang"          "Pahang"          "Pahang"          "Perak"          
##  [57] "Perak"           "Perak"           "Perak"           "Perak"          
##  [61] "Perak"           "Perak"           "Perak"           "Perak"          
##  [65] "Perak"           "Perlis"          "Pulau Pinang"    "Pulau Pinang"   
##  [69] "Pulau Pinang"    "Pulau Pinang"    "Pulau Pinang"    "Putrajaya"      
##  [73] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [77] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [81] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [85] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [89] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [93] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [97] "Sabah"           "Sarawak"         "Sarawak"         "Sarawak"        
## [101] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [105] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [109] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [113] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [117] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [121] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [125] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [129] "Selangor"        "Selangor"        "Selangor"        "Selangor"       
## [133] "Selangor"        "Selangor"        "Selangor"        "Selangor"       
## [137] "Selangor"        "Trengganu"       "Trengganu"       "Trengganu"      
## [141] "Trengganu"       "Trengganu"       "Trengganu"       "Trengganu"

So, this is the plot for Terengganu.

Trg <- Malaysia[138:144,]
plot(Trg)

We going to the map this in ggplot, and stacked the map layer with the coordinate layer.

ggplot() +
  geom_spatvector(data = Trg, color = "grey", fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

geom_spatvector is from tidyterra package. Alternatively, we can plot using geom_sfbut we need to convert the SpatVector data into sf object using sf::st_as_sf.

ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = "grey", fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

Both approaches produce the same plot.

We can further add district labels to the plots. For example, using the geom_sf, we can stack it with geom_sf_label layer. We can alternatively use theme_void to remove the background and the map axis.

ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = "grey", fill = NA) +
  geom_sf_label(aes(label = NAME_2)) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_void() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

Using UMAP preprocessing for image classification

Wed, 16 Mar 2022 00:00:00 +0000

UMAP

Uniform manifold approximation and projection or in short UMAP is a type of dimension reduction techniques. So, basically UMAP will project a set of features into a smaller space. UMAP can be a supervised technique in which we give a label or an outcome or an unsupervised one. For those interested to know in detail how UMAP works can refer to this reference. For those prefer a much simpler or shorter version of it, I recommend a YouTube video by Joshua Starmer.

Example in R

We going to see how to apply a UMAP techniques for image preprocessing and further classify the images using kNN and naive bayes.

These are the packages that we need.

library(keras) #for data and reshape to tabular format
library(tidymodels)
library(embed) #for umap
library(discrim) #for naive bayes model

We going to use the famous MNIST dataset. This dataset contained a handwritten digit from 0 to 9. This dataset is available in keras package.

mnist_data <- dataset_mnist()

## Loaded Tensorflow version 2.2.0

image_data <- mnist_data$train$x
image_labels <- mnist_data$train$y
image_data %>% dim()

## [1] 60000    28    28

For example this is the image for the second row.

image_data[2, 1:28, 1:28] %>% 
  t() %>% 
  image(col = gray.colors(256))

Next, we going to change the image into a tabular data frame format. We going to limit the data to the first 1000 rows or images out of the total 6000 images.

# Reformat to tabular format
image_data <- array_reshape(image_data, dim = c(60000, 28*28))
image_data %>% dim()

## [1] 60000   784

image_data <- image_data[1:10000,]
image_labels <- image_labels[1:10000]

# Reformat to data frame
full_data <- 
  data.frame(image_data) %>% 
  bind_cols(label = image_labels) %>% 
  mutate(label = as.factor(label))

Then, we going to split the data and create a 3-folds cross-validation sets for the sake of simplicity.

# Split data
set.seed(123)
ind <- initial_split(full_data)
data_train <- training(ind)  
data_test <- testing(ind)

# 10-folds CV
set.seed(123)
data_cv <- vfold_cv(data_train, v = 3)

For recipe specification, we going to scale and center all the predictor after creating a new variable using step_umap(). Notice that in step_umap() we supply the outcome and we tune the number of components (num_comp).

rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = tune()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

We create a a base workflow.

wf <- 
  workflow() %>% 
  add_recipe(rec)

We going to use two models as classifier:

kNN
Naive bayes

For each classifier, we going to create a regular grid of parameters to be tuned and further run a regular grid search.

For kNN.

# knn model
knn_mod <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

# knn grid
knn_grid <- grid_regular(neighbors(), num_comp(range = c(2, 8)), levels = 3)

# Tune grid search
knn_tune <- 
  tune_grid(
  wf %>% add_model(knn_mod),
  resamples = data_cv,
  grid = knn_grid, 
  control = control_grid(verbose = F)
)

For naive bayes.

# nb model
nb_mod <- 
  naive_Bayes(smoothness = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("naivebayes")

# nb grid
nb_grid <- grid_regular(smoothness(), num_comp(range = c(2, 10)), levels = 3)

# Tune grid search
nb_tune <- 
  tune_grid(
    wf %>% add_model(nb_mod),
    resamples = data_cv,
    grid = nb_grid, 
    control = control_grid(verbose = F)
  )

Let’s see our tuning performance of our model.

# knn model
knn_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   neighbors num_comp .metric .estimator  mean     n  std_err .config            
##       <int>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>              
## 1        10        8 roc_auc hand_till  0.961     3 0.000268 Preprocessor3_Mode~
## 2        10        5 roc_auc hand_till  0.961     3 0.000421 Preprocessor2_Mode~
## 3         5        8 roc_auc hand_till  0.959     3 0.000757 Preprocessor3_Mode~
## 4        10        2 roc_auc hand_till  0.959     3 0.000737 Preprocessor1_Mode~
## 5         5        5 roc_auc hand_till  0.958     3 0.000740 Preprocessor2_Mode~

knn_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   neighbors num_comp .metric  .estimator  mean     n std_err .config            
##       <int>    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
## 1        10        8 accuracy multiclass 0.914     3 0.00104 Preprocessor3_Mode~
## 2         5        8 accuracy multiclass 0.913     3 0.00315 Preprocessor3_Mode~
## 3        10        5 accuracy multiclass 0.912     3 0.00114 Preprocessor2_Mode~
## 4         5        5 accuracy multiclass 0.91      3 0.00139 Preprocessor2_Mode~
## 5        10        2 accuracy multiclass 0.910     3 0.00175 Preprocessor1_Mode~

# nb model
nb_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   smoothness num_comp .metric .estimator  mean     n  std_err .config           
##        <dbl>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>             
## 1        1.5       10 roc_auc hand_till  0.971     3 0.000400 Preprocessor3_Mod~
## 2        1.5        6 roc_auc hand_till  0.971     3 0.000997 Preprocessor2_Mod~
## 3        1         10 roc_auc hand_till  0.971     3 0.000634 Preprocessor3_Mod~
## 4        1          6 roc_auc hand_till  0.970     3 0.00124  Preprocessor2_Mod~
## 5        0.5       10 roc_auc hand_till  0.969     3 0.000808 Preprocessor3_Mod~

nb_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   smoothness num_comp .metric  .estimator  mean     n  std_err .config          
##        <dbl>    <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>            
## 1        1         10 accuracy multiclass 0.913     3 0.000481 Preprocessor3_Mo~
## 2        1.5       10 accuracy multiclass 0.913     3 0.000267 Preprocessor3_Mo~
## 3        0.5       10 accuracy multiclass 0.912     3 0.000462 Preprocessor3_Mo~
## 4        1.5        6 accuracy multiclass 0.911     3 0.00135  Preprocessor2_Mo~
## 5        1          6 accuracy multiclass 0.910     3 0.00157  Preprocessor2_Mo~

Next, we going to select the best model from the tuned parameters and finalise our model using last_fit().

For knn model.

# Finalize
knn_best <- knn_tune %>% select_best("roc_auc")
knn_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = knn_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

knn_wf <- 
  workflow() %>% 
  add_recipe(knn_rec) %>% 
  add_model(knn_mod) %>% 
  finalize_workflow(knn_best) 

# Last fit
knn_lastfit <- 
  knn_wf %>% 
  last_fit(ind)

For naive bayes model.

# Finalize
nb_best <- nb_tune %>% select_best("roc_auc")
nb_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = nb_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

nb_wf <- 
  workflow() %>% 
  add_recipe(nb_rec) %>% 
  add_model(nb_mod) %>% 
  finalize_workflow(nb_best) 

# Last fit
nb_lastfit <- 
  nb_wf %>% 
  last_fit(ind)

Let’s see the model performance on the testing data.

knn_lastfit %>% 
  collect_metrics() %>% 
  mutate(model = "knn") %>% 
  dplyr::bind_rows(nb_lastfit %>% 
                     collect_metrics() %>% 
                     mutate(model = "nb")) %>% 
  select(-.config)

## # A tibble: 4 x 4
##   .metric  .estimator .estimate model
##   <chr>    <chr>          <dbl> <chr>
## 1 accuracy multiclass     0.938 knn  
## 2 roc_auc  hand_till      0.971 knn  
## 3 accuracy multiclass     0.936 nb   
## 4 roc_auc  hand_till      0.980 nb

These are the confusion matrices.

knn_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - kNN")

nb_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - naive bayes")

Lastly, we can compare the ROC plots for each class.

knn_lastfit %>% 
  collect_predictions() %>%
  mutate(id = "knn") %>% 
  bind_rows(
    nb_lastfit %>% 
      collect_predictions() %>% 
      mutate(id = "nb")
            ) %>% 
  group_by(id) %>% 
  roc_curve(label, .pred_0:.pred_9) %>% 
  autoplot()

Conclusion

I believe UMAP is quite good and can be used as one of preprocessing step in image classification. We are able to get a pretty good performance result in this post. I believe if the the parameter tuning approach is a bit more rigorous, the performance result will be a lot better.

Fitted vs predict in R

Sun, 09 Jan 2022 00:00:00 +0000

There are two functions in R that seems almost similar yet different:

fitted()
predict()

First let’s prepare some data first.

# Packages
library(dplyr)

# Data
set.seed(123)
dat <- 
  iris %>% 
  mutate(twoGp = sample(c("Gp1", "Gp2"), 150, replace = T), #create two group factor
         twoGp = as.factor(twoGp))

This is our data.

summary(dat)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   twoGp   
##  setosa    :50   Gp1:76  
##  versicolor:50   Gp2:74  
##  virginica :50           
##                          
##                          
##

fitted() is used to get a predicted values or \(\hat{y}\) based on the data. Let’s see this on the logistic regression.

logR <- glm(twoGp ~ ., family = binomial(), data = dat)

These are the fitted values.

fitted(logR) %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

For predict(), we have three types:

Response
Link - default
Terms

If no new data supplied to predict(), it will use the original data used to fit the model.

1. Response

The type response is identical to fitted().

predict(logR, type = "response") %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

We can confirm this as below.

all.equal(fitted(logR), predict(logR, type = "response"))

## [1] TRUE

Thus, fitted() and predict(type = "response") give use predicted probabilities on the scale of the response variable. The first observation of this values can be interpreted as probability of Gp2 (Gp1 is a reference group) for first observation is 0.41.

2. Link

predict(type = "link") gives us predicted probabilities on the logit scale or log odds prediction.

predict(logR, type = "link") %>% head() #similar to predict(logR)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

So, the log odds prediction of Gp2 for the first observation is -0.37. Hence, we can get the same values if we apply a link function to the fitted values.

The link function for logistic regression is:

\[ ln(\frac{\mu}{1 - \mu}) \] So, we apply this link function to the fitted values.

logOddsProb <- log(fitted(logR) / (1 - fitted(logR))) 
head(logOddsProb)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

We can further confirm this as we did previously.

all.equal(logOddsProb, predict(logR, type = "link"))

## [1] TRUE

Also, we can conclude predict(type = "link") give use a fitted values before an application of link function (log odds).

3. Terms

Lastly, we have predict(type = "terms"). This type gives us a matrix of fitted values for each variable of each observation in the model on the scale of linear predictor.

predict(logR, type = "terms") %>% head()

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1   0.07988782  0.28070682    0.4819893  -0.2736677 -0.9178543
## 2   0.10138230 -0.03635661    0.4819893  -0.2736677 -0.9178543
## 3   0.12287679  0.09046877    0.5024299  -0.2736677 -0.9178543
## 4   0.13362403  0.02705608    0.4615487  -0.2736677 -0.9178543
## 5   0.09063506  0.34411951    0.4819893  -0.2736677 -0.9178543
## 6   0.04764610  0.53435757    0.4206675  -0.2188976 -0.9178543

So, if we add up the values of the first observation and the constant (or intercept), we will get the same values as the log odds prediction (predict(type = "link")).

predTerm <- predict(logR, type = "terms")
sum(predTerm[1, ], attr(predTerm, "constant")) #add up the first observation and the constant

## [1] -0.374315

logOddsProb[1]

##         1 
## -0.374315

Those values also similar to if we calculate manually using coefficient from the summary().

\[ LogOdds(Gp2) = \beta_0 + \beta_1(Sepal.Length) + \beta_2(Sepal.Width) + \] \[ \beta_3(Petal.Length) + \beta_4(Petal.Width) + \beta_5(Species) \] So, this is the values we get from the first observation.

coef(logR)[1] + coef(logR)[2]*dat$Sepal.Length[1] + coef(logR)[3]*dat$Sepal.Width[1] + coef(logR)[4]*dat$Petal.Length[1] + coef(logR)[5]*dat$Petal.Width[1] + 0 #setosa species

## (Intercept) 
##   -0.374315

However, in predict(type = "terms") the values is centered, thus we have a different values for constant/intercept and for \(\beta_1(Sepal.Length)\),\(\beta_2(Sepal.Width)\) and so on. For example, the values for intercept for both models are:

# Intercept/constant from predict(type = "terms")
attr(predTerm, "constant")

## [1] -0.02537694

# Intercept/constant from summary()
coef(logR)[1]

## (Intercept) 
##   -1.814251

References:

My first interactive map with {leaflet}

Sun, 28 Nov 2021 00:00:00 +0000

I have tried creating a map with ggplot2 previously. In this post, I will try to create an interactive map using leaflet package in R.

These are the required packages.

library(tidyverse)
library(tidygeocoder)
library(leaflet)
library(htmltools)

So, I’m going to use a clinics location data in Malaysia. I already uploaded this data tomy GitHub repo. I will skip the explanation for the pre-processing part, but it is the same pre-processing as my previous post.

# Read the data
clinic1m <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv")
clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")

Show code for pre-processing

# Get the missing coordinate based on postal codes
clinic1m2 <- 
  clinic1m %>%
  mutate(country = "malaysia") %>% 
  select(name, postcode, country) %>% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %>%
  geocode(postalcode = postcode, country = country, method = "osm")

# Add coordinate from external sources for the still missing coordinates
add_coord <- 
  read.table(header = T, text = "
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926")

# Drop clinics with the still missing coordinate
clinic1m2 <- 
  clinic1m2 %>% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %>% 
  drop_na() #drop 2 clinic1m

# Bind the 2 data
all_clinic <- 
  clinic1m2 %>% 
  mutate(Type = "1Malaysia") %>% 
  select(name, Type, lat, long) %>% 
  bind_rows(clinicDesa %>% 
              mutate(Type = "Desa", 
                     lat = latitude, 
                     long = longitude) %>% 
              select(name, Type, lat, long)) %>% 
  mutate(name = str_to_title(name))

First, we going to plot the coordinates to see if there is anything strange.

ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal()

So, we are going to remove the two isolated points as seen from the plot.

all_clinic2 <- all_clinic %>% filter(long > 25)

Once we have our data ready, we can supply to leaflet. We can choose the type of map from addProviderTiles(). Some need an api, but the one we choose here does not. We supply the longitude and latitude of our data to addCircleMarkers(), and clusterOptions to cluster our data.

leaflet(all_clinic2) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>%
  addProviderTiles(providers$Stamen.TerrainLabels) %>%
  addCircleMarkers(~long, ~lat, 
                   clusterOptions = markerClusterOptions())

Next, we can add a label.

labels <- 
  sprintf("<strong>%s</strong>", all_clinic$name) %>% 
  lapply(htmltools::HTML)

Also, we can add a mini map to our map. Here, I change the type of map to a more appropriate one.

leaflet(all_clinic2) %>% 
  addProviderTiles(providers$OpenStreetMap) %>%
  addCircleMarkers(~long, ~lat, popup = ~labels, # popup add the label
                   clusterOptions = markerClusterOptions()) %>% 
    # add a mini map
  addMiniMap(tiles = providers$OpenStreetMap, zoomLevelOffset = -3)

Notice that the coordinates look more accurate as compared to the map I created with ggplot2 previously.

References:

Variable selection for imputation model in {mice}

Mon, 22 Nov 2021 00:00:00 +0000

Some note

I have written a short post about missing data and multiple imputation in mice package previously. This post will add to that previous post.

Imputation model

Imputation model is the model that we use for our imputation approach. There is another term which is complete-data model. This is a model that we want to fit after we impute the missing values (i.e; the complete-data model is the final model).

Generally, we need to include as many relevant variables into the imputation model. However, this general advise may not be very efficient as we may have multicollinearity and computational issue if we include too many predictors. As a rule of thumb, the number of included variables should be no more than 15-20. van Buuren et al. (2011) mentioned that increased in explained variance in linear regression is negligible after 15 variables are included.

There are 4 steps suggested by van Buuren et al. (1999) for variable selection in the case of big data:

Include all variables that appear in the complete-data model (final model)
- This may include the interaction terms as well (passive imputation can be used to specify the interaction terms in mice package)
Include variable that have influence on the occurrence of the missing data
- This can be assessed by a correlation matrix between NAs variables and non-NAs variables
Include variable that explain a considerable amount of variance
- This can be crudely assessed by a correlation matrix between NAs variables and non-NAs variables
Remove variable that have too many missing values within the subgroup of incomplete cases
- This can be assessed by a proportion of usable cases (PUC) - how many cases with missing data in a certain variable have an observed values on the predictor variables

All these steps should be done on the key variables only. There is another more efficient yet laborious approach suggested by Oudshoorn et al. (1999), which take into account important predictor of predictors. We are going to focus on the four steps above, and not cover the latter suggested approach in this post.

R codes

These are the required packages.

library(mice)
library(corrplot)

Our data.

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We have 2 variables; Ozone and Solar.R with missing values or NAs. We can further explore the pattern of missing variable.

md.pattern(airquality)

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

There are 2 rows with NAs in Ozone and Solar.R, 35 rows with NAs only in Ozone, and 5 rows with NAs only in Solar.R. Next, we can check the correlation.

cor(airquality, use = "pairwise.complete.obs") |>
  corrplot(method = "number", type = "upper")

The correlations of Ozone-Temp and Ozone-Wind are the highest. Now, let’s do a correlation between the NAs variable and non-NAs variable.

cor(y = airquality, x = !is.na(airquality), use = "pairwise.complete.obs") |>
  round(digits = 2)

##         Ozone Solar.R  Wind Temp Month   Day
## Ozone      NA   -0.02 -0.05 0.00  0.26 -0.05
## Solar.R     0      NA  0.06 0.11  0.11  0.17
## Wind       NA      NA    NA   NA    NA    NA
## Temp       NA      NA    NA   NA    NA    NA
## Month      NA      NA    NA   NA    NA    NA
## Day        NA      NA    NA   NA    NA    NA

We can ignore the warnings and the NAs as only Ozone and Solar.R have a missing values. So, the highest correlation is 0.26 between Month-Ozone - correlation between Month values with Ozone-related NAs and Month values with non-Ozone-related NAs. The column variable in the correlation matrix is the indicators of NAs and the row variables is the variable with observed values.

Lastly we can calculate ‘manually’ the PUC (proportion of usable cases). md.pairs() here calculate the number of observation per variable pair.

var_pair <- md.pairs(airquality)
round(var_pair$mr / (var_pair$mr + var_pair$mm), digits = 3)

##         Ozone Solar.R Wind Temp Month Day
## Ozone   0.000   0.946    1    1     1   1
## Solar.R 0.714   0.000    1    1     1   1
## Wind      NaN     NaN  NaN  NaN   NaN NaN
## Temp      NaN     NaN  NaN  NaN   NaN NaN
## Month     NaN     NaN  NaN  NaN   NaN NaN
## Day       NaN     NaN  NaN  NaN   NaN NaN

Low value of PUC indicate there is a little information on the predictor to impute the target NAs variable. NaN is shown as the variables have no missing values. The row variable are the target variables to be imputed, and the column variables are the predictors in imputation model. We can see that to impute Solar.R (on the row) Ozone has a little less information (0.714) compare to Wind, Temp, and Day. The diagonal elements will always be 0 or NaN. So, from here we can drop predictors with say, 0 PUC as they contain no information to help impute the target NAs variable.

Actually, we have a nice function from mice that can do what we ‘manually’ did just now.

quickpred(airquality)

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Again, the column variables are the predictors, and the row variables are the target NAs variables. The above matrix is known as predictor matrix, which going to be used in the imputation model. 1 denote a variable included as predictors and 0 vice versa. The two main arguments in quickpred() are:

mincor - if any of the absolute values in the two correlation matrix that we did earlier above 0.1 (default), the predictors will be included in the predictor matrix
minpuc - the default values for PUC is 0, so the predictors are retained even if they have no information to help imputation model

Notice that, variable Day is excluded from the predictors of Ozone. The correlation values are 0 and -0.05 from the first and second correlation matrices, respectively which do not exceed the default setting of 0.1. That’s why, variable Day is excluded. Also, we can observe a similar situation for variable Wind , which is excluded from the predictors of Solar.R (the correlation coefficients are -0.60 and 0.06). The negative (-) sign does not matter as we actually evaluate the absolute values.

Intuitively, we can change these two arguments as we see fit to do a variable selection for imputation model. Once we finalise our variable selection, we can do the multiple imputation using mice().

# Finalised variable selection
var_sel <- quickpred(airquality)
var_sel

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

# Impute
imp <- mice(airquality, m = 5, predictorMatrix = var_sel, printFlag = F)
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Notice that mice() uses the predictor matrix that we provide.

References:

https://www.jstatsoft.org/article/view/v045i03 - paper written by Staf van Buuren (a bit outdated in terms of codes, but runnable)
https://stefvanbuuren.name/fimd/ - online book written by Stef van Buuren (See chapter 6.3.2 and 9.1.6)

Making maps with R (my first attempt ever!)

Fri, 12 Nov 2021 00:00:00 +0000

As written in the title of the post, this is my first try ever in making a map with R. I found a great data on the distribution of the clinics in Malaysia. The two types of clinic that we have here:

Klinik 1Malaysia (1Malaysia clinic)
Klinik Desa (Desa clinic)

Originally, these two data are a separated data. Both of the data can be downloaded from here. Also, I have uploaded the data into my GitHub repo for those interested. Klinik Desa data have a latitude and longitude information, but Klinik 1Malaysia data does not.

These are the required packages.

library(rworldmap) #to get a Malaysia map
library(tidyverse)
library(tidygeocoder) #to get latitude and logitude

Read the data.

clinic1m <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv")
clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")

First, we need to get a latitude and longitude information for Klinik 1Malaysia data. So, we going to retrieve the coordinates based on the postal code, though this is not very accurate. We can use tidygeocoder for this.

clinic1m2 <- 
  clinic1m %>%
  mutate(country = "malaysia") %>% 
  select(name, postcode, country) %>% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %>%
  geocode(postalcode = postcode, country = country, method = "osm")

Further checking on the data, we notice that 5 clinics have no coordinate info.

clinic1m2 %>% filter(is.na(lat) | is.na(long))

## # A tibble: 5 x 5
##   name                                     postcode country    lat  long
##   <chr>                                    <chr>    <chr>    <dbl> <dbl>
## 1 Klinik 1 Malaysia Bandar Lela            90700    malaysia    NA    NA
## 2 Klinik 1 Malaysia Batu Melintang         17250    malaysia    NA    NA
## 3 Klinik 1 Malaysia Cakerapurnama          45010    malaysia    NA    NA
## 4 Klinik 1 Malaysia Jelawat                16070    malaysia    NA    NA
## 5 Klinik 1 Malaysia Taman Kempadang Makmur 26060    malaysia    NA    NA

Some data pre-processing

So, I found this data after some googling time, which give coordinate based on the postal code. So, we going to add in the missing coordinate based on this online data.

add_coord <- 
  read.table(header = T, text = "
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926")

clinic1m2 <- 
  clinic1m2 %>% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %>% 
  drop_na() #drop 2 clinic1m

Even after add in the missing coordinate, we still missing 2 coordinates. So, we going to drop those 2 clinics. Next, we combine both data.

all_clinic <- 
  clinic1m2 %>% 
  mutate(Type = "1Malaysia") %>% 
  select(Type, lat, long) %>% 
  bind_rows(clinicDesa %>% 
              mutate(Type = "Desa", 
                     lat = latitude, 
                     long = longitude) %>% 
              select(Type, lat, long))

Let’s try plotting the data first.

ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal() #should remove the isolated two data

We have 2 isolated points from Klinik Desa data. We will drop these 2 points as well.

all_clinic2 <- all_clinic %>% filter(long > 25)

Plotting the map

There are 2 ways to plot our data to Malaysia map, that we going to cover in this post.

1) map from `ggplot2`

First, we need to get the map.

global <- map_data("world") #get map

Once, we retrieved the map, we need to filter the region to Malaysia. The rest of the codes are ggplot2 function as we know it.

ggplot() + 
  geom_polygon(data = global %>% filter(region == "Malaysia"), aes(x=long, y = lat, group = group), 
               fill = "gray85") + 
  coord_fixed(1.3) +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  theme_void() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Klinik 1Malaysia dan Klinik Desa di Malaysia", 
       subtitle = "(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))), 
       color = "Jenis klinik:", 
       shape = "Jenis klinik:") +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "bottom")

2) map from `rworldmap`

The flow is similar, we need to get the map first. Then, restrict the map to Malaysia region.

world <- getMap(resolution = "low") #get map
msia <- world[world@data$ADMIN == "Malaysia", ]

The rest of the codes are similar to the first approach. But, we going to change the theme a bit.

ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = "black") +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Klinik 1Malaysia dan Klinik Desa di Malaysia", 
       subtitle = "(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))), 
       color = "Jenis klinik:", 
       shape = "Jenis klinik:") +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "bottom")

Conclusion

The coordinates that we have are not as accurate as it should, or maybe there is something wrong that I miss along the way. As we can see, we have clinics on the ocean. As far as I know, we Malaysian are not that advanced yet. Also, noticed that we severely lacking clinics in Sarawak, given that our data is correct.

Some COVID-19 plots for Southeast Asian countries

Wed, 10 Nov 2021 00:00:00 +0000

Recently, I found a GitHub repo containing a global COVID-19 dataset. I thought, why not try to do some plotting for Southeast Asian countries. So, I downloaded the data and limited the data to Southeast Asian countries only (Brunei, Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam). I have uploaded this restricted data to my GitHub repo.

We are not going to do anything fancy, just some visualisations.

Let’s begin by reading the data.

library(tidyverse)
covid_sea <- read_csv("https://raw.githubusercontent.com/tengku-hanis/data-owid-covid/main/covid_sea.csv")

We are going to compare between each Southeast Asian countries in terms of:

Daily cases
Daily deaths
Daily tests
Daily vaccinations

Before that, we need to make a function, as all the above items have a generic things to plot with the exception on y axis.

easy_plot <- function(var1, lab_title, yaxis_lab, span = 0.14){
  covid_sea %>% 
    select(date, location, {{var1}}) %>% 
    drop_na() %>% 
    ggplot(aes(date, {{var1}}, color = location)) +
    geom_smooth(se = F, span = 0.14) +
    geom_point(aes(color = location), alpha = 0.2) +
    geom_line(aes(color = location), alpha = 0.2, linetype = "dashed") +
    labs(title = {{lab_title}}) +
    ylab({{yaxis_lab}}) +
    xlab("Date") +
    theme_minimal() 
}

var1 is going to be the item/variable that we want to compare, lab_title is the plot title, yaxis_lab is the label on the y axis, and span is just how smooth our smoothen line should be.

Daily cases

easy_plot(new_cases, "Daily cases for southeast Asian countries", "Daily cases", span = 0.8)

We cannot compare in terms of the frequency as big countries like Indonesia is expected to had a higher number of daily cases. A smoothen line though very basic, may indicate a simple trend. Thailand, Malaysia, Philippines and Indonesia seems to had a decreasing trend of cases. On the other hand, the daily cases in Vietnam seems to start to increase. Singapore had a more stabilised trend of cases, though a higher number of cases was observed in the latest period. Lastly, Brunei had too little cases, for us to see any sort of trend at the scale of the between countries comparison.

Daily deaths

easy_plot(new_deaths, "Daily deaths for southeast Asian countries", "Daily deaths", span = 0.8)

Philippines and Indonesia seems started to had a bit of increasing trend. Other countries look okay.

Daily tests

easy_plot(new_tests, "Daily tests for southeast Asian countries", "Daily tests", span = 0.2)

The daily tests plot looks a bit weird for Vietnam. Actually, the daily tests below zero are not avaliable (not sure if there is no test done in the period or the values is just missing). Hence, the weird looking plot for Vietnam. Data for Brunei and Thailand are not available. Malaysia seems to be quite aggressive in COVID-19 testing, even on par with Indonesia. Also, Vietnam seems to be very aggressive in the latest period, probably to cover the lack of COVID-19 testing previously.

Daily vaccinations

easy_plot(new_vaccinations, "Daily vaccinations for southeast Asian countries", "Daily vaccinations", span = 0.9)

Malaysia and Singapore had quite a similar distribution. Vietnam, Philippines, Thailand and Indonesia quite similar in which they had a series of wave in the rate of vaccinations, though the trend of wave for Thailand is less obvious. Again, the number in Brunei was too little for us to see any trend or distribution at this scale.

Malaysia situation

Let’s do a plot, specific to Malaysia. We going to scale the numbers, so that we able to see a comparison in term of trend or distribution.

covid_sea %>% 
  filter(location == "Malaysia") %>% 
  mutate(new_cases = scale(new_cases), 
         new_deaths = scale(new_deaths), 
         new_tests = scale(new_tests), 
         new_vaccinations = scale(new_vaccinations)) %>% 
  ggplot(aes(date)) +
  geom_line(aes(y = new_cases, color = "new_cases"), alpha = 0.3) +
  geom_line(aes(y = new_deaths, color = "new_deaths"), alpha = 0.3) +
  geom_line(aes(y = new_tests, color = "new_tests"), alpha = 0.3) +
  geom_line(aes(y = new_vaccinations, color = "new_vaccinations"), alpha = 0.3) +
  geom_point(aes(y = new_cases, color = "new_cases"), alpha = 0.3) +
  geom_point(aes(y = new_deaths, color = "new_deaths"), alpha = 0.3) +
  geom_point(aes(y = new_tests, color = "new_tests"), alpha = 0.3) +
  geom_point(aes(y = new_vaccinations, color = "new_vaccinations"), alpha = 0.3) +
  geom_smooth(aes(y = new_cases, color = "new_cases"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_deaths, color = "new_deaths"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_tests, color = "new_tests"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_vaccinations, color = "new_vaccinations"), se = F, span = 0.6) +
  labs(title = "Situation in Malaysia") +
  ylab("Scaled Frequency") +
  xlab("Date") +
  guides(color = guide_legend("Items")) +
  scale_color_discrete(labels = c("Daily cases", "Daily deaths", "Daily tests", "Daily vaccinations")) +
  theme_minimal()

Interestingly, as the number of vaccination increased up to a certain threshold, the number of daily cases and daily deaths started to decreased. Obviously, the daily testing also decreased as in Malaysia, COVID-19 testing is done based on suspected cases and their persons of contact instead of mass testing.

Disclaimer: Please take anything written here with a massive grain of salt.

Data source: https://github.com/owid/covid-19-data/tree/master/public/data

Extract a table from a pdf

Mon, 01 Nov 2021 00:00:00 +0000

In a couple of days, I am going to conduct a pre-conference workshop for Malaysian R conference 2021. So, some of the data that I am going to use for this workshop is available in a table in pdf form. Hence, this post is about how I get that particular table from the pdf into R for further analysis.

So, this is a table we going to extract.

Extracting a table from pdf

We going to use tabulizer package for this. However, not every pdf works with this package. In our case, it works but need further preprocessing.

Load the required packages.

library(tabulizer)
library(dplyr)
library(stringr)

Read a table from a pdf.

raw_table <- extract_tables("https://static-content.springer.com/esm/art%3A10.1038%2Fs41440-021-00720-3/MediaObjects/41440_2021_720_MOESM1_ESM.pdf", 
                          pages = 17, 
                          output = "data.frame")

So, this is the extracted table.

raw_table[[1]] %>% head(10)

##                X     X.1     X.2     X.3  X.4     X.5 X.6     X.7  X.8
## 1                                                                     
## 2                                                                     
## 3    Ahmed, 2019 Unclear Unclear Unclear High Unclear Low Unclear High
## 4                                                                     
## 5   Badrov, 2013 Unclear    High    High High Unclear Low Unclear High
## 6   Baross, 2012 Unclear Unclear    High High Unclear Low Unclear High
## 7   Baross, 2013 Unclear Unclear    High High Unclear Low Unclear High
## 8  Carlson, 2016     Low    High    High  Low Unclear Low     Low High
## 9  Correia, 2020     Low     Low     Low High Unclear Low     Low High
## 10                                                                    
##                              X.9
## 1      1- selection bias: random
## 2            sequence generation
## 3  2- selection bias: allocation
## 4                    concealment
## 5                               
## 6   3- reporting bias: selective
## 7                      reporting
## 8                               
## 9  4- Performance bias: blinding
## 10  (participants and personnel)

So, a few preprocessing steps needed:

Remove column X.9 - this column supposed to be a header
Rename a header based on column X.9
Remove a space between the author name - “Ahmed,2019” instead of “Ahmed, 2019”
Remove empty rows

irt_rob <- 
  raw_table[[1]] %>% 
  select(-X.9) %>%  
  rename(Study = X, 
         Random.sequence.generation. = X.1, 
         Allocation.concealment. = X.2,
         Selective.reporting. = X.3,
         Blinding.of.participants.and.personnel. = X.4, 
         Blinding.of.outcome.assessment = X.5, 
         Incomplete.outcome.data = X.6, 
         Other.sources.of.bias. = X.7, 
         Overall = X.8) %>% 
  as_tibble() %>% 
  mutate(Study = str_replace_all(Study, " ", "")) %>% 
  mutate(id_del = str_match(Study, ".")) %>% 
  filter(!is.na(id_del)) %>% 
  select(-id_del)

Finally, our data is ready.

irt_rob

##          Study Random.sequence.generation. Allocation.concealment.
## 1   Ahmed,2019                     Unclear                 Unclear
## 2  Badrov,2013                     Unclear                    High
## 3  Baross,2012                     Unclear                 Unclear
## 4  Baross,2013                     Unclear                 Unclear
## 5 Carlson,2016                         Low                    High
##   Selective.reporting. Blinding.of.participants.and.personnel.
## 1              Unclear                                    High
## 2                 High                                    High
## 3                 High                                    High
## 4                 High                                    High
## 5                 High                                     Low
##   Blinding.of.outcome.assessment Incomplete.outcome.data Other.sources.of.bias.
## 1                        Unclear                     Low                Unclear
## 2                        Unclear                     Low                Unclear
## 3                        Unclear                     Low                Unclear
## 4                        Unclear                     Low                Unclear
## 5                        Unclear                     Low                    Low
##   Overall
## 1    High
## 2    High
## 3    High
## 4    High
## 5    High

A short note on multiple imputation

Fri, 29 Oct 2021 00:00:00 +0000

Background

Missing data is quite challenging to deal with. Deleting it may be the easiest solution, but may not be the best solution. Missing data can be categorised into 3 types (Rubin, 1976):

MCAR
- Missing Completely At Random
- Example; some of the observations are missing due to lost of records during the flood
MAR
- Missing At Random
- Example; variable income are missing as some participant refuse to give their salary information which they deems as very personal information
MNAR
- Missing Not At Random
- Example; weight variable is missing for morbidly obese participants since the scale is unable to weight them

Out of the 3 types above, the most problematic is MNAR, though there exist methods to deal with this type. For example, the miceMNAR package in R.

There are several approaches in handling missing data:

Listwise-deletion
- Best approach if the amount of missingness is very small
Simple imputation
- Using mean/median/mode imputation
- This approach is not advisable as it leads to bias due to reduce variance, though the mean is not affected
Single imputation
- Simple imputation above is considered as single imputation as well
- This approach ignores uncertainty of the imputation and almost always underestimate the variance
Multiple imputation
- A bit advanced and it cover the limitation of single imputation approach

However, the main assumption for any imputation methods is the missingness should be MCAR or MAR.

Multiple imputation

In short, there are 2 approaches of multiple imputation implemented by packages in R:

Joint modeling (JM) or joint multivariate normal distribution multiple imputation
- The main assumption for this method is that the observed data follows a multivariate normal distribution
- A violation of this assumption produces incorrect values, though a slight violation is still okay
- Some packages that implemented this method: Amelia and norm
Fully conditional specification (FCS) or conditional multiple imputation
- Also known as multivariate imputation by chained equation (MICE)
- This approach is a bit flexible as distribution is assumed for each variable rather than the whole dataset
- Some package that implemented this method: mice and mi

Example

In mice package, the general steps are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

Figure 1: Main steps in mice package.

These are the required packages.

library(tidyverse)
library(mice)
library(VIM)
#library(missForest) we want to use prodNA() function from this package
library(naniar)
library(niceFunction) #install from github (https://github.com/tengku-hanis/niceFunction)
library(dplyr)
library(gtsummary)

We going to produce some NAs randomly.

set.seed(123)
dat <- iris %>% 
  select(-Sepal.Length)%>% 
  missForest::prodNA(0.2) %>%  # randomly insert 20% NAs
  mutate(Sepal.Length = iris$Sepal.Length)

Explore the NAs and the data.

naniar::miss_var_summary(dat)

## # A tibble: 5 x 3
##   variable     n_miss pct_miss
##   <chr>         <int>    <dbl>
## 1 Petal.Length     38     25.3
## 2 Sepal.Width      33     22  
## 3 Species          28     18.7
## 4 Petal.Width      21     14  
## 5 Sepal.Length      0      0

Some references recommend to remove variables with more than 50% NAs. However, we purposely introduce 20% NAs into our data.

As a guideline, we can check for MCAR for our NAs.

naniar::mcar_test(dat) #p > 0.05, MCAR is indicated

## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      38.8    40   0.522               14

Next step is to evaluate the pattern of missingness in our data.

md.pattern(dat, rotate.names = T, plot = T)

##    Sepal.Length Petal.Width Species Sepal.Width Petal.Length    
## 64            1           1       1           1            1   0
## 21            1           1       1           1            0   1
## 15            1           1       1           0            1   1
## 3             1           1       1           0            0   2
## 14            1           1       0           1            1   1
## 4             1           1       0           1            0   2
## 6             1           1       0           0            1   2
## 2             1           1       0           0            0   3
## 7             1           0       1           1            1   1
## 6             1           0       1           1            0   2
## 4             1           0       1           0            1   2
## 2             1           0       1           0            0   3
## 1             1           0       0           1            1   2
## 1             1           0       0           0            1   3
##               0          21      28          33           38 120

aggr(dat, prop = F, numbers = T)

We have 13 patterns (numbers on the right) of NAs in our data. These 2 functions work well with small dataset, but with a larger dataset (and with lot more pattern of NAs), it’s probably quite difficult to assess the pattern.

matrixplot() probably more appropriate for a larger dataset.

matrixplot(dat)

In terms of the missingness pattern, we can also assess the distribution of NAs of Sepal.Width is dependent on the variable Sepal.Length.

niceFunction::histNA_byVar(dat, Sepal.Width, Sepal.Length)

As we can see the distribution and range of the histograms of the NAs (True) and non-NAs (False) is quite similar. Thus, this may indicated that Sepal.Width is at least MAR. However, by right we should do this for each pair of numerical variable before jumping into any conclusion.

Another good thing to assess is the correlation.

# Data with 1 = NAs, 0 = non-NAs
x <- as.data.frame(abs(is.na(dat))) %>% 
  dplyr::select(-Sepal.Length) #pick variable with NAs only

Firstly, the correlation between the variables with missing data.

cor(x) %>% 
  corrplot::corrplot()

No high correlation among variable with NAs. Secondly, let’s see correlation between NAs in a variable and the observed values of other variables.

cor(dat %>% mutate(Species = as.numeric(Species)), x, use = "pairwise.complete.obs")

##               Sepal.Width Petal.Length  Petal.Width     Species
## Sepal.Width            NA  0.049158733 -0.065917718  0.09948263
## Petal.Length  0.042075695           NA -0.004572405 -0.17265919
## Petal.Width   0.096195805 -0.003320601           NA -0.11024288
## Species       0.045849046 -0.104143925 -0.081055707          NA
## Sepal.Length -0.006435044 -0.052871701 -0.091024799 -0.08527514

Again, there is no high correlation. But, if we were to interpret this correlation matrix; the rows are the observed variables and the columns represent the missingness. For example, missing values of Sepal.Width is more likely to be missing for observations with a high value of Petal.Width (r = 0.05 indicates it’s highly unlikely though).

Now, we can do multiple imputation. These are the methods in the mice package:

methods(mice)

##  [1] mice.impute.2l.bin       mice.impute.2l.lmer      mice.impute.2l.norm     
##  [4] mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart         mice.impute.jomoImpute  
## [10] mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch   mice.impute.mnar.logreg 
## [16] mice.impute.mnar.norm    mice.impute.norm         mice.impute.norm.boot   
## [19] mice.impute.norm.nob     mice.impute.norm.predict mice.impute.panImpute   
## [22] mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [25] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf          
## [28] mice.impute.ri           mice.impute.sample       mice.mids               
## [31] mice.theme              
## see '?methods' for accessing help and source code

By default, mice uses:

pmm (predictive mean matching) for numeric data
logreg (logistic regression imputation) for binary data, factor with 2 levels
polyreg (polytomous regression imputation) for unordered categorical data (factor > 2 levels)
polr (proportional odds model) for ordered, > 2 levels

let’s run the mice function to our data:

imp <- mice(dat, m = 5, seed=1234, maxit = 5, printFlag = F) 
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  Sepal.Width Petal.Length  Petal.Width      Species Sepal.Length 
##        "pmm"        "pmm"        "pmm"    "polyreg"           "" 
## PredictorMatrix:
##              Sepal.Width Petal.Length Petal.Width Species Sepal.Length
## Sepal.Width            0            1           1       1            1
## Petal.Length           1            0           1       1            1
## Petal.Width            1            1           0       1            1
## Species                1            1           1       0            1
## Sepal.Length           1            1           1       1            0

Next, we can do some diagnostic assessment on the imputed data. This is our imputed data.

imp$imp$Sepal.Width %>% head()

##      1   2   3   4   5
## 5  3.4 3.4 4.1 3.1 3.5
## 13 3.2 3.1 3.2 3.6 3.1
## 14 3.1 3.2 2.9 3.4 3.0
## 23 3.6 3.2 3.0 3.8 3.1
## 26 4.1 3.0 3.1 3.5 3.0
## 34 3.4 3.7 3.7 3.4 4.4

One important thing to check is the convergence. We are going increase the number of iteration for this.

imp_conv <- mice.mids(imp, maxit = 30, printFlag = F)
plot(imp_conv)

The line in the plot should be intermingled and no obvious trend should be observed. Our plot above indicates a convergence.

We can also assess density plot of imputed data and the observed data. Blue color is the observed data and red color is the imputed data.

densityplot(imp)

We can further assess variable Sepal.Width.

densityplot(imp, ~ Sepal.Width | .imp)

Lastly, we can assess the strip plot. The imputed observations (red color) should not distributed too far from the observed data (blue color).

stripplot(imp)

So, once we finish the diagnostic checking, we can actually go back and change the imputation method for Sepal.Width, since the its distribution changes quite differently at each iteration. But, we are not going to do that, instead we are going to do the analysis.

# run regression
fit <- with(imp, lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species))
# pool all imputed set
pooled <- pool(fit) 
summary(pooled)

##                term   estimate  std.error statistic       df      p.value
## 1       (Intercept)  2.2008307 0.34577321  6.364954 29.02484 5.859560e-07
## 2       Sepal.Width  0.5233500 0.09717217  5.385801 50.89918 1.854832e-06
## 3      Petal.Length  0.7409159 0.09020153  8.214006 12.73722 1.921415e-06
## 4       Petal.Width -0.3623895 0.18562168 -1.952301 22.34517 6.354332e-02
## 5 Speciesversicolor -0.3891112 0.28166528 -1.381467 15.07547 1.872683e-01
## 6  Speciesvirginica -0.5237106 0.42629920 -1.228505 10.82804 2.452897e-01

Since we have the original dataset without the NAs, we going to compare them.

mimpute <- 
  fit %>% 
  tbl_regression() #with mice

noimpute <- 
  dat %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #w/o mice

original <- 
  iris %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #original data

tbl_merge(
  tbls = list(mimpute, noimpute, original), 
  tab_spanner = c("With MICE", "Without MICE", "Original data")
)

Characteristic	With MICE	Without MICE	Original data
Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
Sepal.Width	0.52	0.33, 0.72	<0.001	0.48	0.17, 0.79	0.003	0.50	0.33, 0.67	<0.001
Petal.Length	0.74	0.55, 0.94	<0.001	0.71	0.51, 0.90	<0.001	0.83	0.69, 1.0	<0.001
Petal.Width	-0.36	-0.75, 0.02	0.064	-0.35	-0.85, 0.14	0.2	-0.32	-0.61, -0.02	0.039
Species
setosa	—	—		—	—		—	—
versicolor	-0.39	-1.0, 0.21	0.2	-0.42	-1.1, 0.30	0.3	-0.72	-1.2, -0.25	0.003
virginica	-0.52	-1.5, 0.42	0.2	-0.42	-1.5, 0.63	0.4	-1.0	-1.7, -0.36	0.003
¹ CI = Confidence Interval

There is a different in the result between the original dataset (no NAs) and with mice imputation. Probably, exploring other imputation methods will produce a better result.

There are a lot more that are not cover in this post. For example passive imputation and post-processing. In fact, there are a series of vignettes written by Gerko Vink and Stef van Buuren (both are the authors of mice) which provides a good tutorial on using mice though quite advanced.

Suggested online books (though, I have not really studied both of the books yet):

Flexible imputation of missing data by Stef van Buuren
Applied missing data analysis with SPSS and (R)Studio

References for this post:

COVID-19 vaccine interest in Malaysia

Sun, 17 Oct 2021 00:00:00 +0000

We are going to do a basic google trends search using gtrendsR package and do some plotting with ggplot2.

These are the required packages.

library(gtrendsR)
library(tidyverse)

Run gtrends() function to search our keywords of interest (i.e; type of vaccine). So far, we only used 4 type of vaccines in Malaysia.

vaccine <- gtrends(c("pfizer", "astrazeneca", "sinovac", "cansino"), geo = "MY")

Then, plot our keywords.

plot(vaccine)

Probably, it’s better if we filter our date to when the COVID-19 pandemic started, which is around March 2020.

vaccine$interest_over_time %>% 
  group_by(keyword) %>% 
  filter(hits != "<1" & date > as.Date("2020-03-01")) %>% 
  mutate(hits = as.numeric(hits), 
         date = as.Date(date)) %>% 
  ggplot() + 
  geom_line(aes(x = date, y = hits, color = keyword), size = 0.8) +
  theme_minimal() +
  labs(title = "COVID-19 vaccine interest in Malaysia", y = "Search hits", x = "Date") +
  scale_x_date(date_breaks = "4 month")

So, AstraZeneca vaccine is of high interest, probably due to infamous blood clotting issue. Next, we can also get the search keywords based on the states.

vaccine$interest_by_region %>% 
  group_by(location) %>% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = "COVID-19 vaccine interest in Malaysia by states", y = "Search hits", x = "")

Lastly, we can plot the search keywords based on the city.

vaccine$interest_by_city %>% 
  group_by(location) %>% 
  drop_na() %>% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = "COVID-19 vaccine interest in Malaysia by cities", y = "Search hits", x = "")

gtrendsR with just a bit of plots certainly very useful if we want to gauge certain issues in the community.

Wordcloud of COVID-19 research in Malaysia

Sat, 11 Sep 2021 00:00:00 +0000

Let’s see how much research has been done in term of COVID-19 in Malaysia. In this analysis, we are going to use Scopus database to access the relevant research or papers. In this analysis we are going to use 4 specific parts of the scientific paper:

Title
Abstract
Author’s keywords
Scopus’s keywords

Above is a sample of paper that shows the section of scientific paper that we are going to use in our analysis. The Scopus’s keywords are generated by the Scopus database, thus, it does not available on the paper.

So, the analysis will be applied separately on these 4 parts of the papers. Also, we are going to use map (equivalent to loop) since the flow of the analysis is similar.

Load the related packages. The main package is quanteda.

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(patchwork)
library(wordcloud2)

I have uploaded the data that I downloaded from the Scopus database into my GitHub.

# Read data from GitHub repo
df <- read.csv("https://raw.githubusercontent.com/tengku-hanis/scopus-data/main/covid-malaysia.csv") %>% 
  janitor::clean_names() %>% 
  rename(title =i_title)

First, we need to tokenize the sentence. In other words, we break down the sentences into words.

# Tokenize
tok_list <- 
  df %>% 
  select(title, abstract, author_keywords, index_keywords) %>% 
  map(tokens, 
      remove_punct = T, 
      remove_numbers = T,               
      remove_symbols = T)

Next, we remove words that are not meaningful such ‘a’, ‘the’, etc. These words are known as stop words.

# Remove stop words
nostop_toks <- 
  tok_list %>% 
  map(tokens_select, 
      c(tidytext::stop_words$word, stopwords("en")), 
      selection = "remove")

Then, we create a document feature matrix (DFM). Basically DFM is a matrix that represent the frequency of each word (feature) in each document (in our case, paper or manuscript). Another name for DFM is document term matrix (DTM). quanteda uses the term DFM, some other packages use the term DTM.

Additionally, we also apply term frequency-inverse document frequency (TF-IDF) metrics. In scientific papers, the words such as ‘determine’, ‘conclusion’, ‘introduction’, etc are very frequent, and these words are not meaningful as well. Instead of removing manually one by one, we use TF-IDF. So, TF-IDF basically remove the words that are too common, thus we get only the relevant or important words.

# Create DFM and apply tf_idf
covid_dfm_list <- 
  nostop_toks %>% 
  map(dfm) %>% 
  map(dfm_tfidf)

Once, we have our words (tokens), we can create a plot of most relevant terms based on TF-IDF.

Show code

# Plot top features
A <- 
  covid_dfm_list$title %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "blueviolet") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the title")

B <- 
  covid_dfm_list$abstract %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "darkolivegreen3") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the abstract")

C <- 
  covid_dfm_list$author_keywords %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "deepskyblue2") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the author's keywords")

D <- 
  covid_dfm_list$index_keywords %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "aquamarine2") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the Scopus's keywords")

These are the plots of the most relevant terms in COVID-19 research in Malaysia.

Wordcloud

Finally, we can make our wordcloud, but we need to convert our DFM to data frame first. Also, we are going to round the value of TF-IDF and limit to top 1000 terms only.

covid_wc <- 
  covid_dfm_list %>% 
  map(textstat_frequency, force = T)

Actually, quanteda itself is able to produce a wordcloud. However, the wordcloud from wordcloud2 is more interactive and we can see the value of TF-IDF if we click the words.

wordcloud2(covid_wc$title%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 1: Top 1000 terms extracted from the title

wordcloud2(covid_wc$abstract %>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 2: Top 1000 terms extracted from the abstract

wordcloud2(covid_wc$author_keywords%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 3: Top 1000 terms extracted from the author’s keywords

wordcloud2(covid_wc$index_keywords%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 4: Top 1000 terms extracted from the Scopus’s keywords

There are some weird symbols in the plot and the wordcloud, it’s better remove to it. However, I am to lazy to remove it, so I will leave it 😃.

Conclusion

These are some of the explorative text analysis that can be done. These relevant terms may provide some insight to our current research of COVID-19 in Malaysia. However, by no means its fully reflect our current COVID-19 research.

Hyperparameter tuning in tidymodels

Sun, 05 Sep 2021 00:00:00 +0000

This post will not go very detail in each of the approach of hyperparameter tuning. This post mainly aims to summarize a few things that I studied for the last couple of days. Generally, there are two approaches to hyperparameter tuning in tidymodels.

Grid search:
– Regular grid search
– Random grid search
Iterative search:
– Bayesian optimization
– Simulated annealing

Grid search

So, in grid search, we provide the combination of parameters and the algorithm will go through each combination of parameters. There are two types of grid search:

Regular grid search
– The algorithm will go through each combinations of parameters.

grid_regular(mtry(c(1, 13)), 
             trees(), 
             min_n(),
             levels = 3) # how many from each parameter

## # A tibble: 27 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1     1     2
##  2     7     1     2
##  3    13     1     2
##  4     1  1000     2
##  5     7  1000     2
##  6    13  1000     2
##  7     1  2000     2
##  8     7  2000     2
##  9    13  2000     2
## 10     1     1    21
## # ... with 17 more rows

Random grid search
– The algorithm will randomly select a number of combination of parameters instead of go through each of them.

grid_random(mtry(c(1, 13)),
            trees(), 
            min_n(), 
            size = 100) # size of parameters combination

## # A tibble: 100 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     5  1216    40
##  2     8  1374    13
##  3     9   859    39
##  4     6   282    12
##  5     2  1210     9
##  6     8  1828    39
##  7    11   550    14
##  8    13  1157    32
##  9     5   282     6
## 10    10  1018    28
## # ... with 90 more rows

By default, tidymodels uses space-filling-design to make sure the combination of parameters are on “equidistance” to each other.

Iterative search

In iterative search, we need to specify some initial parameters/values to start the search.

Bayesian optimization
– This algorithm/function will search the next best combination of parameters based on the previous combination of parameters (priori).
Simulated annealing
– Generally, this algorithm works relatively similar to bayesian optimization.
– However, as the figure below illustrates this algorithm is able to explore in the worst combination of parameters for a short term (barrier of local search), in order to find the best combination of parameters (global minima).

Futher details on iterative search or both methods above can be found here. So, as both iterative methods need a starting parameters, we can actually combine with any of the grid search methods.

Other methods

By default, if we do not supply any combination of parameters, tidymodels will randomly pick 10 combinations of parameters from the default range of values from the model. Additionally, we can set this values to other values as shown below:

tune_grid(
  resamples = dat_cv, # cross validation data set
  grid = 20,  # 20 combinations of parameters
  control = control, # some control parameters
  metrics = metrics # some metrics parameters (roc_auc, etc)
  )

There are another special cases of grid search; tune_race_anova() and tune_race_win_loss(). Both of these methods supposed to be more efficient way of grid search. In general, both methods evaluate the tuning parameters on a small initial set. The combination of parameters with a worst performance will be eliminated. Thus, makes them more efficient in grid search. The main difference between these two methods is how the worst combination of parameters are evaluated and eliminated.

R codes

Load the packages.

# Packages
library(tidyverse)
library(tidymodels)
library(finetune)

We will only use a small chunk of the data for ease of computation.

# Data
data(income, package = "kernlab")

# Make data smaller for computation
set.seed(2021)
income2 <- 
  income %>% 
  filter(INCOME == "[75.000-" | INCOME == "[50.000-75.000)") %>% 
  slice_sample(n = 600) %>% 
  mutate(INCOME = fct_drop(INCOME), 
         INCOME = fct_recode(INCOME, 
                             rich = "[75.000-",
                             less_rich = "[50.000-75.000)"), 
         INCOME = factor(INCOME, ordered = F)) %>% 
  mutate(across(-INCOME, fct_drop))

# Summary of data
glimpse(income2)

## Rows: 600
## Columns: 14
## $ INCOME         <fct> less_rich, rich, rich, rich, less_rich, rich, rich, les~
## $ SEX            <fct> F, M, F, M, F, F, F, M, F, M, M, M, F, F, F, F, M, M, M~
## $ MARITAL.STATUS <fct> Married, Married, Married, Single, Single, NA, Married,~
## $ AGE            <ord> 35-44, 25-34, 45-54, 18-24, 18-24, 14-17, 25-34, 25-34,~
## $ EDUCATION      <ord> 1 to 3 years of college, Grad Study, College graduate, ~
## $ OCCUPATION     <fct> "Professional/Managerial", "Professional/Managerial", "~
## $ AREA           <ord> 10+ years, 7-10 years, 10+ years, -1 year, 4-6 years, 7~
## $ DUAL.INCOMES   <fct> Yes, Yes, Yes, Not Married, Not Married, Not Married, N~
## $ HOUSEHOLD.SIZE <ord> Five, Two, Four, Two, Four, Two, Three, Two, Five, One,~
## $ UNDER18        <ord> Three, None, None, None, None, None, One, None, Three, ~
## $ HOUSEHOLDER    <fct> Own, Own, Own, Rent, Family, Own, Own, Rent, Own, Own, ~
## $ HOME.TYPE      <fct> House, House, House, House, House, Apartment, House, Ho~
## $ ETHNIC.CLASS   <fct> White, White, White, White, White, White, White, White,~
## $ LANGUAGE       <fct> English, English, English, English, English, NA, Englis~

# Outcome variable
table(income2$INCOME)

## 
## less_rich      rich 
##       362       238

# Missing data
DataExplorer::plot_missing(income)

Split the data and create a 10-fold cross validation.

set.seed(2021)
dat_index <- initial_split(income2, strata = INCOME)
dat_train <- training(dat_index)
dat_test <- testing(dat_index)

## CV
set.seed(2021)
dat_cv <- vfold_cv(dat_train, v = 10, repeats = 1, strata = INCOME)

We going to impute the NAs with mode value since all the variable are categorical.

# Recipe
dat_rec <- 
  recipe(INCOME ~ ., data = dat_train) %>% 
  step_impute_mode(all_predictors()) %>% 
  step_ordinalscore(AGE, EDUCATION, AREA, HOUSEHOLD.SIZE, UNDER18)

# Model
rf_mod <- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

# Workflow
rf_wf <- 
  workflow() %>% 
  add_recipe(dat_rec) %>% 
  add_model(rf_mod)

Parameters for grid search

# Regular grid
reg_grid <- grid_regular(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         levels = 3)

# Random grid
rand_grid <- grid_random(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         size = 100)

Tune models using regular grid search. We going to use doParallel library to do parallel processing.

ctrl <- control_grid(save_pred = T,
                        extract = extract_model)
measure <- metric_set(roc_auc)  

# Parallel for regular grid
library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_regular <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = reg_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for regular grid search:

autoplot(tune_regular)

show_best(tune_regular)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7  1000    21 roc_auc binary     0.690    10  0.0148 Preprocessor1_Model14
## 2     7  1000    40 roc_auc binary     0.689    10  0.0179 Preprocessor1_Model23
## 3     7  2000    40 roc_auc binary     0.689    10  0.0178 Preprocessor1_Model26
## 4     7  1000     2 roc_auc binary     0.688    10  0.0173 Preprocessor1_Model05
## 5     7  2000    21 roc_auc binary     0.688    10  0.0159 Preprocessor1_Model17

Tune models using random grid search.

# Parallel for random grid
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_random <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for random grid search:

autoplot(tune_random)

show_best(tune_random)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_Model0~
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_Model0~
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_Model0~
## 4     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_Model0~
## 5     6   160    31 roc_auc binary     0.692    10  0.0172 Preprocessor1_Model0~

Random grid search has slightly a better result. Let’s use this random search result as a base for iterative search. Firstly, we limit the parameters based on the plot from a random grid search.

rf_param <- 
  rf_wf %>% 
  parameters() %>% 
  update(mtry = mtry(c(5, 13)), 
         trees = trees(c(1, 500)), 
         min_n = min_n(c(5, 30)))

Now we do a bayesian optimization.

# Parallel for bayesian optimization
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
bayes_tune <-  
  rf_wf %>% 
  tune_bayes(    
    resamples = dat_cv,
    param_info = rf_param,
    iter = 60,
    initial = tune_random, # result from random grid search        
    control = control_bayes(no_improve = 30, verbose = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

Result for bayesian optimization.

autoplot(bayes_tune, "performance")

show_best(bayes_tune)

## # A tibble: 5 x 10
##    mtry trees min_n .metric .estimator  mean     n std_err .config         .iter
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           <int>
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_~     0
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_~     0
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_~     0
## 4     6   189    15 roc_auc binary     0.693    10  0.0153 Iter1               1
## 5     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_~     0

We get a slightly better result from bayesian optimization. I will not do a simulated annealing approach since I get an error, though I am not sure why.

Lastly, we do a race anova.

# Parallel for race anova
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_efficient <- 
  rf_wf %>% 
  tune_race_anova(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = control_race(verbose_elim = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

We get a relatively similar result to random grid search but with faster computation.

autoplot(tune_efficient)

show_best(tune_efficient)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5  1425     5 roc_auc binary     0.695    10  0.0161 Preprocessor1_Model0~
## 2    11   406     2 roc_auc binary     0.694    10  0.0183 Preprocessor1_Model0~
## 3     6   631     3 roc_auc binary     0.692    10  0.0171 Preprocessor1_Model0~
## 4     7  1264     4 roc_auc binary     0.692    10  0.0159 Preprocessor1_Model0~
## 5     9  1264     3 roc_auc binary     0.692    10  0.0188 Preprocessor1_Model0~

We can also compare ROCs of all approaches. All approaches looks more or less similar.

Show code

# regular grid
rf_reg <- 
  tune_regular %>% 
  select_best(metric = "roc_auc")

reg_auc <- 
  tune_regular %>% 
  collect_predictions(parameters = rf_reg) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "regular_grid")

# random grid
rf_rand <- 
  tune_random %>% 
  select_best(metric = "roc_auc")

rand_auc <- 
  tune_random %>% 
  collect_predictions(parameters = rf_rand) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "random_grid")

# bayes
rf_bayes <- 
  bayes_tune %>% 
  select_best(metric = "roc_auc")

bayes_auc <- 
  bayes_tune %>% 
  collect_predictions(parameters = rf_bayes) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "bayes")

# race_anova
rf_eff <- 
  tune_efficient %>% 
  select_best(metric = "roc_auc")

eff_auc <- 
  tune_efficient %>% 
  collect_predictions(parameters = rf_eff) %>%
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "race_anova")

# Compare ROC between all tuning approach
bind_rows(reg_auc, rand_auc, bayes_auc, eff_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6) +
  theme_bw()

Finally, we fit our best model (bayesian optimization) to the testing data.

# Finalize workflow
best_rf <-
  select_best(bayes_tune, "roc_auc")

final_wf <- 
  rf_wf %>% 
  finalize_workflow(best_rf)
final_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_impute_mode()
## * step_ordinalscore()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 4
##   trees = 1016
##   min_n = 4
## 
## Computational engine: ranger

# Last fit
test_fit <- 
  final_wf %>%
  last_fit(dat_index) 

# Evaluation metrics 
test_fit %>%
  collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.611 Preprocessor1_Model1

test_fit %>%
  collect_predictions() %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  autoplot()

Conclusion

The result is not that good. Our AUC is quite lower. However, we did use only about 8% from the overall data. Nonetheless, the aim of this post is to cover an overview of hyperparameter tuning in tidymodels.

Additionally, there are another two function to construct parameter grids that I did not cover in this post; grid_max_entropy() and grid_latin_hypercube(). Both of these functions do not have much resources explaining them (or at least I did not found it), however, for those interested, a good start will be the tidymodels website.

References:
https://www.tmwr.org/grid-search.html
https://www.tmwr.org/iterative-search.html
https://oliviergimenez.github.io/learning-machine-learning/#
https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7

Data exploration in R

Sun, 22 Aug 2021 00:00:00 +0000

These are some of the packages that I find useful for data exploration. Basically, this post serves more as my note for future reference. I will list out packages (and some awesome functions from that particular package) rather than specific functions. Further, base R and tidyverse packages will not be included specifically in this list.

Load supporting packages

library(tidyverse)

The data we are going to use is from dlookr package:

glimpse(heartfailure)

## Rows: 299
## Columns: 13
## $ age               <int> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 62, 45, ~
## $ anaemia           <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, N~
## $ cpk_enzyme        <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, 123, 81~
## $ diabetes          <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, No~
## $ ejection_fraction <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 25, 30, ~
## $ hblood_pressure   <fct> Yes, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes,~
## $ platelets         <dbl> 265000, 263358, 162000, 210000, 327000, 204000, 1270~
## $ creatinine        <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.10, 1.50~
## $ sodium            <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, 133, 13~
## $ sex               <fct> Male, Male, Male, Male, Female, Male, Male, Male, Fe~
## $ smoking           <fct> No, No, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes~
## $ time              <int> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11, 11, 12~
## $ death_event       <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~

We will create a few NAs in our data.

set.seed(2021)
heartfailure[sample(seq(nrow(heartfailure)), 20), "age"] <- NA
heartfailure[sample(seq(nrow(heartfailure)), 10), "sex"] <- NA

1) dataMaid

library(dataMaid)

One of the very useful function in dataMaid is makeDataReport() which give report on the data. By default it will give a pdf output, but other output options such as word and html are also available.

makeDataReport(heartfailure, replace = T)

This is the output example in pdf.

2) DataExplorer

library(DataExplorer)

General visualization:

heartfailure %>% plot_intro()

Since we have missing data, we can further visualize it:

heartfailure %>% plot_missing()

heartfailure %>% profile_missing()

##              feature num_missing pct_missing
## 1                age          20  0.06688963
## 2            anaemia           0  0.00000000
## 3         cpk_enzyme           0  0.00000000
## 4           diabetes           0  0.00000000
## 5  ejection_fraction           0  0.00000000
## 6    hblood_pressure           0  0.00000000
## 7          platelets           0  0.00000000
## 8         creatinine           0  0.00000000
## 9             sodium           0  0.00000000
## 10               sex          10  0.03344482
## 11           smoking           0  0.00000000
## 12              time           0  0.00000000
## 13       death_event           0  0.00000000

We can also do a correlation plot

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  plot_correlation()

However, I do think correlation plot from corrplot packages gives a better and clean plot. Here is a plot from corrplot.

library(corrplot)

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  cor() %>% 
  corrplot(type = "upper")

Finally, we can get an overall html report from DataExplorer package using the function create_report().

3) dlookr

library(dlookr)

We can assess normality of the data using this package. The code below will plot normality for all numeric variable.

heartfailure %>% 
  plot_normality()

However, for the sake of the simplicity in this post, we will run only for one variable.

heartfailure %>% 
  plot_normality(age)

We can also get a correlation matrix plot from this package, and no need to remove the NAs and filter the numeric variable before running the function.

heartfailure %>% 
  plot_correlate()

Lastly, from dlookr we can get the overall report of the data exploration in pdf (and other formats as well). This report is quite comprehensive, have a look.

heartfailure %>% 
  eda_paged_report(target = "death_event")

4) skimr

skimr package, especially skim() function did not display correctly when using the blogdown. Hence, I included the screenshot of the result that we will typically see in the R console.

library(skimr)
skim(heartfailure)

So, from skimr we can get an overview that includes the histogram for numerical data as well.

5) outliertree

This package identify outlier using a decision tree. I will not go in detail about the approach, but for those who want to read further.

library(outliertree)
outlier.tree(heartfailure)

## Reporting top 2 outliers [out of 2 found]
## 
## row [251] - suspicious column: [creatinine] - suspicious value: [0.50]
##  distribution: 96.000% >= 0.70 - [mean: 1.35] - [sd: 1.22] - [norm. obs: 24]
##  given:
##      [cpk_enzyme] > [1610.00] (value: 2522.00)
## 
## 
## row [32] - suspicious column: [cpk_enzyme] - suspicious value: [23.00]
##  distribution: 98.958% >= 47.00 - [mean: 677.01] - [sd: 1321.86] - [norm. obs: 95]
##  given:
##      [death_event] = [Yes]

## Outlier Tree model
##  Numeric variables: 7
##  Categorical variables: 6
## 
## Consists of 369 clusters, spread across 48 tree branches

We can further explore the detected outliers using histogram and boxplot. Let’s do for variable creatinine.

# histogram
hist(heartfailure$creatinine, breaks = 50, col = "navy",
     xlab = "Creatinine", 
     main = "Creatinine level")

# boxplot
boxplot(heartfailure$creatinine)

Probably in the future I will delve into more detail about outlier detection and any awesome packages in R related to it. If I ever written any post about it, I will link it here.

Conclusion

These are some useful package that I find. I may edit this post in the future to add more additional data exploration package. Furthermore, there are shiny apps for data exploration as well, though I think it’s better to sticks with coded approach in data analysis/exploration. Thus, I did not explore those apps in this post. Another thing to remember is to set the variable type accordingly prior to the data exploration.

Hope this is useful!

References:
https://github.com/ekstroem/dataMaid
https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/
https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html
https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html

A summary of forcats package

Tue, 18 May 2021 00:00:00 +0000

I just watched a youtube video by Andrew Couch about his commonly used function in readr, stringr, and forcats packages. Although, I have used forcats package before, I realised that I have not fully utilised all of its function.

So, in this post, I have summarised main function of forcats that I find useful in my day-to-day R coding. Basically, more like a note to myself.

Main functions

We will use mtcars data to demonstrate each function. forcats is part of tiyverse packages. So, it will load, once we load the tidyverse packages.

library(tidyverse)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

There are 9 forcats functions that I think very useful.

factor()

factor() changes variable type into a factor or categorical type

mtcars$carb <- factor(mtcars$carb)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

fct_inorder()

This function sorts factor levels based on the order of appearance in the dataset.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_inorder(mtcars$carb) # levels based on the order of appearance

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 1 2 3 6 8

fct_infreq()

This function sorts factor levels based on the frequency of values.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_infreq(mtcars$carb) # levels based on the frequency values

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 2 4 1 3 6 8

fct_relevel()

This function can be used to change the order manually.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, c("8", "6", "4", "3", "2", "1")) # manually changed new levels

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 8 6 4 3 2 1

fct_relevel() can also be used to change one factor level only.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, "8", after = 2) # change level 8 to the third place

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 8 3 4 6

fct_reorder()

This function changes the order based on another variable. Let’s change variable carb’s levels based on value of variable disp.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_reorder(mtcars$carb, mtcars$disp, .fun = sum, .desc = TRUE) # new level based on disp value

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 2 1 3 8 6

mtcars %>% 
  group_by(carb) %>% 
  summarise(sum_disp = sum(disp)) %>% 
  arrange(desc(sum_disp)) # this is basically what we do with fct_reorder() above

## # A tibble: 6 x 2
##   carb  sum_disp
##   <fct>    <dbl>
## 1 4        3088.
## 2 2        2082.
## 3 1         940.
## 4 3         827.
## 5 8         301 
## 6 6         145

Additionally, fct_reorder() can be used with plotting as well.

# Original plot
ggplot(mtcars, aes(x = carb, y = disp)) +
  geom_col()

# Plot with changed levels
mtcars %>% 
  mutate(carb = fct_reorder(carb, disp, .fun = sum, .desc = TRUE)) %>% 
  ggplot(aes(x = carb, y = disp)) +
  geom_col()

fct_lump()

This function lumps factor levels into other factors. There are 5 variants of this function:

fct_lump()
fct_lump_min()
fct_lump_n()
fct_lump_lowfreq()

The remaining one variant is fct_lump_prop(). It is not in the example below as I do not find it useful at least for my current R coding routine.

fct_lump() automatically lump small frequency factor group into one group.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

fct_lump(mtcars$carb) %>% fct_count()

## # A tibble: 4 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 4        10
## 4 Other     5

fct_lump_min() lump factor group into one group based on the given value.

table(fct_lump_min(mtcars$carb, min = 2)) # group 6 and 8 lump into one group

## 
##     1     2     3     4 Other 
##     7    10     3    10     2

fct_lump_n() lump all level except for the n most frequent factor groups.

table(fct_lump_n(mtcars$carb, n = 2)) # 2 frequent group only, others in one group

## 
##     2     4 Other 
##    10    10    12

fct_lump_lowfreq() lump small frequent groups into one group, while making sure that particular one group is still the smallest.

table(fct_lump_lowfreq(mtcars$carb, other_level = "low")) # group low is still the smallest

## 
##   1   2   4 low 
##   7  10  10   5

fct_other()

fct_other() is much like fct_lump(), except we manually choose which factor groups to be combined.

table(fct_other(mtcars$carb, keep = c("8", "6")))

## 
##     6     8 Other 
##     1     1    30

fct_recode()

This function is used to rename or relabel the factor group.

table(fct_recode(mtcars$carb, hanis = "8"))

## 
##     1     2     3     4     6 hanis 
##     7    10     3    10     1     1

fct_relabel()

fct_relabel() is extremely useful if we want to rename quite a number of factor groups.

table(mtcars$carb) # original groups

## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

table(fct_relabel(mtcars$carb, ~ c("abu", "ali", "chong", "siti", "krish", "lee"))) # new named groups

## 
##   abu   ali chong  siti krish   lee 
##     7    10     3    10     1     1

Reference:
https://forcats.tidyverse.org/index.html

Handling imbalanced data

Fri, 14 May 2021 00:00:00 +0000

Overview

Imbalance data happens when there is unequal distribution of data within a categorical outcome variable. Imbalance data occurs due to several reasons such as biased sampling method and measurement errors. However, the imbalance may also be the inherent characteristic of the data. For example, a rare disease predictive model, in this case, the imbalance is expected.

Generally, there are two types of imbalanced problem:

Slight imbalance: the imbalance is small, like 4:6
Severe imbalance: the imbalance is large, like 1:100 or more

In slight imbalanced cases, usually it is not a concern, while severe imbalanced cases require a more specialised method to to build a predictive model.

The problem

What’s the problem with the imbalanced data?
Firstly, a predictive model of an imbalanced data is bias towards the majority class. The minority class becomes harder to predict as there are few data from this class. So, the detection rate for a minority class will be very low. Secondly, accuracy is not a good measure in this case. We may get a good accuracy,but in reality the accuracy does not reflect the unequal distribution of the data. This is known as an accuracy paradox. Imagine we have 90% of data belong to the majority class, while the remaining 10% belong to the minority class. So, just by predicting all data as a majority class, the model can easily get 90% accuracy.

Handling approach

The easiest approach is to collect more data, though this may not be practical in all situation. Fortunately, there are a few machine learning techniques available to tackle this problem.

Here is a summary of resampling techniques available in themis package.

Over-sampling approach is preferred when the dataset is small. The under-sampling approach can be used when the dataset is large, though this approach may lead to loss of information. Additionally, ensemble technique such as random forest is said to be able to model the imbalanced data, though some references/blogs say otherwise.

So, we are going to compare four of over-sampling techniques (upsample, SMOTE, ADASYN, and ROSE), and three of under-sampling techniques (downsample, nearmiss and tomek). The base model is a decision tree, which will be used for all the techniques. The decision trees are not going to be extensively hyperparameter tuned, for the sake of simplicity. Additionally, random forest is also going to be included in the comparison.

The dataset is from here. This is a summary of the dataset.

summary(df)

##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

As we can see from the summary, variable admit has a moderate imbalanced data about 1:3 ratio.

ggplot(df, aes(admit)) + 
  geom_bar() +
  theme_bw()

Below is the code for each model.

Show code

# Packages
library(tidyverse)
library(magrittr)
library(tidymodels)
library(themis)

# Data
df <- read.csv("https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv")

# Split data
set.seed(1234)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

# 1) Decision tree ----

# Recipe
dt_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank)

df_train_rec <- 
  dt_rec %>% 
  prep() %>% 
  bake(new_data = NULL)
  
df_test_rec <- 
  dt_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv <- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
dt_mod <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

## Specify workflow
dt_wf <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune <- 
  dt_wf %>% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune <- dt_tune %>% select_best("accuracy")

## Finalize workflow
dt_wf_final <- 
  dt_wf %>% 
  finalize_workflow(best_tune)

# Fit on train data
dt_train <- 
  dt_wf_final %>% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train, new_data = df_test_rec)) %>% 
  rename(pred = .pred_class)

# 2) Oversampling ----
## step_upsample() ----

# Recipe
up_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_upsample(admit,
                seed = 1234)

df_train_up <- 
  up_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_up <- 
  up_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_up <- vfold_cv(df_train_up)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_up <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_up <- 
  dt_wf_up %>% 
  tune_grid(resamples = df_cv_up,
            metrics = metric_set(accuracy))

## Select best model
best_tune_up <- dt_tune_up %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_up <- 
  dt_wf_up %>% 
  finalize_workflow(best_tune_up)

# Fit on train data
dt_train_up <- 
  dt_wf_final_up %>% 
  fit(data = df_train_up)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_up, new_data = df_test_rec_up)) %>% 
  rename(pred_up = .pred_class)

## step_smote() ----

# Recipe
smote_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_smote(admit, 
             seed = 1234)

df_train_smote <- 
  smote_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_smote <- 
  smote_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_smote <- vfold_cv(df_train_smote)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_smote <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_smote <- 
  dt_wf_smote %>% 
  tune_grid(resamples = df_cv_smote,
            metrics = metric_set(accuracy))

## Select best model
best_tune_smote <- dt_tune_smote %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_smote <- 
  dt_wf_smote %>% 
  finalize_workflow(best_tune_smote)

# Fit on train data
dt_train_smote <- 
  dt_wf_final_smote %>% 
  fit(data = df_train_smote)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_smote, new_data = df_test_rec_smote)) %>% 
  rename(pred_smote = .pred_class)

## step_rose() ----

# Recipe
rose_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_rose(admit, 
             seed = 1234)

df_train_rose <- 
  rose_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_rose <- 
  rose_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_rose <- vfold_cv(df_train_rose)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_rose <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_rose <- 
  dt_wf_rose %>% 
  tune_grid(resamples = df_cv_rose,
            metrics = metric_set(accuracy))

## Select best model
best_tune_rose <- dt_tune_rose %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_rose <- 
  dt_wf_rose %>% 
  finalize_workflow(best_tune_rose)

# Fit on train data
dt_train_rose <- 
  dt_wf_final_rose %>% 
  fit(data = df_train_rose)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_rose, new_data = df_test_rec_rose)) %>% 
  rename(pred_rose = .pred_class)

## step_adasyn() ----

# Recipe
adasyn_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_adasyn(admit, 
            seed = 1234)

df_train_adasyn <- 
  adasyn_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_adasyn <- 
  adasyn_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_adasyn <- vfold_cv(df_train_adasyn)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_adasyn <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_adasyn <- 
  dt_wf_adasyn %>% 
  tune_grid(resamples = df_cv_adasyn,
            metrics = metric_set(accuracy))

## Select best model
best_tune_adasyn <- dt_tune_adasyn %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_adasyn <- 
  dt_wf_adasyn %>% 
  finalize_workflow(best_tune_adasyn)

# Fit on train data
dt_train_adasyn <- 
  dt_wf_final_adasyn %>% 
  fit(data = df_train_adasyn)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_adasyn, new_data = df_test_rec_adasyn)) %>% 
  rename(pred_adasyn = .pred_class)

# 3) Undersampling ----
## step_downsample() ----

# Recipe
down_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_downsample(admit,
                seed = 1234)

df_train_down <- 
  down_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_down <- 
  down_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_down <- vfold_cv(df_train_down)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_down <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_down <- 
  dt_wf_down %>% 
  tune_grid(resamples = df_cv_down,
            metrics = metric_set(accuracy))

## Select best model
best_tune_down <- dt_tune_down %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_down <- 
  dt_wf_down %>% 
  finalize_workflow(best_tune_down)

# Fit on train data
dt_train_down <- 
  dt_wf_final_down %>% 
  fit(data = df_train_down)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_down, new_data = df_test_rec_down)) %>% 
  rename(pred_down = .pred_class)

## step_nearmiss() ----

# Recipe
nearmiss_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_nearmiss(admit,
                  seed = 1234)

df_train_nearmiss <- 
  nearmiss_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_nearmiss <- 
  nearmiss_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_nearmiss <- vfold_cv(df_train_nearmiss)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_nearmiss <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_nearmiss <- 
  dt_wf_nearmiss %>% 
  tune_grid(resamples = df_cv_nearmiss,
            metrics = metric_set(accuracy))

## Select best model
best_tune_nearmiss <- dt_tune_nearmiss %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_nearmiss <- 
  dt_wf_nearmiss %>% 
  finalize_workflow(best_tune_nearmiss)

# Fit on train data
dt_train_nearmiss <- 
  dt_wf_final_nearmiss %>% 
  fit(data = df_train_nearmiss)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_nearmiss, new_data = df_test_rec_nearmiss)) %>% 
  rename(pred_nearmiss = .pred_class)

## step_tomek() ----

# Recipe
tomek_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_tomek(admit,
                  seed = 1234)

df_train_tomek <- 
  tomek_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_tomek <- 
  tomek_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_tomek <- vfold_cv(df_train_tomek)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_tomek <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_tomek <- 
  dt_wf_tomek %>% 
  tune_grid(resamples = df_cv_tomek,
            metrics = metric_set(accuracy))

## Select best model
best_tune_tomek <- dt_tune_tomek %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_tomek <- 
  dt_wf_tomek %>% 
  finalize_workflow(best_tune_tomek)

# Fit on train data
dt_train_tomek <- 
  dt_wf_final_tomek %>% 
  fit(data = df_train_tomek)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_tomek, new_data = df_test_rec_tomek)) %>% 
  rename(pred_tomek = .pred_class)

# 4) Ensemble approach: random forest ----

## 10-folds CV
set.seed(1234)
df_cv <- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
rf_mod <- rand_forest(
 mtry = tune(),
 trees = tune(),
 min_n = tune()
 ) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

## Specify workflow
rf_wf <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
rf_tune <- 
  rf_wf %>% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune <- rf_tune %>% select_best("accuracy")

## Finalize workflow
rf_wf_final <- 
  rf_wf %>% 
  finalize_workflow(best_tune)

# Fit on train data
rf_train <- 
  rf_wf_final %>% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(rf_train, new_data = df_test_rec)) %>% 
  rename(pred_rf = .pred_class)

Now, let’s get the accuracy, sensitivity, specificity, and Mathews Correlation Coefficient (MCC) for each model.

Show code

# Get all measurements
df_test$admit %<>% as_factor()
pred_col <- colnames(df_test)[5:13]
result <- vector("list", 0)
sensi <- vector("list", 0)
specif <- vector("list", 0)
mathew <- vector("list", 0)

for (i in seq_along(pred_col)) {
  # accuracy
  result[[i]] <-
    df_test %>% 
    accuracy(admit, df_test[,pred_col[i]])
  
  # sensitivity
  sensi[[i]] <-
    df_test %>% 
    sensitivity(admit, df_test[,pred_col[i]])
  
  # specificity
  specif[[i]] <-
    df_test %>% 
    specificity(admit, df_test[,pred_col[i]])
  
  # MCC
  mathew[[i]] <-
    df_test %>% 
    mcc(admit, df_test[,pred_col[i]])
}

## Turn into dataframe
result  %<>%  
  enframe() %>% 
  unnest(cols = c("value")) %>% 
  rename(model = name, 
         accuracy = .estimate) %>% 
  select(model, accuracy) %>% 
  mutate(model = factor(model,labels = 
                          c(
                            "1" = "base",
                            "2" = "upsample",
                            "3" = "smote",
                            "4" = "rose",
                            "5" = "adasyn",
                            "6" = "downsample",
                            "7" = "nearmiss",
                            "8" = "tomek",
                            "9" = "random_forest"
                            )
                        ))

sensi  %<>%  
  enframe() %>% 
  unnest(cols = c("value"))

specif %<>% 
  enframe() %>% 
  unnest(cols = c("value"))

mathew %<>% 
  enframe() %>% 
  unnest(cols = c("value"))

result %<>% 
  bind_cols(sensitive = sensi$.estimate, specific = specif$.estimate, mathew = mathew$.estimate)

# Plot the result
result %>% 
  pivot_longer(cols = 2:5, names_to = "measure") %>% 
  ggplot(aes(x = model, y = value, fill = measure)) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_bw() +
  coord_flip() +
  geom_text(aes(label = paste0(round(value*100, digits = 1), "%")), 
            position = position_dodge(0.9), vjust = 0.3, size = 2.7, hjust = -0.1) +
  labs(title = "Comparison of unbalanced data techniques", 
       x = "Techniques", 
       y = "Performance") +
  scale_fill_discrete(name = "Metrics:",
                      labels = c("Accuracy", "MCC", "Sensitivity", "Specificity")) +
  theme(legend.position = "bottom")

We can see from the above plot, the base model (decision tree) clearly has a low detection rate for a minority class (specificity). All methods able to increase the specificity, while sacrificing the accuracy and sensitivity. As mentioned earlier, accuracy is not a good metrics for this kind of model (ie; accuracy paradox). MCC on the other hand, takes into account all values of confusion matrix; true positive, false positive, true negative, and false negative. Hence, MCC is more informative compared to accuracy (and F score, which has not been included in the plot, for the sake of simplicity).

A more balanced model probably downsample approach based on MCC, specificity, and sensitivity. However, this does not mean that downsample technique is the best as I believes each technique behaves differently from one data to another.

References:

Exponentially Weighted Average in Deep Learning

Sun, 09 May 2021 00:00:00 +0000

I have been reading about lost functions and optimisers in deep learning for the last couple of days when I stumble upon the term Exponentially Weighted Average (EWA). So, in this post I aims to explain my understanding of EWA.

Overview of EWA

EWA basically is an important concept in deep learning and have been used in several optimisers to smoothen the noise of the data.

Let’s see the formula for EWA:

V_t is some smoothen value at point t, while S_t is a data point at point t. B here is a hyperparameter that we need to tune in our network. So, the choice of B will determine how many data points that we average the value of V_t as shown below:

EWA in deep learnings’ optimiser

So, some of the optimisers that adopt the approach of EWA are (red box indicates the EWA part in each formula):

Stochastic gradient descent (SGD) with momentum

The issue with SGD is the present of noise while searching for global minima. So, SGD with momentum integrated the EWA, which reduces these noises and helps the network converges faster.

Adaptive delta (Adadelta) and Root Mean Square Propagation (RMSprop)

Adadelta and RMSprop are proposed in attempt to solve the issue of diminishing learning rate of adaptive gradient (Adagrad) optimiser. The use of EWA in both optimisers actually helps to achieve this. Both optimisers have quite a similar formula, but attached below is the formula for Adadelta.

Adaptive moment estimation (ADAM)

ADAM basically combined the SGD with momentum with Adadelta. As shown earlier, both optimisers use EWA.

More details on EWA

Now, let’s go back to EWA. Here is the example of calculation of EWA:

Keep in mind that t₃ is the latest time point, followed by t₂ and t₁, respectively. So, if we want to calculate V₃:

So, if we were to varies the value of B across the equation (while the values of a₁…a_n remain constant), we can do so in R.

library(tidyverse) 

func <- function(b) (1 - b) * b^((20:1) - 1)
beta <- seq(0.1, 0.9, by=0.2)

dat <- t(sapply(beta, func)) %>% 
  as.data.frame()
colnames(dat)[1:20] <- 1:20

dat %>%  
  mutate(beta = as_factor(beta)) %>%
  pivot_longer(cols = 1:20, names_to = "data_point", values_to = "weight") %>% 
  ggplot(aes(x=as.numeric(data_point), y=weight, color=beta)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:20) +
  labs(title = "Change of Exponentially Weighted Average function", 
       subtitle = "Time at t20 is the recent time, and t1 is the initial time") +
  scale_colour_discrete("Beta:") +
  xlab("Time(t)") +
  ylab("Weights/Coefficients") +
  theme_bw()

Note that time at t₂₀ is the recent time, and t₁ is the initial time. Thus, two main points from the above plot are:

The EWA function acts in a decaying manner.
As beta, B increases we actually put more emphasize on the recent data point.

Side note: I have tried to do the plot in plotly, not sure why it did not work 😕

References:
1) https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f (all the equations are from this reference)
2) https://youtu.be/NxTFlzBjS-4
3) https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46

Base R vs tidyverse

Tue, 04 May 2021 00:00:00 +0000

First of all, this write up is mean for a beginner in R.

Things can be done in many ways in R. In facts, R has been very flexible in this regard compared to other statistical softwares. Basic things such as selecting a column, slicing a row, filtering a data based on certain condition can be done using a base R function. However, all these things can also be done using a tidyverse approach.

Tidyverse basically, a collection of packages that can be loaded in a line of function.

library(tidyverse)

Tidyverse is developed by “RStudio people” pioneered by Hadley Wickham, which means that these packages will be continuously maintained and updated.

So, without further ado, these are the comparisons between these two approaches for some very basic thingy:

Select or deselect a column and a row

# Base R
iris[1:5, c("Sepal.Length", "Sepal.Width")]
iris[1:5,c(1,2)] # similar to above
iris[1:5, -1]

# Tidyverse
iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  slice(1:5)
iris %>% 
  select(-Sepal.Length) %>% 
  slice(1:5)

Filter based on condition

# Base R
iris[iris$Species == "setosa", ]

# Tidyverse
iris %>% 
  filter(Species == "setosa")

Mutate (transmute replace the variable)

# Base R
iris$SL_minus10 <- iris$Sepal.Length - 10

# Tidyverse
iris %>% 
  mutate(SL_minus10 = Sepal.Length - 10)

Sort variable

# Base R
iris[order(-iris$Sepal.Width),]

# Tidyverse
iris %>% 
  arrange(desc(Sepal.Length))

Group by (and get mean for variable Sepal.Width)

# Not really base R
doBy::summaryBy(Sepal.Width~Species, iris, FUN = mean) 

# Tidyverse
iris %>% 
  group_by(Species) %>% 
  summarise(mean_SW = mean(Sepal.Width))

Rename variable

# Base R
colnames(iris)[6] <- "hanis"

# Tidyverse
iris %>% 
  rename(Species = hanis)

So, that’s it. Overall, tidyverse give a clarity in understanding the code as it reads from left to right. On the contrary, the base R approach reads from inside to outside, especially for a more complicated code.

Loop vs apply in R

Tue, 04 May 2021 00:00:00 +0000

I have heard quite a several times that apply function is faster than loop function in R. Loop function is said to be inefficient, though in certain situation loop is the only way.

Let’s compare between loop function and apply function in R.

First, make a very big fake data contain a list of vector.

set.seed(2021)
xlist <- list(col1 = rnorm(10000000), 
              col2 = rnorm(10000000),
              col3 = rnorm(100000000),
              col4 = rnorm(1000000)) # this will take a few seconds

Then, calculate the mean of each vector using for loop().

ptm <- proc.time() #-- start the clock

mean_loop <- vector("list", 0) # place holder for a value
for (i in seq_along(xlist)) {
  mean_loop[[i]] <- mean(xlist[[i]])
}

proc.time() - ptm #-- stop the clock (time in seconds)

##    user  system elapsed 
##    0.38    0.00    0.37

Now, using lapply() function.

ptm <- proc.time() #-- start the clock

mean_apply <- lapply(xlist, mean)

proc.time() - ptm #-- stop the clock

##    user  system elapsed 
##    0.34    0.00    0.35

So, lapply() is a little bit faster. Obviously, with a very big dataset and a more complicated objective, lapply() is the right choice, but for a “normal” size dataset, the use of any of the two functions probably do not make much different.

R | Tengku Hanis

Mapping the states in Malaysia

Plot the peninsular of Malaysia (not the best way)

Plot the states in Malaysia

Using UMAP preprocessing for image classification

UMAP

Example in R

Conclusion

Fitted vs predict in R

My first interactive map with {leaflet}

Variable selection for imputation model in {mice}

Some note

Imputation model

R codes

Making maps with R (my first attempt ever!)

Some data pre-processing

Plotting the map

1) map from ggplot2

2) map from rworldmap

Conclusion

Some COVID-19 plots for Southeast Asian countries

Daily cases

Daily deaths

Daily tests

Daily vaccinations

Malaysia situation

Extract a table from a pdf

Extracting a table from pdf

A short note on multiple imputation

Background

Multiple imputation

Example

COVID-19 vaccine interest in Malaysia

Wordcloud of COVID-19 research in Malaysia

Wordcloud

Conclusion

Hyperparameter tuning in tidymodels

Grid search

Iterative search

Other methods

R codes

Conclusion

Data exploration in R

Conclusion

A summary of forcats package

Main functions

Handling imbalanced data

Overview

The problem

Handling approach

Exponentially Weighted Average in Deep Learning

Overview of EWA

EWA in deep learnings’ optimiser

More details on EWA

Base R vs tidyverse

Loop vs apply in R

1) map from `ggplot2`

2) map from `rworldmap`