Posts | Tengku Hanis

Basic plotting with Matplotlib and Seaborn

Wed, 07 Aug 2024 00:00:00 +0000

This post is continuation of my previous post about Python. For those interested:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

There are several packages or libraries available in Python for plotting and visualization. However, the most commonly used package is matplotlib. This package is quite extensive and often time can be quite complicated to use. Thus, seaborn package is another alternative and complementary to matplotlib. Seaborn is based on matplotlib and provides a high-level functionality compare to matplotlib.

So, in this blog post, let us compare several basic plots using both packages.

Load packages

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Load dataset

We going to use the iris dataset.

dat = sns.load_dataset('iris')

We can further see the information on this dataset.

dat.head(5)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Histogram

Let’s plot the histogram using matplotlib first.

plt.hist(dat['sepal_length'], bins=30)
plt.show()

Notice that this histogram does not has any label. So, to add a label, we need to do this manually.

plt.hist(dat['sepal_length'], bins=30)
plt.xlabel('Sepal length') #x-axis label
plt.ylabel('Frequency') #y-axis label
plt.show()

However, using seaborn, the label is extracted from the variable name, which is pretty convenient.

sns.histplot(dat['sepal_length'], bins=30)
plt.show()

Let’s say we want to plot the histogram according to different levels.

species = ['setosa', 'versicolor', 'virginica']

for i in species:
    subset = dat[dat['species'] == i]
    plt.hist(subset['sepal_length'], label = i)

plt.legend(loc = 'upper right')
plt.xlabel('Sepal length')
plt.ylabel('Frequency')
plt.show()

The codes above are quite long. In seaborn, the histogram above can be generated quite easily.

sns.histplot(x = 'sepal_length', hue = 'species', data = dat)
plt.show()

Boxplot

First, let’s do boxplot using matplotlib.

bp = plt.boxplot(dat['sepal_length'])
plt.xlabel('Sepal length')
plt.show()

If we wanto to do boxplot according to other variable. The codes become a bit complicated especially for beginners.

species = dat.groupby('species')
setosa = species.get_group('setosa')['sepal_length']
versicolor = species.get_group('versicolor')['sepal_length']
virginica = species.get_group('virginica')['sepal_length']

bp = plt.boxplot([setosa, versicolor, virginica], labels = ['setosa', 'versicolor', 'virginica'])
plt.xlabel('Sepal length')
plt.show()

Both plots above are quite easy to do in seaborn. Below are the codes for the basic histogram.

sns.boxplot(dat['sepal_length'])
plt.show()

Next, to plot sepal_length based on species is pretty much straightforward in seaborn.

sns.boxplot(y='sepal_length', hue='species', data=dat)
plt.show()

Scatter plot

Lastly, let’s see the scatter plot using matplotlib.

plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

We can further extend this plot by categorising it into different species.

# Define the species to colors mapping
species_to_color = {'setosa': 'blue', 'versicolor': 'green', 'virginica': 'red'}
colors = dat['species'].map(species_to_color)

# Create the scatter plot
plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'], c=colors)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, label=species) for species, color in species_to_color.items()], title='Species')
plt.show()

Now, let’s see the seaborn package. This is the basic scatter plot.

sns.scatterplot(x='sepal_length', y='sepal_width', data=dat)
plt.show()

To extend this plot by categorising it into different species in seaborn is actually quite simple.

sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=dat)
plt.show()

Conclusion

In conclusion, matplotlib and seaborn complement each other well. Seaborn is an excellent choice for quick and standard plots, thanks to its high-level interface. On the other hand, matplotlib offers a more extensive range of customization options and is ideal for creating complex and detailed visualizations. Ultimately, choosing between matplotlib and seaborn depends on the specific requirements of the visualization task.

Basic data wrangling with Python

Thu, 18 Jul 2024 00:00:00 +0000

Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

Once I finish writing any of the topics, I will link it to the above.

So, let’s start.

Loading necessary packages

Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to this link to install the packages using pip command or this link to install the packages using conda command. For those who has both R and Python in your machine, I suggest to use a conda command.

Let’s load the required packages.

import numpy as np 
import pandas as pd
from seaborn import load_dataset

All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in pandas package can be accessed through pd or to be specific pd.. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use pd for pandas and np for numpy, but this is a convention or standard practice widely adopted in the Python community.

Load the data

We going to use iris dataset. This dataset is readily available in seaborn package.

iris = load_dataset('iris')

Once we load the data, we need to check the variable type.

iris.dtypes

## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object

Variable species, by right, is a categorical variable. So, we can use Categorical() from pandas to change it from an object variable type to a category. pd. here, means we access the function from pandas package as I explained it previously.

iris['species'] = pd.Categorical(iris['species'])

If we check the variable type again, we can see the species variable is a category.

iris.dtypes

## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object

Next, we can also see the data. Let’s see the first 10 rows.

iris.head(10)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa

Slicing and indexing

To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.

iris['sepal_length'][0:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Similarly, we can also index as below to get the first 10 rows of sepal_length variable.

iris['sepal_length'][:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Next to access the first 5 rows, we can do as below.

iris[0:5]

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

We can also use iloc() and loc() functions. The main difference between the two functions is that iloc() can only accept a numerical value and loc() function can accept a string value.

iris.iloc[0:2, 0:3] #rows, then columns

##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4

iris.loc[0:2, ['sepal_length', 'species']]

##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa

Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.

ind = iris['petal_length'] > 6
iris['petal_length'][ind]

## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64

Let’s say we want our data to include only setosa species.

ind = iris['species'] == 'setosa'
iris.loc[ind, :].head()

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:

row 1, 2, 3, and 4 of sepal_length to NA values
row 6 of species and sepal_width to NA values

iris.loc[0:3, 'sepal_length'] = np.nan 
iris.iloc[5, [1, 4]] = np.nan

Let’s see the result.

iris.head(6)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN

Missing values

If we want to see if we have any missing values in our data, we can use isnull() function.

iris.isnull().any().any() #For overall

## True

iris.isnull().any() #Check for each column

## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool

We can further calculate how many missing values that we have.

iris.isnull().sum()

## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64

Descriptive statistics

To get a basic descriptive statistics, we can use describe() function. Below, we additionally use round() to round up the results into one decimal points.

iris.describe().round()

##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0

Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add include = all as below.

iris.describe(include = 'all').round()

##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN

Alternatively, we can also calculate the unique values for the categorical variable. value_counts() only calculate the non-missing values.

iris['species'].value_counts()

## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64

Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use mean().

iris['sepal_width'].mean().round()

## 3.0

That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.

What makes data "good enough" for a statistical analysis?

Thu, 29 Feb 2024 00:00:00 +0000

A few days earlier, someone asked me to help her with the data analysis. However, the data that she gave me was so bad that it was completely impossible to run the analysis unless a serious data cleaning was done first.

So, I am thinking about what is a general guideline to consider a data is good enough to run the statistical analysis with it.

First thing first, what are the basic format of a “good enough” data.

1. Each row represents an observation

2. For a categorical variable, make sure the levels are standardised.

For example, for gender variable, make sure to have only “male” and “female” instead of “male”, “female”, “men”, and “women”.

3. For a numerical variable, make sure the value is numeric and do not contain any text.

For example, for height variable, do not put “1.68m” or “1.68 meter”.

4. For numerical variable as well, make sure the numeric values in the variable are in the same scale.

For example, for weight variable, do not mix the weight in grams and kilograms. If you want to use grams, use it consistently throughout the data, or at least throughout the variable.

5. Do not use symbol in your data.

For example, do not use “X” as no and “/” as yes.

6. The data should be an individual data.

An individual data means that each row consists of information about each sample or observation. Each observation in the dataset represents a single entity or unit (e.g., a person, a transaction, a product) and includes all relevant attributes or variables for that unit. Individual data allow for detailed analysis at the level of individual observations. Here is an example of individual data:

Id	Age	Obese
1	<50	yes
2	<50	yes
3	>50	no
4	>50	no
5	>50	yes
6	<50	yes

Aggregated data, on the other hand, combines multiple individual observations into summary statistics or groups. Instead of representing individual units, aggregated data presents information at a higher level of abstraction, such as groups, categories, or intervals. This aggregation typically involves summarizing data using functions like sums, averages, counts, or percentages. Here is an example of aggregated data based on the individual data previously:

Age	Obese	Count
<50	yes	3
>50	no	2
<50	no	0
>50	yes	1

I think the above six points are the basics of building a good enough dataset for a statistical analysis. While we are at it, let’s go through the two main formats of a dataset. These formats are more common when you have a repeated measure study design whereby each participant has several values/measurements/responds at several time points.

Wide format

In the wide format, the response of each participant will be in a single row. For example, below is the data of time taking by two participants in answering three questions in second. As we can see each row consists of the time in second in answering all three questions.

ID	Question1	Question2	Question3
1	5	10	50
2	8	20	40

Long format

In the long format (also known as tidy format in R community), the response at each time point of each participant will be in a single row. By using the same data previously, below is the in the long format. As we can see the data is arrange in format that each row represents each time taking to answer the question by each participant.

ID	Time	Question
1	5	1
1	10	2
1	50	3
2	8	1
2	20	2
2	40	3

Mapping the states in Malaysia

Wed, 22 Feb 2023 00:00:00 +0000

I have written two blog posts about making map in R:

This post is sort of a continuation to the first blog post. I have shown how to plot a coordinate to a map in that post specifically for Malaysia.

However, using the two approaches in the previous blog post, we cannot plot the coordinate to a certain states in Malaysia. At least I am not unable to find how to do that after googling around. But, we can plot the borneo or peninsular of Malaysia using the two approaches.

Plot the peninsular of Malaysia (not the best way)

Load the necessary packages.

library(rworldmap) 
library(tidyverse)

First, we get the data. The data is about desa clinic (klinik desa) in Malaysia.

clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")
head(clinicDesa)

##   id facilities_id                     name              address postcode
## 1  1    KD01010019  KLINIK DESA ASSAM BUBOK     Jalan Batu Pahat    86400
## 2  2    KD01010020   KLINIK DESA BATU PUTIH    Jalan Behor Temak    83000
## 3  3    KD01010021      KLINIK DESA BEROLEH    Jalan Parit Besar    83300
## 4  4    KD01010022        KLINIK DESA BINDU Jalan Tongkang Pecah    83010
## 5  5    KD01010023 KLINIK DESA KAMPUNG BARU   Jalan Parit Kemang    83710
## 6  6    KD01010024 KLINIK DESA KANGKAR BARU      Jalan Meng Seng    85400
##             city   district  state tel fax website email image latitude
## 1     Ayer Hitam Batu Pahat Johor       NA      NA    NA    NA 1.933330
## 2          Bagan Batu Pahat Johor       NA      NA    NA    NA 1.889100
## 3     Sri Gading Batu Pahat Johor       NA      NA    NA    NA 1.877890
## 4 Tongkang Pecah Batu Pahat Johor       NA      NA    NA    NA 1.901515
## 5    Parit Yaani Batu Pahat Johor       NA      NA    NA    NA 1.905120
## 6      Yong Peng Batu Pahat Johor       NA      NA    NA    NA 2.065310
##   longitude likes rating status
## 1  103.1167     0      0    NEW
## 2  102.8778     0      0    NEW
## 3  102.9858     0      0    NEW
## 4  102.9665     0      0    NEW
## 5  103.0372     0      0    NEW
## 6  103.1248     0      0    NEW

First we plot the data.

ggplot(clinicDesa, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()

Remove the two points.

clinicDesa2 <- clinicDesa %>% filter(longitude > 25)

Again, plot the updated data.

ggplot(clinicDesa2, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()

From the plot, we already know the left side consists of the coordinates in the peninsular of Malaysia. So, we can limit our plot by limit the longitude < 105 and longitude > 97.

# Get base map
global <- map_data("world") 

# Plot
ggplot() + 
  geom_polygon(data = global %>% filter(region == "Malaysia"), aes(x=long, y = lat, group = group), 
               fill = "gray85") + 
  coord_fixed(1.3) +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in the peninsular of Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan")))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia

I am not going to re-explain the above and below codes as I have explain it in the previous blog post.

This approach also works with rworldmap.

# Get base map
world <- getMap(resolution = "low")
msia <- world[world@data$ADMIN == "Malaysia", ]

# Plot
ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = "black") +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in the peninsular of Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan")))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia

As we can see using the two approaches, we can plot the borne and peninsular sides of Malaysia. But, at least to my knowledge we cannot apply this approach if we want to plot a coordinate to certain states in Malaysia.

Plot the states in Malaysia

Load the necessary package.

library(geodata)
library(tidyterra)

As we can see from the package, we going to use a geodata package. tidyterra is used to supplements the ggplot. First, let’s limit the data into desa clinics in Terengganu only.

clinic_trg <- 
  clinicDesa %>% 
  filter(state == "Terengganu") %>% 
  dplyr::select(latitude, longitude) 
head(clinic_trg)

##   latitude longitude
## 1  5.48533  102.4914
## 2  5.81578  102.5778
## 3  5.70886  102.4892
## 4  5.75722  102.5303
## 5  5.67444  102.6289
## 6  5.69875  102.5430

Now we get the map from the geodata package with the boundaries at the district level.

Malaysia <- gadm(country = "MYS", level = 2, path=tempdir())

We can use the below information to limit the map to Terengganu state only.

Malaysia$NAME_1

##   [1] "Johor"           "Johor"           "Johor"           "Johor"          
##   [5] "Johor"           "Johor"           "Johor"           "Johor"          
##   [9] "Johor"           "Johor"           "Kedah"           "Kedah"          
##  [13] "Kedah"           "Kedah"           "Kedah"           "Kedah"          
##  [17] "Kedah"           "Kedah"           "Kedah"           "Kedah"          
##  [21] "Kedah"           "Kedah"           "Kelantan"        "Kelantan"       
##  [25] "Kelantan"        "Kelantan"        "Kelantan"        "Kelantan"       
##  [29] "Kelantan"        "Kelantan"        "Kelantan"        "Kelantan"       
##  [33] "Kuala Lumpur"    "Labuan"          "Melaka"          "Melaka"         
##  [37] "Melaka"          "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan"
##  [41] "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan" "Negeri Sembilan"
##  [45] "Pahang"          "Pahang"          "Pahang"          "Pahang"         
##  [49] "Pahang"          "Pahang"          "Pahang"          "Pahang"         
##  [53] "Pahang"          "Pahang"          "Pahang"          "Perak"          
##  [57] "Perak"           "Perak"           "Perak"           "Perak"          
##  [61] "Perak"           "Perak"           "Perak"           "Perak"          
##  [65] "Perak"           "Perlis"          "Pulau Pinang"    "Pulau Pinang"   
##  [69] "Pulau Pinang"    "Pulau Pinang"    "Pulau Pinang"    "Putrajaya"      
##  [73] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [77] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [81] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [85] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [89] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [93] "Sabah"           "Sabah"           "Sabah"           "Sabah"          
##  [97] "Sabah"           "Sarawak"         "Sarawak"         "Sarawak"        
## [101] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [105] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [109] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [113] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [117] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [121] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [125] "Sarawak"         "Sarawak"         "Sarawak"         "Sarawak"        
## [129] "Selangor"        "Selangor"        "Selangor"        "Selangor"       
## [133] "Selangor"        "Selangor"        "Selangor"        "Selangor"       
## [137] "Selangor"        "Trengganu"       "Trengganu"       "Trengganu"      
## [141] "Trengganu"       "Trengganu"       "Trengganu"       "Trengganu"

So, this is the plot for Terengganu.

Trg <- Malaysia[138:144,]
plot(Trg)

We going to the map this in ggplot, and stacked the map layer with the coordinate layer.

ggplot() +
  geom_spatvector(data = Trg, color = "grey", fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

geom_spatvector is from tidyterra package. Alternatively, we can plot using geom_sfbut we need to convert the SpatVector data into sf object using sf::st_as_sf.

ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = "grey", fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_minimal() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

Both approaches produce the same plot.

We can further add district labels to the plots. For example, using the geom_sf, we can stack it with geom_sf_label layer. We can alternatively use theme_void to remove the background and the map axis.

ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = "grey", fill = NA) +
  geom_sf_label(aes(label = NAME_2)) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = "red")) +
  theme_void() +
  theme(legend.position = "none") +
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Desa clinic in Terengganu, Malaysia", 
       subtitle = "(Data last updated: Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))))

Visualising augmented images in Keras

Wed, 28 Dec 2022 00:00:00 +0000

Data augmentation

Data augmentation is been used in deep learning for many reasons. One of the reason is to reduce overfitting and makes the model more robust. Data augmentation can be done relatively easy in keras package in R. However, I have not found any resources on how to visualise the augmented image in R except in Python. Visualising the augmented image can be quite useful to get an idea of how the image looks like. So, this post covers a simple to do this in R.

R code

Let’s load the keras library

library(keras)

## Warning: package 'keras' was built under R version 4.2.2

Next, we load the image from the internet.

r_logo <- 
  get_file("img", "https://ih1.redbubble.net/image.522493300.6771/st,small,507x507-pad,600x600,f8f8f8.jpg") %>% 
  image_load()

Our image right now is 600x 600 x 3. The 3 at the back because the image is coloured (RGB channels).

r_logo$size

## [[1]]
## [1] 600
## 
## [[2]]
## [1] 600

So, we need to change the image into an array with the dimension of 1 x 600 x 600 x 3. The number 1 indicates we have only one image.

r_logo <- 
  r_logo %>% 
  image_to_array() %>% 
  array_reshape(c(1, dim(.)))
dim(r_logo)

## [1]   1 600 600   3

Once we have a correct dimension, we can specify the parameters for the data augmentation.

augment_params <- image_data_generator(horizontal_flip = T, 
                                       vertical_flip = T,
                                       rotation_range = 0.5,
                                       zoom_range = 0.5,
                                       fill_mode = "reflect")

I am not going to into the details of the parameters. For those interested, the TensorFlow for R website explain this very well.

Next, we can generate the batch of augmented data at random. This function, however, will only run once we fit the model.

img_gen <- flow_images_from_data(r_logo,
                                 generator = augment_params, 
                                 batch_size = 1)

Finally, we can plot the image. Firstly, this is our original image.

img_gen$x [1,,,] %>% 
  as.raster(max = 255) %>% 
  as.array() %>% 
  plot()

Now, we going to loop the augmentation process. Here, we going to generate six augmented images. The set.seed for reproducibility.

set.seed(123)
par(mfrow = c(3, 2), mar = c(1, 0, 1, 0))

for (i in 1:6) {
  IMG <- img_gen$`next`()
  IMG[1,,,] %>% as.raster(max = 255) %>% as.array() %>% plot()
}

Conclusion

I believe this is quite useful to get a sense of how your data is augmented. Consequently, this may help in selecting the parameters for the data augmentation.

Using UMAP preprocessing for image classification

Wed, 16 Mar 2022 00:00:00 +0000

UMAP

Uniform manifold approximation and projection or in short UMAP is a type of dimension reduction techniques. So, basically UMAP will project a set of features into a smaller space. UMAP can be a supervised technique in which we give a label or an outcome or an unsupervised one. For those interested to know in detail how UMAP works can refer to this reference. For those prefer a much simpler or shorter version of it, I recommend a YouTube video by Joshua Starmer.

Example in R

We going to see how to apply a UMAP techniques for image preprocessing and further classify the images using kNN and naive bayes.

These are the packages that we need.

library(keras) #for data and reshape to tabular format
library(tidymodels)
library(embed) #for umap
library(discrim) #for naive bayes model

We going to use the famous MNIST dataset. This dataset contained a handwritten digit from 0 to 9. This dataset is available in keras package.

mnist_data <- dataset_mnist()

## Loaded Tensorflow version 2.2.0

image_data <- mnist_data$train$x
image_labels <- mnist_data$train$y
image_data %>% dim()

## [1] 60000    28    28

For example this is the image for the second row.

image_data[2, 1:28, 1:28] %>% 
  t() %>% 
  image(col = gray.colors(256))

Next, we going to change the image into a tabular data frame format. We going to limit the data to the first 1000 rows or images out of the total 6000 images.

# Reformat to tabular format
image_data <- array_reshape(image_data, dim = c(60000, 28*28))
image_data %>% dim()

## [1] 60000   784

image_data <- image_data[1:10000,]
image_labels <- image_labels[1:10000]

# Reformat to data frame
full_data <- 
  data.frame(image_data) %>% 
  bind_cols(label = image_labels) %>% 
  mutate(label = as.factor(label))

Then, we going to split the data and create a 3-folds cross-validation sets for the sake of simplicity.

# Split data
set.seed(123)
ind <- initial_split(full_data)
data_train <- training(ind)  
data_test <- testing(ind)

# 10-folds CV
set.seed(123)
data_cv <- vfold_cv(data_train, v = 3)

For recipe specification, we going to scale and center all the predictor after creating a new variable using step_umap(). Notice that in step_umap() we supply the outcome and we tune the number of components (num_comp).

rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = tune()) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

We create a a base workflow.

wf <- 
  workflow() %>% 
  add_recipe(rec)

We going to use two models as classifier:

kNN
Naive bayes

For each classifier, we going to create a regular grid of parameters to be tuned and further run a regular grid search.

For kNN.

# knn model
knn_mod <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

# knn grid
knn_grid <- grid_regular(neighbors(), num_comp(range = c(2, 8)), levels = 3)

# Tune grid search
knn_tune <- 
  tune_grid(
  wf %>% add_model(knn_mod),
  resamples = data_cv,
  grid = knn_grid, 
  control = control_grid(verbose = F)
)

For naive bayes.

# nb model
nb_mod <- 
  naive_Bayes(smoothness = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("naivebayes")

# nb grid
nb_grid <- grid_regular(smoothness(), num_comp(range = c(2, 10)), levels = 3)

# Tune grid search
nb_tune <- 
  tune_grid(
    wf %>% add_model(nb_mod),
    resamples = data_cv,
    grid = nb_grid, 
    control = control_grid(verbose = F)
  )

Let’s see our tuning performance of our model.

# knn model
knn_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   neighbors num_comp .metric .estimator  mean     n  std_err .config            
##       <int>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>              
## 1        10        8 roc_auc hand_till  0.961     3 0.000268 Preprocessor3_Mode~
## 2        10        5 roc_auc hand_till  0.961     3 0.000421 Preprocessor2_Mode~
## 3         5        8 roc_auc hand_till  0.959     3 0.000757 Preprocessor3_Mode~
## 4        10        2 roc_auc hand_till  0.959     3 0.000737 Preprocessor1_Mode~
## 5         5        5 roc_auc hand_till  0.958     3 0.000740 Preprocessor2_Mode~

knn_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   neighbors num_comp .metric  .estimator  mean     n std_err .config            
##       <int>    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
## 1        10        8 accuracy multiclass 0.914     3 0.00104 Preprocessor3_Mode~
## 2         5        8 accuracy multiclass 0.913     3 0.00315 Preprocessor3_Mode~
## 3        10        5 accuracy multiclass 0.912     3 0.00114 Preprocessor2_Mode~
## 4         5        5 accuracy multiclass 0.91      3 0.00139 Preprocessor2_Mode~
## 5        10        2 accuracy multiclass 0.910     3 0.00175 Preprocessor1_Mode~

# nb model
nb_tune %>% 
  show_best("roc_auc")

## # A tibble: 5 x 8
##   smoothness num_comp .metric .estimator  mean     n  std_err .config           
##        <dbl>    <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>             
## 1        1.5       10 roc_auc hand_till  0.971     3 0.000400 Preprocessor3_Mod~
## 2        1.5        6 roc_auc hand_till  0.971     3 0.000997 Preprocessor2_Mod~
## 3        1         10 roc_auc hand_till  0.971     3 0.000634 Preprocessor3_Mod~
## 4        1          6 roc_auc hand_till  0.970     3 0.00124  Preprocessor2_Mod~
## 5        0.5       10 roc_auc hand_till  0.969     3 0.000808 Preprocessor3_Mod~

nb_tune %>% 
  show_best("accuracy")

## # A tibble: 5 x 8
##   smoothness num_comp .metric  .estimator  mean     n  std_err .config          
##        <dbl>    <int> <chr>    <chr>      <dbl> <int>    <dbl> <chr>            
## 1        1         10 accuracy multiclass 0.913     3 0.000481 Preprocessor3_Mo~
## 2        1.5       10 accuracy multiclass 0.913     3 0.000267 Preprocessor3_Mo~
## 3        0.5       10 accuracy multiclass 0.912     3 0.000462 Preprocessor3_Mo~
## 4        1.5        6 accuracy multiclass 0.911     3 0.00135  Preprocessor2_Mo~
## 5        1          6 accuracy multiclass 0.910     3 0.00157  Preprocessor2_Mo~

Next, we going to select the best model from the tuned parameters and finalise our model using last_fit().

For knn model.

# Finalize
knn_best <- knn_tune %>% select_best("roc_auc")
knn_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = knn_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

knn_wf <- 
  workflow() %>% 
  add_recipe(knn_rec) %>% 
  add_model(knn_mod) %>% 
  finalize_workflow(knn_best) 

# Last fit
knn_lastfit <- 
  knn_wf %>% 
  last_fit(ind)

For naive bayes model.

# Finalize
nb_best <- nb_tune %>% select_best("roc_auc")
nb_rec <- 
  recipe(label ~ ., data = data_train) %>% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = nb_best$num_comp) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

nb_wf <- 
  workflow() %>% 
  add_recipe(nb_rec) %>% 
  add_model(nb_mod) %>% 
  finalize_workflow(nb_best) 

# Last fit
nb_lastfit <- 
  nb_wf %>% 
  last_fit(ind)

Let’s see the model performance on the testing data.

knn_lastfit %>% 
  collect_metrics() %>% 
  mutate(model = "knn") %>% 
  dplyr::bind_rows(nb_lastfit %>% 
                     collect_metrics() %>% 
                     mutate(model = "nb")) %>% 
  select(-.config)

## # A tibble: 4 x 4
##   .metric  .estimator .estimate model
##   <chr>    <chr>          <dbl> <chr>
## 1 accuracy multiclass     0.938 knn  
## 2 roc_auc  hand_till      0.971 knn  
## 3 accuracy multiclass     0.936 nb   
## 4 roc_auc  hand_till      0.980 nb

These are the confusion matrices.

knn_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - kNN")

nb_lastfit %>% 
  collect_predictions() %>%
  conf_mat(label, .pred_class) %>% 
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - naive bayes")

Lastly, we can compare the ROC plots for each class.

knn_lastfit %>% 
  collect_predictions() %>%
  mutate(id = "knn") %>% 
  bind_rows(
    nb_lastfit %>% 
      collect_predictions() %>% 
      mutate(id = "nb")
            ) %>% 
  group_by(id) %>% 
  roc_curve(label, .pred_0:.pred_9) %>% 
  autoplot()

Conclusion

I believe UMAP is quite good and can be used as one of preprocessing step in image classification. We are able to get a pretty good performance result in this post. I believe if the the parameter tuning approach is a bit more rigorous, the performance result will be a lot better.

Explore data using PCA

Wed, 09 Feb 2022 00:00:00 +0000

Principal component analysis (PCA)

PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.

PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.

For those who want to study the theoretical side of PCA can further read on this link. We going to focus more on the coding part in the machine learning framework (using tidymodels package) in this post.

Example in R

These are the packages that we going to use.

library(tidymodels)
library(tidyverse)
library(mlbench) #data

We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.

data("PimaIndiansDiabetes")
glimpse(PimaIndiansDiabetes)

## Rows: 768
## Columns: 9
## $ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~

We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.

set.seed(1)

ind <- initial_split(PimaIndiansDiabetes)
dat_train <- training(ind)

We create a recipe and apply normalization and PCA techniques. Then, we prep it.

# Recipe
pca_rec <- 
  recipe(diabetes ~ ., data = dat_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep <- prep(pca_rec)

So, we can extract the PCA data using tidy(). type = "coef" indicates that we want the loadings values. So, the values in the data are the loadings.

pca_tidied <- tidy(pca_prep, 2, type = "coef")
pca_tidied

## # A tibble: 64 x 4
##    terms     value component id       
##    <chr>     <dbl> <chr>     <chr>    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows

So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.

We can further visualise these loadings.

pca_tidied %>% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab("") +
  xlab("Loadings") + 
  theme_minimal()

Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.

pca_tidied2 <- tidy(pca_prep, 2, type = "variance")

pca_tidied2 %>% 
  pivot_wider(names_from = component, values_from = value, names_prefix = "PC") %>% 
  select(-id) %>% 
  mutate_if(is.numeric, round, digits = 1) %>% 
  kableExtra::kable("simple")

terms	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
variance	2.1	1.7	1.0	0.8	0.8	0.7	0.5	0.4
cumulative variance	2.1	3.8	4.9	5.7	6.5	7.2	7.6	8.0
percent variance	26.2	21.5	12.9	10.6	9.9	8.5	5.7	4.7
cumulative percent variance	26.2	47.7	60.7	71.2	81.1	89.6	95.3	100.0

Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.

pca_tidied3 <- 
  pca_tidied %>% 
  filter(component %in% c("PC1", "PC2")) %>% 
  select(-id) %>% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3

## # A tibble: 8 x 3
##   terms      PC1    PC2
##   <chr>    <dbl>  <dbl>
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638

Now, we can make a scatter plot using training set data (juice(pca_prep)) and the loadings data (pca_tidied3). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.

juice(pca_prep) %>% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, "picas")),
               color = "blue") +
  annotate("text", 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab("PC1 (26.2%)") +
  ylab("PC2 (21.5%)")

So, from this scatter plot we learn that:

(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other
As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase
As PC2 decreases, pregnant and age increase

References:

Fitted vs predict in R

Sun, 09 Jan 2022 00:00:00 +0000

There are two functions in R that seems almost similar yet different:

fitted()
predict()

First let’s prepare some data first.

# Packages
library(dplyr)

# Data
set.seed(123)
dat <- 
  iris %>% 
  mutate(twoGp = sample(c("Gp1", "Gp2"), 150, replace = T), #create two group factor
         twoGp = as.factor(twoGp))

This is our data.

summary(dat)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   twoGp   
##  setosa    :50   Gp1:76  
##  versicolor:50   Gp2:74  
##  virginica :50           
##                          
##                          
##

fitted() is used to get a predicted values or \(\hat{y}\) based on the data. Let’s see this on the logistic regression.

logR <- glm(twoGp ~ ., family = binomial(), data = dat)

These are the fitted values.

fitted(logR) %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

For predict(), we have three types:

Response
Link - default
Terms

If no new data supplied to predict(), it will use the original data used to fit the model.

1. Response

The type response is identical to fitted().

predict(logR, type = "response") %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

We can confirm this as below.

all.equal(fitted(logR), predict(logR, type = "response"))

## [1] TRUE

Thus, fitted() and predict(type = "response") give use predicted probabilities on the scale of the response variable. The first observation of this values can be interpreted as probability of Gp2 (Gp1 is a reference group) for first observation is 0.41.

2. Link

predict(type = "link") gives us predicted probabilities on the logit scale or log odds prediction.

predict(logR, type = "link") %>% head() #similar to predict(logR)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

So, the log odds prediction of Gp2 for the first observation is -0.37. Hence, we can get the same values if we apply a link function to the fitted values.

The link function for logistic regression is:

\[ ln(\frac{\mu}{1 - \mu}) \] So, we apply this link function to the fitted values.

logOddsProb <- log(fitted(logR) / (1 - fitted(logR))) 
head(logOddsProb)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

We can further confirm this as we did previously.

all.equal(logOddsProb, predict(logR, type = "link"))

## [1] TRUE

Also, we can conclude predict(type = "link") give use a fitted values before an application of link function (log odds).

3. Terms

Lastly, we have predict(type = "terms"). This type gives us a matrix of fitted values for each variable of each observation in the model on the scale of linear predictor.

predict(logR, type = "terms") %>% head()

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1   0.07988782  0.28070682    0.4819893  -0.2736677 -0.9178543
## 2   0.10138230 -0.03635661    0.4819893  -0.2736677 -0.9178543
## 3   0.12287679  0.09046877    0.5024299  -0.2736677 -0.9178543
## 4   0.13362403  0.02705608    0.4615487  -0.2736677 -0.9178543
## 5   0.09063506  0.34411951    0.4819893  -0.2736677 -0.9178543
## 6   0.04764610  0.53435757    0.4206675  -0.2188976 -0.9178543

So, if we add up the values of the first observation and the constant (or intercept), we will get the same values as the log odds prediction (predict(type = "link")).

predTerm <- predict(logR, type = "terms")
sum(predTerm[1, ], attr(predTerm, "constant")) #add up the first observation and the constant

## [1] -0.374315

logOddsProb[1]

##         1 
## -0.374315

Those values also similar to if we calculate manually using coefficient from the summary().

\[ LogOdds(Gp2) = \beta_0 + \beta_1(Sepal.Length) + \beta_2(Sepal.Width) + \] \[ \beta_3(Petal.Length) + \beta_4(Petal.Width) + \beta_5(Species) \] So, this is the values we get from the first observation.

coef(logR)[1] + coef(logR)[2]*dat$Sepal.Length[1] + coef(logR)[3]*dat$Sepal.Width[1] + coef(logR)[4]*dat$Petal.Length[1] + coef(logR)[5]*dat$Petal.Width[1] + 0 #setosa species

## (Intercept) 
##   -0.374315

However, in predict(type = "terms") the values is centered, thus we have a different values for constant/intercept and for \(\beta_1(Sepal.Length)\),\(\beta_2(Sepal.Width)\) and so on. For example, the values for intercept for both models are:

# Intercept/constant from predict(type = "terms")
attr(predTerm, "constant")

## [1] -0.02537694

# Intercept/constant from summary()
coef(logR)[1]

## (Intercept) 
##   -1.814251

References:

A short note on variable selection

Sat, 08 Jan 2022 00:00:00 +0000

Variable selection

Variable or feature selection is one of the important step whether in machine learning or statistical analysis. This post is geared more to the machine learning side. Certain machine learning models such as Support vector machine (SVM) and neural network do not handle irrelevant predictors very well, whereas models such as linear and logistic regression do not handle correlated predictors very well. Thus, careful selection of the variables will help mitigate this issue and further improve the predictive performance.

There are three types of approaches in variable selection:

1. Intrinsic (or built-in feature selection)

An intrinsic feature selection is a feature selection embedded in the algorithm. Some examples include:

Tree-and-rule-based model - decision tree, random forest, etc
Multivariate adaptive regression spline (MARS)
Regularization method such as least absolute shrinkage and selection operator (LASSO or L1)

Advantages of this type of approach are they are fast and computationally efficient. However, the best variable selected in this approach is model dependent.

2. Filter

In filter approach we determine the variable importance, usually separately though not necessarily. An example of this approach is univariate filter. If the outcome is two categories, we can use t-test to assess the numerical predictors. Variables with a significant p-value or a large t-statistics will be chosen.

This approach is very simple and fast. However, the best subset of variables selected using some filtering criteria such as statistical significance may not reflect the best predictive performance of the model. Additionally, this approach is prone to over-selection of the predictors.

3. Wrapper

There two types of wrapper approaches:

Greedy wrapper

Greedy approach or algorithm direct a search path towards the best at times to achieve the best immediate benefit. Due to this reason this approach cannot escape local minima. We can assume in Figure 1 below local minima represents locally best predictors and global minima represents globally best predictors.

Figure 1: Local minima and global minima

An example of this approach is recursive feature elimination or backward selection. The main weakness of this greedy approach is the selected subset of features identified by this approach may not has the best predictive performance.

Non-greedy wrapper

The examples of this approach are simulated annealing and genetic algorithm. Both of these algorithm incorporate a randomness in their approach. Hence, it is classified as non-greedy wrapper. Due to this randomness, it can escape a local minima (see Figure 1 above).

The wrapper type has the best chance to find the globally best predictors. However, this approach is computationally expensive. Not to mention, this approach has a tendency to overfit (some packages like caret use resampling to avoid this issue).

Suggested approach

Kuhn & Johnson (2019) suggested this approach:

Start with an intrinsic approach
Then, do a wrapper approach:
- If a linear intrinsic approach has a better performance - proceed to wrapper method with a linear model
- If non-linear intrinsic approach has a better performance - proceed to wrapper method with a non-linear model
If several approach select a large number of predictors, it may not feasible to reduce the number of features

References:

Stepwise selection after multiple imputation

Tue, 04 Jan 2022 00:00:00 +0000

Some note

I have written two post previously about multiple imputation using mice package:

This post probably my last post about multiple imputation using mice package.

Stepwise selection

The general steps in mice package are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.

Brand (1999) proposed this solution:

Perform stepwise selection separately on each imputed dataset
Fit a preliminary model that contains all variables that present in at least half of the models in the step 1
Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p > 0.05)
Repeat step 3 until all variables have p values < 0.05

So, we going to do this solution and use multivariate Wald test (D1() in mice package) for model comparison instead of pooled likelihood ratio p value.

Example in R

Load the packages.

library(mice)
library(tidyverse)

Create a missing data. We going to use the famous mtcars dataset, which already available in R.

set.seed(123)
dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor)) %>% 
  select(-mpg) %>% 
  missForest::prodNA(0.1) %>% 
  bind_cols(mpg = mtcars$mpg)
summary(dat)

##       cyl             disp             hp             drat      
##  Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:4.000   1st Qu.:120.7   1st Qu.:103.0   1st Qu.:3.150  
##  Median :6.000   Median :225.0   Median :123.0   Median :3.715  
##  Mean   :6.148   Mean   :232.8   Mean   :147.4   Mean   :3.642  
##  3rd Qu.:8.000   3rd Qu.:334.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##  NA's   :5       NA's   :1       NA's   :4       NA's   :2      
##        wt             qsec          vs        am          gear     
##  Min.   :1.513   Min.   :14.50   0   :17   0   :18   Min.   :3.00  
##  1st Qu.:2.429   1st Qu.:16.88   1   :11   1   :10   1st Qu.:3.00  
##  Median :3.203   Median :17.51   NA's: 4   NA's: 4   Median :4.00  
##  Mean   :3.112   Mean   :17.75                       Mean   :3.71  
##  3rd Qu.:3.533   3rd Qu.:18.83                       3rd Qu.:4.00  
##  Max.   :5.424   Max.   :22.90                       Max.   :5.00  
##  NA's   :4       NA's   :2                           NA's   :1     
##       carb            mpg       
##  Min.   :1.000   Min.   :10.40  
##  1st Qu.:2.000   1st Qu.:15.43  
##  Median :2.000   Median :19.20  
##  Mean   :2.667   Mean   :20.09  
##  3rd Qu.:4.000   3rd Qu.:22.80  
##  Max.   :6.000   Max.   :33.90  
##  NA's   :5

Run mice() on missing data with 10 imputed datasets (m = 10).

datImp <- mice(dat, m = 10, printFlag = F, seed = 123)
datImp

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##      cyl     disp       hp     drat       wt     qsec       vs       am 
##    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"    "pmm" "logreg" "logreg" 
##     gear     carb      mpg 
##    "pmm"    "pmm"       "" 
## PredictorMatrix:
##      cyl disp hp drat wt qsec vs am gear carb mpg
## cyl    0    1  1    1  1    1  1  1    1    1   1
## disp   1    0  1    1  1    1  1  1    1    1   1
## hp     1    1  0    1  1    1  1  1    1    1   1
## drat   1    1  1    0  1    1  1  1    1    1   1
## wt     1    1  1    1  0    1  1  1    1    1   1
## qsec   1    1  1    1  1    0  1  1    1    1   1

Run stepwise selection on each imputed dataset.

sc <- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, 
           lower = ~ 1)
exp <- expression(f1 <- lm(mpg ~ 1),
                  f2 <- step(f1, scope = sc, trace = 0))
fit <- with(datImp, exp)

Next, we calculate how many times each variable selected in the each model by stepwise selection.

fit$analyses %>% 
  map(formula) %>% #get the formula
  map(terms) %>% #get the terms
  map(labels) %>% #get the name of variables
  unlist() %>% 
  table()

## .
##   am carb  cyl disp drat   hp qsec   vs   wt 
##    7    5    3    2    4    5    3    4    7

We going to select:

am
carb
hp
wt

These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.

fit_full1 <- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2         am1  3.06689135 1.94363342  1.577917 13.06329 1.384846e-01
## 3        carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4          hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5          wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02

We exclude carb variable in the next model as it has the largest non-significant p value.

fit_full2 <- with(datImp, lm(mpg ~ am + hp + wt))

Next, we compare using multivariate Wald test.

D1(fit_full1, fit_full2)

##    test statistic df1     df2 dfcom   p.value       riv
##  1 ~~ 2 0.9765411   1 9.21378    27 0.3482934 0.6935655

P > 0.05. So, we opt for the simpler model.

pool(fit_full2) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2         am1  2.50264907 1.79966590  1.390619 15.31418 1.842201e-01
## 3          hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4          wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02

We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.

fit_full3 <- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)

##    test statistic df1      df2 dfcom   p.value       riv
##  1 ~~ 2   1.93382   1 12.90982    28 0.1878483 0.4392918

Again, we opt for the simple model.

pool(fit_full3) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2          hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3          wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05

There is no non-significant variable in the model anymore. Thus, this is our final model.

gtsummary::tbl_regression(fit_full3)

Characteristic	Beta	95% CI¹	p-value
hp	-0.03	-0.05, -0.01	0.005
wt	-3.9	-5.5, -2.4	<0.001
¹ CI = Confidence Interval

Reference:

https://stefvanbuuren.name/fimd/sec-stepwise.html

Variable selection using genetic algorithm

Sun, 02 Jan 2022 00:00:00 +0000

Background

Genetic algorithm is inspired by a natural selection process by which the fittest individuals be selected to reproduce. This algorithm has been used in optimization and search problem, and also, can be used for variable selection.

Genetic algorithm - gene, chromosome, population, crossover (upper right), offspring (lower right)

First, let’s go into a few terms related to genetic algorithm theory.

Population - a set of chromosomes
Chromosome - a subset of variables (also known as individual by some reference)
Gene - a variable or feature
Fitness function - give fitness score to each chromosome and guide the selection
Selection - a process to select the two chromosome known as parents
Crossover - a process to generate offspring by parents (illustrate in the picture above, on the upper right side)
Mutation - the process by which the gene in the chromosome is randomly flipped into 1 or 0

Mutation

So, the basic flow of genetic algorithm:

Algorithm starts with an initial population, often randomly generated
Create a successive generation by selecting a portion of the initial population (the selection is guided by the fitness function) - this includes selection -> crossover -> mutation
The algorithm terminates if certain predetermined criteria are met such as:
- Solution satisfies the minimum criteria
- Fixed number of generation reached
- Successive iteration no longer produce a better result

Example in R

There is GA package in R, where we can implement the genetic algorithm a bit more manually where we can specify our own fitness function. However, I think it is easier to use a genetic algorithm implemented in caret package for variable selection.

Load the packages.

library(caret)
library(tidyverse)
library(rsample)
library(recipes)

The data.

dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor),
         am = fct_recode(am, auto = "0", man = "1"))
str(dat)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "auto","man": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

For this, we going to use random forest (rfGA). Other options are bagged tree (treebagGA) and caretGA. We are able to use other method in caret if we use caretGA.

# specify control
ga_ctrl <- gafsControl(functions = rfGA, method = "cv", number = 5)

# run random forest
set.seed(123)
rf_ga <- gafs(x = dat %>% select(-am), 
              y = dat$am,
              iters = 5,
              gafsControl = ga_ctrl)
rf_ga

## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 2 classes: 'auto', 'man' 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     qsec (60%), wt (60%), disp (40%), gear (40%), vs (40%)
##   * on average, 3.2 variables were selected (min = 1, max = 7)
## 
## In the final search using the entire training set:
##    * 7 features selected at iteration 3 including:
##      cyl, hp, drat, qsec, vs ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9429      0.8831

The optimal features/variables:

rf_ga$optVariables

## [1] "cyl"  "hp"   "drat" "qsec" "vs"   "gear" "carb"

This is the time taken for random forest approach.

rf_ga$times

## $everything
##    user  system elapsed 
##   51.22    1.25   52.92

By default the algorithm will find a solution or a set of variable that reduce RMSE for numerical outcome, and accuracy for categorical outcome. Also, genetic algorithm tend to overfit, that’s why for the implementation in caret we have internal and external performance. So, for the 10-fold cross-validation, 10 genetic algorithm will be run separately. All the first nine folds are used for the genetic algorithm, and the 10th for external performance evaluation.

Let’s try a variable selection using linear regression model.

# specify control
lm_ga_ctrl <- gafsControl(functions = caretGA, method = "cv", number = 5)

# run lm
set.seed(123)
lm_ga <- gafs(x = dat %>% select(-mpg), 
              y = dat$mpg,
              iters = 5,
              gafsControl = lm_ga_ctrl,
              # below is the option for `train`
              method = "lm",
              trControl = trainControl(method = "cv", allowParallel = F))
lm_ga

## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared, MAE
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (100%), hp (80%), carb (60%), cyl (60%), am (40%)
##   * on average, 4.4 variables were selected (min = 4, max = 5)
## 
## In the final search using the entire training set:
##    * 5 features selected at iteration 5 including:
##      cyl, disp, hp, wt, qsec  
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      3.3434      0.7624      2.6037

Now, let’s see how to integrate this in machine learning flow using recipes from rsample.

First, we split the data.

set.seed(123)
dat_split <-initial_split(dat)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)

We specify two recipes for numerical and categorical outcome.

# Numerical
rec_num <- 
  recipe(mpg ~., data = dat_train) %>% 
  step_center(all_numeric()) %>% 
  step_dummy(all_nominal_predictors())

# Categorical
rec_cat <- 
  recipe(am ~., data = dat_train) %>% 
  step_center(all_numeric()) %>% 
  step_dummy(all_nominal_predictors())

We run random forest for numerical outcome recipes.

# specify control
rf_ga_ctrl <- gafsControl(functions = rfGA, method = "cv", number = 5)

# run random forest
set.seed(123)
rf_ga2 <- 
  gafs(rec_num,
       data = dat_train,
       iters = 5, 
       gafsControl = rf_ga_ctrl) 
rf_ga2

## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     cyl (80%), disp (80%), hp (80%), wt (80%), carb (60%)
##   * on average, 4.8 variables were selected (min = 2, max = 9)
## 
## In the final search using the entire training set:
##    * 6 features selected at iteration 5 including:
##      cyl, disp, hp, wt, gear ... 
##    * external performance at this iteration is
## 
##       RMSE   Rsquared        MAE 
##      2.830      0.928      2.408

The optimal variables.

rf_ga2$optVariables

## [1] "cyl"   "disp"  "hp"    "wt"    "gear"  "vs_X1"

Let’s try run SVM for the numerical outcome recipes.

# specify control
svm_ga_ctrl <- gafsControl(functions = caretGA, method = "cv", number = 5)

# run SVM
set.seed(123)
svm_ga <- 
  gafs(rec_cat,
       data = dat_train,
       iters = 5, 
       gafsControl = svm_ga_ctrl,
       # below is the options to `train` for caretGA
       method = "svmRadial", #SVM with Radial Basis Function Kernel
       trControl = trainControl(method = "cv", allowParallel = T))
svm_ga

## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 2 classes: 'auto', 'man' 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (80%), qsec (60%), vs_X1 (60%), carb (40%), disp (40%)
##   * on average, 4 variables were selected (min = 3, max = 6)
## 
## In the final search using the entire training set:
##    * 9 features selected at iteration 2 including:
##      mpg, cyl, disp, hp, drat ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9200      0.8571

The optimal variables.

svm_ga$optVariables

## [1] "mpg"   "cyl"   "disp"  "hp"    "drat"  "wt"    "qsec"  "carb"  "vs_X1"

Conclusion

Although genetic algorithm seems quite good for variable selection, the main limitation I would say is the computational time. However, if we have a lot of variables or features to reduced, using the genetic algorithm despite the long computational time seems beneficial to me.

Reference:

My first interactive map with {leaflet}

Sun, 28 Nov 2021 00:00:00 +0000

I have tried creating a map with ggplot2 previously. In this post, I will try to create an interactive map using leaflet package in R.

These are the required packages.

library(tidyverse)
library(tidygeocoder)
library(leaflet)
library(htmltools)

So, I’m going to use a clinics location data in Malaysia. I already uploaded this data tomy GitHub repo. I will skip the explanation for the pre-processing part, but it is the same pre-processing as my previous post.

# Read the data
clinic1m <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv")
clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")

Show code for pre-processing

# Get the missing coordinate based on postal codes
clinic1m2 <- 
  clinic1m %>%
  mutate(country = "malaysia") %>% 
  select(name, postcode, country) %>% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %>%
  geocode(postalcode = postcode, country = country, method = "osm")

# Add coordinate from external sources for the still missing coordinates
add_coord <- 
  read.table(header = T, text = "
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926")

# Drop clinics with the still missing coordinate
clinic1m2 <- 
  clinic1m2 %>% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %>% 
  drop_na() #drop 2 clinic1m

# Bind the 2 data
all_clinic <- 
  clinic1m2 %>% 
  mutate(Type = "1Malaysia") %>% 
  select(name, Type, lat, long) %>% 
  bind_rows(clinicDesa %>% 
              mutate(Type = "Desa", 
                     lat = latitude, 
                     long = longitude) %>% 
              select(name, Type, lat, long)) %>% 
  mutate(name = str_to_title(name))

First, we going to plot the coordinates to see if there is anything strange.

ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal()

So, we are going to remove the two isolated points as seen from the plot.

all_clinic2 <- all_clinic %>% filter(long > 25)

Once we have our data ready, we can supply to leaflet. We can choose the type of map from addProviderTiles(). Some need an api, but the one we choose here does not. We supply the longitude and latitude of our data to addCircleMarkers(), and clusterOptions to cluster our data.

leaflet(all_clinic2) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>%
  addProviderTiles(providers$Stamen.TerrainLabels) %>%
  addCircleMarkers(~long, ~lat, 
                   clusterOptions = markerClusterOptions())

Next, we can add a label.

labels <- 
  sprintf("<strong>%s</strong>", all_clinic$name) %>% 
  lapply(htmltools::HTML)

Also, we can add a mini map to our map. Here, I change the type of map to a more appropriate one.

leaflet(all_clinic2) %>% 
  addProviderTiles(providers$OpenStreetMap) %>%
  addCircleMarkers(~long, ~lat, popup = ~labels, # popup add the label
                   clusterOptions = markerClusterOptions()) %>% 
    # add a mini map
  addMiniMap(tiles = providers$OpenStreetMap, zoomLevelOffset = -3)

Notice that the coordinates look more accurate as compared to the map I created with ggplot2 previously.

References:

Variable selection for imputation model in {mice}

Mon, 22 Nov 2021 00:00:00 +0000

Some note

I have written a short post about missing data and multiple imputation in mice package previously. This post will add to that previous post.

Imputation model

Imputation model is the model that we use for our imputation approach. There is another term which is complete-data model. This is a model that we want to fit after we impute the missing values (i.e; the complete-data model is the final model).

Generally, we need to include as many relevant variables into the imputation model. However, this general advise may not be very efficient as we may have multicollinearity and computational issue if we include too many predictors. As a rule of thumb, the number of included variables should be no more than 15-20. van Buuren et al. (2011) mentioned that increased in explained variance in linear regression is negligible after 15 variables are included.

There are 4 steps suggested by van Buuren et al. (1999) for variable selection in the case of big data:

Include all variables that appear in the complete-data model (final model)
- This may include the interaction terms as well (passive imputation can be used to specify the interaction terms in mice package)
Include variable that have influence on the occurrence of the missing data
- This can be assessed by a correlation matrix between NAs variables and non-NAs variables
Include variable that explain a considerable amount of variance
- This can be crudely assessed by a correlation matrix between NAs variables and non-NAs variables
Remove variable that have too many missing values within the subgroup of incomplete cases
- This can be assessed by a proportion of usable cases (PUC) - how many cases with missing data in a certain variable have an observed values on the predictor variables

All these steps should be done on the key variables only. There is another more efficient yet laborious approach suggested by Oudshoorn et al. (1999), which take into account important predictor of predictors. We are going to focus on the four steps above, and not cover the latter suggested approach in this post.

R codes

These are the required packages.

library(mice)
library(corrplot)

Our data.

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We have 2 variables; Ozone and Solar.R with missing values or NAs. We can further explore the pattern of missing variable.

md.pattern(airquality)

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

There are 2 rows with NAs in Ozone and Solar.R, 35 rows with NAs only in Ozone, and 5 rows with NAs only in Solar.R. Next, we can check the correlation.

cor(airquality, use = "pairwise.complete.obs") |>
  corrplot(method = "number", type = "upper")

The correlations of Ozone-Temp and Ozone-Wind are the highest. Now, let’s do a correlation between the NAs variable and non-NAs variable.

cor(y = airquality, x = !is.na(airquality), use = "pairwise.complete.obs") |>
  round(digits = 2)

##         Ozone Solar.R  Wind Temp Month   Day
## Ozone      NA   -0.02 -0.05 0.00  0.26 -0.05
## Solar.R     0      NA  0.06 0.11  0.11  0.17
## Wind       NA      NA    NA   NA    NA    NA
## Temp       NA      NA    NA   NA    NA    NA
## Month      NA      NA    NA   NA    NA    NA
## Day        NA      NA    NA   NA    NA    NA

We can ignore the warnings and the NAs as only Ozone and Solar.R have a missing values. So, the highest correlation is 0.26 between Month-Ozone - correlation between Month values with Ozone-related NAs and Month values with non-Ozone-related NAs. The column variable in the correlation matrix is the indicators of NAs and the row variables is the variable with observed values.

Lastly we can calculate ‘manually’ the PUC (proportion of usable cases). md.pairs() here calculate the number of observation per variable pair.

var_pair <- md.pairs(airquality)
round(var_pair$mr / (var_pair$mr + var_pair$mm), digits = 3)

##         Ozone Solar.R Wind Temp Month Day
## Ozone   0.000   0.946    1    1     1   1
## Solar.R 0.714   0.000    1    1     1   1
## Wind      NaN     NaN  NaN  NaN   NaN NaN
## Temp      NaN     NaN  NaN  NaN   NaN NaN
## Month     NaN     NaN  NaN  NaN   NaN NaN
## Day       NaN     NaN  NaN  NaN   NaN NaN

Low value of PUC indicate there is a little information on the predictor to impute the target NAs variable. NaN is shown as the variables have no missing values. The row variable are the target variables to be imputed, and the column variables are the predictors in imputation model. We can see that to impute Solar.R (on the row) Ozone has a little less information (0.714) compare to Wind, Temp, and Day. The diagonal elements will always be 0 or NaN. So, from here we can drop predictors with say, 0 PUC as they contain no information to help impute the target NAs variable.

Actually, we have a nice function from mice that can do what we ‘manually’ did just now.

quickpred(airquality)

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Again, the column variables are the predictors, and the row variables are the target NAs variables. The above matrix is known as predictor matrix, which going to be used in the imputation model. 1 denote a variable included as predictors and 0 vice versa. The two main arguments in quickpred() are:

mincor - if any of the absolute values in the two correlation matrix that we did earlier above 0.1 (default), the predictors will be included in the predictor matrix
minpuc - the default values for PUC is 0, so the predictors are retained even if they have no information to help imputation model

Notice that, variable Day is excluded from the predictors of Ozone. The correlation values are 0 and -0.05 from the first and second correlation matrices, respectively which do not exceed the default setting of 0.1. That’s why, variable Day is excluded. Also, we can observe a similar situation for variable Wind , which is excluded from the predictors of Solar.R (the correlation coefficients are -0.60 and 0.06). The negative (-) sign does not matter as we actually evaluate the absolute values.

Intuitively, we can change these two arguments as we see fit to do a variable selection for imputation model. Once we finalise our variable selection, we can do the multiple imputation using mice().

# Finalised variable selection
var_sel <- quickpred(airquality)
var_sel

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

# Impute
imp <- mice(airquality, m = 5, predictorMatrix = var_sel, printFlag = F)
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Notice that mice() uses the predictor matrix that we provide.

References:

https://www.jstatsoft.org/article/view/v045i03 - paper written by Staf van Buuren (a bit outdated in terms of codes, but runnable)
https://stefvanbuuren.name/fimd/ - online book written by Stef van Buuren (See chapter 6.3.2 and 9.1.6)

Making maps with R (my first attempt ever!)

Fri, 12 Nov 2021 00:00:00 +0000

As written in the title of the post, this is my first try ever in making a map with R. I found a great data on the distribution of the clinics in Malaysia. The two types of clinic that we have here:

Klinik 1Malaysia (1Malaysia clinic)
Klinik Desa (Desa clinic)

Originally, these two data are a separated data. Both of the data can be downloaded from here. Also, I have uploaded the data into my GitHub repo for those interested. Klinik Desa data have a latitude and longitude information, but Klinik 1Malaysia data does not.

These are the required packages.

library(rworldmap) #to get a Malaysia map
library(tidyverse)
library(tidygeocoder) #to get latitude and logitude

Read the data.

clinic1m <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv")
clinicDesa <- read.csv("https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv")

First, we need to get a latitude and longitude information for Klinik 1Malaysia data. So, we going to retrieve the coordinates based on the postal code, though this is not very accurate. We can use tidygeocoder for this.

clinic1m2 <- 
  clinic1m %>%
  mutate(country = "malaysia") %>% 
  select(name, postcode, country) %>% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %>%
  geocode(postalcode = postcode, country = country, method = "osm")

Further checking on the data, we notice that 5 clinics have no coordinate info.

clinic1m2 %>% filter(is.na(lat) | is.na(long))

## # A tibble: 5 x 5
##   name                                     postcode country    lat  long
##   <chr>                                    <chr>    <chr>    <dbl> <dbl>
## 1 Klinik 1 Malaysia Bandar Lela            90700    malaysia    NA    NA
## 2 Klinik 1 Malaysia Batu Melintang         17250    malaysia    NA    NA
## 3 Klinik 1 Malaysia Cakerapurnama          45010    malaysia    NA    NA
## 4 Klinik 1 Malaysia Jelawat                16070    malaysia    NA    NA
## 5 Klinik 1 Malaysia Taman Kempadang Makmur 26060    malaysia    NA    NA

Some data pre-processing

So, I found this data after some googling time, which give coordinate based on the postal code. So, we going to add in the missing coordinate based on this online data.

add_coord <- 
  read.table(header = T, text = "
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926")

clinic1m2 <- 
  clinic1m2 %>% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %>% 
  drop_na() #drop 2 clinic1m

Even after add in the missing coordinate, we still missing 2 coordinates. So, we going to drop those 2 clinics. Next, we combine both data.

all_clinic <- 
  clinic1m2 %>% 
  mutate(Type = "1Malaysia") %>% 
  select(Type, lat, long) %>% 
  bind_rows(clinicDesa %>% 
              mutate(Type = "Desa", 
                     lat = latitude, 
                     long = longitude) %>% 
              select(Type, lat, long))

Let’s try plotting the data first.

ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal() #should remove the isolated two data

We have 2 isolated points from Klinik Desa data. We will drop these 2 points as well.

all_clinic2 <- all_clinic %>% filter(long > 25)

Plotting the map

There are 2 ways to plot our data to Malaysia map, that we going to cover in this post.

1) map from `ggplot2`

First, we need to get the map.

global <- map_data("world") #get map

Once, we retrieved the map, we need to filter the region to Malaysia. The rest of the codes are ggplot2 function as we know it.

ggplot() + 
  geom_polygon(data = global %>% filter(region == "Malaysia"), aes(x=long, y = lat, group = group), 
               fill = "gray85") + 
  coord_fixed(1.3) +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  theme_void() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Klinik 1Malaysia dan Klinik Desa di Malaysia", 
       subtitle = "(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))), 
       color = "Jenis klinik:", 
       shape = "Jenis klinik:") +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "bottom")

2) map from `rworldmap`

The flow is similar, we need to get the map first. Then, restrict the map to Malaysia region.

world <- getMap(resolution = "low") #get map
msia <- world[world@data$ADMIN == "Malaysia", ]

The rest of the codes are similar to the first approach. But, we going to change the theme a bit.

ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = "black") +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab("Longitude") +
  ylab("Latitude") +
  labs(title = "Klinik 1Malaysia dan Klinik Desa di Malaysia", 
       subtitle = "(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)",
       caption = expression(paste(italic("Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan"))), 
       color = "Jenis klinik:", 
       shape = "Jenis klinik:") +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = "bottom")

Conclusion

The coordinates that we have are not as accurate as it should, or maybe there is something wrong that I miss along the way. As we can see, we have clinics on the ocean. As far as I know, we Malaysian are not that advanced yet. Also, noticed that we severely lacking clinics in Sarawak, given that our data is correct.

Some COVID-19 plots for Southeast Asian countries

Wed, 10 Nov 2021 00:00:00 +0000

Recently, I found a GitHub repo containing a global COVID-19 dataset. I thought, why not try to do some plotting for Southeast Asian countries. So, I downloaded the data and limited the data to Southeast Asian countries only (Brunei, Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam). I have uploaded this restricted data to my GitHub repo.

We are not going to do anything fancy, just some visualisations.

Let’s begin by reading the data.

library(tidyverse)
covid_sea <- read_csv("https://raw.githubusercontent.com/tengku-hanis/data-owid-covid/main/covid_sea.csv")

We are going to compare between each Southeast Asian countries in terms of:

Daily cases
Daily deaths
Daily tests
Daily vaccinations

Before that, we need to make a function, as all the above items have a generic things to plot with the exception on y axis.

easy_plot <- function(var1, lab_title, yaxis_lab, span = 0.14){
  covid_sea %>% 
    select(date, location, {{var1}}) %>% 
    drop_na() %>% 
    ggplot(aes(date, {{var1}}, color = location)) +
    geom_smooth(se = F, span = 0.14) +
    geom_point(aes(color = location), alpha = 0.2) +
    geom_line(aes(color = location), alpha = 0.2, linetype = "dashed") +
    labs(title = {{lab_title}}) +
    ylab({{yaxis_lab}}) +
    xlab("Date") +
    theme_minimal() 
}

var1 is going to be the item/variable that we want to compare, lab_title is the plot title, yaxis_lab is the label on the y axis, and span is just how smooth our smoothen line should be.

Daily cases

easy_plot(new_cases, "Daily cases for southeast Asian countries", "Daily cases", span = 0.8)

We cannot compare in terms of the frequency as big countries like Indonesia is expected to had a higher number of daily cases. A smoothen line though very basic, may indicate a simple trend. Thailand, Malaysia, Philippines and Indonesia seems to had a decreasing trend of cases. On the other hand, the daily cases in Vietnam seems to start to increase. Singapore had a more stabilised trend of cases, though a higher number of cases was observed in the latest period. Lastly, Brunei had too little cases, for us to see any sort of trend at the scale of the between countries comparison.

Daily deaths

easy_plot(new_deaths, "Daily deaths for southeast Asian countries", "Daily deaths", span = 0.8)

Philippines and Indonesia seems started to had a bit of increasing trend. Other countries look okay.

Daily tests

easy_plot(new_tests, "Daily tests for southeast Asian countries", "Daily tests", span = 0.2)

The daily tests plot looks a bit weird for Vietnam. Actually, the daily tests below zero are not avaliable (not sure if there is no test done in the period or the values is just missing). Hence, the weird looking plot for Vietnam. Data for Brunei and Thailand are not available. Malaysia seems to be quite aggressive in COVID-19 testing, even on par with Indonesia. Also, Vietnam seems to be very aggressive in the latest period, probably to cover the lack of COVID-19 testing previously.

Daily vaccinations

easy_plot(new_vaccinations, "Daily vaccinations for southeast Asian countries", "Daily vaccinations", span = 0.9)

Malaysia and Singapore had quite a similar distribution. Vietnam, Philippines, Thailand and Indonesia quite similar in which they had a series of wave in the rate of vaccinations, though the trend of wave for Thailand is less obvious. Again, the number in Brunei was too little for us to see any trend or distribution at this scale.

Malaysia situation

Let’s do a plot, specific to Malaysia. We going to scale the numbers, so that we able to see a comparison in term of trend or distribution.

covid_sea %>% 
  filter(location == "Malaysia") %>% 
  mutate(new_cases = scale(new_cases), 
         new_deaths = scale(new_deaths), 
         new_tests = scale(new_tests), 
         new_vaccinations = scale(new_vaccinations)) %>% 
  ggplot(aes(date)) +
  geom_line(aes(y = new_cases, color = "new_cases"), alpha = 0.3) +
  geom_line(aes(y = new_deaths, color = "new_deaths"), alpha = 0.3) +
  geom_line(aes(y = new_tests, color = "new_tests"), alpha = 0.3) +
  geom_line(aes(y = new_vaccinations, color = "new_vaccinations"), alpha = 0.3) +
  geom_point(aes(y = new_cases, color = "new_cases"), alpha = 0.3) +
  geom_point(aes(y = new_deaths, color = "new_deaths"), alpha = 0.3) +
  geom_point(aes(y = new_tests, color = "new_tests"), alpha = 0.3) +
  geom_point(aes(y = new_vaccinations, color = "new_vaccinations"), alpha = 0.3) +
  geom_smooth(aes(y = new_cases, color = "new_cases"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_deaths, color = "new_deaths"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_tests, color = "new_tests"), se = F, span = 0.3) +
  geom_smooth(aes(y = new_vaccinations, color = "new_vaccinations"), se = F, span = 0.6) +
  labs(title = "Situation in Malaysia") +
  ylab("Scaled Frequency") +
  xlab("Date") +
  guides(color = guide_legend("Items")) +
  scale_color_discrete(labels = c("Daily cases", "Daily deaths", "Daily tests", "Daily vaccinations")) +
  theme_minimal()

Interestingly, as the number of vaccination increased up to a certain threshold, the number of daily cases and daily deaths started to decreased. Obviously, the daily testing also decreased as in Malaysia, COVID-19 testing is done based on suspected cases and their persons of contact instead of mass testing.

Disclaimer: Please take anything written here with a massive grain of salt.

Data source: https://github.com/owid/covid-19-data/tree/master/public/data

Extract a table from a pdf

Mon, 01 Nov 2021 00:00:00 +0000

In a couple of days, I am going to conduct a pre-conference workshop for Malaysian R conference 2021. So, some of the data that I am going to use for this workshop is available in a table in pdf form. Hence, this post is about how I get that particular table from the pdf into R for further analysis.

So, this is a table we going to extract.

Extracting a table from pdf

We going to use tabulizer package for this. However, not every pdf works with this package. In our case, it works but need further preprocessing.

Load the required packages.

library(tabulizer)
library(dplyr)
library(stringr)

Read a table from a pdf.

raw_table <- extract_tables("https://static-content.springer.com/esm/art%3A10.1038%2Fs41440-021-00720-3/MediaObjects/41440_2021_720_MOESM1_ESM.pdf", 
                          pages = 17, 
                          output = "data.frame")

So, this is the extracted table.

raw_table[[1]] %>% head(10)

##                X     X.1     X.2     X.3  X.4     X.5 X.6     X.7  X.8
## 1                                                                     
## 2                                                                     
## 3    Ahmed, 2019 Unclear Unclear Unclear High Unclear Low Unclear High
## 4                                                                     
## 5   Badrov, 2013 Unclear    High    High High Unclear Low Unclear High
## 6   Baross, 2012 Unclear Unclear    High High Unclear Low Unclear High
## 7   Baross, 2013 Unclear Unclear    High High Unclear Low Unclear High
## 8  Carlson, 2016     Low    High    High  Low Unclear Low     Low High
## 9  Correia, 2020     Low     Low     Low High Unclear Low     Low High
## 10                                                                    
##                              X.9
## 1      1- selection bias: random
## 2            sequence generation
## 3  2- selection bias: allocation
## 4                    concealment
## 5                               
## 6   3- reporting bias: selective
## 7                      reporting
## 8                               
## 9  4- Performance bias: blinding
## 10  (participants and personnel)

So, a few preprocessing steps needed:

Remove column X.9 - this column supposed to be a header
Rename a header based on column X.9
Remove a space between the author name - “Ahmed,2019” instead of “Ahmed, 2019”
Remove empty rows

irt_rob <- 
  raw_table[[1]] %>% 
  select(-X.9) %>%  
  rename(Study = X, 
         Random.sequence.generation. = X.1, 
         Allocation.concealment. = X.2,
         Selective.reporting. = X.3,
         Blinding.of.participants.and.personnel. = X.4, 
         Blinding.of.outcome.assessment = X.5, 
         Incomplete.outcome.data = X.6, 
         Other.sources.of.bias. = X.7, 
         Overall = X.8) %>% 
  as_tibble() %>% 
  mutate(Study = str_replace_all(Study, " ", "")) %>% 
  mutate(id_del = str_match(Study, ".")) %>% 
  filter(!is.na(id_del)) %>% 
  select(-id_del)

Finally, our data is ready.

irt_rob

##          Study Random.sequence.generation. Allocation.concealment.
## 1   Ahmed,2019                     Unclear                 Unclear
## 2  Badrov,2013                     Unclear                    High
## 3  Baross,2012                     Unclear                 Unclear
## 4  Baross,2013                     Unclear                 Unclear
## 5 Carlson,2016                         Low                    High
##   Selective.reporting. Blinding.of.participants.and.personnel.
## 1              Unclear                                    High
## 2                 High                                    High
## 3                 High                                    High
## 4                 High                                    High
## 5                 High                                     Low
##   Blinding.of.outcome.assessment Incomplete.outcome.data Other.sources.of.bias.
## 1                        Unclear                     Low                Unclear
## 2                        Unclear                     Low                Unclear
## 3                        Unclear                     Low                Unclear
## 4                        Unclear                     Low                Unclear
## 5                        Unclear                     Low                    Low
##   Overall
## 1    High
## 2    High
## 3    High
## 4    High
## 5    High

A short note on multiple imputation

Fri, 29 Oct 2021 00:00:00 +0000

Background

Missing data is quite challenging to deal with. Deleting it may be the easiest solution, but may not be the best solution. Missing data can be categorised into 3 types (Rubin, 1976):

MCAR
- Missing Completely At Random
- Example; some of the observations are missing due to lost of records during the flood
MAR
- Missing At Random
- Example; variable income are missing as some participant refuse to give their salary information which they deems as very personal information
MNAR
- Missing Not At Random
- Example; weight variable is missing for morbidly obese participants since the scale is unable to weight them

Out of the 3 types above, the most problematic is MNAR, though there exist methods to deal with this type. For example, the miceMNAR package in R.

There are several approaches in handling missing data:

Listwise-deletion
- Best approach if the amount of missingness is very small
Simple imputation
- Using mean/median/mode imputation
- This approach is not advisable as it leads to bias due to reduce variance, though the mean is not affected
Single imputation
- Simple imputation above is considered as single imputation as well
- This approach ignores uncertainty of the imputation and almost always underestimate the variance
Multiple imputation
- A bit advanced and it cover the limitation of single imputation approach

However, the main assumption for any imputation methods is the missingness should be MCAR or MAR.

Multiple imputation

In short, there are 2 approaches of multiple imputation implemented by packages in R:

Joint modeling (JM) or joint multivariate normal distribution multiple imputation
- The main assumption for this method is that the observed data follows a multivariate normal distribution
- A violation of this assumption produces incorrect values, though a slight violation is still okay
- Some packages that implemented this method: Amelia and norm
Fully conditional specification (FCS) or conditional multiple imputation
- Also known as multivariate imputation by chained equation (MICE)
- This approach is a bit flexible as distribution is assumed for each variable rather than the whole dataset
- Some package that implemented this method: mice and mi

Example

In mice package, the general steps are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

Figure 1: Main steps in mice package.

These are the required packages.

library(tidyverse)
library(mice)
library(VIM)
#library(missForest) we want to use prodNA() function from this package
library(naniar)
library(niceFunction) #install from github (https://github.com/tengku-hanis/niceFunction)
library(dplyr)
library(gtsummary)

We going to produce some NAs randomly.

set.seed(123)
dat <- iris %>% 
  select(-Sepal.Length)%>% 
  missForest::prodNA(0.2) %>%  # randomly insert 20% NAs
  mutate(Sepal.Length = iris$Sepal.Length)

Explore the NAs and the data.

naniar::miss_var_summary(dat)

## # A tibble: 5 x 3
##   variable     n_miss pct_miss
##   <chr>         <int>    <dbl>
## 1 Petal.Length     38     25.3
## 2 Sepal.Width      33     22  
## 3 Species          28     18.7
## 4 Petal.Width      21     14  
## 5 Sepal.Length      0      0

Some references recommend to remove variables with more than 50% NAs. However, we purposely introduce 20% NAs into our data.

As a guideline, we can check for MCAR for our NAs.

naniar::mcar_test(dat) #p > 0.05, MCAR is indicated

## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      38.8    40   0.522               14

Next step is to evaluate the pattern of missingness in our data.

md.pattern(dat, rotate.names = T, plot = T)

##    Sepal.Length Petal.Width Species Sepal.Width Petal.Length    
## 64            1           1       1           1            1   0
## 21            1           1       1           1            0   1
## 15            1           1       1           0            1   1
## 3             1           1       1           0            0   2
## 14            1           1       0           1            1   1
## 4             1           1       0           1            0   2
## 6             1           1       0           0            1   2
## 2             1           1       0           0            0   3
## 7             1           0       1           1            1   1
## 6             1           0       1           1            0   2
## 4             1           0       1           0            1   2
## 2             1           0       1           0            0   3
## 1             1           0       0           1            1   2
## 1             1           0       0           0            1   3
##               0          21      28          33           38 120

aggr(dat, prop = F, numbers = T)

We have 13 patterns (numbers on the right) of NAs in our data. These 2 functions work well with small dataset, but with a larger dataset (and with lot more pattern of NAs), it’s probably quite difficult to assess the pattern.

matrixplot() probably more appropriate for a larger dataset.

matrixplot(dat)

In terms of the missingness pattern, we can also assess the distribution of NAs of Sepal.Width is dependent on the variable Sepal.Length.

niceFunction::histNA_byVar(dat, Sepal.Width, Sepal.Length)

As we can see the distribution and range of the histograms of the NAs (True) and non-NAs (False) is quite similar. Thus, this may indicated that Sepal.Width is at least MAR. However, by right we should do this for each pair of numerical variable before jumping into any conclusion.

Another good thing to assess is the correlation.

# Data with 1 = NAs, 0 = non-NAs
x <- as.data.frame(abs(is.na(dat))) %>% 
  dplyr::select(-Sepal.Length) #pick variable with NAs only

Firstly, the correlation between the variables with missing data.

cor(x) %>% 
  corrplot::corrplot()

No high correlation among variable with NAs. Secondly, let’s see correlation between NAs in a variable and the observed values of other variables.

cor(dat %>% mutate(Species = as.numeric(Species)), x, use = "pairwise.complete.obs")

##               Sepal.Width Petal.Length  Petal.Width     Species
## Sepal.Width            NA  0.049158733 -0.065917718  0.09948263
## Petal.Length  0.042075695           NA -0.004572405 -0.17265919
## Petal.Width   0.096195805 -0.003320601           NA -0.11024288
## Species       0.045849046 -0.104143925 -0.081055707          NA
## Sepal.Length -0.006435044 -0.052871701 -0.091024799 -0.08527514

Again, there is no high correlation. But, if we were to interpret this correlation matrix; the rows are the observed variables and the columns represent the missingness. For example, missing values of Sepal.Width is more likely to be missing for observations with a high value of Petal.Width (r = 0.05 indicates it’s highly unlikely though).

Now, we can do multiple imputation. These are the methods in the mice package:

methods(mice)

##  [1] mice.impute.2l.bin       mice.impute.2l.lmer      mice.impute.2l.norm     
##  [4] mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart         mice.impute.jomoImpute  
## [10] mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch   mice.impute.mnar.logreg 
## [16] mice.impute.mnar.norm    mice.impute.norm         mice.impute.norm.boot   
## [19] mice.impute.norm.nob     mice.impute.norm.predict mice.impute.panImpute   
## [22] mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [25] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf          
## [28] mice.impute.ri           mice.impute.sample       mice.mids               
## [31] mice.theme              
## see '?methods' for accessing help and source code

By default, mice uses:

pmm (predictive mean matching) for numeric data
logreg (logistic regression imputation) for binary data, factor with 2 levels
polyreg (polytomous regression imputation) for unordered categorical data (factor > 2 levels)
polr (proportional odds model) for ordered, > 2 levels

let’s run the mice function to our data:

imp <- mice(dat, m = 5, seed=1234, maxit = 5, printFlag = F) 
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  Sepal.Width Petal.Length  Petal.Width      Species Sepal.Length 
##        "pmm"        "pmm"        "pmm"    "polyreg"           "" 
## PredictorMatrix:
##              Sepal.Width Petal.Length Petal.Width Species Sepal.Length
## Sepal.Width            0            1           1       1            1
## Petal.Length           1            0           1       1            1
## Petal.Width            1            1           0       1            1
## Species                1            1           1       0            1
## Sepal.Length           1            1           1       1            0

Next, we can do some diagnostic assessment on the imputed data. This is our imputed data.

imp$imp$Sepal.Width %>% head()

##      1   2   3   4   5
## 5  3.4 3.4 4.1 3.1 3.5
## 13 3.2 3.1 3.2 3.6 3.1
## 14 3.1 3.2 2.9 3.4 3.0
## 23 3.6 3.2 3.0 3.8 3.1
## 26 4.1 3.0 3.1 3.5 3.0
## 34 3.4 3.7 3.7 3.4 4.4

One important thing to check is the convergence. We are going increase the number of iteration for this.

imp_conv <- mice.mids(imp, maxit = 30, printFlag = F)
plot(imp_conv)

The line in the plot should be intermingled and no obvious trend should be observed. Our plot above indicates a convergence.

We can also assess density plot of imputed data and the observed data. Blue color is the observed data and red color is the imputed data.

densityplot(imp)

We can further assess variable Sepal.Width.

densityplot(imp, ~ Sepal.Width | .imp)

Lastly, we can assess the strip plot. The imputed observations (red color) should not distributed too far from the observed data (blue color).

stripplot(imp)

So, once we finish the diagnostic checking, we can actually go back and change the imputation method for Sepal.Width, since the its distribution changes quite differently at each iteration. But, we are not going to do that, instead we are going to do the analysis.

# run regression
fit <- with(imp, lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species))
# pool all imputed set
pooled <- pool(fit) 
summary(pooled)

##                term   estimate  std.error statistic       df      p.value
## 1       (Intercept)  2.2008307 0.34577321  6.364954 29.02484 5.859560e-07
## 2       Sepal.Width  0.5233500 0.09717217  5.385801 50.89918 1.854832e-06
## 3      Petal.Length  0.7409159 0.09020153  8.214006 12.73722 1.921415e-06
## 4       Petal.Width -0.3623895 0.18562168 -1.952301 22.34517 6.354332e-02
## 5 Speciesversicolor -0.3891112 0.28166528 -1.381467 15.07547 1.872683e-01
## 6  Speciesvirginica -0.5237106 0.42629920 -1.228505 10.82804 2.452897e-01

Since we have the original dataset without the NAs, we going to compare them.

mimpute <- 
  fit %>% 
  tbl_regression() #with mice

noimpute <- 
  dat %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #w/o mice

original <- 
  iris %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #original data

tbl_merge(
  tbls = list(mimpute, noimpute, original), 
  tab_spanner = c("With MICE", "Without MICE", "Original data")
)

Characteristic	With MICE	Without MICE	Original data
Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
Sepal.Width	0.52	0.33, 0.72	<0.001	0.48	0.17, 0.79	0.003	0.50	0.33, 0.67	<0.001
Petal.Length	0.74	0.55, 0.94	<0.001	0.71	0.51, 0.90	<0.001	0.83	0.69, 1.0	<0.001
Petal.Width	-0.36	-0.75, 0.02	0.064	-0.35	-0.85, 0.14	0.2	-0.32	-0.61, -0.02	0.039
Species
setosa	—	—		—	—		—	—
versicolor	-0.39	-1.0, 0.21	0.2	-0.42	-1.1, 0.30	0.3	-0.72	-1.2, -0.25	0.003
virginica	-0.52	-1.5, 0.42	0.2	-0.42	-1.5, 0.63	0.4	-1.0	-1.7, -0.36	0.003
¹ CI = Confidence Interval

There is a different in the result between the original dataset (no NAs) and with mice imputation. Probably, exploring other imputation methods will produce a better result.

There are a lot more that are not cover in this post. For example passive imputation and post-processing. In fact, there are a series of vignettes written by Gerko Vink and Stef van Buuren (both are the authors of mice) which provides a good tutorial on using mice though quite advanced.

Suggested online books (though, I have not really studied both of the books yet):

Flexible imputation of missing data by Stef van Buuren
Applied missing data analysis with SPSS and (R)Studio

References for this post:

COVID-19 vaccine interest in Malaysia

Sun, 17 Oct 2021 00:00:00 +0000

We are going to do a basic google trends search using gtrendsR package and do some plotting with ggplot2.

These are the required packages.

library(gtrendsR)
library(tidyverse)

Run gtrends() function to search our keywords of interest (i.e; type of vaccine). So far, we only used 4 type of vaccines in Malaysia.

vaccine <- gtrends(c("pfizer", "astrazeneca", "sinovac", "cansino"), geo = "MY")

Then, plot our keywords.

plot(vaccine)

Probably, it’s better if we filter our date to when the COVID-19 pandemic started, which is around March 2020.

vaccine$interest_over_time %>% 
  group_by(keyword) %>% 
  filter(hits != "<1" & date > as.Date("2020-03-01")) %>% 
  mutate(hits = as.numeric(hits), 
         date = as.Date(date)) %>% 
  ggplot() + 
  geom_line(aes(x = date, y = hits, color = keyword), size = 0.8) +
  theme_minimal() +
  labs(title = "COVID-19 vaccine interest in Malaysia", y = "Search hits", x = "Date") +
  scale_x_date(date_breaks = "4 month")

So, AstraZeneca vaccine is of high interest, probably due to infamous blood clotting issue. Next, we can also get the search keywords based on the states.

vaccine$interest_by_region %>% 
  group_by(location) %>% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = "COVID-19 vaccine interest in Malaysia by states", y = "Search hits", x = "")

Lastly, we can plot the search keywords based on the city.

vaccine$interest_by_city %>% 
  group_by(location) %>% 
  drop_na() %>% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = "COVID-19 vaccine interest in Malaysia by cities", y = "Search hits", x = "")

gtrendsR with just a bit of plots certainly very useful if we want to gauge certain issues in the community.

Wordcloud of COVID-19 research in Malaysia

Sat, 11 Sep 2021 00:00:00 +0000

Let’s see how much research has been done in term of COVID-19 in Malaysia. In this analysis, we are going to use Scopus database to access the relevant research or papers. In this analysis we are going to use 4 specific parts of the scientific paper:

Title
Abstract
Author’s keywords
Scopus’s keywords

Above is a sample of paper that shows the section of scientific paper that we are going to use in our analysis. The Scopus’s keywords are generated by the Scopus database, thus, it does not available on the paper.

So, the analysis will be applied separately on these 4 parts of the papers. Also, we are going to use map (equivalent to loop) since the flow of the analysis is similar.

Load the related packages. The main package is quanteda.

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(patchwork)
library(wordcloud2)

I have uploaded the data that I downloaded from the Scopus database into my GitHub.

# Read data from GitHub repo
df <- read.csv("https://raw.githubusercontent.com/tengku-hanis/scopus-data/main/covid-malaysia.csv") %>% 
  janitor::clean_names() %>% 
  rename(title =i_title)

First, we need to tokenize the sentence. In other words, we break down the sentences into words.

# Tokenize
tok_list <- 
  df %>% 
  select(title, abstract, author_keywords, index_keywords) %>% 
  map(tokens, 
      remove_punct = T, 
      remove_numbers = T,               
      remove_symbols = T)

Next, we remove words that are not meaningful such ‘a’, ‘the’, etc. These words are known as stop words.

# Remove stop words
nostop_toks <- 
  tok_list %>% 
  map(tokens_select, 
      c(tidytext::stop_words$word, stopwords("en")), 
      selection = "remove")

Then, we create a document feature matrix (DFM). Basically DFM is a matrix that represent the frequency of each word (feature) in each document (in our case, paper or manuscript). Another name for DFM is document term matrix (DTM). quanteda uses the term DFM, some other packages use the term DTM.

Additionally, we also apply term frequency-inverse document frequency (TF-IDF) metrics. In scientific papers, the words such as ‘determine’, ‘conclusion’, ‘introduction’, etc are very frequent, and these words are not meaningful as well. Instead of removing manually one by one, we use TF-IDF. So, TF-IDF basically remove the words that are too common, thus we get only the relevant or important words.

# Create DFM and apply tf_idf
covid_dfm_list <- 
  nostop_toks %>% 
  map(dfm) %>% 
  map(dfm_tfidf)

Once, we have our words (tokens), we can create a plot of most relevant terms based on TF-IDF.

Show code

# Plot top features
A <- 
  covid_dfm_list$title %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "blueviolet") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the title")

B <- 
  covid_dfm_list$abstract %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "darkolivegreen3") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the abstract")

C <- 
  covid_dfm_list$author_keywords %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "deepskyblue2") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the author's keywords")

D <- 
  covid_dfm_list$index_keywords %>% 
  textstat_frequency(n = 15, force = T) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = "aquamarine2") +
  coord_flip() +
  labs(x = NULL, y = "Frequency (tf-idf)") +
  theme_minimal() +
  labs(title = "Top relevant terms for covid research based on the Scopus's keywords")

These are the plots of the most relevant terms in COVID-19 research in Malaysia.

Wordcloud

Finally, we can make our wordcloud, but we need to convert our DFM to data frame first. Also, we are going to round the value of TF-IDF and limit to top 1000 terms only.

covid_wc <- 
  covid_dfm_list %>% 
  map(textstat_frequency, force = T)

Actually, quanteda itself is able to produce a wordcloud. However, the wordcloud from wordcloud2 is more interactive and we can see the value of TF-IDF if we click the words.

wordcloud2(covid_wc$title%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 1: Top 1000 terms extracted from the title

wordcloud2(covid_wc$abstract %>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 2: Top 1000 terms extracted from the abstract

wordcloud2(covid_wc$author_keywords%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 3: Top 1000 terms extracted from the author’s keywords

wordcloud2(covid_wc$index_keywords%>% 
             slice(1:1000) %>% 
             mutate(frequency = round(frequency)))

Figure 4: Top 1000 terms extracted from the Scopus’s keywords

There are some weird symbols in the plot and the wordcloud, it’s better remove to it. However, I am to lazy to remove it, so I will leave it 😃.

Conclusion

These are some of the explorative text analysis that can be done. These relevant terms may provide some insight to our current research of COVID-19 in Malaysia. However, by no means its fully reflect our current COVID-19 research.

Hyperparameter tuning in tidymodels

Sun, 05 Sep 2021 00:00:00 +0000

This post will not go very detail in each of the approach of hyperparameter tuning. This post mainly aims to summarize a few things that I studied for the last couple of days. Generally, there are two approaches to hyperparameter tuning in tidymodels.

Grid search:
– Regular grid search
– Random grid search
Iterative search:
– Bayesian optimization
– Simulated annealing

Grid search

So, in grid search, we provide the combination of parameters and the algorithm will go through each combination of parameters. There are two types of grid search:

Regular grid search
– The algorithm will go through each combinations of parameters.

grid_regular(mtry(c(1, 13)), 
             trees(), 
             min_n(),
             levels = 3) # how many from each parameter

## # A tibble: 27 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1     1     2
##  2     7     1     2
##  3    13     1     2
##  4     1  1000     2
##  5     7  1000     2
##  6    13  1000     2
##  7     1  2000     2
##  8     7  2000     2
##  9    13  2000     2
## 10     1     1    21
## # ... with 17 more rows

Random grid search
– The algorithm will randomly select a number of combination of parameters instead of go through each of them.

grid_random(mtry(c(1, 13)),
            trees(), 
            min_n(), 
            size = 100) # size of parameters combination

## # A tibble: 100 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     5  1216    40
##  2     8  1374    13
##  3     9   859    39
##  4     6   282    12
##  5     2  1210     9
##  6     8  1828    39
##  7    11   550    14
##  8    13  1157    32
##  9     5   282     6
## 10    10  1018    28
## # ... with 90 more rows

By default, tidymodels uses space-filling-design to make sure the combination of parameters are on “equidistance” to each other.

Iterative search

In iterative search, we need to specify some initial parameters/values to start the search.

Bayesian optimization
– This algorithm/function will search the next best combination of parameters based on the previous combination of parameters (priori).
Simulated annealing
– Generally, this algorithm works relatively similar to bayesian optimization.
– However, as the figure below illustrates this algorithm is able to explore in the worst combination of parameters for a short term (barrier of local search), in order to find the best combination of parameters (global minima).

Futher details on iterative search or both methods above can be found here. So, as both iterative methods need a starting parameters, we can actually combine with any of the grid search methods.

Other methods

By default, if we do not supply any combination of parameters, tidymodels will randomly pick 10 combinations of parameters from the default range of values from the model. Additionally, we can set this values to other values as shown below:

tune_grid(
  resamples = dat_cv, # cross validation data set
  grid = 20,  # 20 combinations of parameters
  control = control, # some control parameters
  metrics = metrics # some metrics parameters (roc_auc, etc)
  )

There are another special cases of grid search; tune_race_anova() and tune_race_win_loss(). Both of these methods supposed to be more efficient way of grid search. In general, both methods evaluate the tuning parameters on a small initial set. The combination of parameters with a worst performance will be eliminated. Thus, makes them more efficient in grid search. The main difference between these two methods is how the worst combination of parameters are evaluated and eliminated.

R codes

Load the packages.

# Packages
library(tidyverse)
library(tidymodels)
library(finetune)

We will only use a small chunk of the data for ease of computation.

# Data
data(income, package = "kernlab")

# Make data smaller for computation
set.seed(2021)
income2 <- 
  income %>% 
  filter(INCOME == "[75.000-" | INCOME == "[50.000-75.000)") %>% 
  slice_sample(n = 600) %>% 
  mutate(INCOME = fct_drop(INCOME), 
         INCOME = fct_recode(INCOME, 
                             rich = "[75.000-",
                             less_rich = "[50.000-75.000)"), 
         INCOME = factor(INCOME, ordered = F)) %>% 
  mutate(across(-INCOME, fct_drop))

# Summary of data
glimpse(income2)

## Rows: 600
## Columns: 14
## $ INCOME         <fct> less_rich, rich, rich, rich, less_rich, rich, rich, les~
## $ SEX            <fct> F, M, F, M, F, F, F, M, F, M, M, M, F, F, F, F, M, M, M~
## $ MARITAL.STATUS <fct> Married, Married, Married, Single, Single, NA, Married,~
## $ AGE            <ord> 35-44, 25-34, 45-54, 18-24, 18-24, 14-17, 25-34, 25-34,~
## $ EDUCATION      <ord> 1 to 3 years of college, Grad Study, College graduate, ~
## $ OCCUPATION     <fct> "Professional/Managerial", "Professional/Managerial", "~
## $ AREA           <ord> 10+ years, 7-10 years, 10+ years, -1 year, 4-6 years, 7~
## $ DUAL.INCOMES   <fct> Yes, Yes, Yes, Not Married, Not Married, Not Married, N~
## $ HOUSEHOLD.SIZE <ord> Five, Two, Four, Two, Four, Two, Three, Two, Five, One,~
## $ UNDER18        <ord> Three, None, None, None, None, None, One, None, Three, ~
## $ HOUSEHOLDER    <fct> Own, Own, Own, Rent, Family, Own, Own, Rent, Own, Own, ~
## $ HOME.TYPE      <fct> House, House, House, House, House, Apartment, House, Ho~
## $ ETHNIC.CLASS   <fct> White, White, White, White, White, White, White, White,~
## $ LANGUAGE       <fct> English, English, English, English, English, NA, Englis~

# Outcome variable
table(income2$INCOME)

## 
## less_rich      rich 
##       362       238

# Missing data
DataExplorer::plot_missing(income)

Split the data and create a 10-fold cross validation.

set.seed(2021)
dat_index <- initial_split(income2, strata = INCOME)
dat_train <- training(dat_index)
dat_test <- testing(dat_index)

## CV
set.seed(2021)
dat_cv <- vfold_cv(dat_train, v = 10, repeats = 1, strata = INCOME)

We going to impute the NAs with mode value since all the variable are categorical.

# Recipe
dat_rec <- 
  recipe(INCOME ~ ., data = dat_train) %>% 
  step_impute_mode(all_predictors()) %>% 
  step_ordinalscore(AGE, EDUCATION, AREA, HOUSEHOLD.SIZE, UNDER18)

# Model
rf_mod <- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

# Workflow
rf_wf <- 
  workflow() %>% 
  add_recipe(dat_rec) %>% 
  add_model(rf_mod)

Parameters for grid search

# Regular grid
reg_grid <- grid_regular(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         levels = 3)

# Random grid
rand_grid <- grid_random(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         size = 100)

Tune models using regular grid search. We going to use doParallel library to do parallel processing.

ctrl <- control_grid(save_pred = T,
                        extract = extract_model)
measure <- metric_set(roc_auc)  

# Parallel for regular grid
library(doParallel)

# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_regular <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = reg_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for regular grid search:

autoplot(tune_regular)

show_best(tune_regular)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7  1000    21 roc_auc binary     0.690    10  0.0148 Preprocessor1_Model14
## 2     7  1000    40 roc_auc binary     0.689    10  0.0179 Preprocessor1_Model23
## 3     7  2000    40 roc_auc binary     0.689    10  0.0178 Preprocessor1_Model26
## 4     7  1000     2 roc_auc binary     0.688    10  0.0173 Preprocessor1_Model05
## 5     7  2000    21 roc_auc binary     0.688    10  0.0159 Preprocessor1_Model17

Tune models using random grid search.

# Parallel for random grid
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_random <- 
  rf_wf %>% 
  tune_grid(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)

Result for random grid search:

autoplot(tune_random)

show_best(tune_random)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_Model0~
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_Model0~
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_Model0~
## 4     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_Model0~
## 5     6   160    31 roc_auc binary     0.692    10  0.0172 Preprocessor1_Model0~

Random grid search has slightly a better result. Let’s use this random search result as a base for iterative search. Firstly, we limit the parameters based on the plot from a random grid search.

rf_param <- 
  rf_wf %>% 
  parameters() %>% 
  update(mtry = mtry(c(5, 13)), 
         trees = trees(c(1, 500)), 
         min_n = min_n(c(5, 30)))

Now we do a bayesian optimization.

# Parallel for bayesian optimization
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
bayes_tune <-  
  rf_wf %>% 
  tune_bayes(    
    resamples = dat_cv,
    param_info = rf_param,
    iter = 60,
    initial = tune_random, # result from random grid search        
    control = control_bayes(no_improve = 30, verbose = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

Result for bayesian optimization.

autoplot(bayes_tune, "performance")

show_best(bayes_tune)

## # A tibble: 5 x 10
##    mtry trees min_n .metric .estimator  mean     n std_err .config         .iter
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           <int>
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_~     0
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_~     0
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_~     0
## 4     6   189    15 roc_auc binary     0.693    10  0.0153 Iter1               1
## 5     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_~     0

We get a slightly better result from bayesian optimization. I will not do a simulated annealing approach since I get an error, though I am not sure why.

Lastly, we do a race anova.

# Parallel for race anova
# Create a cluster object and then register: 
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_efficient <- 
  rf_wf %>% 
  tune_race_anova(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = control_race(verbose_elim = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)

We get a relatively similar result to random grid search but with faster computation.

autoplot(tune_efficient)

show_best(tune_efficient)

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5  1425     5 roc_auc binary     0.695    10  0.0161 Preprocessor1_Model0~
## 2    11   406     2 roc_auc binary     0.694    10  0.0183 Preprocessor1_Model0~
## 3     6   631     3 roc_auc binary     0.692    10  0.0171 Preprocessor1_Model0~
## 4     7  1264     4 roc_auc binary     0.692    10  0.0159 Preprocessor1_Model0~
## 5     9  1264     3 roc_auc binary     0.692    10  0.0188 Preprocessor1_Model0~

We can also compare ROCs of all approaches. All approaches looks more or less similar.

Show code

# regular grid
rf_reg <- 
  tune_regular %>% 
  select_best(metric = "roc_auc")

reg_auc <- 
  tune_regular %>% 
  collect_predictions(parameters = rf_reg) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "regular_grid")

# random grid
rf_rand <- 
  tune_random %>% 
  select_best(metric = "roc_auc")

rand_auc <- 
  tune_random %>% 
  collect_predictions(parameters = rf_rand) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "random_grid")

# bayes
rf_bayes <- 
  bayes_tune %>% 
  select_best(metric = "roc_auc")

bayes_auc <- 
  bayes_tune %>% 
  collect_predictions(parameters = rf_bayes) %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "bayes")

# race_anova
rf_eff <- 
  tune_efficient %>% 
  select_best(metric = "roc_auc")

eff_auc <- 
  tune_efficient %>% 
  collect_predictions(parameters = rf_eff) %>%
  roc_curve(INCOME, .pred_less_rich) %>% 
  mutate(model = "race_anova")

# Compare ROC between all tuning approach
bind_rows(reg_auc, rand_auc, bayes_auc, eff_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6) +
  theme_bw()

Finally, we fit our best model (bayesian optimization) to the testing data.

# Finalize workflow
best_rf <-
  select_best(bayes_tune, "roc_auc")

final_wf <- 
  rf_wf %>% 
  finalize_workflow(best_rf)
final_wf

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_impute_mode()
## * step_ordinalscore()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 4
##   trees = 1016
##   min_n = 4
## 
## Computational engine: ranger

# Last fit
test_fit <- 
  final_wf %>%
  last_fit(dat_index) 

# Evaluation metrics 
test_fit %>%
  collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.611 Preprocessor1_Model1

test_fit %>%
  collect_predictions() %>% 
  roc_curve(INCOME, .pred_less_rich) %>% 
  autoplot()

Conclusion

The result is not that good. Our AUC is quite lower. However, we did use only about 8% from the overall data. Nonetheless, the aim of this post is to cover an overview of hyperparameter tuning in tidymodels.

Additionally, there are another two function to construct parameter grids that I did not cover in this post; grid_max_entropy() and grid_latin_hypercube(). Both of these functions do not have much resources explaining them (or at least I did not found it), however, for those interested, a good start will be the tidymodels website.

References:
https://www.tmwr.org/grid-search.html
https://www.tmwr.org/iterative-search.html
https://oliviergimenez.github.io/learning-machine-learning/#
https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7

Data exploration in R

Sun, 22 Aug 2021 00:00:00 +0000

These are some of the packages that I find useful for data exploration. Basically, this post serves more as my note for future reference. I will list out packages (and some awesome functions from that particular package) rather than specific functions. Further, base R and tidyverse packages will not be included specifically in this list.

Load supporting packages

library(tidyverse)

The data we are going to use is from dlookr package:

glimpse(heartfailure)

## Rows: 299
## Columns: 13
## $ age               <int> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 62, 45, ~
## $ anaemia           <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, N~
## $ cpk_enzyme        <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, 123, 81~
## $ diabetes          <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, No~
## $ ejection_fraction <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 25, 30, ~
## $ hblood_pressure   <fct> Yes, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes,~
## $ platelets         <dbl> 265000, 263358, 162000, 210000, 327000, 204000, 1270~
## $ creatinine        <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.10, 1.50~
## $ sodium            <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, 133, 13~
## $ sex               <fct> Male, Male, Male, Male, Female, Male, Male, Male, Fe~
## $ smoking           <fct> No, No, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes~
## $ time              <int> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11, 11, 12~
## $ death_event       <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~

We will create a few NAs in our data.

set.seed(2021)
heartfailure[sample(seq(nrow(heartfailure)), 20), "age"] <- NA
heartfailure[sample(seq(nrow(heartfailure)), 10), "sex"] <- NA

1) dataMaid

library(dataMaid)

One of the very useful function in dataMaid is makeDataReport() which give report on the data. By default it will give a pdf output, but other output options such as word and html are also available.

makeDataReport(heartfailure, replace = T)

This is the output example in pdf.

2) DataExplorer

library(DataExplorer)

General visualization:

heartfailure %>% plot_intro()

Since we have missing data, we can further visualize it:

heartfailure %>% plot_missing()

heartfailure %>% profile_missing()

##              feature num_missing pct_missing
## 1                age          20  0.06688963
## 2            anaemia           0  0.00000000
## 3         cpk_enzyme           0  0.00000000
## 4           diabetes           0  0.00000000
## 5  ejection_fraction           0  0.00000000
## 6    hblood_pressure           0  0.00000000
## 7          platelets           0  0.00000000
## 8         creatinine           0  0.00000000
## 9             sodium           0  0.00000000
## 10               sex          10  0.03344482
## 11           smoking           0  0.00000000
## 12              time           0  0.00000000
## 13       death_event           0  0.00000000

We can also do a correlation plot

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  plot_correlation()

However, I do think correlation plot from corrplot packages gives a better and clean plot. Here is a plot from corrplot.

library(corrplot)

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  cor() %>% 
  corrplot(type = "upper")

Finally, we can get an overall html report from DataExplorer package using the function create_report().

3) dlookr

library(dlookr)

We can assess normality of the data using this package. The code below will plot normality for all numeric variable.

heartfailure %>% 
  plot_normality()

However, for the sake of the simplicity in this post, we will run only for one variable.

heartfailure %>% 
  plot_normality(age)

We can also get a correlation matrix plot from this package, and no need to remove the NAs and filter the numeric variable before running the function.

heartfailure %>% 
  plot_correlate()

Lastly, from dlookr we can get the overall report of the data exploration in pdf (and other formats as well). This report is quite comprehensive, have a look.

heartfailure %>% 
  eda_paged_report(target = "death_event")

4) skimr

skimr package, especially skim() function did not display correctly when using the blogdown. Hence, I included the screenshot of the result that we will typically see in the R console.

library(skimr)
skim(heartfailure)

So, from skimr we can get an overview that includes the histogram for numerical data as well.

5) outliertree

This package identify outlier using a decision tree. I will not go in detail about the approach, but for those who want to read further.

library(outliertree)
outlier.tree(heartfailure)

## Reporting top 2 outliers [out of 2 found]
## 
## row [251] - suspicious column: [creatinine] - suspicious value: [0.50]
##  distribution: 96.000% >= 0.70 - [mean: 1.35] - [sd: 1.22] - [norm. obs: 24]
##  given:
##      [cpk_enzyme] > [1610.00] (value: 2522.00)
## 
## 
## row [32] - suspicious column: [cpk_enzyme] - suspicious value: [23.00]
##  distribution: 98.958% >= 47.00 - [mean: 677.01] - [sd: 1321.86] - [norm. obs: 95]
##  given:
##      [death_event] = [Yes]

## Outlier Tree model
##  Numeric variables: 7
##  Categorical variables: 6
## 
## Consists of 369 clusters, spread across 48 tree branches

We can further explore the detected outliers using histogram and boxplot. Let’s do for variable creatinine.

# histogram
hist(heartfailure$creatinine, breaks = 50, col = "navy",
     xlab = "Creatinine", 
     main = "Creatinine level")

# boxplot
boxplot(heartfailure$creatinine)

Probably in the future I will delve into more detail about outlier detection and any awesome packages in R related to it. If I ever written any post about it, I will link it here.

Conclusion

These are some useful package that I find. I may edit this post in the future to add more additional data exploration package. Furthermore, there are shiny apps for data exploration as well, though I think it’s better to sticks with coded approach in data analysis/exploration. Thus, I did not explore those apps in this post. Another thing to remember is to set the variable type accordingly prior to the data exploration.

Hope this is useful!

References:
https://github.com/ekstroem/dataMaid
https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/
https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html
https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html

A summary of forcats package

Tue, 18 May 2021 00:00:00 +0000

I just watched a youtube video by Andrew Couch about his commonly used function in readr, stringr, and forcats packages. Although, I have used forcats package before, I realised that I have not fully utilised all of its function.

So, in this post, I have summarised main function of forcats that I find useful in my day-to-day R coding. Basically, more like a note to myself.

Main functions

We will use mtcars data to demonstrate each function. forcats is part of tiyverse packages. So, it will load, once we load the tidyverse packages.

library(tidyverse)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

There are 9 forcats functions that I think very useful.

factor()

factor() changes variable type into a factor or categorical type

mtcars$carb <- factor(mtcars$carb)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

fct_inorder()

This function sorts factor levels based on the order of appearance in the dataset.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_inorder(mtcars$carb) # levels based on the order of appearance

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 1 2 3 6 8

fct_infreq()

This function sorts factor levels based on the frequency of values.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_infreq(mtcars$carb) # levels based on the frequency values

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 2 4 1 3 6 8

fct_relevel()

This function can be used to change the order manually.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, c("8", "6", "4", "3", "2", "1")) # manually changed new levels

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 8 6 4 3 2 1

fct_relevel() can also be used to change one factor level only.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, "8", after = 2) # change level 8 to the third place

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 8 3 4 6

fct_reorder()

This function changes the order based on another variable. Let’s change variable carb’s levels based on value of variable disp.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_reorder(mtcars$carb, mtcars$disp, .fun = sum, .desc = TRUE) # new level based on disp value

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 2 1 3 8 6

mtcars %>% 
  group_by(carb) %>% 
  summarise(sum_disp = sum(disp)) %>% 
  arrange(desc(sum_disp)) # this is basically what we do with fct_reorder() above

## # A tibble: 6 x 2
##   carb  sum_disp
##   <fct>    <dbl>
## 1 4        3088.
## 2 2        2082.
## 3 1         940.
## 4 3         827.
## 5 8         301 
## 6 6         145

Additionally, fct_reorder() can be used with plotting as well.

# Original plot
ggplot(mtcars, aes(x = carb, y = disp)) +
  geom_col()

# Plot with changed levels
mtcars %>% 
  mutate(carb = fct_reorder(carb, disp, .fun = sum, .desc = TRUE)) %>% 
  ggplot(aes(x = carb, y = disp)) +
  geom_col()

fct_lump()

This function lumps factor levels into other factors. There are 5 variants of this function:

fct_lump()
fct_lump_min()
fct_lump_n()
fct_lump_lowfreq()

The remaining one variant is fct_lump_prop(). It is not in the example below as I do not find it useful at least for my current R coding routine.

fct_lump() automatically lump small frequency factor group into one group.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

fct_lump(mtcars$carb) %>% fct_count()

## # A tibble: 4 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 4        10
## 4 Other     5

fct_lump_min() lump factor group into one group based on the given value.

table(fct_lump_min(mtcars$carb, min = 2)) # group 6 and 8 lump into one group

## 
##     1     2     3     4 Other 
##     7    10     3    10     2

fct_lump_n() lump all level except for the n most frequent factor groups.

table(fct_lump_n(mtcars$carb, n = 2)) # 2 frequent group only, others in one group

## 
##     2     4 Other 
##    10    10    12

fct_lump_lowfreq() lump small frequent groups into one group, while making sure that particular one group is still the smallest.

table(fct_lump_lowfreq(mtcars$carb, other_level = "low")) # group low is still the smallest

## 
##   1   2   4 low 
##   7  10  10   5

fct_other()

fct_other() is much like fct_lump(), except we manually choose which factor groups to be combined.

table(fct_other(mtcars$carb, keep = c("8", "6")))

## 
##     6     8 Other 
##     1     1    30

fct_recode()

This function is used to rename or relabel the factor group.

table(fct_recode(mtcars$carb, hanis = "8"))

## 
##     1     2     3     4     6 hanis 
##     7    10     3    10     1     1

fct_relabel()

fct_relabel() is extremely useful if we want to rename quite a number of factor groups.

table(mtcars$carb) # original groups

## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

table(fct_relabel(mtcars$carb, ~ c("abu", "ali", "chong", "siti", "krish", "lee"))) # new named groups

## 
##   abu   ali chong  siti krish   lee 
##     7    10     3    10     1     1

Reference:
https://forcats.tidyverse.org/index.html

Handling imbalanced data

Fri, 14 May 2021 00:00:00 +0000

Overview

Imbalance data happens when there is unequal distribution of data within a categorical outcome variable. Imbalance data occurs due to several reasons such as biased sampling method and measurement errors. However, the imbalance may also be the inherent characteristic of the data. For example, a rare disease predictive model, in this case, the imbalance is expected.

Generally, there are two types of imbalanced problem:

Slight imbalance: the imbalance is small, like 4:6
Severe imbalance: the imbalance is large, like 1:100 or more

In slight imbalanced cases, usually it is not a concern, while severe imbalanced cases require a more specialised method to to build a predictive model.

The problem

What’s the problem with the imbalanced data?
Firstly, a predictive model of an imbalanced data is bias towards the majority class. The minority class becomes harder to predict as there are few data from this class. So, the detection rate for a minority class will be very low. Secondly, accuracy is not a good measure in this case. We may get a good accuracy,but in reality the accuracy does not reflect the unequal distribution of the data. This is known as an accuracy paradox. Imagine we have 90% of data belong to the majority class, while the remaining 10% belong to the minority class. So, just by predicting all data as a majority class, the model can easily get 90% accuracy.

Handling approach

The easiest approach is to collect more data, though this may not be practical in all situation. Fortunately, there are a few machine learning techniques available to tackle this problem.

Here is a summary of resampling techniques available in themis package.

Over-sampling approach is preferred when the dataset is small. The under-sampling approach can be used when the dataset is large, though this approach may lead to loss of information. Additionally, ensemble technique such as random forest is said to be able to model the imbalanced data, though some references/blogs say otherwise.

So, we are going to compare four of over-sampling techniques (upsample, SMOTE, ADASYN, and ROSE), and three of under-sampling techniques (downsample, nearmiss and tomek). The base model is a decision tree, which will be used for all the techniques. The decision trees are not going to be extensively hyperparameter tuned, for the sake of simplicity. Additionally, random forest is also going to be included in the comparison.

The dataset is from here. This is a summary of the dataset.

summary(df)

##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

As we can see from the summary, variable admit has a moderate imbalanced data about 1:3 ratio.

ggplot(df, aes(admit)) + 
  geom_bar() +
  theme_bw()

Below is the code for each model.

Show code

# Packages
library(tidyverse)
library(magrittr)
library(tidymodels)
library(themis)

# Data
df <- read.csv("https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv")

# Split data
set.seed(1234)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

# 1) Decision tree ----

# Recipe
dt_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank)

df_train_rec <- 
  dt_rec %>% 
  prep() %>% 
  bake(new_data = NULL)
  
df_test_rec <- 
  dt_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv <- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
dt_mod <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

## Specify workflow
dt_wf <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune <- 
  dt_wf %>% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune <- dt_tune %>% select_best("accuracy")

## Finalize workflow
dt_wf_final <- 
  dt_wf %>% 
  finalize_workflow(best_tune)

# Fit on train data
dt_train <- 
  dt_wf_final %>% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train, new_data = df_test_rec)) %>% 
  rename(pred = .pred_class)

# 2) Oversampling ----
## step_upsample() ----

# Recipe
up_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_upsample(admit,
                seed = 1234)

df_train_up <- 
  up_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_up <- 
  up_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_up <- vfold_cv(df_train_up)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_up <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_up <- 
  dt_wf_up %>% 
  tune_grid(resamples = df_cv_up,
            metrics = metric_set(accuracy))

## Select best model
best_tune_up <- dt_tune_up %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_up <- 
  dt_wf_up %>% 
  finalize_workflow(best_tune_up)

# Fit on train data
dt_train_up <- 
  dt_wf_final_up %>% 
  fit(data = df_train_up)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_up, new_data = df_test_rec_up)) %>% 
  rename(pred_up = .pred_class)

## step_smote() ----

# Recipe
smote_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_smote(admit, 
             seed = 1234)

df_train_smote <- 
  smote_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_smote <- 
  smote_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_smote <- vfold_cv(df_train_smote)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_smote <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_smote <- 
  dt_wf_smote %>% 
  tune_grid(resamples = df_cv_smote,
            metrics = metric_set(accuracy))

## Select best model
best_tune_smote <- dt_tune_smote %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_smote <- 
  dt_wf_smote %>% 
  finalize_workflow(best_tune_smote)

# Fit on train data
dt_train_smote <- 
  dt_wf_final_smote %>% 
  fit(data = df_train_smote)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_smote, new_data = df_test_rec_smote)) %>% 
  rename(pred_smote = .pred_class)

## step_rose() ----

# Recipe
rose_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_rose(admit, 
             seed = 1234)

df_train_rose <- 
  rose_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_rose <- 
  rose_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_rose <- vfold_cv(df_train_rose)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_rose <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_rose <- 
  dt_wf_rose %>% 
  tune_grid(resamples = df_cv_rose,
            metrics = metric_set(accuracy))

## Select best model
best_tune_rose <- dt_tune_rose %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_rose <- 
  dt_wf_rose %>% 
  finalize_workflow(best_tune_rose)

# Fit on train data
dt_train_rose <- 
  dt_wf_final_rose %>% 
  fit(data = df_train_rose)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_rose, new_data = df_test_rec_rose)) %>% 
  rename(pred_rose = .pred_class)

## step_adasyn() ----

# Recipe
adasyn_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_adasyn(admit, 
            seed = 1234)

df_train_adasyn <- 
  adasyn_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_adasyn <- 
  adasyn_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_adasyn <- vfold_cv(df_train_adasyn)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_adasyn <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_adasyn <- 
  dt_wf_adasyn %>% 
  tune_grid(resamples = df_cv_adasyn,
            metrics = metric_set(accuracy))

## Select best model
best_tune_adasyn <- dt_tune_adasyn %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_adasyn <- 
  dt_wf_adasyn %>% 
  finalize_workflow(best_tune_adasyn)

# Fit on train data
dt_train_adasyn <- 
  dt_wf_final_adasyn %>% 
  fit(data = df_train_adasyn)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_adasyn, new_data = df_test_rec_adasyn)) %>% 
  rename(pred_adasyn = .pred_class)

# 3) Undersampling ----
## step_downsample() ----

# Recipe
down_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_downsample(admit,
                seed = 1234)

df_train_down <- 
  down_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_down <- 
  down_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_down <- vfold_cv(df_train_down)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_down <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_down <- 
  dt_wf_down %>% 
  tune_grid(resamples = df_cv_down,
            metrics = metric_set(accuracy))

## Select best model
best_tune_down <- dt_tune_down %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_down <- 
  dt_wf_down %>% 
  finalize_workflow(best_tune_down)

# Fit on train data
dt_train_down <- 
  dt_wf_final_down %>% 
  fit(data = df_train_down)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_down, new_data = df_test_rec_down)) %>% 
  rename(pred_down = .pred_class)

## step_nearmiss() ----

# Recipe
nearmiss_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_nearmiss(admit,
                  seed = 1234)

df_train_nearmiss <- 
  nearmiss_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_nearmiss <- 
  nearmiss_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_nearmiss <- vfold_cv(df_train_nearmiss)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_nearmiss <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_nearmiss <- 
  dt_wf_nearmiss %>% 
  tune_grid(resamples = df_cv_nearmiss,
            metrics = metric_set(accuracy))

## Select best model
best_tune_nearmiss <- dt_tune_nearmiss %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_nearmiss <- 
  dt_wf_nearmiss %>% 
  finalize_workflow(best_tune_nearmiss)

# Fit on train data
dt_train_nearmiss <- 
  dt_wf_final_nearmiss %>% 
  fit(data = df_train_nearmiss)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_nearmiss, new_data = df_test_rec_nearmiss)) %>% 
  rename(pred_nearmiss = .pred_class)

## step_tomek() ----

# Recipe
tomek_rec <- 
  recipe(admit ~., data = df_train) %>% 
  step_mutate_at(c("admit", "rank"), fn = as_factor) %>% 
  step_dummy(rank) %>% 
  step_tomek(admit,
                  seed = 1234)

df_train_tomek <- 
  tomek_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

df_test_rec_tomek <- 
  tomek_rec %>% 
  prep() %>% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_tomek <- vfold_cv(df_train_tomek)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_tomek <- 
  workflow() %>% 
  add_model(dt_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_tomek <- 
  dt_wf_tomek %>% 
  tune_grid(resamples = df_cv_tomek,
            metrics = metric_set(accuracy))

## Select best model
best_tune_tomek <- dt_tune_tomek %>% select_best("accuracy")

## Finalize workflow
dt_wf_final_tomek <- 
  dt_wf_tomek %>% 
  finalize_workflow(best_tune_tomek)

# Fit on train data
dt_train_tomek <- 
  dt_wf_final_tomek %>% 
  fit(data = df_train_tomek)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(dt_train_tomek, new_data = df_test_rec_tomek)) %>% 
  rename(pred_tomek = .pred_class)

# 4) Ensemble approach: random forest ----

## 10-folds CV
set.seed(1234)
df_cv <- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
rf_mod <- rand_forest(
 mtry = tune(),
 trees = tune(),
 min_n = tune()
 ) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

## Specify workflow
rf_wf <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
rf_tune <- 
  rf_wf %>% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune <- rf_tune %>% select_best("accuracy")

## Finalize workflow
rf_wf_final <- 
  rf_wf %>% 
  finalize_workflow(best_tune)

# Fit on train data
rf_train <- 
  rf_wf_final %>% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %<>%  
  bind_cols(predict(rf_train, new_data = df_test_rec)) %>% 
  rename(pred_rf = .pred_class)

Now, let’s get the accuracy, sensitivity, specificity, and Mathews Correlation Coefficient (MCC) for each model.

Show code

# Get all measurements
df_test$admit %<>% as_factor()
pred_col <- colnames(df_test)[5:13]
result <- vector("list", 0)
sensi <- vector("list", 0)
specif <- vector("list", 0)
mathew <- vector("list", 0)

for (i in seq_along(pred_col)) {
  # accuracy
  result[[i]] <-
    df_test %>% 
    accuracy(admit, df_test[,pred_col[i]])
  
  # sensitivity
  sensi[[i]] <-
    df_test %>% 
    sensitivity(admit, df_test[,pred_col[i]])
  
  # specificity
  specif[[i]] <-
    df_test %>% 
    specificity(admit, df_test[,pred_col[i]])
  
  # MCC
  mathew[[i]] <-
    df_test %>% 
    mcc(admit, df_test[,pred_col[i]])
}

## Turn into dataframe
result  %<>%  
  enframe() %>% 
  unnest(cols = c("value")) %>% 
  rename(model = name, 
         accuracy = .estimate) %>% 
  select(model, accuracy) %>% 
  mutate(model = factor(model,labels = 
                          c(
                            "1" = "base",
                            "2" = "upsample",
                            "3" = "smote",
                            "4" = "rose",
                            "5" = "adasyn",
                            "6" = "downsample",
                            "7" = "nearmiss",
                            "8" = "tomek",
                            "9" = "random_forest"
                            )
                        ))

sensi  %<>%  
  enframe() %>% 
  unnest(cols = c("value"))

specif %<>% 
  enframe() %>% 
  unnest(cols = c("value"))

mathew %<>% 
  enframe() %>% 
  unnest(cols = c("value"))

result %<>% 
  bind_cols(sensitive = sensi$.estimate, specific = specif$.estimate, mathew = mathew$.estimate)

# Plot the result
result %>% 
  pivot_longer(cols = 2:5, names_to = "measure") %>% 
  ggplot(aes(x = model, y = value, fill = measure)) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_bw() +
  coord_flip() +
  geom_text(aes(label = paste0(round(value*100, digits = 1), "%")), 
            position = position_dodge(0.9), vjust = 0.3, size = 2.7, hjust = -0.1) +
  labs(title = "Comparison of unbalanced data techniques", 
       x = "Techniques", 
       y = "Performance") +
  scale_fill_discrete(name = "Metrics:",
                      labels = c("Accuracy", "MCC", "Sensitivity", "Specificity")) +
  theme(legend.position = "bottom")

We can see from the above plot, the base model (decision tree) clearly has a low detection rate for a minority class (specificity). All methods able to increase the specificity, while sacrificing the accuracy and sensitivity. As mentioned earlier, accuracy is not a good metrics for this kind of model (ie; accuracy paradox). MCC on the other hand, takes into account all values of confusion matrix; true positive, false positive, true negative, and false negative. Hence, MCC is more informative compared to accuracy (and F score, which has not been included in the plot, for the sake of simplicity).

A more balanced model probably downsample approach based on MCC, specificity, and sensitivity. However, this does not mean that downsample technique is the best as I believes each technique behaves differently from one data to another.

References:

Exponentially Weighted Average in Deep Learning

Sun, 09 May 2021 00:00:00 +0000

I have been reading about lost functions and optimisers in deep learning for the last couple of days when I stumble upon the term Exponentially Weighted Average (EWA). So, in this post I aims to explain my understanding of EWA.

Overview of EWA

EWA basically is an important concept in deep learning and have been used in several optimisers to smoothen the noise of the data.

Let’s see the formula for EWA:

V_t is some smoothen value at point t, while S_t is a data point at point t. B here is a hyperparameter that we need to tune in our network. So, the choice of B will determine how many data points that we average the value of V_t as shown below:

EWA in deep learnings’ optimiser

So, some of the optimisers that adopt the approach of EWA are (red box indicates the EWA part in each formula):

Stochastic gradient descent (SGD) with momentum

The issue with SGD is the present of noise while searching for global minima. So, SGD with momentum integrated the EWA, which reduces these noises and helps the network converges faster.

Adaptive delta (Adadelta) and Root Mean Square Propagation (RMSprop)

Adadelta and RMSprop are proposed in attempt to solve the issue of diminishing learning rate of adaptive gradient (Adagrad) optimiser. The use of EWA in both optimisers actually helps to achieve this. Both optimisers have quite a similar formula, but attached below is the formula for Adadelta.

Adaptive moment estimation (ADAM)

ADAM basically combined the SGD with momentum with Adadelta. As shown earlier, both optimisers use EWA.

More details on EWA

Now, let’s go back to EWA. Here is the example of calculation of EWA:

Keep in mind that t₃ is the latest time point, followed by t₂ and t₁, respectively. So, if we want to calculate V₃:

So, if we were to varies the value of B across the equation (while the values of a₁…a_n remain constant), we can do so in R.

library(tidyverse) 

func <- function(b) (1 - b) * b^((20:1) - 1)
beta <- seq(0.1, 0.9, by=0.2)

dat <- t(sapply(beta, func)) %>% 
  as.data.frame()
colnames(dat)[1:20] <- 1:20

dat %>%  
  mutate(beta = as_factor(beta)) %>%
  pivot_longer(cols = 1:20, names_to = "data_point", values_to = "weight") %>% 
  ggplot(aes(x=as.numeric(data_point), y=weight, color=beta)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:20) +
  labs(title = "Change of Exponentially Weighted Average function", 
       subtitle = "Time at t20 is the recent time, and t1 is the initial time") +
  scale_colour_discrete("Beta:") +
  xlab("Time(t)") +
  ylab("Weights/Coefficients") +
  theme_bw()

Note that time at t₂₀ is the recent time, and t₁ is the initial time. Thus, two main points from the above plot are:

The EWA function acts in a decaying manner.
As beta, B increases we actually put more emphasize on the recent data point.

Side note: I have tried to do the plot in plotly, not sure why it did not work 😕

References:
1) https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f (all the equations are from this reference)
2) https://youtu.be/NxTFlzBjS-4
3) https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46

Base R vs tidyverse

Tue, 04 May 2021 00:00:00 +0000

First of all, this write up is mean for a beginner in R.

Things can be done in many ways in R. In facts, R has been very flexible in this regard compared to other statistical softwares. Basic things such as selecting a column, slicing a row, filtering a data based on certain condition can be done using a base R function. However, all these things can also be done using a tidyverse approach.

Tidyverse basically, a collection of packages that can be loaded in a line of function.

library(tidyverse)

Tidyverse is developed by “RStudio people” pioneered by Hadley Wickham, which means that these packages will be continuously maintained and updated.

So, without further ado, these are the comparisons between these two approaches for some very basic thingy:

Select or deselect a column and a row

# Base R
iris[1:5, c("Sepal.Length", "Sepal.Width")]
iris[1:5,c(1,2)] # similar to above
iris[1:5, -1]

# Tidyverse
iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  slice(1:5)
iris %>% 
  select(-Sepal.Length) %>% 
  slice(1:5)

Filter based on condition

# Base R
iris[iris$Species == "setosa", ]

# Tidyverse
iris %>% 
  filter(Species == "setosa")

Mutate (transmute replace the variable)

# Base R
iris$SL_minus10 <- iris$Sepal.Length - 10

# Tidyverse
iris %>% 
  mutate(SL_minus10 = Sepal.Length - 10)

Sort variable

# Base R
iris[order(-iris$Sepal.Width),]

# Tidyverse
iris %>% 
  arrange(desc(Sepal.Length))

Group by (and get mean for variable Sepal.Width)

# Not really base R
doBy::summaryBy(Sepal.Width~Species, iris, FUN = mean) 

# Tidyverse
iris %>% 
  group_by(Species) %>% 
  summarise(mean_SW = mean(Sepal.Width))

Rename variable

# Base R
colnames(iris)[6] <- "hanis"

# Tidyverse
iris %>% 
  rename(Species = hanis)

So, that’s it. Overall, tidyverse give a clarity in understanding the code as it reads from left to right. On the contrary, the base R approach reads from inside to outside, especially for a more complicated code.

Loop vs apply in R

Tue, 04 May 2021 00:00:00 +0000

I have heard quite a several times that apply function is faster than loop function in R. Loop function is said to be inefficient, though in certain situation loop is the only way.

Let’s compare between loop function and apply function in R.

First, make a very big fake data contain a list of vector.

set.seed(2021)
xlist <- list(col1 = rnorm(10000000), 
              col2 = rnorm(10000000),
              col3 = rnorm(100000000),
              col4 = rnorm(1000000)) # this will take a few seconds

Then, calculate the mean of each vector using for loop().

ptm <- proc.time() #-- start the clock

mean_loop <- vector("list", 0) # place holder for a value
for (i in seq_along(xlist)) {
  mean_loop[[i]] <- mean(xlist[[i]])
}

proc.time() - ptm #-- stop the clock (time in seconds)

##    user  system elapsed 
##    0.38    0.00    0.37

Now, using lapply() function.

ptm <- proc.time() #-- start the clock

mean_apply <- lapply(xlist, mean)

proc.time() - ptm #-- stop the clock

##    user  system elapsed 
##    0.34    0.00    0.35

So, lapply() is a little bit faster. Obviously, with a very big dataset and a more complicated objective, lapply() is the right choice, but for a “normal” size dataset, the use of any of the two functions probably do not make much different.

How many Malaysian should be vaccinated to get herd immunity from COVID-19?

Mon, 07 Dec 2020 00:00:00 +0000

Recently I have read an article that the Malaysian government have made a deal with Pfizer for 6.4 million Malaysian to be vaccinated. So, I am wondering what is the minimal number of people should be vaccinated.

I have also come across this interesting article, which explains how we can calculate a minimal number of people to be vaccinated to achieves herd immunity based on R naught (R₀).

R naught (R₀)

The basic idea of R₀ or basic reproduction number is quite simple. It describes how many secondary infections will derive from the first case. I think Figure 1 below describes this idea very well.

Figure 1: Basic idea of R₀(image from https://www.atrainceu.com/content/3-basic-reproduction-number-r-naught)

So, R₀ can be affected by a few factors, such as:

proportion of susceptible people at the initial outbreak
infectiousness of the virus or the disease
rate of recovery or death
and a few other factors

As R₀ increases more than 1, the spread of the disease will increases, while R₀ below 1 indicates the spread of the disease will decrease and eventually dies out.

However, I noticed that quite a few including KKM (Ministry of Health, Malaysia) have used the term R₀ in their reports instead of R_e or R_t which is the effective reproduction number or time-varying reproduction number. R₀ refers to the initial reproduction number at the beginning of the outbreak. The “naught” or “zero” in R naught (R₀) is referring to population condition that has zero immunity to the disease.

Herd immunity

Herd immunity is said to occur when a significant proportion of the population is immunized. Subsequently, those whose susceptible (not immunized) will be protected.

How many should be vaccinated

So, back to the initial topic. We can use the formula below to answer this question.

\[P_i > 1 - \frac{1}{R_0}\]

P_i refers to the number of proportion that should be immunized or in this case, vaccinated.

So, after googling, I found one calculation by my lecturer in Biostat Unit, USM, Dr Wan Arifin and his colleague. The R₀ based on his calculation is 2.673. Also, I found another article reported that the R₀ is 3.55 in March, according to KKM.

Malaysian’s population is estimated at 32.7 million by the Department of Statistics, Malaysia (DOSM). So, using the formula above, about 63% to 72% of Malaysian population should vaccinated, and this translates to about 20.6 to 23.5 million people.

The deal that the Malaysian government made with Pfizer is far from enough, but of course, this is a very good and quick decision. We also have other vaccines like Moderna’s vaccine coming up.

Disclaimer: This is just my opinion. Please take it with a massive grain of salt.

Posts | Tengku Hanis

Basic plotting with Matplotlib and Seaborn

Load packages

Load dataset

Histogram

Boxplot

Scatter plot

Conclusion

Basic data wrangling with Python

Loading necessary packages

Load the data

Slicing and indexing

Missing values

Descriptive statistics

What makes data "good enough" for a statistical analysis?

Mapping the states in Malaysia

Plot the peninsular of Malaysia (not the best way)

Plot the states in Malaysia

Visualising augmented images in Keras

Data augmentation

R code

Conclusion

Using UMAP preprocessing for image classification

UMAP

Example in R

Conclusion

Explore data using PCA

Principal component analysis (PCA)

Example in R

Fitted vs predict in R

A short note on variable selection

Variable selection

Suggested approach

Stepwise selection after multiple imputation

Some note

Stepwise selection

Example in R

Variable selection using genetic algorithm

Background

Example in R

Conclusion

My first interactive map with {leaflet}

Variable selection for imputation model in {mice}

Some note

Imputation model

R codes

Making maps with R (my first attempt ever!)

Some data pre-processing

Plotting the map

1) map from ggplot2

2) map from rworldmap

Conclusion

Some COVID-19 plots for Southeast Asian countries

Daily cases

Daily deaths

Daily tests

Daily vaccinations

Malaysia situation

Extract a table from a pdf

Extracting a table from pdf

A short note on multiple imputation

Background

Multiple imputation

Example

COVID-19 vaccine interest in Malaysia

Wordcloud of COVID-19 research in Malaysia

Wordcloud

Conclusion

Hyperparameter tuning in tidymodels

Grid search

Iterative search

Other methods

R codes

Conclusion

Data exploration in R

Conclusion

A summary of forcats package

Main functions

Handling imbalanced data

Overview

1) map from `ggplot2`

2) map from `rworldmap`