<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Posts | Tengku Hanis</title>
    <link>https://tengkuhanis.netlify.app/post/</link>
      <atom:link href="https://tengkuhanis.netlify.app/post/index.xml" rel="self" type="application/rss+xml" />
    <description>Posts</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>©Tengku Hanis 2020-2025 Made with [blogdown](https://github.com/rstudio/blogdown)</copyright><lastBuildDate>Wed, 07 Aug 2024 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://tengkuhanis.netlify.app/images/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_2.png</url>
      <title>Posts</title>
      <link>https://tengkuhanis.netlify.app/post/</link>
    </image>
    
    <item>
      <title>Basic plotting with Matplotlib and Seaborn</title>
      <link>https://tengkuhanis.netlify.app/post/basic-plotting-in-python/</link>
      <pubDate>Wed, 07 Aug 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/basic-plotting-in-python/</guid>
      <description>


&lt;p&gt;This post is continuation of my previous post about Python. For those interested:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/&#34;&gt;Basic data wrangling with Python&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Basic plotting with matplotlib and seaborn&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Comparison of ggplot in R versus in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are several packages or libraries available in Python for plotting and visualization. However, the most commonly used package is &lt;a href=&#34;https://matplotlib.org/&#34;&gt;matplotlib&lt;/a&gt;. This package is quite extensive and often time can be quite complicated to use. Thus, &lt;a href=&#34;https://seaborn.pydata.org/&#34;&gt;seaborn&lt;/a&gt; package is another alternative and complementary to matplotlib. Seaborn is based on matplotlib and provides a high-level functionality compare to matplotlib.&lt;/p&gt;
&lt;p&gt;So, in this blog post, let us compare several basic plots using both packages.&lt;/p&gt;
&lt;div id=&#34;load-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load packages&lt;/h2&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;load-dataset&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load dataset&lt;/h2&gt;
&lt;p&gt;We going to use the &lt;a href=&#34;https://www.kaggle.com/datasets/arshid/dat-flower-dataset&#34;&gt;iris dataset&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;dat = sns.load_dataset(&amp;#39;iris&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further see the information on this dataset.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;dat.head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;histogram&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Histogram&lt;/h2&gt;
&lt;p&gt;Let’s plot the histogram using matplotlib first.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.hist(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Notice that this histogram does not has any label. So, to add a label, we need to do this manually.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.hist(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.xlabel(&amp;#39;Sepal length&amp;#39;) #x-axis label
plt.ylabel(&amp;#39;Frequency&amp;#39;) #y-axis label
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-5-3.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However, using seaborn, the label is extracted from the variable name, which is pretty convenient.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.histplot(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-6-5.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Let’s say we want to plot the histogram according to different levels.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;species = [&amp;#39;setosa&amp;#39;, &amp;#39;versicolor&amp;#39;, &amp;#39;virginica&amp;#39;]

for i in species:
    subset = dat[dat[&amp;#39;species&amp;#39;] == i]
    plt.hist(subset[&amp;#39;sepal_length&amp;#39;], label = i)

plt.legend(loc = &amp;#39;upper right&amp;#39;)
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Frequency&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-7-7.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The codes above are quite long. In seaborn, the histogram above can be generated quite easily.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.histplot(x = &amp;#39;sepal_length&amp;#39;, hue = &amp;#39;species&amp;#39;, data = dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-8-9.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;boxplot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Boxplot&lt;/h2&gt;
&lt;p&gt;First, let’s do boxplot using matplotlib.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;bp = plt.boxplot(dat[&amp;#39;sepal_length&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-9-11.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we wanto to do boxplot according to other variable. The codes become a bit complicated especially for beginners.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;species = dat.groupby(&amp;#39;species&amp;#39;)
setosa = species.get_group(&amp;#39;setosa&amp;#39;)[&amp;#39;sepal_length&amp;#39;]
versicolor = species.get_group(&amp;#39;versicolor&amp;#39;)[&amp;#39;sepal_length&amp;#39;]
virginica = species.get_group(&amp;#39;virginica&amp;#39;)[&amp;#39;sepal_length&amp;#39;]

bp = plt.boxplot([setosa, versicolor, virginica], labels = [&amp;#39;setosa&amp;#39;, &amp;#39;versicolor&amp;#39;, &amp;#39;virginica&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-10-13.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Both plots above are quite easy to do in seaborn. Below are the codes for the basic histogram.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.boxplot(dat[&amp;#39;sepal_length&amp;#39;])
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-11-15.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Next, to plot &lt;code&gt;sepal_length&lt;/code&gt; based on &lt;code&gt;species&lt;/code&gt; is pretty much straightforward in seaborn.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.boxplot(y=&amp;#39;sepal_length&amp;#39;, hue=&amp;#39;species&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-12-17.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;scatter-plot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Scatter plot&lt;/h2&gt;
&lt;p&gt;Lastly, let’s see the scatter plot using matplotlib.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.scatter(x=dat[&amp;#39;sepal_length&amp;#39;], y=dat[&amp;#39;sepal_width&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Sepal width&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-13-19.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can further extend this plot by categorising it into different species.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;# Define the species to colors mapping
species_to_color = {&amp;#39;setosa&amp;#39;: &amp;#39;blue&amp;#39;, &amp;#39;versicolor&amp;#39;: &amp;#39;green&amp;#39;, &amp;#39;virginica&amp;#39;: &amp;#39;red&amp;#39;}
colors = dat[&amp;#39;species&amp;#39;].map(species_to_color)

# Create the scatter plot
plt.scatter(x=dat[&amp;#39;sepal_length&amp;#39;], y=dat[&amp;#39;sepal_width&amp;#39;], c=colors)
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Sepal width&amp;#39;)
plt.legend(handles=[plt.Line2D([0], [0], marker=&amp;#39;o&amp;#39;, color=&amp;#39;w&amp;#39;, markerfacecolor=color, markersize=10, label=species) for species, color in species_to_color.items()], title=&amp;#39;Species&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-14-21.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now, let’s see the seaborn package. This is the basic scatter plot.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.scatterplot(x=&amp;#39;sepal_length&amp;#39;, y=&amp;#39;sepal_width&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-15-23.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To extend this plot by categorising it into different species in seaborn is actually quite simple.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.scatterplot(x=&amp;#39;sepal_length&amp;#39;, y=&amp;#39;sepal_width&amp;#39;, hue=&amp;#39;species&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-16-25.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, matplotlib and seaborn complement each other well. Seaborn is an excellent choice for quick and standard plots, thanks to its high-level interface. On the other hand, matplotlib offers a more extensive range of customization options and is ideal for creating complex and detailed visualizations. Ultimately, choosing between matplotlib and seaborn depends on the specific requirements of the visualization task.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Basic data wrangling with Python</title>
      <link>https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/</link>
      <pubDate>Thu, 18 Jul 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/</guid>
      <description>


&lt;p&gt;&lt;img src=&#34;images/img2.jpeg&#34; style=&#34;width:60.0%;height:40.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Basic data wrangling with Python&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Basic plotting with matplotlib and seaborn&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Comparison of ggplot in R versus in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once I finish writing any of the topics, I will link it to the above.&lt;/p&gt;
&lt;p&gt;So, let’s start.&lt;/p&gt;
&lt;div id=&#34;loading-necessary-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading necessary packages&lt;/h2&gt;
&lt;p&gt;Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to &lt;a href=&#34;https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-packages&#34;&gt;this link to install the packages using pip command&lt;/a&gt; or &lt;a href=&#34;https://conda.io/projects/conda/en/latest/user-guide/concepts/installing-with-conda.html#installing-with-conda&#34;&gt;this link to install the packages using conda command&lt;/a&gt;. For those who has both R and Python in your machine, I suggest to use a conda command.&lt;/p&gt;
&lt;p&gt;Let’s load the required packages.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;import numpy as np 
import pandas as pd
from seaborn import load_dataset&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in &lt;code&gt;pandas&lt;/code&gt; package can be accessed through &lt;code&gt;pd&lt;/code&gt; or to be specific &lt;code&gt;pd.&lt;/code&gt;. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use &lt;code&gt;pd&lt;/code&gt; for &lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;np&lt;/code&gt; for &lt;code&gt;numpy&lt;/code&gt;, but this is a convention or standard practice widely adopted in the Python community.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;load-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load the data&lt;/h2&gt;
&lt;p&gt;We going to use &lt;a href=&#34;https://archive.ics.uci.edu/dataset/53/iris&#34;&gt;iris dataset&lt;/a&gt;. This dataset is readily available in seaborn package.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris = load_dataset(&amp;#39;iris&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we load the data, we need to check the variable type.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.dtypes&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Variable species, by right, is a categorical variable. So, we can use &lt;code&gt;Categorical()&lt;/code&gt; from &lt;code&gt;pandas&lt;/code&gt; to change it from an object variable type to a category. &lt;code&gt;pd.&lt;/code&gt; here, means we access the function from &lt;code&gt;pandas&lt;/code&gt; package as I explained it previously.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;species&amp;#39;] = pd.Categorical(iris[&amp;#39;species&amp;#39;])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we check the variable type again, we can see the species variable is a category.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.dtypes&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we can also see the data. Let’s see the first 10 rows.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.head(10)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;slicing-and-indexing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Slicing and indexing&lt;/h2&gt;
&lt;p&gt;To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_length&amp;#39;][0:10]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, we can also index as below to get the first 10 rows of sepal_length variable.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_length&amp;#39;][:10]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next to access the first 5 rows, we can do as below.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[0:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also use &lt;code&gt;iloc()&lt;/code&gt; and &lt;code&gt;loc()&lt;/code&gt; functions. The main difference between the two functions is that &lt;code&gt;iloc()&lt;/code&gt; can only accept a numerical value and &lt;code&gt;loc()&lt;/code&gt; function can accept a string value.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.iloc[0:2, 0:3] #rows, then columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.loc[0:2, [&amp;#39;sepal_length&amp;#39;, &amp;#39;species&amp;#39;]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;ind = iris[&amp;#39;petal_length&amp;#39;] &amp;gt; 6
iris[&amp;#39;petal_length&amp;#39;][ind]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s say we want our data to include only setosa species.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;ind = iris[&amp;#39;species&amp;#39;] == &amp;#39;setosa&amp;#39;
iris.loc[ind, :].head()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;row 1, 2, 3, and 4 of sepal_length to NA values&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;row 6 of species and sepal_width to NA values&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.loc[0:3, &amp;#39;sepal_length&amp;#39;] = np.nan 
iris.iloc[5, [1, 4]] = np.nan&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see the result.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.head(6)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;missing-values&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Missing values&lt;/h2&gt;
&lt;p&gt;If we want to see if we have any missing values in our data, we can use &lt;code&gt;isnull()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().any().any() #For overall&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## True&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().any() #Check for each column&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further calculate how many missing values that we have.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().sum()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;descriptive-statistics&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Descriptive statistics&lt;/h2&gt;
&lt;p&gt;To get a basic descriptive statistics, we can use &lt;code&gt;describe()&lt;/code&gt; function. Below, we additionally use &lt;code&gt;round()&lt;/code&gt; to round up the results into one decimal points.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.describe().round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add &lt;code&gt;include = all&lt;/code&gt; as below.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.describe(include = &amp;#39;all&amp;#39;).round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, we can also calculate the unique values for the categorical variable. &lt;code&gt;value_counts()&lt;/code&gt; only calculate the non-missing values.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;species&amp;#39;].value_counts()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use &lt;code&gt;mean()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_width&amp;#39;].mean().round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 3.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What makes data &#34;good enough&#34; for a statistical analysis?</title>
      <link>https://tengkuhanis.netlify.app/post/what-makes-data-good-enough-for-a-statistical-analysis/</link>
      <pubDate>Thu, 29 Feb 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/what-makes-data-good-enough-for-a-statistical-analysis/</guid>
      <description>


&lt;p&gt;&lt;img src=&#34;images/_34facba1-9993-41d5-ae25-c0cab57f2184.jpg&#34; style=&#34;width:60.0%;height:40.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A few days earlier, someone asked me to help her with the data analysis. However, the data that she gave me was so bad that it was completely impossible to run the analysis unless a serious data cleaning was done first.&lt;/p&gt;
&lt;p&gt;So, I am thinking about what is a general guideline to consider a data is good enough to run the statistical analysis with it.&lt;/p&gt;
&lt;p&gt;First thing first, what are the basic format of a “good enough” data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Each row represents an observation&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. For a categorical variable, make sure the levels are standardised.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for gender variable, make sure to have only “male” and “female” instead of “male”, “female”, “men”, and “women”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. For a numerical variable, make sure the value is numeric and do not contain any text.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for height variable, do not put “1.68m” or “1.68 meter”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. For numerical variable as well, make sure the numeric values in the variable are in the same scale.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for weight variable, do not mix the weight in grams and kilograms. If you want to use grams, use it consistently throughout the data, or at least throughout the variable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Do not use symbol in your data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, do not use “X” as no and “/” as yes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;6. The data should be an individual data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An individual data means that each row consists of information about each sample or observation. Each observation in the dataset represents a single entity or unit (e.g., a person, a transaction, a product) and includes all relevant attributes or variables for that unit. Individual data allow for detailed analysis at the level of individual observations. Here is an example of individual data:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;Id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Age&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Obese&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Aggregated data, on the other hand, combines multiple individual observations into summary statistics or groups. Instead of representing individual units, aggregated data presents information at a higher level of abstraction, such as groups, categories, or intervals. This aggregation typically involves summarizing data using functions like sums, averages, counts, or percentages. Here is an example of aggregated data based on the individual data previously:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Age&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Obese&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I think the above six points are the basics of building a good enough dataset for a statistical analysis. While we are at it, let’s go through the two main formats of a dataset. These formats are more common when you have a repeated measure study design whereby each participant has several values/measurements/responds at several time points.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;Wide format&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the wide format, the response of each participant will be in a single row. For example, below is the data of time taking by two participants in answering three questions in second. As we can see each row consists of the time in second in answering all three questions.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;ID&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;Long format&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the long format (also known as tidy format in R community), the response at each time point of each participant will be in a single row. By using the same data previously, below is the in the long format. As we can see the data is arrange in format that each row represents each time taking to answer the question by each participant.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;ID&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Time&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
</description>
    </item>
    
    <item>
      <title>Mapping the states in Malaysia</title>
      <link>https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/</link>
      <pubDate>Wed, 22 Feb 2023 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/</guid>
      <description>


&lt;p&gt;I have written two blog posts about making map in R:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/&#34;&gt;Making maps with R (my first attempt ever!)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/&#34;&gt;My first interactive map with {leaflet}&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This post is sort of a continuation to the &lt;a href=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/&#34;&gt;first blog post&lt;/a&gt;. I have shown how to plot a coordinate to a map in that post specifically for Malaysia.&lt;/p&gt;
&lt;p&gt;However, using the two approaches in the previous blog post, we cannot plot the coordinate to a certain states in Malaysia. At least I am not unable to find how to do that after googling around. But, we can plot the borneo or peninsular of Malaysia using the two approaches.&lt;/p&gt;
&lt;div id=&#34;plot-the-peninsular-of-malaysia-not-the-best-way&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plot the peninsular of Malaysia (not the best way)&lt;/h2&gt;
&lt;p&gt;Load the necessary packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rworldmap) 
library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First, we get the data. The data is about desa clinic (klinik desa) in Malaysia.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinicDesa &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv&amp;quot;)
head(clinicDesa)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   id facilities_id                     name              address postcode
## 1  1    KD01010019  KLINIK DESA ASSAM BUBOK     Jalan Batu Pahat    86400
## 2  2    KD01010020   KLINIK DESA BATU PUTIH    Jalan Behor Temak    83000
## 3  3    KD01010021      KLINIK DESA BEROLEH    Jalan Parit Besar    83300
## 4  4    KD01010022        KLINIK DESA BINDU Jalan Tongkang Pecah    83010
## 5  5    KD01010023 KLINIK DESA KAMPUNG BARU   Jalan Parit Kemang    83710
## 6  6    KD01010024 KLINIK DESA KANGKAR BARU      Jalan Meng Seng    85400
##             city   district  state tel fax website email image latitude
## 1     Ayer Hitam Batu Pahat Johor       NA      NA    NA    NA 1.933330
## 2          Bagan Batu Pahat Johor       NA      NA    NA    NA 1.889100
## 3     Sri Gading Batu Pahat Johor       NA      NA    NA    NA 1.877890
## 4 Tongkang Pecah Batu Pahat Johor       NA      NA    NA    NA 1.901515
## 5    Parit Yaani Batu Pahat Johor       NA      NA    NA    NA 1.905120
## 6      Yong Peng Batu Pahat Johor       NA      NA    NA    NA 2.065310
##   longitude likes rating status
## 1  103.1167     0      0    NEW
## 2  102.8778     0      0    NEW
## 3  102.9858     0      0    NEW
## 4  102.9665     0      0    NEW
## 5  103.0372     0      0    NEW
## 6  103.1248     0      0    NEW&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First we plot the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(clinicDesa, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Remove the two points.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinicDesa2 &amp;lt;- clinicDesa %&amp;gt;% filter(longitude &amp;gt; 25)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, plot the updated data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(clinicDesa2, aes(longitude, latitude)) +
  geom_point() +
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From the plot, we already know the left side consists of the coordinates in the peninsular of Malaysia. So, we can limit our plot by limit the longitude &amp;lt; 105 and longitude &amp;gt; 97.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get base map
global &amp;lt;- map_data(&amp;quot;world&amp;quot;) 

# Plot
ggplot() + 
  geom_polygon(data = global %&amp;gt;% filter(region == &amp;quot;Malaysia&amp;quot;), aes(x=long, y = lat, group = group), 
               fill = &amp;quot;gray85&amp;quot;) + 
  coord_fixed(1.3) +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  theme_minimal() + 
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Desa clinic in the peninsular of Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data last updated: Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;)))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I am not going to re-explain the above and below codes as I have explain it in &lt;a href=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/&#34;&gt;the previous blog post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This approach also works with &lt;code&gt;rworldmap&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get base map
world &amp;lt;- getMap(resolution = &amp;quot;low&amp;quot;)
msia &amp;lt;- world[world@data$ADMIN == &amp;quot;Malaysia&amp;quot;, ]

# Plot
ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = &amp;quot;black&amp;quot;) +
  geom_point(data = clinicDesa2, aes(x = longitude, y = latitude)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Desa clinic in the peninsular of Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data last updated: Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;)))) +
  xlim(97, 105) #limit overall map to peninsular of Malaysia&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we can see using the two approaches, we can plot the borne and peninsular sides of Malaysia. But, at least to my knowledge we cannot apply this approach if we want to plot a coordinate to certain states in Malaysia.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;plot-the-states-in-malaysia&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plot the states in Malaysia&lt;/h2&gt;
&lt;p&gt;Load the necessary package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(geodata)
library(tidyterra)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see from the package, we going to use a &lt;code&gt;geodata&lt;/code&gt; package. &lt;code&gt;tidyterra&lt;/code&gt; is used to supplements the ggplot. First, let’s limit the data into desa clinics in Terengganu only.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinic_trg &amp;lt;- 
  clinicDesa %&amp;gt;% 
  filter(state == &amp;quot;Terengganu&amp;quot;) %&amp;gt;% 
  dplyr::select(latitude, longitude) 
head(clinic_trg)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   latitude longitude
## 1  5.48533  102.4914
## 2  5.81578  102.5778
## 3  5.70886  102.4892
## 4  5.75722  102.5303
## 5  5.67444  102.6289
## 6  5.69875  102.5430&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we get the map from the &lt;code&gt;geodata&lt;/code&gt; package with the boundaries at the district level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Malaysia &amp;lt;- gadm(country = &amp;quot;MYS&amp;quot;, level = 2, path=tempdir())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can use the below information to limit the map to Terengganu state only.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Malaysia$NAME_1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   [1] &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;          
##   [5] &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;          
##   [9] &amp;quot;Johor&amp;quot;           &amp;quot;Johor&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;          
##  [13] &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;          
##  [17] &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;          
##  [21] &amp;quot;Kedah&amp;quot;           &amp;quot;Kedah&amp;quot;           &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;       
##  [25] &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;       
##  [29] &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;        &amp;quot;Kelantan&amp;quot;       
##  [33] &amp;quot;Kuala Lumpur&amp;quot;    &amp;quot;Labuan&amp;quot;          &amp;quot;Melaka&amp;quot;          &amp;quot;Melaka&amp;quot;         
##  [37] &amp;quot;Melaka&amp;quot;          &amp;quot;Negeri Sembilan&amp;quot; &amp;quot;Negeri Sembilan&amp;quot; &amp;quot;Negeri Sembilan&amp;quot;
##  [41] &amp;quot;Negeri Sembilan&amp;quot; &amp;quot;Negeri Sembilan&amp;quot; &amp;quot;Negeri Sembilan&amp;quot; &amp;quot;Negeri Sembilan&amp;quot;
##  [45] &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;         
##  [49] &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;         
##  [53] &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Pahang&amp;quot;          &amp;quot;Perak&amp;quot;          
##  [57] &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;          
##  [61] &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;           &amp;quot;Perak&amp;quot;          
##  [65] &amp;quot;Perak&amp;quot;           &amp;quot;Perlis&amp;quot;          &amp;quot;Pulau Pinang&amp;quot;    &amp;quot;Pulau Pinang&amp;quot;   
##  [69] &amp;quot;Pulau Pinang&amp;quot;    &amp;quot;Pulau Pinang&amp;quot;    &amp;quot;Pulau Pinang&amp;quot;    &amp;quot;Putrajaya&amp;quot;      
##  [73] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [77] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [81] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [85] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [89] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [93] &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;           &amp;quot;Sabah&amp;quot;          
##  [97] &amp;quot;Sabah&amp;quot;           &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [101] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [105] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [109] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [113] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [117] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [121] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [125] &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;         &amp;quot;Sarawak&amp;quot;        
## [129] &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;       
## [133] &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;        &amp;quot;Selangor&amp;quot;       
## [137] &amp;quot;Selangor&amp;quot;        &amp;quot;Trengganu&amp;quot;       &amp;quot;Trengganu&amp;quot;       &amp;quot;Trengganu&amp;quot;      
## [141] &amp;quot;Trengganu&amp;quot;       &amp;quot;Trengganu&amp;quot;       &amp;quot;Trengganu&amp;quot;       &amp;quot;Trengganu&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, this is the plot for Terengganu.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Trg &amp;lt;- Malaysia[138:144,]
plot(Trg)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We going to the map this in ggplot, and stacked the map layer with the coordinate layer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() +
  geom_spatvector(data = Trg, color = &amp;quot;grey&amp;quot;, fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = &amp;quot;red&amp;quot;)) +
  theme_minimal() +
  theme(legend.position = &amp;quot;none&amp;quot;) +
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Desa clinic in Terengganu, Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data last updated: Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;)))) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;geom_spatvector&lt;/code&gt; is from &lt;code&gt;tidyterra&lt;/code&gt; package. Alternatively, we can plot using &lt;code&gt;geom_sf&lt;/code&gt;but we need to convert the &lt;code&gt;SpatVector&lt;/code&gt; data into &lt;code&gt;sf&lt;/code&gt; object using &lt;code&gt;sf::st_as_sf&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = &amp;quot;grey&amp;quot;, fill = NA) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = &amp;quot;red&amp;quot;)) +
  theme_minimal() +
  theme(legend.position = &amp;quot;none&amp;quot;) +
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Desa clinic in Terengganu, Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data last updated: Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;)))) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Both approaches produce the same plot.&lt;/p&gt;
&lt;p&gt;We can further add district labels to the plots. For example, using the &lt;code&gt;geom_sf&lt;/code&gt;, we can stack it with &lt;code&gt;geom_sf_label&lt;/code&gt; layer. We can alternatively use &lt;code&gt;theme_void&lt;/code&gt; to remove the background and the map axis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data = sf::st_as_sf(Trg)) +
  geom_sf(color = &amp;quot;grey&amp;quot;, fill = NA) +
  geom_sf_label(aes(label = NAME_2)) +
  geom_point(data = clinic_trg, aes(x = longitude, y = latitude, color = &amp;quot;red&amp;quot;)) +
  theme_void() +
  theme(legend.position = &amp;quot;none&amp;quot;) +
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Desa clinic in Terengganu, Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data last updated: Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;)))) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/mapping-the-states-in-malaysia/index.en_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Visualising augmented images in Keras</title>
      <link>https://tengkuhanis.netlify.app/post/visualising-augmented-images-in-keras/</link>
      <pubDate>Wed, 28 Dec 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/visualising-augmented-images-in-keras/</guid>
      <description>


&lt;div id=&#34;data-augmentation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data augmentation&lt;/h2&gt;
&lt;p&gt;Data augmentation is been used in deep learning for many reasons. One of the reason is to reduce overfitting and makes the model more robust. Data augmentation can be done relatively easy in &lt;code&gt;keras&lt;/code&gt; package in R. However, I have not found any resources on how to visualise the augmented image in R except in Python. Visualising the augmented image can be quite useful to get an idea of how the image looks like. So, this post covers a simple to do this in R.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R code&lt;/h2&gt;
&lt;p&gt;Let’s load the &lt;code&gt;keras&lt;/code&gt; library&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(keras)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;keras&amp;#39; was built under R version 4.2.2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we load the image from the internet.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r_logo &amp;lt;- 
  get_file(&amp;quot;img&amp;quot;, &amp;quot;https://ih1.redbubble.net/image.522493300.6771/st,small,507x507-pad,600x600,f8f8f8.jpg&amp;quot;) %&amp;gt;% 
  image_load()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our image right now is 600x 600 x 3. The 3 at the back because the image is coloured (RGB channels).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r_logo$size&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
## [1] 600
## 
## [[2]]
## [1] 600&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, we need to change the image into an array with the dimension of 1 x 600 x 600 x 3. The number 1 indicates we have only one image.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r_logo &amp;lt;- 
  r_logo %&amp;gt;% 
  image_to_array() %&amp;gt;% 
  array_reshape(c(1, dim(.)))
dim(r_logo)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]   1 600 600   3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we have a correct dimension, we can specify the parameters for the data augmentation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;augment_params &amp;lt;- image_data_generator(horizontal_flip = T, 
                                       vertical_flip = T,
                                       rotation_range = 0.5,
                                       zoom_range = 0.5,
                                       fill_mode = &amp;quot;reflect&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I am not going to into the details of the parameters. For those interested, the &lt;a href=&#34;https://tensorflow.rstudio.com/reference/keras/image_data_generator&#34;&gt;TensorFlow for R website&lt;/a&gt; explain this very well.&lt;/p&gt;
&lt;p&gt;Next, we can generate the batch of augmented data at random. This function, however, will only run once we fit the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;img_gen &amp;lt;- flow_images_from_data(r_logo,
                                 generator = augment_params, 
                                 batch_size = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we can plot the image. Firstly, this is our original image.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;img_gen$x [1,,,] %&amp;gt;% 
  as.raster(max = 255) %&amp;gt;% 
  as.array() %&amp;gt;% 
  plot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/visualising-augmented-images-in-keras/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now, we going to loop the augmentation process. Here, we going to generate six augmented images. The &lt;code&gt;set.seed&lt;/code&gt; for reproducibility.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
par(mfrow = c(3, 2), mar = c(1, 0, 1, 0))

for (i in 1:6) {
  IMG &amp;lt;- img_gen$`next`()
  IMG[1,,,] %&amp;gt;% as.raster(max = 255) %&amp;gt;% as.array() %&amp;gt;% plot()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/visualising-augmented-images-in-keras/index.en_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I believe this is quite useful to get a sense of how your data is augmented. Consequently, this may help in selecting the parameters for the data augmentation.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Using UMAP preprocessing for image classification</title>
      <link>https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/</link>
      <pubDate>Wed, 16 Mar 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;umap&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;UMAP&lt;/h2&gt;
&lt;p&gt;Uniform manifold approximation and projection or in short UMAP is a type of dimension reduction techniques. So, basically UMAP will project a set of features into a smaller space. UMAP can be a supervised technique in which we give a label or an outcome or an unsupervised one. For those interested to know in detail how UMAP works can refer to this &lt;a href=&#34;https://umap-learn.readthedocs.io/en/latest/how_umap_works.html&#34;&gt;reference&lt;/a&gt;. For those prefer a much simpler or shorter version of it, I recommend a &lt;a href=&#34;https://www.youtube.com/watch?v=eN0wFzBA4Sc&amp;amp;list=WL&amp;amp;index=2&#34;&gt;YouTube video by Joshua Starmer&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;We going to see how to apply a UMAP techniques for image preprocessing and further classify the images using kNN and naive bayes.&lt;/p&gt;
&lt;p&gt;These are the packages that we need.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(keras) #for data and reshape to tabular format
library(tidymodels)
library(embed) #for umap
library(discrim) #for naive bayes model&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use the famous MNIST dataset. This dataset contained a handwritten digit from 0 to 9. This dataset is available in &lt;code&gt;keras&lt;/code&gt; package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mnist_data &amp;lt;- dataset_mnist()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loaded Tensorflow version 2.2.0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data &amp;lt;- mnist_data$train$x
image_labels &amp;lt;- mnist_data$train$y
image_data %&amp;gt;% dim()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 60000    28    28&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For example this is the image for the second row.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data[2, 1:28, 1:28] %&amp;gt;% 
  t() %&amp;gt;% 
  image(col = gray.colors(256))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Next, we going to change the image into a tabular data frame format. We going to limit the data to the first 1000 rows or images out of the total 6000 images.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Reformat to tabular format
image_data &amp;lt;- array_reshape(image_data, dim = c(60000, 28*28))
image_data %&amp;gt;% dim()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 60000   784&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;image_data &amp;lt;- image_data[1:10000,]
image_labels &amp;lt;- image_labels[1:10000]

# Reformat to data frame
full_data &amp;lt;- 
  data.frame(image_data) %&amp;gt;% 
  bind_cols(label = image_labels) %&amp;gt;% 
  mutate(label = as.factor(label))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, we going to split the data and create a 3-folds cross-validation sets for the sake of simplicity.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Split data
set.seed(123)
ind &amp;lt;- initial_split(full_data)
data_train &amp;lt;- training(ind)  
data_test &amp;lt;- testing(ind)

# 10-folds CV
set.seed(123)
data_cv &amp;lt;- vfold_cv(data_train, v = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For recipe specification, we going to scale and center all the predictor after creating a new variable using &lt;code&gt;step_umap()&lt;/code&gt;. Notice that in &lt;code&gt;step_umap()&lt;/code&gt; we supply the outcome and we tune the number of components (&lt;code&gt;num_comp&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = tune()) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We create a a base workflow.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(rec) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use two models as classifier:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;kNN&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Naive bayes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For each classifier, we going to create a regular grid of parameters to be tuned and further run a regular grid search.&lt;/p&gt;
&lt;p&gt;For kNN.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# knn model
knn_mod &amp;lt;- 
  nearest_neighbor(neighbors = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;kknn&amp;quot;)

# knn grid
knn_grid &amp;lt;- grid_regular(neighbors(), num_comp(range = c(2, 8)), levels = 3)

# Tune grid search
knn_tune &amp;lt;- 
  tune_grid(
  wf %&amp;gt;% add_model(knn_mod),
  resamples = data_cv,
  grid = knn_grid, 
  control = control_grid(verbose = F)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For naive bayes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# nb model
nb_mod &amp;lt;- 
  naive_Bayes(smoothness = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;naivebayes&amp;quot;)

# nb grid
nb_grid &amp;lt;- grid_regular(smoothness(), num_comp(range = c(2, 10)), levels = 3)

# Tune grid search
nb_tune &amp;lt;- 
  tune_grid(
    wf %&amp;gt;% add_model(nb_mod),
    resamples = data_cv,
    grid = nb_grid, 
    control = control_grid(verbose = F)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see our tuning performance of our model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# knn model
knn_tune %&amp;gt;% 
  show_best(&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   neighbors num_comp .metric .estimator  mean     n  std_err .config            
##       &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;              
## 1        10        8 roc_auc hand_till  0.961     3 0.000268 Preprocessor3_Mode~
## 2        10        5 roc_auc hand_till  0.961     3 0.000421 Preprocessor2_Mode~
## 3         5        8 roc_auc hand_till  0.959     3 0.000757 Preprocessor3_Mode~
## 4        10        2 roc_auc hand_till  0.959     3 0.000737 Preprocessor1_Mode~
## 5         5        5 roc_auc hand_till  0.958     3 0.000740 Preprocessor2_Mode~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_tune %&amp;gt;% 
  show_best(&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   neighbors num_comp .metric  .estimator  mean     n std_err .config            
##       &amp;lt;int&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;              
## 1        10        8 accuracy multiclass 0.914     3 0.00104 Preprocessor3_Mode~
## 2         5        8 accuracy multiclass 0.913     3 0.00315 Preprocessor3_Mode~
## 3        10        5 accuracy multiclass 0.912     3 0.00114 Preprocessor2_Mode~
## 4         5        5 accuracy multiclass 0.91      3 0.00139 Preprocessor2_Mode~
## 5        10        2 accuracy multiclass 0.910     3 0.00175 Preprocessor1_Mode~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# nb model
nb_tune %&amp;gt;% 
  show_best(&amp;quot;roc_auc&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   smoothness num_comp .metric .estimator  mean     n  std_err .config           
##        &amp;lt;dbl&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;             
## 1        1.5       10 roc_auc hand_till  0.971     3 0.000400 Preprocessor3_Mod~
## 2        1.5        6 roc_auc hand_till  0.971     3 0.000997 Preprocessor2_Mod~
## 3        1         10 roc_auc hand_till  0.971     3 0.000634 Preprocessor3_Mod~
## 4        1          6 roc_auc hand_till  0.970     3 0.00124  Preprocessor2_Mod~
## 5        0.5       10 roc_auc hand_till  0.969     3 0.000808 Preprocessor3_Mod~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nb_tune %&amp;gt;% 
  show_best(&amp;quot;accuracy&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 8
##   smoothness num_comp .metric  .estimator  mean     n  std_err .config          
##        &amp;lt;dbl&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;            
## 1        1         10 accuracy multiclass 0.913     3 0.000481 Preprocessor3_Mo~
## 2        1.5       10 accuracy multiclass 0.913     3 0.000267 Preprocessor3_Mo~
## 3        0.5       10 accuracy multiclass 0.912     3 0.000462 Preprocessor3_Mo~
## 4        1.5        6 accuracy multiclass 0.911     3 0.00135  Preprocessor2_Mo~
## 5        1          6 accuracy multiclass 0.910     3 0.00157  Preprocessor2_Mo~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we going to select the best model from the tuned parameters and finalise our model using &lt;code&gt;last_fit()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For knn model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize
knn_best &amp;lt;- knn_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
knn_rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = knn_best$num_comp) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())

knn_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(knn_rec) %&amp;gt;% 
  add_model(knn_mod) %&amp;gt;% 
  finalize_workflow(knn_best) 

# Last fit
knn_lastfit &amp;lt;- 
  knn_wf %&amp;gt;% 
  last_fit(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For naive bayes model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize
nb_best &amp;lt;- nb_tune %&amp;gt;% select_best(&amp;quot;roc_auc&amp;quot;)
nb_rec &amp;lt;- 
  recipe(label ~ ., data = data_train) %&amp;gt;% 
  step_umap(all_predictors(), outcome = vars(label), num_comp = nb_best$num_comp) %&amp;gt;% 
  step_center(all_predictors()) %&amp;gt;% 
  step_scale(all_predictors())

nb_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(nb_rec) %&amp;gt;% 
  add_model(nb_mod) %&amp;gt;% 
  finalize_workflow(nb_best) 

# Last fit
nb_lastfit &amp;lt;- 
  nb_wf %&amp;gt;% 
  last_fit(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see the model performance on the testing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_metrics() %&amp;gt;% 
  mutate(model = &amp;quot;knn&amp;quot;) %&amp;gt;% 
  dplyr::bind_rows(nb_lastfit %&amp;gt;% 
                     collect_metrics() %&amp;gt;% 
                     mutate(model = &amp;quot;nb&amp;quot;)) %&amp;gt;% 
  select(-.config)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 4 x 4
##   .metric  .estimator .estimate model
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;
## 1 accuracy multiclass     0.938 knn  
## 2 roc_auc  hand_till      0.971 knn  
## 3 accuracy multiclass     0.936 nb   
## 4 roc_auc  hand_till      0.980 nb&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are the confusion matrices.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  conf_mat(label, .pred_class) %&amp;gt;% 
  autoplot(type = &amp;quot;heatmap&amp;quot;) +
  labs(title = &amp;quot;Confusion matrix - kNN&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nb_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  conf_mat(label, .pred_class) %&amp;gt;% 
  autoplot(type = &amp;quot;heatmap&amp;quot;) +
  labs(title = &amp;quot;Confusion matrix - naive bayes&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-14-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, we can compare the ROC plots for each class.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;knn_lastfit %&amp;gt;% 
  collect_predictions() %&amp;gt;%
  mutate(id = &amp;quot;knn&amp;quot;) %&amp;gt;% 
  bind_rows(
    nb_lastfit %&amp;gt;% 
      collect_predictions() %&amp;gt;% 
      mutate(id = &amp;quot;nb&amp;quot;)
            ) %&amp;gt;% 
  group_by(id) %&amp;gt;% 
  roc_curve(label, .pred_0:.pred_9) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/using-umap-preprocessing-for-image-classification/index.en_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I believe UMAP is quite good and can be used as one of preprocessing step in image classification. We are able to get a pretty good performance result in this post. I believe if the the parameter tuning approach is a bit more rigorous, the performance result will be a lot better.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Explore data using PCA</title>
      <link>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</link>
      <pubDate>Wed, 09 Feb 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;principal-component-analysis-pca&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Principal component analysis (PCA)&lt;/h2&gt;
&lt;p&gt;PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.&lt;/p&gt;
&lt;p&gt;PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.&lt;/p&gt;
&lt;p&gt;For those who want to study the theoretical side of PCA can further read on this &lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34;&gt;link&lt;/a&gt;. We going to focus more on the coding part in the machine learning framework (using &lt;code&gt;tidymodels&lt;/code&gt; package) in this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;These are the packages that we going to use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidymodels)
library(tidyverse)
library(mlbench) #data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(&amp;quot;PimaIndiansDiabetes&amp;quot;)
glimpse(PimaIndiansDiabetes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 768
## Columns: 9
## $ pregnant &amp;lt;dbl&amp;gt; 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  &amp;lt;dbl&amp;gt; 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure &amp;lt;dbl&amp;gt; 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  &amp;lt;dbl&amp;gt; 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  &amp;lt;dbl&amp;gt; 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     &amp;lt;dbl&amp;gt; 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree &amp;lt;dbl&amp;gt; 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      &amp;lt;dbl&amp;gt; 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes &amp;lt;fct&amp;gt; pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1)

ind &amp;lt;- initial_split(PimaIndiansDiabetes)
dat_train &amp;lt;- training(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We create a recipe and apply normalization and PCA techniques. Then, we prep it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Recipe
pca_rec &amp;lt;- 
  recipe(diabetes ~ ., data = dat_train) %&amp;gt;% 
  step_normalize(all_numeric_predictors()) %&amp;gt;% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep &amp;lt;- prep(pca_rec)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, we can extract the PCA data using &lt;code&gt;tidy()&lt;/code&gt;. &lt;code&gt;type = &#34;coef&#34;&lt;/code&gt; indicates that we want the loadings values. So, the values in the data are the loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;coef&amp;quot;)
pca_tidied&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 64 x 4
##    terms     value component id       
##    &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.&lt;/p&gt;
&lt;p&gt;We can further visualise these loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied %&amp;gt;% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab(&amp;quot;&amp;quot;) +
  xlab(&amp;quot;Loadings&amp;quot;) + 
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied2 &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;variance&amp;quot;)

pca_tidied2 %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value, names_prefix = &amp;quot;PC&amp;quot;) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  mutate_if(is.numeric, round, digits = 1) %&amp;gt;% 
  kableExtra::kable(&amp;quot;simple&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;terms&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC3&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC4&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC5&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC6&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC7&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;60.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;81.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;89.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;95.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;100.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied3 &amp;lt;- 
  pca_tidied %&amp;gt;% 
  filter(component %in% c(&amp;quot;PC1&amp;quot;, &amp;quot;PC2&amp;quot;)) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 8 x 3
##   terms      PC1    PC2
##   &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we can make a scatter plot using training set data (&lt;code&gt;juice(pca_prep)&lt;/code&gt;) and the loadings data (&lt;code&gt;pca_tidied3&lt;/code&gt;). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;juice(pca_prep) %&amp;gt;% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, &amp;quot;picas&amp;quot;)),
               color = &amp;quot;blue&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab(&amp;quot;PC1 (26.2%)&amp;quot;) +
  ylab(&amp;quot;PC2 (21.5%)&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, from this scatter plot we learn that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase&lt;/li&gt;
&lt;li&gt;As PC2 decreases, pregnant and age increase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34; class=&#34;uri&#34;&gt;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://juliasilge.com/blog/cocktail-recipes-umap/&#34; class=&#34;uri&#34;&gt;https://juliasilge.com/blog/cocktail-recipes-umap/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Fitted vs predict in R</title>
      <link>https://tengkuhanis.netlify.app/post/fitted-vs-predict-in-r/</link>
      <pubDate>Sun, 09 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/fitted-vs-predict-in-r/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/fitted-vs-predict-in-r/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;There are two functions in R that seems almost similar yet different:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fitted()&lt;/code&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;predict()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;First let’s prepare some data first.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Packages
library(dplyr)

# Data
set.seed(123)
dat &amp;lt;- 
  iris %&amp;gt;% 
  mutate(twoGp = sample(c(&amp;quot;Gp1&amp;quot;, &amp;quot;Gp2&amp;quot;), 150, replace = T), #create two group factor
         twoGp = as.factor(twoGp))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   twoGp   
##  setosa    :50   Gp1:76  
##  versicolor:50   Gp2:74  
##  virginica :50           
##                          
##                          
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;fitted()&lt;/code&gt; is used to get a predicted values or &lt;span class=&#34;math inline&#34;&gt;\(\hat{y}\)&lt;/span&gt; based on the data. Let’s see this on the logistic regression.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;logR &amp;lt;- glm(twoGp ~ ., family = binomial(), data = dat)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are the fitted values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fitted(logR) %&amp;gt;% head()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For &lt;code&gt;predict()&lt;/code&gt;, we have three types:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Response&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Link - default&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Terms&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If no new data supplied to &lt;code&gt;predict()&lt;/code&gt;, it will use the original data used to fit the model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Response&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The type response is identical to &lt;code&gt;fitted()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(logR, type = &amp;quot;response&amp;quot;) %&amp;gt;% head()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can confirm this as below.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all.equal(fitted(logR), predict(logR, type = &amp;quot;response&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Thus, &lt;code&gt;fitted()&lt;/code&gt; and &lt;code&gt;predict(type = &#34;response&#34;)&lt;/code&gt; give use predicted probabilities on the scale of the response variable. The first observation of this values can be interpreted as probability of Gp2 (Gp1 is a reference group) for first observation is 0.41.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Link&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;predict(type = &#34;link&#34;)&lt;/code&gt; gives us predicted probabilities on the logit scale or log odds prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(logR, type = &amp;quot;link&amp;quot;) %&amp;gt;% head() #similar to predict(logR)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, the log odds prediction of Gp2 for the first observation is -0.37. Hence, we can get the same values if we apply a &lt;a href=&#34;https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function&#34;&gt;link function&lt;/a&gt; to the fitted values.&lt;/p&gt;
&lt;p&gt;The link function for logistic regression is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
ln(\frac{\mu}{1 - \mu})
\]&lt;/span&gt;
So, we apply this link function to the fitted values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;logOddsProb &amp;lt;- log(fitted(logR) / (1 - fitted(logR))) 
head(logOddsProb)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further confirm this as we did previously.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all.equal(logOddsProb, predict(logR, type = &amp;quot;link&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also, we can conclude &lt;code&gt;predict(type = &#34;link&#34;)&lt;/code&gt; give use a fitted values before an application of link function (log odds).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Terms&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Lastly, we have &lt;code&gt;predict(type = &#34;terms&#34;)&lt;/code&gt;. This type gives us a matrix of fitted values for each variable of each observation in the model on the scale of linear predictor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predict(logR, type = &amp;quot;terms&amp;quot;) %&amp;gt;% head() &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1   0.07988782  0.28070682    0.4819893  -0.2736677 -0.9178543
## 2   0.10138230 -0.03635661    0.4819893  -0.2736677 -0.9178543
## 3   0.12287679  0.09046877    0.5024299  -0.2736677 -0.9178543
## 4   0.13362403  0.02705608    0.4615487  -0.2736677 -0.9178543
## 5   0.09063506  0.34411951    0.4819893  -0.2736677 -0.9178543
## 6   0.04764610  0.53435757    0.4206675  -0.2188976 -0.9178543&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, if we add up the values of the first observation and the constant (or intercept), we will get the same values as the log odds prediction (&lt;code&gt;predict(type = &#34;link&#34;)&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predTerm &amp;lt;- predict(logR, type = &amp;quot;terms&amp;quot;)
sum(predTerm[1, ], attr(predTerm, &amp;quot;constant&amp;quot;)) #add up the first observation and the constant&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] -0.374315&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;logOddsProb[1]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         1 
## -0.374315&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Those values also similar to if we calculate manually using coefficient from the &lt;code&gt;summary()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
LogOdds(Gp2) = \beta_0 + \beta_1(Sepal.Length) + \beta_2(Sepal.Width) + 
\]&lt;/span&gt;
&lt;span class=&#34;math display&#34;&gt;\[
\beta_3(Petal.Length) + \beta_4(Petal.Width) + \beta_5(Species)
\]&lt;/span&gt;
So, this is the values we get from the first observation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;coef(logR)[1] + coef(logR)[2]*dat$Sepal.Length[1] + coef(logR)[3]*dat$Sepal.Width[1] + coef(logR)[4]*dat$Petal.Length[1] + coef(logR)[5]*dat$Petal.Width[1] + 0 #setosa species&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (Intercept) 
##   -0.374315&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, in &lt;code&gt;predict(type = &#34;terms&#34;)&lt;/code&gt; the values is &lt;a href=&#34;https://www.statology.org/center-data-in-r/&#34;&gt;centered&lt;/a&gt;, thus we have a different values for constant/intercept and for &lt;span class=&#34;math inline&#34;&gt;\(\beta_1(Sepal.Length)\)&lt;/span&gt;,&lt;span class=&#34;math inline&#34;&gt;\(\beta_2(Sepal.Width)\)&lt;/span&gt; and so on. For example, the values for intercept for both models are:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Intercept/constant from predict(type = &amp;quot;terms&amp;quot;)
attr(predTerm, &amp;quot;constant&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] -0.02537694&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Intercept/constant from summary()
coef(logR)[1]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## (Intercept) 
##   -1.814251&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://stackoverflow.com/a/12201502/11215767&#34; class=&#34;uri&#34;&gt;https://stackoverflow.com/a/12201502/11215767&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stackoverflow.com/a/47854088/11215767&#34; class=&#34;uri&#34;&gt;https://stackoverflow.com/a/47854088/11215767&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>A short note on variable selection</title>
      <link>https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/</link>
      <pubDate>Sat, 08 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-variable-selection/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;variable-selection&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Variable selection&lt;/h2&gt;
&lt;p&gt;Variable or feature selection is one of the important step whether in machine learning or statistical analysis. This post is geared more to the machine learning side. Certain machine learning models such as Support vector machine (SVM) and neural network do not handle irrelevant predictors very well, whereas models such as linear and logistic regression do not handle correlated predictors very well. Thus, careful selection of the variables will help mitigate this issue and further improve the predictive performance.&lt;/p&gt;
&lt;p&gt;There are three types of approaches in variable selection:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Intrinsic (or built-in feature selection)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An intrinsic feature selection is a feature selection embedded in the algorithm. Some examples include:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Tree-and-rule-based model - decision tree, random forest, etc&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Multivariate adaptive regression spline (MARS)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Regularization method such as least absolute shrinkage and selection operator (LASSO or L1)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Advantages of this type of approach are they are fast and computationally efficient. However, the best variable selected in this approach is model dependent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Filter&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In filter approach we determine the variable importance, usually separately though not necessarily. An example of this approach is univariate filter. If the outcome is two categories, we can use t-test to assess the numerical predictors. Variables with a significant p-value or a large t-statistics will be chosen.&lt;/p&gt;
&lt;p&gt;This approach is very simple and fast. However, the best subset of variables selected using some filtering criteria such as statistical significance may not reflect the best predictive performance of the model. Additionally, this approach is prone to over-selection of the predictors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Wrapper&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There two types of wrapper approaches:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Greedy wrapper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Greedy approach or algorithm direct a search path towards the best at times to achieve the best immediate benefit. Due to this reason this approach cannot escape local minima. We can assume in Figure 1 below local minima represents locally best predictors and global minima represents globally best predictors.&lt;/p&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;img.png&#34; alt=&#34;Local minima and global minima&#34; width=&#34;576&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Local minima and global minima
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;An example of this approach is recursive feature elimination or backward selection. The main weakness of this greedy approach is the selected subset of features identified by this approach may not has the best predictive performance.&lt;/p&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Non-greedy wrapper&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The examples of this approach are simulated annealing and &lt;a href=&#34;https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/&#34;&gt;genetic algorithm&lt;/a&gt;. Both of these algorithm incorporate a randomness in their approach. Hence, it is classified as non-greedy wrapper. Due to this randomness, it can escape a local minima (see Figure 1 above).&lt;/p&gt;
&lt;p&gt;The wrapper type has the best chance to find the globally best predictors. However, this approach is computationally expensive. Not to mention, this approach has a tendency to overfit (some packages like &lt;code&gt;caret&lt;/code&gt; use resampling to avoid this issue).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;suggested-approach&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Suggested approach&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://bookdown.org/max/FES/&#34;&gt;Kuhn &amp;amp; Johnson (2019)&lt;/a&gt; suggested this approach:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Start with an intrinsic approach&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then, do a wrapper approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If a linear intrinsic approach has a better performance - proceed to wrapper method with a linear model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If non-linear intrinsic approach has a better performance - proceed to wrapper method with a non-linear model&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If several approach select a large number of predictors, it may not feasible to reduce the number of features&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/max/FES/classes-of-feature-selection-methodologies.html&#34; class=&#34;uri&#34;&gt;https://bookdown.org/max/FES/classes-of-feature-selection-methodologies.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://topepo.github.io/caret/feature-selection-overview.html&#34; class=&#34;uri&#34;&gt;http://topepo.github.io/caret/feature-selection-overview.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Stepwise selection after multiple imputation</title>
      <link>https://tengkuhanis.netlify.app/post/stepwise-selection-after-multiple-imputation/</link>
      <pubDate>Tue, 04 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/stepwise-selection-after-multiple-imputation/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/stepwise-selection-after-multiple-imputation/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;some-note&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Some note&lt;/h2&gt;
&lt;p&gt;I have written two post previously about multiple imputation using &lt;code&gt;mice&lt;/code&gt; package:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/&#34;&gt;A short note on multiple imputation&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/&#34;&gt;Variable selection for imputation model in {mice}&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This post probably my last post about multiple imputation using &lt;code&gt;mice&lt;/code&gt; package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;stepwise-selection&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Stepwise selection&lt;/h2&gt;
&lt;p&gt;The general steps in &lt;code&gt;mice&lt;/code&gt; package are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;mice()&lt;/code&gt; - impute the NAs&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;with()&lt;/code&gt; - run the analysis (lm, glm, etc)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pool()&lt;/code&gt; - pool the results&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://books.google.com.my/books/about/Development_Implementation_and_Evaluatio.html?id=-Y0TywAACAAJ&amp;amp;redir_esc=y&#34;&gt;Brand (1999)&lt;/a&gt; proposed this solution:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Perform stepwise selection separately on each imputed dataset&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Fit a preliminary model that contains all variables that present in at least half of the models in the step 1&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p &amp;gt; 0.05)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Repeat step 3 until all variables have p values &amp;lt; 0.05&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So, we going to do this solution and use multivariate Wald test (&lt;code&gt;D1()&lt;/code&gt; in &lt;code&gt;mice&lt;/code&gt; package) for model comparison instead of pooled likelihood ratio p value.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;Load the packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(mice)
library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a missing data. We going to use the famous &lt;code&gt;mtcars&lt;/code&gt; dataset, which already available in R.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
dat &amp;lt;- 
  mtcars %&amp;gt;% 
  mutate(across(c(vs, am), as.factor)) %&amp;gt;% 
  select(-mpg) %&amp;gt;% 
  missForest::prodNA(0.1) %&amp;gt;% 
  bind_cols(mpg = mtcars$mpg)
summary(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       cyl             disp             hp             drat      
##  Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:4.000   1st Qu.:120.7   1st Qu.:103.0   1st Qu.:3.150  
##  Median :6.000   Median :225.0   Median :123.0   Median :3.715  
##  Mean   :6.148   Mean   :232.8   Mean   :147.4   Mean   :3.642  
##  3rd Qu.:8.000   3rd Qu.:334.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##  NA&amp;#39;s   :5       NA&amp;#39;s   :1       NA&amp;#39;s   :4       NA&amp;#39;s   :2      
##        wt             qsec          vs        am          gear     
##  Min.   :1.513   Min.   :14.50   0   :17   0   :18   Min.   :3.00  
##  1st Qu.:2.429   1st Qu.:16.88   1   :11   1   :10   1st Qu.:3.00  
##  Median :3.203   Median :17.51   NA&amp;#39;s: 4   NA&amp;#39;s: 4   Median :4.00  
##  Mean   :3.112   Mean   :17.75                       Mean   :3.71  
##  3rd Qu.:3.533   3rd Qu.:18.83                       3rd Qu.:4.00  
##  Max.   :5.424   Max.   :22.90                       Max.   :5.00  
##  NA&amp;#39;s   :4       NA&amp;#39;s   :2                           NA&amp;#39;s   :1     
##       carb            mpg       
##  Min.   :1.000   Min.   :10.40  
##  1st Qu.:2.000   1st Qu.:15.43  
##  Median :2.000   Median :19.20  
##  Mean   :2.667   Mean   :20.09  
##  3rd Qu.:4.000   3rd Qu.:22.80  
##  Max.   :6.000   Max.   :33.90  
##  NA&amp;#39;s   :5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;mice()&lt;/code&gt; on missing data with 10 imputed datasets (&lt;code&gt;m = 10&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;datImp &amp;lt;- mice(dat, m = 10, printFlag = F, seed = 123)
datImp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##      cyl     disp       hp     drat       wt     qsec       vs       am 
##    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot; &amp;quot;logreg&amp;quot; &amp;quot;logreg&amp;quot; 
##     gear     carb      mpg 
##    &amp;quot;pmm&amp;quot;    &amp;quot;pmm&amp;quot;       &amp;quot;&amp;quot; 
## PredictorMatrix:
##      cyl disp hp drat wt qsec vs am gear carb mpg
## cyl    0    1  1    1  1    1  1  1    1    1   1
## disp   1    0  1    1  1    1  1  1    1    1   1
## hp     1    1  0    1  1    1  1  1    1    1   1
## drat   1    1  1    0  1    1  1  1    1    1   1
## wt     1    1  1    1  0    1  1  1    1    1   1
## qsec   1    1  1    1  1    0  1  1    1    1   1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run stepwise selection on each imputed dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sc &amp;lt;- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, 
           lower = ~ 1)
exp &amp;lt;- expression(f1 &amp;lt;- lm(mpg ~ 1),
                  f2 &amp;lt;- step(f1, scope = sc, trace = 0))
fit &amp;lt;- with(datImp, exp)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we calculate how many times each variable selected in the each model by stepwise selection.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit$analyses %&amp;gt;% 
  map(formula) %&amp;gt;% #get the formula
  map(terms) %&amp;gt;% #get the terms
  map(labels) %&amp;gt;% #get the name of variables
  unlist() %&amp;gt;% 
  table()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## .
##   am carb  cyl disp drat   hp qsec   vs   wt 
##    7    5    3    2    4    5    3    4    7&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to select:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;am&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;carb&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;hp&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;wt&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit_full1 &amp;lt;- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %&amp;gt;% 
  summary()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2         am1  3.06689135 1.94363342  1.577917 13.06329 1.384846e-01
## 3        carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4          hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5          wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We exclude carb variable in the next model as it has the largest non-significant p value.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit_full2 &amp;lt;- with(datImp, lm(mpg ~ am + hp + wt))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we compare using multivariate Wald test.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;D1(fit_full1, fit_full2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    test statistic df1     df2 dfcom   p.value       riv
##  1 ~~ 2 0.9765411   1 9.21378    27 0.3482934 0.6935655&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;P &amp;gt; 0.05. So, we opt for the simpler model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pool(fit_full2) %&amp;gt;% 
  summary()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2         am1  2.50264907 1.79966590  1.390619 15.31418 1.842201e-01
## 3          hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4          wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit_full3 &amp;lt;- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    test statistic df1      df2 dfcom   p.value       riv
##  1 ~~ 2   1.93382   1 12.90982    28 0.1878483 0.4392918&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, we opt for the simple model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pool(fit_full3) %&amp;gt;% 
  summary()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2          hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3          wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There is no non-significant variable in the model anymore. Thus, this is our final model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gtsummary::tbl_regression(fit_full3)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;ybehlmrayy&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#ybehlmrayy .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#ybehlmrayy .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ybehlmrayy .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#ybehlmrayy .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 6px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#ybehlmrayy .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ybehlmrayy .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ybehlmrayy .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#ybehlmrayy .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#ybehlmrayy .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#ybehlmrayy .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#ybehlmrayy .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#ybehlmrayy .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#ybehlmrayy .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#ybehlmrayy .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#ybehlmrayy .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#ybehlmrayy .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#ybehlmrayy .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#ybehlmrayy .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ybehlmrayy .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#ybehlmrayy .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ybehlmrayy .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#ybehlmrayy .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#ybehlmrayy .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ybehlmrayy .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ybehlmrayy .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#ybehlmrayy .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ybehlmrayy .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#ybehlmrayy .gt_left {
  text-align: left;
}

#ybehlmrayy .gt_center {
  text-align: center;
}

#ybehlmrayy .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#ybehlmrayy .gt_font_normal {
  font-weight: normal;
}

#ybehlmrayy .gt_font_bold {
  font-weight: bold;
}

#ybehlmrayy .gt_font_italic {
  font-style: italic;
}

#ybehlmrayy .gt_super {
  font-size: 65%;
}

#ybehlmrayy .gt_footnote_marks {
  font-style: italic;
  font-weight: normal;
  font-size: 65%;
}
&lt;/style&gt;
&lt;table class=&#34;gt_table&#34;&gt;
  
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Characteristic&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Beta&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;p-value&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;hp&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.03&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.05, -0.01&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.005&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;wt&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-3.9&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-5.5, -2.4&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
  
  &lt;tfoot&gt;
    &lt;tr class=&#34;gt_footnotes&#34;&gt;
      &lt;td colspan=&#34;4&#34;&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;1&lt;/em&gt;
          &lt;/sup&gt;
           
          CI = Confidence Interval
          &lt;br /&gt;
        &lt;/p&gt;
      &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br&gt;&lt;/p&gt;
&lt;p&gt;Reference:&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://stefvanbuuren.name/fimd/sec-stepwise.html&#34; class=&#34;uri&#34;&gt;https://stefvanbuuren.name/fimd/sec-stepwise.html&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Variable selection using genetic algorithm</title>
      <link>https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/</link>
      <pubDate>Sun, 02 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/variable-selection-using-genetic-algorithm/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;background&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Genetic algorithm is inspired by a natural selection process by which the fittest individuals be selected to reproduce. This algorithm has been used in optimization and search problem, and also, can be used for variable selection.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;images/ga_fig.png&#34; alt=&#34;&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Genetic algorithm - gene, chromosome, population, crossover (upper right), offspring (lower right)&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;First, let’s go into a few terms related to genetic algorithm theory.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Population - a set of chromosomes&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Chromosome - a subset of variables (also known as individual by some reference)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Gene - a variable or feature&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Fitness function - give fitness score to each chromosome and guide the selection&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Selection - a process to select the two chromosome known as parents&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Crossover - a process to generate offspring by parents (illustrate in the picture above, on the upper right side)&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Mutation - the process by which the gene in the chromosome is randomly flipped into 1 or 0&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;images/mutation.png&#34; width=&#34;250&#34; alt=&#34;&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Mutation&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;So, the basic flow of genetic algorithm:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Algorithm starts with an initial population, often randomly generated&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create a successive generation by selecting a portion of the initial population (the selection is guided by the fitness function) - this includes selection -&amp;gt; crossover -&amp;gt; mutation&lt;br /&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The algorithm terminates if certain predetermined criteria are met such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Solution satisfies the minimum criteria&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Fixed number of generation reached&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Successive iteration no longer produce a better result&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;There is &lt;code&gt;GA&lt;/code&gt; package in R, where we can implement the genetic algorithm a bit more manually where we can specify our own fitness function. However, I think it is easier to use a genetic algorithm implemented in &lt;code&gt;caret&lt;/code&gt; package for variable selection.&lt;/p&gt;
&lt;p&gt;Load the packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(caret)
library(tidyverse)
library(rsample)
library(recipes)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat &amp;lt;- 
  mtcars %&amp;gt;% 
  mutate(across(c(vs, am), as.factor),
         am = fct_recode(am, auto = &amp;quot;0&amp;quot;, man = &amp;quot;1&amp;quot;))
str(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels &amp;quot;0&amp;quot;,&amp;quot;1&amp;quot;: 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels &amp;quot;auto&amp;quot;,&amp;quot;man&amp;quot;: 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For this, we going to use random forest (&lt;code&gt;rfGA&lt;/code&gt;). Other options are bagged tree (&lt;code&gt;treebagGA&lt;/code&gt;) and &lt;code&gt;caretGA&lt;/code&gt;. We are able to use other method in &lt;code&gt;caret&lt;/code&gt; if we use &lt;code&gt;caretGA&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
ga_ctrl &amp;lt;- gafsControl(functions = rfGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run random forest
set.seed(123)
rf_ga &amp;lt;- gafs(x = dat %&amp;gt;% select(-am), 
              y = dat$am,
              iters = 5,
              gafsControl = ga_ctrl)
rf_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 2 classes: &amp;#39;auto&amp;#39;, &amp;#39;man&amp;#39; 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     qsec (60%), wt (60%), disp (40%), gear (40%), vs (40%)
##   * on average, 3.2 variables were selected (min = 1, max = 7)
## 
## In the final search using the entire training set:
##    * 7 features selected at iteration 3 including:
##      cyl, hp, drat, qsec, vs ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9429      0.8831&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal features/variables:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;cyl&amp;quot;  &amp;quot;hp&amp;quot;   &amp;quot;drat&amp;quot; &amp;quot;qsec&amp;quot; &amp;quot;vs&amp;quot;   &amp;quot;gear&amp;quot; &amp;quot;carb&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the time taken for random forest approach.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga$times&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## $everything
##    user  system elapsed 
##   51.22    1.25   52.92&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default the algorithm will find a solution or a set of variable that reduce RMSE for numerical outcome, and accuracy for categorical outcome. Also, genetic algorithm tend to overfit, that’s why for the implementation in &lt;code&gt;caret&lt;/code&gt; we have internal and external performance. So, for the 10-fold cross-validation, 10 genetic algorithm will be run separately. All the first nine folds are used for the genetic algorithm, and the 10th for external performance evaluation.&lt;/p&gt;
&lt;p&gt;Let’s try a variable selection using linear regression model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
lm_ga_ctrl &amp;lt;- gafsControl(functions = caretGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run lm
set.seed(123)
lm_ga &amp;lt;- gafs(x = dat %&amp;gt;% select(-mpg), 
              y = dat$mpg,
              iters = 5,
              gafsControl = lm_ga_ctrl,
              # below is the option for `train`
              method = &amp;quot;lm&amp;quot;,
              trControl = trainControl(method = &amp;quot;cv&amp;quot;, allowParallel = F))
lm_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 32 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared, MAE
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (100%), hp (80%), carb (60%), cyl (60%), am (40%)
##   * on average, 4.4 variables were selected (min = 4, max = 5)
## 
## In the final search using the entire training set:
##    * 5 features selected at iteration 5 including:
##      cyl, disp, hp, wt, qsec  
##    * external performance at this iteration is
## 
##        RMSE    Rsquared         MAE 
##      3.3434      0.7624      2.6037&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, let’s see how to integrate this in machine learning flow using recipes from &lt;code&gt;rsample&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;First, we split the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
dat_split &amp;lt;-initial_split(dat)
dat_train &amp;lt;- training(dat_split)
dat_test &amp;lt;- testing(dat_split)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We specify two recipes for numerical and categorical outcome.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Numerical
rec_num &amp;lt;- 
  recipe(mpg ~., data = dat_train) %&amp;gt;% 
  step_center(all_numeric()) %&amp;gt;% 
  step_dummy(all_nominal_predictors())

# Categorical
rec_cat &amp;lt;- 
  recipe(am ~., data = dat_train) %&amp;gt;% 
  step_center(all_numeric()) %&amp;gt;% 
  step_dummy(all_nominal_predictors())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We run random forest for numerical outcome recipes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
rf_ga_ctrl &amp;lt;- gafsControl(functions = rfGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run random forest
set.seed(123)
rf_ga2 &amp;lt;- 
  gafs(rec_num,
       data = dat_train,
       iters = 5, 
       gafsControl = rf_ga_ctrl) 
rf_ga2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: RMSE, Rsquared
## Subset selection driven to minimize internal RMSE 
## 
## External performance values: RMSE, Rsquared, MAE
## Best iteration chose by minimizing external RMSE 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     cyl (80%), disp (80%), hp (80%), wt (80%), carb (60%)
##   * on average, 4.8 variables were selected (min = 2, max = 9)
## 
## In the final search using the entire training set:
##    * 6 features selected at iteration 5 including:
##      cyl, disp, hp, wt, gear ... 
##    * external performance at this iteration is
## 
##       RMSE   Rsquared        MAE 
##      2.830      0.928      2.408&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_ga2$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;cyl&amp;quot;   &amp;quot;disp&amp;quot;  &amp;quot;hp&amp;quot;    &amp;quot;wt&amp;quot;    &amp;quot;gear&amp;quot;  &amp;quot;vs_X1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s try run SVM for the numerical outcome recipes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# specify control
svm_ga_ctrl &amp;lt;- gafsControl(functions = caretGA, method = &amp;quot;cv&amp;quot;, number = 5)

# run SVM
set.seed(123)
svm_ga &amp;lt;- 
  gafs(rec_cat,
       data = dat_train,
       iters = 5, 
       gafsControl = svm_ga_ctrl,
       # below is the options to `train` for caretGA
       method = &amp;quot;svmRadial&amp;quot;, #SVM with Radial Basis Function Kernel
       trControl = trainControl(method = &amp;quot;cv&amp;quot;, allowParallel = T))
svm_ga&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Genetic Algorithm Feature Selection
## 
## 24 samples
## 10 predictors
## 2 classes: &amp;#39;auto&amp;#39;, &amp;#39;man&amp;#39; 
## 
## Maximum generations: 5 
## Population per generation: 50 
## Crossover probability: 0.8 
## Mutation probability: 0.1 
## Elitism: 0 
## 
## Internal performance values: Accuracy, Kappa
## Subset selection driven to maximize internal Accuracy 
## 
## External performance values: Accuracy, Kappa
## Best iteration chose by maximizing external Accuracy 
## External resampling method: Cross-Validated (5 fold) 
## 
## During resampling:
##   * the top 5 selected variables (out of a possible 10):
##     wt (80%), qsec (60%), vs_X1 (60%), carb (40%), disp (40%)
##   * on average, 4 variables were selected (min = 3, max = 6)
## 
## In the final search using the entire training set:
##    * 9 features selected at iteration 2 including:
##      mpg, cyl, disp, hp, drat ... 
##    * external performance at this iteration is
## 
##    Accuracy       Kappa 
##      0.9200      0.8571&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The optimal variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;svm_ga$optVariables&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;mpg&amp;quot;   &amp;quot;cyl&amp;quot;   &amp;quot;disp&amp;quot;  &amp;quot;hp&amp;quot;    &amp;quot;drat&amp;quot;  &amp;quot;wt&amp;quot;    &amp;quot;qsec&amp;quot;  &amp;quot;carb&amp;quot;  &amp;quot;vs_X1&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Although genetic algorithm seems quite good for variable selection, the main limitation I would say is the computational time. However, if we have a lot of variables or features to reduced, using the genetic algorithm despite the long computational time seems beneficial to me.&lt;/p&gt;
&lt;p&gt;Reference:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html#ga&#34; class=&#34;uri&#34;&gt;https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html#ga&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://towardsdatascience.com/feature-selection-using-genetic-algorithms-in-r-3d9252f1aa66&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/feature-selection-using-genetic-algorithms-in-r-3d9252f1aa66&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>My first interactive map with {leaflet}</title>
      <link>https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/</link>
      <pubDate>Sun, 28 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/index.en_files/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/index.en_files/pymjs/pym.v1.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/index.en_files/widgetframe-binding/widgetframe.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I have tried creating a map with ggplot2 &lt;a href=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/&#34;&gt;previously&lt;/a&gt;. In this post, I will try to create an interactive map using &lt;code&gt;leaflet&lt;/code&gt; package in R.&lt;/p&gt;
&lt;p&gt;These are the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(tidygeocoder)
library(leaflet)
library(htmltools)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, I’m going to use a clinics location data in Malaysia. I already uploaded this data tomy &lt;a href=&#34;https://github.com/tengku-hanis/clinic-data&#34;&gt;GitHub repo&lt;/a&gt;. I will skip the explanation for the pre-processing part, but it is the same pre-processing as my &lt;a href=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/&#34;&gt;previous post&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Read the data
clinic1m &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv&amp;quot;)
clinicDesa &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;details&gt;
&lt;summary&gt;
Show code for pre-processing
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get the missing coordinate based on postal codes
clinic1m2 &amp;lt;- 
  clinic1m %&amp;gt;%
  mutate(country = &amp;quot;malaysia&amp;quot;) %&amp;gt;% 
  select(name, postcode, country) %&amp;gt;% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %&amp;gt;%
  geocode(postalcode = postcode, country = country, method = &amp;quot;osm&amp;quot;)

# Add coordinate from external sources for the still missing coordinates
add_coord &amp;lt;- 
  read.table(header = T, text = &amp;quot;
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926&amp;quot;)

# Drop clinics with the still missing coordinate
clinic1m2 &amp;lt;- 
  clinic1m2 %&amp;gt;% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %&amp;gt;% 
  drop_na() #drop 2 clinic1m

# Bind the 2 data
all_clinic &amp;lt;- 
  clinic1m2 %&amp;gt;% 
  mutate(Type = &amp;quot;1Malaysia&amp;quot;) %&amp;gt;% 
  select(name, Type, lat, long) %&amp;gt;% 
  bind_rows(clinicDesa %&amp;gt;% 
              mutate(Type = &amp;quot;Desa&amp;quot;, 
                     lat = latitude, 
                     long = longitude) %&amp;gt;% 
              select(name, Type, lat, long)) %&amp;gt;% 
  mutate(name = str_to_title(name))&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;First, we going to plot the coordinates to see if there is anything strange.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/my-first-interactive-map-with-leaflet/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, we are going to remove the two isolated points as seen from the plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_clinic2 &amp;lt;- all_clinic %&amp;gt;% filter(long &amp;gt; 25)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we have our data ready, we can supply to &lt;code&gt;leaflet&lt;/code&gt;. We can choose the type of map from &lt;code&gt;addProviderTiles()&lt;/code&gt;. Some need an api, but the one we choose here does not. We supply the longitude and latitude of our data to &lt;code&gt;addCircleMarkers()&lt;/code&gt;, and &lt;code&gt;clusterOptions&lt;/code&gt; to cluster our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;leaflet(all_clinic2) %&amp;gt;% 
  addProviderTiles(providers$Stamen.Watercolor) %&amp;gt;%
  addProviderTiles(providers$Stamen.TerrainLabels) %&amp;gt;%
  addCircleMarkers(~long, ~lat, 
                   clusterOptions = markerClusterOptions())&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:100%;height:480px;&#34; class=&#34;widgetframe html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;url&#34;:&#34;index.en_files/figure-html//widgets/widget_unnamed-chunk-7.html&#34;,&#34;options&#34;:{&#34;xdomain&#34;:&#34;*&#34;,&#34;allowfullscreen&#34;:false,&#34;lazyload&#34;:false}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p&gt;Next, we can add a label.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;labels &amp;lt;- 
  sprintf(&amp;quot;&amp;lt;strong&amp;gt;%s&amp;lt;/strong&amp;gt;&amp;quot;, all_clinic$name) %&amp;gt;% 
  lapply(htmltools::HTML)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also, we can add a mini map to our map. Here, I change the type of map to a more appropriate one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;leaflet(all_clinic2) %&amp;gt;% 
  addProviderTiles(providers$OpenStreetMap) %&amp;gt;%
  addCircleMarkers(~long, ~lat, popup = ~labels, # popup add the label
                   clusterOptions = markerClusterOptions()) %&amp;gt;% 
    # add a mini map
  addMiniMap(tiles = providers$OpenStreetMap, zoomLevelOffset = -3)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;htmlwidget-2&#34; style=&#34;width:100%;height:480px;&#34; class=&#34;widgetframe html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-2&#34;&gt;{&#34;x&#34;:{&#34;url&#34;:&#34;index.en_files/figure-html//widgets/widget_unnamed-chunk-10.html&#34;,&#34;options&#34;:{&#34;xdomain&#34;:&#34;*&#34;,&#34;allowfullscreen&#34;:false,&#34;lazyload&#34;:false}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p&gt;Notice that the coordinates look more accurate as compared to the map I created with &lt;code&gt;ggplot2&lt;/code&gt; previously.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://lauriebaker.rbind.io/post/where_work/&#34; class=&#34;uri&#34;&gt;https://lauriebaker.rbind.io/post/where_work/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://laurielbaker.github.io/DSCA_leaflet_mapping_in_r/&#34; class=&#34;uri&#34;&gt;https://laurielbaker.github.io/DSCA_leaflet_mapping_in_r/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Variable selection for imputation model in {mice}</title>
      <link>https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/</link>
      <pubDate>Mon, 22 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;some-note&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Some note&lt;/h2&gt;
&lt;p&gt;I have written a &lt;a href=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/&#34;&gt;short post&lt;/a&gt; about missing data and multiple imputation in &lt;code&gt;mice&lt;/code&gt; package previously. This post will add to that previous post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;imputation-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Imputation model&lt;/h2&gt;
&lt;p&gt;Imputation model is the model that we use for our imputation approach. There is another term which is complete-data model. This is a model that we want to fit after we impute the missing values (i.e; the complete-data model is the final model).&lt;/p&gt;
&lt;p&gt;Generally, we need to include as many relevant variables into the imputation model. However, this general advise may not be very efficient as we may have multicollinearity and computational issue if we include too many predictors. As a rule of thumb, the number of included variables should be no more than 15-20. &lt;a href=&#34;https://www.jstatsoft.org/article/view/v045i03&#34;&gt;van Buuren &lt;em&gt;et al&lt;/em&gt;&lt;/a&gt;&lt;em&gt;.&lt;/em&gt; &lt;a href=&#34;https://www.jstatsoft.org/article/view/v045i03&#34;&gt;(2011)&lt;/a&gt; mentioned that increased in explained variance in linear regression is negligible after 15 variables are included.&lt;/p&gt;
&lt;p&gt;There are 4 steps suggested by &lt;a href=&#34;https://stefvanbuuren.name/publications/Flexible%20multivariate%20-%20TNO99054%201999.pdf&#34;&gt;van Buuren &lt;em&gt;et al.&lt;/em&gt; (1999)&lt;/a&gt; for variable selection in the case of big data:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Include all variables that appear in the complete-data model (final model)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This may include the interaction terms as well (passive imputation can be used to specify the interaction terms in &lt;code&gt;mice&lt;/code&gt; package)&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Include variable that have influence on the occurrence of the missing data&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This can be assessed by a correlation matrix between NAs variables and non-NAs variables&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Include variable that explain a considerable amount of variance&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This can be crudely assessed by a correlation matrix between NAs variables and non-NAs variables&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Remove variable that have too many missing values within the subgroup of incomplete cases&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This can be assessed by a proportion of usable cases (PUC) - how many cases with missing data in a certain variable have an observed values on the predictor variables&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All these steps should be done on the key variables only. There is another more efficient yet laborious approach suggested by &lt;a href=&#34;https://stefvanbuuren.name/publications/Flexible%20multiple%20-%20TNO99045%201999.pdf&#34;&gt;Oudshoorn &lt;em&gt;et al.&lt;/em&gt; (1999)&lt;/a&gt;, which take into account important predictor of predictors. We are going to focus on the four steps above, and not cover the latter suggested approach in this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-codes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R codes&lt;/h2&gt;
&lt;p&gt;These are the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(mice)
library(corrplot)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(airquality)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA&amp;#39;s   :37       NA&amp;#39;s   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We have 2 variables; Ozone and Solar.R with missing values or NAs. We can further explore the pattern of missing variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;md.pattern(airquality)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are 2 rows with NAs in Ozone and Solar.R, 35 rows with NAs only in Ozone, and 5 rows with NAs only in Solar.R. Next, we can check the correlation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cor(airquality, use = &amp;quot;pairwise.complete.obs&amp;quot;) |&amp;gt;
  corrplot(method = &amp;quot;number&amp;quot;, type = &amp;quot;upper&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/variable-selection-for-imputation-model-in-mice/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The correlations of Ozone-Temp and Ozone-Wind are the highest. Now, let’s do a correlation between the NAs variable and non-NAs variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cor(y = airquality, x = !is.na(airquality), use = &amp;quot;pairwise.complete.obs&amp;quot;) |&amp;gt;
  round(digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         Ozone Solar.R  Wind Temp Month   Day
## Ozone      NA   -0.02 -0.05 0.00  0.26 -0.05
## Solar.R     0      NA  0.06 0.11  0.11  0.17
## Wind       NA      NA    NA   NA    NA    NA
## Temp       NA      NA    NA   NA    NA    NA
## Month      NA      NA    NA   NA    NA    NA
## Day        NA      NA    NA   NA    NA    NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can ignore the warnings and the NAs as only Ozone and Solar.R have a missing values. So, the highest correlation is 0.26 between Month-Ozone - correlation between Month values with Ozone-related NAs and Month values with non-Ozone-related NAs. The column variable in the correlation matrix is the indicators of NAs and the row variables is the variable with observed values.&lt;/p&gt;
&lt;p&gt;Lastly we can calculate ‘manually’ the PUC (proportion of usable cases). &lt;code&gt;md.pairs()&lt;/code&gt; here calculate the number of observation per variable pair.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;var_pair &amp;lt;- md.pairs(airquality)
round(var_pair$mr / (var_pair$mr + var_pair$mm), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         Ozone Solar.R Wind Temp Month Day
## Ozone   0.000   0.946    1    1     1   1
## Solar.R 0.714   0.000    1    1     1   1
## Wind      NaN     NaN  NaN  NaN   NaN NaN
## Temp      NaN     NaN  NaN  NaN   NaN NaN
## Month     NaN     NaN  NaN  NaN   NaN NaN
## Day       NaN     NaN  NaN  NaN   NaN NaN&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Low value of PUC indicate there is a little information on the predictor to impute the target NAs variable. NaN is shown as the variables have no missing values. The row variable are the target variables to be imputed, and the column variables are the predictors in imputation model. We can see that to impute Solar.R (on the row) Ozone has a little less information (0.714) compare to Wind, Temp, and Day. The diagonal elements will always be 0 or NaN. So, from here we can drop predictors with say, 0 PUC as they contain no information to help impute the target NAs variable.&lt;/p&gt;
&lt;p&gt;Actually, we have a nice function from &lt;code&gt;mice&lt;/code&gt; that can do what we ‘manually’ did just now.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;quickpred(airquality)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, the column variables are the predictors, and the row variables are the target NAs variables. The above matrix is known as predictor matrix, which going to be used in the imputation model. 1 denote a variable included as predictors and 0 vice versa. The two main arguments in &lt;code&gt;quickpred()&lt;/code&gt; are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;mincor - if any of the absolute values in the two correlation matrix that we did earlier above 0.1 (default), the predictors will be included in the predictor matrix&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;minpuc - the default values for PUC is 0, so the predictors are retained even if they have no information to help imputation model&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice that, variable Day is excluded from the predictors of Ozone. The correlation values are 0 and -0.05 from the first and second correlation matrices, respectively which do not exceed the default setting of 0.1. That’s why, variable Day is excluded. Also, we can observe a similar situation for variable Wind , which is excluded from the predictors of Solar.R (the correlation coefficients are -0.60 and 0.06). The negative (-) sign does not matter as we actually evaluate the absolute values.&lt;/p&gt;
&lt;p&gt;Intuitively, we can change these two arguments as we see fit to do a variable selection for imputation model. Once we finalise our variable selection, we can do the multiple imputation using &lt;code&gt;mice()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalised variable selection
var_sel &amp;lt;- quickpred(airquality)
var_sel&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Impute
imp &amp;lt;- mice(airquality, m = 5, predictorMatrix = var_sel, printFlag = F)
imp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   &amp;quot;pmm&amp;quot;   &amp;quot;pmm&amp;quot;      &amp;quot;&amp;quot;      &amp;quot;&amp;quot;      &amp;quot;&amp;quot;      &amp;quot;&amp;quot; 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that &lt;code&gt;mice()&lt;/code&gt; uses the predictor matrix that we provide.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://www.jstatsoft.org/article/view/v045i03&#34; class=&#34;uri&#34;&gt;https://www.jstatsoft.org/article/view/v045i03&lt;/a&gt; - paper written by Staf van Buuren (a bit outdated in terms of codes, but runnable)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://stefvanbuuren.name/fimd/&#34; class=&#34;uri&#34;&gt;https://stefvanbuuren.name/fimd/&lt;/a&gt; - online book written by Stef van Buuren (See chapter 6.3.2 and 9.1.6)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Making maps with R (my first attempt ever!)</title>
      <link>https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/</link>
      <pubDate>Fri, 12 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;As written in the title of the post, this is my first try ever in making a map with R. I found a great data on the distribution of the clinics in Malaysia. The two types of clinic that we have here:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Klinik 1Malaysia (1Malaysia clinic)&lt;/li&gt;
&lt;li&gt;Klinik Desa (Desa clinic)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Originally, these two data are a separated data. Both of the data can be downloaded from &lt;a href=&#34;https://www.data.gov.my/data/ms_MY/group/pemetaan&#34;&gt;here&lt;/a&gt;. Also, I have uploaded the data into my &lt;a href=&#34;https://github.com/tengku-hanis/clinic-data&#34;&gt;GitHub repo&lt;/a&gt; for those interested. Klinik Desa data have a latitude and longitude information, but Klinik 1Malaysia data does not.&lt;/p&gt;
&lt;p&gt;These are the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rworldmap) #to get a Malaysia map
library(tidyverse)
library(tidygeocoder) #to get latitude and logitude&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Read the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinic1m &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinic1m.csv&amp;quot;)
clinicDesa &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/clinic-data/main/clinicdesa.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First, we need to get a latitude and longitude information for Klinik 1Malaysia data. So, we going to retrieve the coordinates based on the postal code, though this is not very accurate. We can use &lt;code&gt;tidygeocoder&lt;/code&gt; for this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinic1m2 &amp;lt;- 
  clinic1m %&amp;gt;%
  mutate(country = &amp;quot;malaysia&amp;quot;) %&amp;gt;% 
  select(name, postcode, country) %&amp;gt;% 
  mutate(postcode = ifelse(nchar(postcode) == 4, paste0(0, postcode), postcode)) %&amp;gt;%
  geocode(postalcode = postcode, country = country, method = &amp;quot;osm&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Further checking on the data, we notice that 5 clinics have no coordinate info.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clinic1m2 %&amp;gt;% filter(is.na(lat) | is.na(long))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 5
##   name                                     postcode country    lat  long
##   &amp;lt;chr&amp;gt;                                    &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 Klinik 1 Malaysia Bandar Lela            90700    malaysia    NA    NA
## 2 Klinik 1 Malaysia Batu Melintang         17250    malaysia    NA    NA
## 3 Klinik 1 Malaysia Cakerapurnama          45010    malaysia    NA    NA
## 4 Klinik 1 Malaysia Jelawat                16070    malaysia    NA    NA
## 5 Klinik 1 Malaysia Taman Kempadang Makmur 26060    malaysia    NA    NA&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;some-data-pre-processing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Some data pre-processing&lt;/h2&gt;
&lt;p&gt;So, I found this &lt;a href=&#34;https://www.listendata.com/2020/11/zip-code-to-latitude-and-longitude.html&#34;&gt;data&lt;/a&gt; after some googling time, which give coordinate based on the postal code. So, we going to add in the missing coordinate based on this online data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;add_coord &amp;lt;- 
  read.table(header = T, text = &amp;quot;
postal_code    latitude   longitude
16070            6.0334    102.3499
26060            3.6228    102.3926
90700            5.8456    118.0571
26060            3.6228    102.3926&amp;quot;)

clinic1m2 &amp;lt;- 
  clinic1m2 %&amp;gt;% 
  mutate(lat = ifelse(postcode %in% add_coord$postal_code, add_coord$latitude, lat), 
         long = ifelse(postcode %in% add_coord$postal_code, add_coord$longitude, long)) %&amp;gt;% 
  drop_na() #drop 2 clinic1m&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Even after add in the missing coordinate, we still missing 2 coordinates. So, we going to drop those 2 clinics. Next, we combine both data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_clinic &amp;lt;- 
  clinic1m2 %&amp;gt;% 
  mutate(Type = &amp;quot;1Malaysia&amp;quot;) %&amp;gt;% 
  select(Type, lat, long) %&amp;gt;% 
  bind_rows(clinicDesa %&amp;gt;% 
              mutate(Type = &amp;quot;Desa&amp;quot;, 
                     lat = latitude, 
                     long = longitude) %&amp;gt;% 
              select(Type, lat, long))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s try plotting the data first.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(all_clinic, aes(long, lat, color = Type)) +
  geom_point() +
  theme_minimal() #should remove the isolated two data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We have 2 isolated points from Klinik Desa data. We will drop these 2 points as well.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_clinic2 &amp;lt;- all_clinic %&amp;gt;% filter(long &amp;gt; 25)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;plotting-the-map&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plotting the map&lt;/h2&gt;
&lt;p&gt;There are 2 ways to plot our data to Malaysia map, that we going to cover in this post.&lt;/p&gt;
&lt;div id=&#34;map-from-ggplot2&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;1) map from &lt;code&gt;ggplot2&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;First, we need to get the map.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;global &amp;lt;- map_data(&amp;quot;world&amp;quot;) #get map&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once, we retrieved the map, we need to filter the region to Malaysia. The rest of the codes are &lt;code&gt;ggplot2&lt;/code&gt; function as we know it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() + 
  geom_polygon(data = global %&amp;gt;% filter(region == &amp;quot;Malaysia&amp;quot;), aes(x=long, y = lat, group = group), 
               fill = &amp;quot;gray85&amp;quot;) + 
  coord_fixed(1.3) +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  theme_void() + 
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Klinik 1Malaysia dan Klinik Desa di Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;))), 
       color = &amp;quot;Jenis klinik:&amp;quot;, 
       shape = &amp;quot;Jenis klinik:&amp;quot;) +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = &amp;quot;bottom&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/index.en_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;map-from-rworldmap&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;2) map from &lt;code&gt;rworldmap&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The flow is similar, we need to get the map first. Then, restrict the map to Malaysia region.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;world &amp;lt;- getMap(resolution = &amp;quot;low&amp;quot;) #get map
msia &amp;lt;- world[world@data$ADMIN == &amp;quot;Malaysia&amp;quot;, ]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The rest of the codes are similar to the first approach. But, we going to change the theme a bit.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() +
  geom_polygon(data = msia, aes(x = long, y = lat, group = group), fill = NA, colour = &amp;quot;black&amp;quot;) +
  geom_point(data = all_clinic2, aes(x = long, y = lat, group = Type, color = Type, shape = Type)) +
  coord_quickmap() + 
  theme_minimal() + 
  xlab(&amp;quot;Longitude&amp;quot;) +
  ylab(&amp;quot;Latitude&amp;quot;) +
  labs(title = &amp;quot;Klinik 1Malaysia dan Klinik Desa di Malaysia&amp;quot;, 
       subtitle = &amp;quot;(Data dikemaskini: Klinik 1Malaysia - 16 Mac 2021, Klinik Desa - 9 Mac 2021)&amp;quot;,
       caption = expression(paste(italic(&amp;quot;Sumber data: https://www.data.gov.my/data/ms_MY/group/pemetaan&amp;quot;))), 
       color = &amp;quot;Jenis klinik:&amp;quot;, 
       shape = &amp;quot;Jenis klinik:&amp;quot;) +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5), 
        legend.position = &amp;quot;bottom&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/making-maps-with-r-my-first-attempt-ever/index.en_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The coordinates that we have are not as accurate as it should, or maybe there is something wrong that I miss along the way. As we can see, we have clinics on the ocean. As far as I know, we Malaysian are not that advanced yet. Also, noticed that we severely lacking clinics in Sarawak, given that our data is correct.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Some COVID-19 plots for Southeast Asian countries</title>
      <link>https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/</link>
      <pubDate>Wed, 10 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Recently, I found a GitHub &lt;a href=&#34;https://github.com/owid/covid-19-data/tree/master/public/data&#34;&gt;repo&lt;/a&gt; containing a global COVID-19 dataset. I thought, why not try to do some plotting for Southeast Asian countries. So, I downloaded the data and limited the data to Southeast Asian countries only (Brunei, Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam). I have uploaded this restricted data to my GitHub &lt;a href=&#34;https://github.com/tengku-hanis/data-owid-covid&#34;&gt;repo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We are not going to do anything fancy, just some visualisations.&lt;/p&gt;
&lt;p&gt;Let’s begin by reading the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
covid_sea &amp;lt;- read_csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/data-owid-covid/main/covid_sea.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We are going to compare between each Southeast Asian countries in terms of:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Daily cases&lt;/li&gt;
&lt;li&gt;Daily deaths&lt;/li&gt;
&lt;li&gt;Daily tests&lt;/li&gt;
&lt;li&gt;Daily vaccinations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Before that, we need to make a function, as all the above items have a generic things to plot with the exception on y axis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;easy_plot &amp;lt;- function(var1, lab_title, yaxis_lab, span = 0.14){
  covid_sea %&amp;gt;% 
    select(date, location, {{var1}}) %&amp;gt;% 
    drop_na() %&amp;gt;% 
    ggplot(aes(date, {{var1}}, color = location)) +
    geom_smooth(se = F, span = 0.14) +
    geom_point(aes(color = location), alpha = 0.2) +
    geom_line(aes(color = location), alpha = 0.2, linetype = &amp;quot;dashed&amp;quot;) +
    labs(title = {{lab_title}}) +
    ylab({{yaxis_lab}}) +
    xlab(&amp;quot;Date&amp;quot;) +
    theme_minimal() 
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;var1 is going to be the item/variable that we want to compare, lab_title is the plot title, yaxis_lab is the label on the y axis, and span is just how smooth our smoothen line should be.&lt;/p&gt;
&lt;div id=&#34;daily-cases&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Daily cases&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;easy_plot(new_cases, &amp;quot;Daily cases for southeast Asian countries&amp;quot;, &amp;quot;Daily cases&amp;quot;, span = 0.8)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We cannot compare in terms of the frequency as big countries like Indonesia is expected to had a higher number of daily cases. A smoothen line though very basic, may indicate a simple trend. Thailand, Malaysia, Philippines and Indonesia seems to had a decreasing trend of cases. On the other hand, the daily cases in Vietnam seems to start to increase. Singapore had a more stabilised trend of cases, though a higher number of cases was observed in the latest period. Lastly, Brunei had too little cases, for us to see any sort of trend at the scale of the between countries comparison.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;daily-deaths&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Daily deaths&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;easy_plot(new_deaths, &amp;quot;Daily deaths for southeast Asian countries&amp;quot;, &amp;quot;Daily deaths&amp;quot;, span = 0.8)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Philippines and Indonesia seems started to had a bit of increasing trend. Other countries look okay.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;daily-tests&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Daily tests&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;easy_plot(new_tests, &amp;quot;Daily tests for southeast Asian countries&amp;quot;, &amp;quot;Daily tests&amp;quot;, span = 0.2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The daily tests plot looks a bit weird for Vietnam. Actually, the daily tests below zero are not avaliable (not sure if there is no test done in the period or the values is just missing). Hence, the weird looking plot for Vietnam. Data for Brunei and Thailand are not available. Malaysia seems to be quite aggressive in COVID-19 testing, even on par with Indonesia. Also, Vietnam seems to be very aggressive in the latest period, probably to cover the lack of COVID-19 testing previously.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;daily-vaccinations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Daily vaccinations&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;easy_plot(new_vaccinations, &amp;quot;Daily vaccinations for southeast Asian countries&amp;quot;, &amp;quot;Daily vaccinations&amp;quot;, span = 0.9)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Malaysia and Singapore had quite a similar distribution. Vietnam, Philippines, Thailand and Indonesia quite similar in which they had a series of wave in the rate of vaccinations, though the trend of wave for Thailand is less obvious. Again, the number in Brunei was too little for us to see any trend or distribution at this scale.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;malaysia-situation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Malaysia situation&lt;/h2&gt;
&lt;p&gt;Let’s do a plot, specific to Malaysia. We going to scale the numbers, so that we able to see a comparison in term of trend or distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;covid_sea %&amp;gt;% 
  filter(location == &amp;quot;Malaysia&amp;quot;) %&amp;gt;% 
  mutate(new_cases = scale(new_cases), 
         new_deaths = scale(new_deaths), 
         new_tests = scale(new_tests), 
         new_vaccinations = scale(new_vaccinations)) %&amp;gt;% 
  ggplot(aes(date)) +
  geom_line(aes(y = new_cases, color = &amp;quot;new_cases&amp;quot;), alpha = 0.3) +
  geom_line(aes(y = new_deaths, color = &amp;quot;new_deaths&amp;quot;), alpha = 0.3) +
  geom_line(aes(y = new_tests, color = &amp;quot;new_tests&amp;quot;), alpha = 0.3) +
  geom_line(aes(y = new_vaccinations, color = &amp;quot;new_vaccinations&amp;quot;), alpha = 0.3) +
  geom_point(aes(y = new_cases, color = &amp;quot;new_cases&amp;quot;), alpha = 0.3) +
  geom_point(aes(y = new_deaths, color = &amp;quot;new_deaths&amp;quot;), alpha = 0.3) +
  geom_point(aes(y = new_tests, color = &amp;quot;new_tests&amp;quot;), alpha = 0.3) +
  geom_point(aes(y = new_vaccinations, color = &amp;quot;new_vaccinations&amp;quot;), alpha = 0.3) +
  geom_smooth(aes(y = new_cases, color = &amp;quot;new_cases&amp;quot;), se = F, span = 0.3) +
  geom_smooth(aes(y = new_deaths, color = &amp;quot;new_deaths&amp;quot;), se = F, span = 0.3) +
  geom_smooth(aes(y = new_tests, color = &amp;quot;new_tests&amp;quot;), se = F, span = 0.3) +
  geom_smooth(aes(y = new_vaccinations, color = &amp;quot;new_vaccinations&amp;quot;), se = F, span = 0.6) +
  labs(title = &amp;quot;Situation in Malaysia&amp;quot;) +
  ylab(&amp;quot;Scaled Frequency&amp;quot;) +
  xlab(&amp;quot;Date&amp;quot;) +
  guides(color = guide_legend(&amp;quot;Items&amp;quot;)) +
  scale_color_discrete(labels = c(&amp;quot;Daily cases&amp;quot;, &amp;quot;Daily deaths&amp;quot;, &amp;quot;Daily tests&amp;quot;, &amp;quot;Daily vaccinations&amp;quot;)) +
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/some-covid-19-plots-for-southeast-asian-countries/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Interestingly, as the number of vaccination increased up to a certain threshold, the number of daily cases and daily deaths started to decreased. Obviously, the daily testing also decreased as in Malaysia, COVID-19 testing is done based on suspected cases and their persons of contact instead of mass testing.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: Please take anything written here with a massive grain of salt.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Data source:
&lt;a href=&#34;https://github.com/owid/covid-19-data/tree/master/public/data&#34; class=&#34;uri&#34;&gt;https://github.com/owid/covid-19-data/tree/master/public/data&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Extract a table from a pdf</title>
      <link>https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/</link>
      <pubDate>Mon, 01 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In a couple of days, I am going to conduct a pre-conference workshop for Malaysian &lt;a href=&#34;https://www.r-conference.com/&#34;&gt;R conference 2021&lt;/a&gt;. So, some of the data that I am going to use for this workshop is available in a table in pdf form. Hence, this post is about how I get that particular table from the pdf into R for further analysis.&lt;/p&gt;
&lt;p&gt;So, this is a table we going to extract.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;images/table.png&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;extracting-a-table-from-pdf&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extracting a table from pdf&lt;/h2&gt;
&lt;p&gt;We going to use &lt;code&gt;tabulizer&lt;/code&gt; package for this. However, not every pdf works with this package. In our case, it works but need further preprocessing.&lt;/p&gt;
&lt;p&gt;Load the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tabulizer)
library(dplyr)
library(stringr)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Read a table from a pdf.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;raw_table &amp;lt;- extract_tables(&amp;quot;https://static-content.springer.com/esm/art%3A10.1038%2Fs41440-021-00720-3/MediaObjects/41440_2021_720_MOESM1_ESM.pdf&amp;quot;, 
                          pages = 17, 
                          output = &amp;quot;data.frame&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, this is the extracted table.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;raw_table[[1]] %&amp;gt;% head(10)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                X     X.1     X.2     X.3  X.4     X.5 X.6     X.7  X.8
## 1                                                                     
## 2                                                                     
## 3    Ahmed, 2019 Unclear Unclear Unclear High Unclear Low Unclear High
## 4                                                                     
## 5   Badrov, 2013 Unclear    High    High High Unclear Low Unclear High
## 6   Baross, 2012 Unclear Unclear    High High Unclear Low Unclear High
## 7   Baross, 2013 Unclear Unclear    High High Unclear Low Unclear High
## 8  Carlson, 2016     Low    High    High  Low Unclear Low     Low High
## 9  Correia, 2020     Low     Low     Low High Unclear Low     Low High
## 10                                                                    
##                              X.9
## 1      1- selection bias: random
## 2            sequence generation
## 3  2- selection bias: allocation
## 4                    concealment
## 5                               
## 6   3- reporting bias: selective
## 7                      reporting
## 8                               
## 9  4- Performance bias: blinding
## 10  (participants and personnel)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, a few preprocessing steps needed:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Remove column X.9 - this column supposed to be a header&lt;/li&gt;
&lt;li&gt;Rename a header based on column X.9&lt;/li&gt;
&lt;li&gt;Remove a space between the author name - “Ahmed,2019” instead of “Ahmed, 2019”&lt;/li&gt;
&lt;li&gt;Remove empty rows&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;irt_rob &amp;lt;- 
  raw_table[[1]] %&amp;gt;% 
  select(-X.9) %&amp;gt;%  
  rename(Study = X, 
         Random.sequence.generation. = X.1, 
         Allocation.concealment. = X.2,
         Selective.reporting. = X.3,
         Blinding.of.participants.and.personnel. = X.4, 
         Blinding.of.outcome.assessment = X.5, 
         Incomplete.outcome.data = X.6, 
         Other.sources.of.bias. = X.7, 
         Overall = X.8) %&amp;gt;% 
  as_tibble() %&amp;gt;% 
  mutate(Study = str_replace_all(Study, &amp;quot; &amp;quot;, &amp;quot;&amp;quot;)) %&amp;gt;% 
  mutate(id_del = str_match(Study, &amp;quot;.&amp;quot;)) %&amp;gt;% 
  filter(!is.na(id_del)) %&amp;gt;% 
  select(-id_del)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, our data is ready.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;irt_rob&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          Study Random.sequence.generation. Allocation.concealment.
## 1   Ahmed,2019                     Unclear                 Unclear
## 2  Badrov,2013                     Unclear                    High
## 3  Baross,2012                     Unclear                 Unclear
## 4  Baross,2013                     Unclear                 Unclear
## 5 Carlson,2016                         Low                    High
##   Selective.reporting. Blinding.of.participants.and.personnel.
## 1              Unclear                                    High
## 2                 High                                    High
## 3                 High                                    High
## 4                 High                                    High
## 5                 High                                     Low
##   Blinding.of.outcome.assessment Incomplete.outcome.data Other.sources.of.bias.
## 1                        Unclear                     Low                Unclear
## 2                        Unclear                     Low                Unclear
## 3                        Unclear                     Low                Unclear
## 4                        Unclear                     Low                Unclear
## 5                        Unclear                     Low                    Low
##   Overall
## 1    High
## 2    High
## 3    High
## 4    High
## 5    High&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>A short note on multiple imputation</title>
      <link>https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/</link>
      <pubDate>Fri, 29 Oct 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;background&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Missing data is quite challenging to deal with. Deleting it may be the easiest solution, but may not be the best solution. Missing data can be categorised into 3 types (&lt;a href=&#34;https://www.jstor.org/stable/2335739&#34;&gt;Rubin, 1976&lt;/a&gt;):&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;MCAR&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Missing Completely At Random&lt;/li&gt;
&lt;li&gt;Example; some of the observations are missing due to lost of records during the flood&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MAR&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Missing At Random&lt;/li&gt;
&lt;li&gt;Example; variable income are missing as some participant refuse to give their salary information which they deems as very personal information&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MNAR&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Missing Not At Random&lt;/li&gt;
&lt;li&gt;Example; weight variable is missing for morbidly obese participants since the scale is unable to weight them&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Out of the 3 types above, the most problematic is MNAR, though there exist methods to deal with this type. For example, the &lt;a href=&#34;https://cran.r-project.org/web/packages/miceMNAR/miceMNAR.pdf&#34;&gt;miceMNAR&lt;/a&gt; package in R.&lt;/p&gt;
&lt;p&gt;There are several approaches in handling missing data:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Listwise-deletion&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Best approach if the amount of missingness is very small&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple imputation&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using mean/median/mode imputation&lt;/li&gt;
&lt;li&gt;This approach is not advisable as it leads to bias due to reduce variance, though the mean is not affected&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Single imputation&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple imputation above is considered as single imputation as well&lt;/li&gt;
&lt;li&gt;This approach ignores uncertainty of the imputation and almost always underestimate the variance&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple imputation&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A bit advanced and it cover the limitation of single imputation approach&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;However, the main assumption for any imputation methods is the missingness should be MCAR or MAR.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;multiple-imputation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Multiple imputation&lt;/h2&gt;
&lt;p&gt;In short, there are 2 approaches of multiple imputation implemented by packages in R:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Joint modeling (JM) or joint multivariate normal distribution multiple imputation&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The main assumption for this method is that the observed data follows a multivariate normal distribution&lt;/li&gt;
&lt;li&gt;A violation of this assumption produces incorrect values, though a slight violation is still okay&lt;/li&gt;
&lt;li&gt;Some packages that implemented this method: &lt;code&gt;Amelia&lt;/code&gt; and &lt;code&gt;norm&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fully conditional specification (FCS) or conditional multiple imputation&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Also known as multivariate imputation by chained equation (MICE)&lt;/li&gt;
&lt;li&gt;This approach is a bit flexible as distribution is assumed for each variable rather than the whole dataset&lt;/li&gt;
&lt;li&gt;Some package that implemented this method: &lt;code&gt;mice&lt;/code&gt; and &lt;code&gt;mi&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example&lt;/h2&gt;
&lt;p&gt;In &lt;code&gt;mice&lt;/code&gt; package, the general steps are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;mice()&lt;/code&gt; - impute the NAs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;with()&lt;/code&gt; - run the analysis (lm, glm, etc)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pool()&lt;/code&gt; - pool the results&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;Screenshot%202021-11-20%20145517.png&#34; alt=&#34;Main steps in mice package.&#34; width=&#34;90%&#34; height=&#34;90%&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Main steps in mice package.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;These are the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(mice)
library(VIM)
#library(missForest) we want to use prodNA() function from this package
library(naniar)
library(niceFunction) #install from github (https://github.com/tengku-hanis/niceFunction)
library(dplyr)
library(gtsummary)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to produce some NAs randomly.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
dat &amp;lt;- iris %&amp;gt;% 
  select(-Sepal.Length)%&amp;gt;% 
  missForest::prodNA(0.2) %&amp;gt;%  # randomly insert 20% NAs
  mutate(Sepal.Length = iris$Sepal.Length)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explore the NAs and the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;naniar::miss_var_summary(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 3
##   variable     n_miss pct_miss
##   &amp;lt;chr&amp;gt;         &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 Petal.Length     38     25.3
## 2 Sepal.Width      33     22  
## 3 Species          28     18.7
## 4 Petal.Width      21     14  
## 5 Sepal.Length      0      0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Some references recommend to remove variables with more than 50% NAs. However, we purposely introduce 20% NAs into our data.&lt;/p&gt;
&lt;p&gt;As a guideline, we can check for MCAR for our NAs.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;naniar::mcar_test(dat) #p &amp;gt; 0.05, MCAR is indicated&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;            &amp;lt;int&amp;gt;
## 1      38.8    40   0.522               14&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next step is to evaluate the pattern of missingness in our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;md.pattern(dat, rotate.names = T, plot = T) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;##    Sepal.Length Petal.Width Species Sepal.Width Petal.Length    
## 64            1           1       1           1            1   0
## 21            1           1       1           1            0   1
## 15            1           1       1           0            1   1
## 3             1           1       1           0            0   2
## 14            1           1       0           1            1   1
## 4             1           1       0           1            0   2
## 6             1           1       0           0            1   2
## 2             1           1       0           0            0   3
## 7             1           0       1           1            1   1
## 6             1           0       1           1            0   2
## 4             1           0       1           0            1   2
## 2             1           0       1           0            0   3
## 1             1           0       0           1            1   2
## 1             1           0       0           0            1   3
##               0          21      28          33           38 120&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;aggr(dat, prop = F, numbers = T) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We have 13 patterns (numbers on the right) of NAs in our data. These 2 functions work well with small dataset, but with a larger dataset (and with lot more pattern of NAs), it’s probably quite difficult to assess the pattern.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;matrixplot()&lt;/code&gt; probably more appropriate for a larger dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;matrixplot(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In terms of the missingness pattern, we can also assess the distribution of NAs of Sepal.Width is dependent on the variable Sepal.Length.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;niceFunction::histNA_byVar(dat, Sepal.Width, Sepal.Length)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As we can see the distribution and range of the histograms of the NAs (True) and non-NAs (False) is quite similar. Thus, this may indicated that Sepal.Width is at least MAR. However, by right we should do this for each pair of numerical variable before jumping into any conclusion.&lt;/p&gt;
&lt;p&gt;Another good thing to assess is the correlation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Data with 1 = NAs, 0 = non-NAs
x &amp;lt;- as.data.frame(abs(is.na(dat))) %&amp;gt;% 
  dplyr::select(-Sepal.Length) #pick variable with NAs only&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Firstly, the correlation between the variables with missing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cor(x) %&amp;gt;% 
  corrplot::corrplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;No high correlation among variable with NAs. Secondly, let’s see correlation between NAs in a variable and the observed values of other variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cor(dat %&amp;gt;% mutate(Species = as.numeric(Species)), x, use = &amp;quot;pairwise.complete.obs&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##               Sepal.Width Petal.Length  Petal.Width     Species
## Sepal.Width            NA  0.049158733 -0.065917718  0.09948263
## Petal.Length  0.042075695           NA -0.004572405 -0.17265919
## Petal.Width   0.096195805 -0.003320601           NA -0.11024288
## Species       0.045849046 -0.104143925 -0.081055707          NA
## Sepal.Length -0.006435044 -0.052871701 -0.091024799 -0.08527514&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, there is no high correlation. But, if we were to interpret this correlation matrix; the rows are the observed variables and the columns represent the missingness. For example, missing values of Sepal.Width is more likely to be missing for observations with a high value of Petal.Width (r = 0.05 indicates it’s highly unlikely though).&lt;/p&gt;
&lt;p&gt;Now, we can do multiple imputation. These are the methods in the &lt;code&gt;mice&lt;/code&gt; package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;methods(mice)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] mice.impute.2l.bin       mice.impute.2l.lmer      mice.impute.2l.norm     
##  [4] mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart         mice.impute.jomoImpute  
## [10] mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch   mice.impute.mnar.logreg 
## [16] mice.impute.mnar.norm    mice.impute.norm         mice.impute.norm.boot   
## [19] mice.impute.norm.nob     mice.impute.norm.predict mice.impute.panImpute   
## [22] mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [25] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf          
## [28] mice.impute.ri           mice.impute.sample       mice.mids               
## [31] mice.theme              
## see &amp;#39;?methods&amp;#39; for accessing help and source code&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, mice uses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pmm (predictive mean matching) for numeric data&lt;/li&gt;
&lt;li&gt;logreg (logistic regression imputation) for binary data, factor with 2 levels&lt;/li&gt;
&lt;li&gt;polyreg (polytomous regression imputation) for unordered categorical data (factor &amp;gt; 2 levels)&lt;/li&gt;
&lt;li&gt;polr (proportional odds model) for ordered, &amp;gt; 2 levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let’s run the mice function to our data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imp &amp;lt;- mice(dat, m = 5, seed=1234, maxit = 5, printFlag = F) 
imp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  Sepal.Width Petal.Length  Petal.Width      Species Sepal.Length 
##        &amp;quot;pmm&amp;quot;        &amp;quot;pmm&amp;quot;        &amp;quot;pmm&amp;quot;    &amp;quot;polyreg&amp;quot;           &amp;quot;&amp;quot; 
## PredictorMatrix:
##              Sepal.Width Petal.Length Petal.Width Species Sepal.Length
## Sepal.Width            0            1           1       1            1
## Petal.Length           1            0           1       1            1
## Petal.Width            1            1           0       1            1
## Species                1            1           1       0            1
## Sepal.Length           1            1           1       1            0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we can do some diagnostic assessment on the imputed data. This is our imputed data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imp$imp$Sepal.Width %&amp;gt;% head()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      1   2   3   4   5
## 5  3.4 3.4 4.1 3.1 3.5
## 13 3.2 3.1 3.2 3.6 3.1
## 14 3.1 3.2 2.9 3.4 3.0
## 23 3.6 3.2 3.0 3.8 3.1
## 26 4.1 3.0 3.1 3.5 3.0
## 34 3.4 3.7 3.7 3.4 4.4&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One important thing to check is the convergence. We are going increase the number of iteration for this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imp_conv &amp;lt;- mice.mids(imp, maxit = 30, printFlag = F)
plot(imp_conv)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;672&#34; /&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-16-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The line in the plot should be intermingled and no obvious trend should be observed. Our plot above indicates a convergence.&lt;/p&gt;
&lt;p&gt;We can also assess density plot of imputed data and the observed data. Blue color is the observed data and red color is the imputed data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;densityplot(imp)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-17-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can further assess variable Sepal.Width.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;densityplot(imp, ~ Sepal.Width | .imp)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-18-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, we can assess the strip plot. The imputed observations (red color) should not distributed too far from the observed data (blue color).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;stripplot(imp)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-short-note-on-multiple-imputation/index.en_files/figure-html/unnamed-chunk-19-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, once we finish the diagnostic checking, we can actually go back and change the imputation method for Sepal.Width, since the its distribution changes quite differently at each iteration. But, we are not going to do that, instead we are going to do the analysis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# run regression
fit &amp;lt;- with(imp, lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species))
# pool all imputed set
pooled &amp;lt;- pool(fit) 
summary(pooled)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                term   estimate  std.error statistic       df      p.value
## 1       (Intercept)  2.2008307 0.34577321  6.364954 29.02484 5.859560e-07
## 2       Sepal.Width  0.5233500 0.09717217  5.385801 50.89918 1.854832e-06
## 3      Petal.Length  0.7409159 0.09020153  8.214006 12.73722 1.921415e-06
## 4       Petal.Width -0.3623895 0.18562168 -1.952301 22.34517 6.354332e-02
## 5 Speciesversicolor -0.3891112 0.28166528 -1.381467 15.07547 1.872683e-01
## 6  Speciesvirginica -0.5237106 0.42629920 -1.228505 10.82804 2.452897e-01&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we have the original dataset without the NAs, we going to compare them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mimpute &amp;lt;- 
  fit %&amp;gt;% 
  tbl_regression() #with mice

noimpute &amp;lt;- 
  dat %&amp;gt;% 
  lm(Sepal.Length ~ ., data = .) %&amp;gt;% 
  tbl_regression() #w/o mice

original &amp;lt;- 
  iris %&amp;gt;% 
  lm(Sepal.Length ~ ., data = .) %&amp;gt;% 
  tbl_regression() #original data

tbl_merge(
  tbls = list(mimpute, noimpute, original), 
  tab_spanner = c(&amp;quot;With MICE&amp;quot;, &amp;quot;Without MICE&amp;quot;, &amp;quot;Original data&amp;quot;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;kofvwjwgme&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#kofvwjwgme .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#kofvwjwgme .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#kofvwjwgme .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#kofvwjwgme .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 6px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#kofvwjwgme .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#kofvwjwgme .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#kofvwjwgme .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#kofvwjwgme .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#kofvwjwgme .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#kofvwjwgme .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#kofvwjwgme .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#kofvwjwgme .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#kofvwjwgme .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#kofvwjwgme .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#kofvwjwgme .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#kofvwjwgme .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#kofvwjwgme .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#kofvwjwgme .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#kofvwjwgme .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#kofvwjwgme .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#kofvwjwgme .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#kofvwjwgme .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#kofvwjwgme .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#kofvwjwgme .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#kofvwjwgme .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#kofvwjwgme .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#kofvwjwgme .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#kofvwjwgme .gt_left {
  text-align: left;
}

#kofvwjwgme .gt_center {
  text-align: center;
}

#kofvwjwgme .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#kofvwjwgme .gt_font_normal {
  font-weight: normal;
}

#kofvwjwgme .gt_font_bold {
  font-weight: bold;
}

#kofvwjwgme .gt_font_italic {
  font-style: italic;
}

#kofvwjwgme .gt_super {
  font-size: 65%;
}

#kofvwjwgme .gt_footnote_marks {
  font-style: italic;
  font-weight: normal;
  font-size: 65%;
}
&lt;/style&gt;
&lt;table class=&#34;gt_table&#34;&gt;
  
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Characteristic&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;3&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;With MICE&lt;/span&gt;
      &lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;3&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Without MICE&lt;/span&gt;
      &lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;3&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Original data&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Beta&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;p-value&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Beta&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;p-value&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Beta&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;p-value&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;Sepal.Width&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.52&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.33, 0.72&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.48&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.17, 0.79&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.003&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.50&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.33, 0.67&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;Petal.Length&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.74&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.55, 0.94&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.71&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.51, 0.90&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.83&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.69, 1.0&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;0.001&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;Petal.Width&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.36&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.75, 0.02&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.064&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.35&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.85, 0.14&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.2&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.32&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.61, -0.02&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.039&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;Species&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34; style=&#34;text-align: left; text-indent: 10px;&#34;&gt;setosa&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;—&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34; style=&#34;text-align: left; text-indent: 10px;&#34;&gt;versicolor&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.39&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.0, 0.21&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.2&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.42&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.1, 0.30&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.3&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.72&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.2, -0.25&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.003&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34; style=&#34;text-align: left; text-indent: 10px;&#34;&gt;virginica&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.52&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.5, 0.42&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.2&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.42&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.5, 0.63&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.4&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.0&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-1.7, -0.36&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.003&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
  
  &lt;tfoot&gt;
    &lt;tr class=&#34;gt_footnotes&#34;&gt;
      &lt;td colspan=&#34;10&#34;&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;1&lt;/em&gt;
          &lt;/sup&gt;
           
          CI = Confidence Interval
          &lt;br /&gt;
        &lt;/p&gt;
      &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;There is a different in the result between the original dataset (no NAs) and with mice imputation. Probably, exploring other imputation methods will produce a better result.&lt;/p&gt;
&lt;p&gt;There are a lot more that are not cover in this post. For example &lt;a href=&#34;https://www.gerkovink.com/miceVignettes/Passive_Post_processing/Passive_imputation_post_processing.html&#34;&gt;passive imputation and post-processing&lt;/a&gt;. In fact, there are a series of &lt;a href=&#34;https://github.com/amices/mice#vignettes&#34;&gt;vignettes&lt;/a&gt; written by Gerko Vink and Stef van Buuren (both are the authors of &lt;code&gt;mice&lt;/code&gt;) which provides a good tutorial on using &lt;code&gt;mice&lt;/code&gt; though quite advanced.&lt;/p&gt;
&lt;p&gt;Suggested online books (though, I have not really studied both of the books yet):&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://stefvanbuuren.name/fimd/&#34;&gt;Flexible imputation of missing data&lt;/a&gt; by Stef van Buuren&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bookdown.org/mwheymans/bookmi/&#34;&gt;Applied missing data analysis with SPSS and (R)Studio&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;References for this post:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;http://www.cs.uni.edu/~jacobson/4772/week11/R_in_Action.pdf&#34;&gt;R in Action, Data analysis and graphics with R&lt;/a&gt; (Chapter 15)&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://data.library.virginia.edu/getting-started-with-multiple-imputation-in-r/&#34; class=&#34;uri&#34;&gt;https://data.library.virginia.edu/getting-started-with-multiple-imputation-in-r/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/&#34; class=&#34;uri&#34;&gt;https://stats.idre.ucla.edu/r/faq/how-do-i-perform-multiple-imputation-using-predictive-mean-matching-in-r/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.jstatsoft.org/article/view/v045i03&#34;&gt;mice: Multivariate Imputation by Chained Equations in R&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>COVID-19 vaccine interest in Malaysia</title>
      <link>https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/</link>
      <pubDate>Sun, 17 Oct 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;We are going to do a basic google trends search using &lt;code&gt;gtrendsR&lt;/code&gt; package and do some plotting with &lt;code&gt;ggplot2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;These are the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(gtrendsR)
library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;gtrends()&lt;/code&gt; function to search our keywords of interest (i.e; type of vaccine). So far, we only used &lt;a href=&#34;https://covidnow.moh.gov.my/vaccinations/&#34;&gt;4 type of vaccines&lt;/a&gt; in Malaysia.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vaccine &amp;lt;- gtrends(c(&amp;quot;pfizer&amp;quot;, &amp;quot;astrazeneca&amp;quot;, &amp;quot;sinovac&amp;quot;, &amp;quot;cansino&amp;quot;), geo = &amp;quot;MY&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, plot our keywords.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(vaccine)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/index.en_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt;
Probably, it’s better if we filter our date to when the COVID-19 pandemic started, which is around March 2020.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vaccine$interest_over_time %&amp;gt;% 
  group_by(keyword) %&amp;gt;% 
  filter(hits != &amp;quot;&amp;lt;1&amp;quot; &amp;amp; date &amp;gt; as.Date(&amp;quot;2020-03-01&amp;quot;)) %&amp;gt;% 
  mutate(hits = as.numeric(hits), 
         date = as.Date(date)) %&amp;gt;% 
  ggplot() + 
  geom_line(aes(x = date, y = hits, color = keyword), size = 0.8) +
  theme_minimal() +
  labs(title = &amp;quot;COVID-19 vaccine interest in Malaysia&amp;quot;, y = &amp;quot;Search hits&amp;quot;, x = &amp;quot;Date&amp;quot;) +
  scale_x_date(date_breaks = &amp;quot;4 month&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, AstraZeneca vaccine is of high interest, probably due to infamous blood clotting issue. Next, we can also get the search keywords based on the states.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vaccine$interest_by_region %&amp;gt;% 
  group_by(location) %&amp;gt;% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = &amp;quot;COVID-19 vaccine interest in Malaysia by states&amp;quot;, y = &amp;quot;Search hits&amp;quot;, x = &amp;quot;&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/index.en_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, we can plot the search keywords based on the city.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;vaccine$interest_by_city %&amp;gt;% 
  group_by(location) %&amp;gt;% 
  drop_na() %&amp;gt;% 
  ggplot(aes(location, hits, fill = keyword)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  scale_fill_viridis_d() +
  labs(title = &amp;quot;COVID-19 vaccine interest in Malaysia by cities&amp;quot;, y = &amp;quot;Search hits&amp;quot;, x = &amp;quot;&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/covid-19-vaccine-interest-in-malaysia/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;gtrendsR&lt;/code&gt; with just a bit of plots certainly very useful if we want to gauge certain issues in the community.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Wordcloud of COVID-19 research in Malaysia</title>
      <link>https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/</link>
      <pubDate>Sat, 11 Sep 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/wordcloud2/wordcloud.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/wordcloud2/wordcloud2.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/wordcloud2/hover.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/wordcloud2-binding/wordcloud2.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Let’s see how much research has been done in term of COVID-19 in Malaysia. In this analysis, we are going to use &lt;a href=&#34;https://www.scopus.com/search/form.uri?display=basic&amp;amp;zone=header&amp;amp;origin=#basic&#34;&gt;Scopus database&lt;/a&gt; to access the relevant research or papers. In this analysis we are going to use 4 specific parts of the scientific paper:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Title&lt;/li&gt;
&lt;li&gt;Abstract&lt;/li&gt;
&lt;li&gt;Author’s keywords&lt;/li&gt;
&lt;li&gt;Scopus’s keywords&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src=&#34;images/sample%20paper.png&#34; alt=&#34;Sample of paper&#34; /&gt;
Above is a sample of paper that shows the section of scientific paper that we are going to use in our analysis. The Scopus’s keywords are generated by the Scopus database, thus, it does not available on the paper.&lt;/p&gt;
&lt;p&gt;So, the analysis will be applied separately on these 4 parts of the papers. Also, we are going to use map (equivalent to loop) since the flow of the analysis is similar.&lt;/p&gt;
&lt;p&gt;Load the related packages. The main package is &lt;code&gt;quanteda&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(patchwork)
library(wordcloud2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I have uploaded the data that I downloaded from the Scopus database into my GitHub.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Read data from GitHub repo
df &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/tengku-hanis/scopus-data/main/covid-malaysia.csv&amp;quot;) %&amp;gt;% 
  janitor::clean_names() %&amp;gt;% 
  rename(title =i_title)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First, we need to tokenize the sentence. In other words, we break down the sentences into words.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Tokenize
tok_list &amp;lt;- 
  df %&amp;gt;% 
  select(title, abstract, author_keywords, index_keywords) %&amp;gt;% 
  map(tokens, 
      remove_punct = T, 
      remove_numbers = T,               
      remove_symbols = T)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we remove words that are not meaningful such ‘a’, ‘the’, etc. These words are known as stop words.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Remove stop words
nostop_toks &amp;lt;- 
  tok_list %&amp;gt;% 
  map(tokens_select, 
      c(tidytext::stop_words$word, stopwords(&amp;quot;en&amp;quot;)), 
      selection = &amp;quot;remove&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, we create a document feature matrix (DFM). Basically DFM is a matrix that represent the frequency of each word (feature) in each document (in our case, paper or manuscript). Another name for DFM is document term matrix (DTM). &lt;code&gt;quanteda&lt;/code&gt; uses the term DFM, some other packages use the term DTM.&lt;/p&gt;
&lt;p&gt;Additionally, we also apply term frequency-inverse document frequency (TF-IDF) metrics. In scientific papers, the words such as ‘determine’, ‘conclusion’, ‘introduction’, etc are very frequent, and these words are not meaningful as well. Instead of removing manually one by one, we use TF-IDF. So, TF-IDF basically remove the words that are too common, thus we get only the relevant or important words.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Create DFM and apply tf_idf
covid_dfm_list &amp;lt;- 
  nostop_toks %&amp;gt;% 
  map(dfm) %&amp;gt;% 
  map(dfm_tfidf)&lt;/code&gt;&lt;/pre&gt;
Once, we have our words (tokens), we can create a plot of most relevant terms based on TF-IDF.
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Plot top features
A &amp;lt;- 
  covid_dfm_list$title %&amp;gt;% 
  textstat_frequency(n = 15, force = T) %&amp;gt;% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = &amp;quot;blueviolet&amp;quot;) +
  coord_flip() +
  labs(x = NULL, y = &amp;quot;Frequency (tf-idf)&amp;quot;) +
  theme_minimal() +
  labs(title = &amp;quot;Top relevant terms for covid research based on the title&amp;quot;)

B &amp;lt;- 
  covid_dfm_list$abstract %&amp;gt;% 
  textstat_frequency(n = 15, force = T) %&amp;gt;% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = &amp;quot;darkolivegreen3&amp;quot;) +
  coord_flip() +
  labs(x = NULL, y = &amp;quot;Frequency (tf-idf)&amp;quot;) +
  theme_minimal() +
  labs(title = &amp;quot;Top relevant terms for covid research based on the abstract&amp;quot;)

C &amp;lt;- 
  covid_dfm_list$author_keywords %&amp;gt;% 
  textstat_frequency(n = 15, force = T) %&amp;gt;% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = &amp;quot;deepskyblue2&amp;quot;) +
  coord_flip() +
  labs(x = NULL, y = &amp;quot;Frequency (tf-idf)&amp;quot;) +
  theme_minimal() +
  labs(title = &amp;quot;Top relevant terms for covid research based on the author&amp;#39;s keywords&amp;quot;)

D &amp;lt;- 
  covid_dfm_list$index_keywords %&amp;gt;% 
  textstat_frequency(n = 15, force = T) %&amp;gt;% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(size = 4, colour = &amp;quot;aquamarine2&amp;quot;) +
  coord_flip() +
  labs(x = NULL, y = &amp;quot;Frequency (tf-idf)&amp;quot;) +
  theme_minimal() +
  labs(title = &amp;quot;Top relevant terms for covid research based on the Scopus&amp;#39;s keywords&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;These are the plots of the most relevant terms in COVID-19 research in Malaysia.
&lt;img src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/figure-html/unnamed-chunk-7-2.png&#34; width=&#34;672&#34; /&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/figure-html/unnamed-chunk-7-3.png&#34; width=&#34;672&#34; /&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/wordcloud-of-covid-19-research-in-malaysia/index.en_files/figure-html/unnamed-chunk-7-4.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;wordcloud&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Wordcloud&lt;/h2&gt;
&lt;p&gt;Finally, we can make our wordcloud, but we need to convert our DFM to data frame first. Also, we are going to round the value of TF-IDF and limit to top 1000 terms only.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;covid_wc &amp;lt;- 
  covid_dfm_list %&amp;gt;% 
  map(textstat_frequency, force = T)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Actually, &lt;code&gt;quanteda&lt;/code&gt; itself is able to produce a wordcloud. However, the wordcloud from &lt;code&gt;wordcloud2&lt;/code&gt; is more interactive and we can see the value of TF-IDF if we click the words.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wordcloud2(covid_wc$title%&amp;gt;% 
             slice(1:1000) %&amp;gt;% 
             mutate(frequency = round(frequency)))&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-9&#34;&gt;&lt;/span&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:672px;height:480px;&#34; class=&#34;wordcloud2 html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;word&#34;:[&#34;pandemic&#34;,&#34;malaysia&#34;,&#34;ð&#34;,&#34;impact&#34;,&#34;health&#34;,&#34;study&#34;,&#34;patients&#34;,&#34;sars-cov-2&#34;,&#34;learning&#34;,&#34;review&#34;,&#34;analysis&#34;,&#34;students&#34;,&#34;malaysian&#34;,&#34;coronavirus&#34;,&#34;control&#34;,&#34;online&#34;,&#34;global&#34;,&#34;outbreak&#34;,&#34;social&#34;,&#34;risk&#34;,&#34;healthcare&#34;,&#34;model&#34;,&#34;covid-19&#34;,&#34;challenges&#34;,&#34;education&#34;,&#34;disease&#34;,&#34;workers&#34;,&#34;lockdown&#34;,&#34;factors&#34;,&#34;countries&#34;,&#34;survey&#34;,&#34;mental&#34;,&#34;response&#34;,&#34;potential&#34;,&#34;psychological&#34;,&#34;management&#34;,&#34;system&#34;,&#34;perspective&#34;,&#34;movement&#34;,&#34;public&#34;,&#34;clinical&#34;,&#34;role&#34;,&#34;medical&#34;,&#34;infection&#34;,&#34;â&#34;,&#34;detection&#34;,&#34;cross-sectional&#34;,&#34;university&#34;,&#34;implications&#34;,&#34;impacts&#34;,&#34;systematic&#34;,&#34;transmission&#34;,&#34;covid&#34;,&#34;care&#34;,&#34;development&#34;,&#34;evidence&#34;,&#34;effect&#34;,&#34;meta-analysis&#34;,&#34;experience&#34;,&#34;teaching&#34;,&#34;based&#34;,&#34;treatment&#34;,&#34;knowledge&#34;,&#34;practice&#34;,&#34;approach&#34;,&#34;economic&#34;,&#34;strategies&#34;,&#34;effects&#34;,&#34;ñ&#34;,&#34;quality&#34;,&#34;media&#34;,&#34;measures&#34;,&#34;indonesia&#34;,&#34;asia&#34;,&#34;amid&#34;,&#34;vaccine&#34;,&#34;pakistan&#34;,&#34;mortality&#34;,&#34;application&#34;,&#34;tourism&#34;,&#34;images&#34;,&#34;intention&#34;,&#34;digital&#34;,&#34;coping&#34;,&#34;anxiety&#34;,&#34;spread&#34;,&#34;islamic&#34;,&#34;epidemic&#34;,&#34;data&#34;,&#34;people&#34;,&#34;stress&#34;,&#34;screening&#34;,&#34;future&#34;,&#34;performance&#34;,&#34;post-covid-19&#34;,&#34;perception&#34;,&#34;crisis&#34;,&#34;machine&#34;,&#34;industry&#34;,&#34;era&#34;,&#34;hospital&#34;,&#34;adults&#34;,&#34;assessment&#34;,&#34;critical&#34;,&#34;services&#34;,&#34;international&#34;,&#34;time&#34;,&#34;financial&#34;,&#34;rapid&#34;,&#34;depression&#34;,&#34;food&#34;,&#34;bangladesh&#34;,&#34;sustainable&#34;,&#34;stock&#34;,&#34;understanding&#34;,&#34;deep&#34;,&#34;virtual&#34;,&#34;molecular&#34;,&#34;x-ray&#34;,&#34;patient&#34;,&#34;prevention&#34;,&#34;asian&#34;,&#34;chest&#34;,&#34;responses&#34;,&#34;comparison&#34;,&#34;acute&#34;,&#34;effectiveness&#34;,&#34;perspectives&#34;,&#34;population&#34;,&#34;dynamics&#34;,&#34;e-learning&#34;,&#34;policy&#34;,&#34;awareness&#34;,&#34;security&#34;,&#34;inhibitors&#34;,&#34;framework&#34;,&#34;cancer&#34;,&#34;severe&#34;,&#34;network&#34;,&#34;perceptions&#34;,&#34;prediction&#34;,&#34;relationship&#34;,&#34;association&#34;,&#34;technology&#34;,&#34;china&#34;,&#34;recovery&#34;,&#34;mco&#34;,&#34;drugs&#34;,&#34;lessons&#34;,&#34;women&#34;,&#34;air&#34;,&#34;perceived&#34;,&#34;research&#34;,&#34;neural&#34;,&#34;behavior&#34;,&#34;level&#34;,&#34;life&#34;,&#34;ct&#34;,&#34;exploring&#34;,&#34;information&#34;,&#34;therapy&#34;,&#34;overview&#34;,&#34;distress&#34;,&#34;environmental&#34;,&#34;monitoring&#34;,&#34;emergency&#34;,&#34;southeast&#34;,&#34;strategy&#34;,&#34;market&#34;,&#34;environment&#34;,&#34;syndrome&#34;,&#34;status&#34;,&#34;community&#34;,&#34;children&#34;,&#34;preventive&#34;,&#34;prevalence&#34;,&#34;diagnosis&#34;,&#34;adoption&#34;,&#34;sector&#34;,&#34;approaches&#34;,&#34;threat&#34;,&#34;modelling&#34;,&#34;preliminary&#34;,&#34;managing&#34;,&#34;characteristics&#34;,&#34;region&#34;,&#34;post&#34;,&#34;practices&#34;,&#34;due&#34;,&#34;recommendations&#34;,&#34;waste&#34;,&#34;respiratory&#34;,&#34;preparedness&#34;,&#34;human&#34;,&#34;support&#34;,&#34;fear&#34;,&#34;amidst&#34;,&#34;therapeutic&#34;,&#34;drug&#34;,&#34;resilience&#34;,&#34;current&#34;,&#34;travel&#34;,&#34;severity&#34;,&#34;attitude&#34;,&#34;vaccines&#34;,&#34;effective&#34;,&#34;methods&#34;,&#34;insights&#34;,&#34;news&#34;,&#34;economy&#34;,&#34;acceptance&#34;,&#34;systems&#34;,&#34;vaccination&#34;,&#34;dental&#34;,&#34;infections&#34;,&#34;image&#34;,&#34;academic&#34;,&#34;physical&#34;,&#34;activity&#34;,&#34;report&#34;,&#34;influence&#34;,&#34;implementation&#34;,&#34;diagnostic&#34;,&#34;protease&#34;,&#34;diseases&#34;,&#34;immune&#34;,&#34;outcomes&#34;,&#34;well-being&#34;,&#34;surgery&#34;,&#34;period&#34;,&#34;home&#34;,&#34;evaluation&#34;,&#34;experiences&#34;,&#34;developing&#34;,&#34;studentsâ&#34;,&#34;energy&#34;,&#34;professionals&#34;,&#34;ðµð&#34;,&#34;sharing&#34;,&#34;opportunities&#34;,&#34;behaviour&#34;,&#34;method&#34;,&#34;sars&#34;,&#34;virus&#34;,&#34;design&#34;,&#34;qualitative&#34;,&#34;predictors&#34;,&#34;government&#34;,&#34;related&#34;,&#34;student&#34;,&#34;engagement&#34;,&#34;national&#34;,&#34;infected&#34;,&#34;pneumonia&#34;,&#34;editor&#34;,&#34;pharmacy&#34;,&#34;handling&#34;,&#34;models&#34;,&#34;safe&#34;,&#34;india&#34;,&#34;consequences&#34;,&#34;society&#34;,&#34;emerging&#34;,&#34;nexus&#34;,&#34;hydroxychloroquine&#34;,&#34;adverse&#34;,&#34;classification&#34;,&#34;testing&#34;,&#34;safety&#34;,&#34;moderating&#34;,&#34;psychosocial&#34;,&#34;findings&#34;,&#34;symptoms&#34;,&#34;trials&#34;,&#34;issues&#34;,&#34;readiness&#34;,&#34;distance&#34;,&#34;influencing&#34;,&#34;protection&#34;,&#34;mechanisms&#34;,&#34;training&#34;,&#34;surgical&#34;,&#34;techniques&#34;,&#34;letter&#34;,&#34;de&#34;,&#34;fake&#34;,&#34;en&#34;,&#34;reality&#34;,&#34;di&#34;,&#34;hospitals&#34;,&#34;investigation&#34;,&#34;price&#34;,&#34;oral&#34;,&#34;communication&#34;,&#34;country&#34;,&#34;context&#34;,&#34;markets&#34;,&#34;interventions&#34;,&#34;spike&#34;,&#34;diabetes&#34;,&#34;asia-pacific&#34;,&#34;identification&#34;,&#34;assessing&#34;,&#34;science&#34;,&#34;technologies&#34;,&#34;iot&#34;,&#34;educational&#34;,&#34;malaysiaâ&#34;,&#34;trends&#34;,&#34;combating&#34;,&#34;satisfaction&#34;,&#34;sustainability&#34;,&#34;pandemik&#34;,&#34;service&#34;,&#34;negative&#34;,&#34;considerations&#34;,&#34;action&#34;,&#34;integrated&#34;,&#34;pattern&#34;,&#34;literature&#34;,&#34;la&#34;,&#34;position&#34;,&#34;sleep&#34;,&#34;protein&#34;,&#34;hybrid&#34;,&#34;covidâ&#34;,&#34;theory&#34;,&#34;responsibility&#34;,&#34;call&#34;,&#34;middle-income&#34;,&#34;tertiary&#34;,&#34;comparative&#34;,&#34;pacific&#34;,&#34;sabah&#34;,&#34;mitigating&#34;,&#34;events&#34;,&#34;nationwide&#34;,&#34;protective&#34;,&#34;secondary&#34;,&#34;delivery&#34;,&#34;age&#34;,&#34;phase&#34;,&#34;confirmed&#34;,&#34;hospitalized&#34;,&#34;saudi&#34;,&#34;main&#34;,&#34;mathematical&#34;,&#34;business&#34;,&#34;uk&#34;,&#34;concerns&#34;,&#34;natural&#34;,&#34;mass&#34;,&#34;telemedicine&#34;,&#34;scenario&#34;,&#34;wave&#34;,&#34;daily&#34;,&#34;frontline&#34;,&#34;normal&#34;,&#34;medicine&#34;,&#34;applications&#34;,&#34;learned&#34;,&#34;quarantine&#34;,&#34;forecasting&#34;,&#34;roles&#34;,&#34;predicting&#34;,&#34;modeling&#34;,&#34;building&#34;,&#34;ace2&#34;,&#34;attitudes&#34;,&#34;antiviral&#34;,&#34;mitigation&#34;,&#34;simulation&#34;,&#34;world&#34;,&#34;reduce&#34;,&#34;dynamic&#34;,&#34;manifestations&#34;,&#34;rate&#34;,&#34;ðµ&#34;,&#34;mobile&#34;,&#34;endoscopy&#34;,&#34;genome&#34;,&#34;type&#34;,&#34;spatial&#34;,&#34;marketing&#34;,&#34;major&#34;,&#34;structural&#34;,&#34;fuzzy&#34;,&#34;empirical&#34;,&#34;aftermath&#34;,&#34;combat&#34;,&#34;behavioural&#34;,&#34;outpatient&#34;,&#34;affect&#34;,&#34;suspected&#34;,&#34;cardiovascular&#34;,&#34;products&#34;,&#34;policies&#34;,&#34;destination&#34;,&#34;examining&#34;,&#34;smart&#34;,&#34;properties&#34;,&#34;bangladeshi&#34;,&#34;arabia&#34;,&#34;artificial&#34;,&#34;randomized&#34;,&#34;storm&#34;,&#34;providers&#34;,&#34;key&#34;,&#34;death&#34;,&#34;platform&#34;,&#34;barriers&#34;,&#34;school&#34;,&#34;dan&#34;,&#34;pulmonary&#34;,&#34;province&#34;,&#34;protocol&#34;,&#34;activities&#34;,&#34;supply&#34;,&#34;index&#34;,&#34;scale&#34;,&#34;systemic&#34;,&#34;times&#34;,&#34;consensus&#34;,&#34;contact&#34;,&#34;medicines&#34;,&#34;computational&#34;,&#34;hiv&#34;,&#34;laboratory&#34;,&#34;corporate&#34;,&#34;narrative&#34;,&#34;link&#34;,&#34;convolutional&#34;,&#34;confinement&#34;,&#34;usage&#34;,&#34;statements&#34;,&#34;body&#34;,&#34;cloud&#34;,&#34;influenza&#34;,&#34;continuity&#34;,&#34;disaster&#34;,&#34;nursing&#34;,&#34;aerosol&#34;,&#34;levels&#34;,&#34;religious&#34;,&#34;silent&#34;,&#34;proteins&#34;,&#34;scoping&#34;,&#34;united&#34;,&#34;kingdom&#34;,&#34;staff&#34;,&#34;correction&#34;,&#34;solidarity&#34;,&#34;web-based&#34;,&#34;versus&#34;,&#34;chinese&#34;,&#34;covid-19â&#34;,&#34;success&#34;,&#34;mutation&#34;,&#34;individuals&#34;,&#34;addressing&#34;,&#34;detect&#34;,&#34;uncertainty&#34;,&#34;silico&#34;,&#34;residents&#34;,&#34;compliance&#34;,&#34;test&#34;,&#34;construction&#34;,&#34;docking&#34;,&#34;database&#34;,&#34;climate&#34;,&#34;animal&#34;,&#34;automatic&#34;,&#34;features&#34;,&#34;conditions&#34;,&#34;efficient&#34;,&#34;networks&#34;,&#34;cerebral&#34;,&#34;venous&#34;,&#34;cytokine&#34;,&#34;teachers&#34;,&#34;undergoing&#34;,&#34;elective&#34;,&#34;district&#34;,&#34;income&#34;,&#34;results&#34;,&#34;plant&#34;,&#34;surveillance&#34;,&#34;pandemics&#34;,&#34;detected&#34;,&#34;asymptomatic&#34;,&#34;male&#34;,&#34;goals&#34;,&#34;innovation&#34;,&#34;receptor&#34;,&#34;intervention&#34;,&#34;employee&#34;,&#34;wellbeing&#34;,&#34;nigeria&#34;,&#34;resources&#34;,&#34;studies&#34;,&#34;mediating&#34;,&#34;aspects&#34;,&#34;innovative&#34;,&#34;green&#34;,&#34;inflammatory&#34;,&#34;practical&#34;,&#34;setting&#34;,&#34;immunity&#34;,&#34;local&#34;,&#34;injury&#34;,&#34;transformation&#34;,&#34;participation&#34;,&#34;collaboration&#34;,&#34;positive&#34;,&#34;banking&#34;,&#34;cluster&#34;,&#34;remdesivir&#34;,&#34;motivation&#34;,&#34;statement&#34;,&#34;pakistani&#34;,&#34;neurosurgical&#34;,&#34;sensing&#34;,&#34;stability&#34;,&#34;equipment&#34;,&#34;estimation&#34;,&#34;distancing&#34;,&#34;host&#34;,&#34;remote&#34;,&#34;alternative&#34;,&#34;guidelines&#34;,&#34;improving&#34;,&#34;asean&#34;,&#34;war&#34;,&#34;migrant&#34;,&#34;patterns&#34;,&#34;google&#34;,&#34;comprehensive&#34;,&#34;share&#34;,&#34;stroke&#34;,&#34;isolation&#34;,&#34;collective&#34;,&#34;hand&#34;,&#34;promote&#34;,&#34;reproductive&#34;,&#34;longitudinal&#34;,&#34;concurrent&#34;,&#34;sectors&#34;,&#34;detecting&#34;,&#34;matter&#34;,&#34;pm2.5&#34;,&#34;change&#34;,&#34;interactive&#34;,&#34;antibody&#34;,&#34;cells&#34;,&#34;option&#34;,&#34;reported&#34;,&#34;plastic&#34;,&#34;opportunity&#34;,&#34;middle&#34;,&#34;algorithms&#34;,&#34;fresh&#34;,&#34;saliva&#34;,&#34;influences&#34;,&#34;distribution&#34;,&#34;gender&#34;,&#34;project&#34;,&#34;estimating&#34;,&#34;descriptive&#34;,&#34;mining&#34;,&#34;emotion&#34;,&#34;opinion&#34;,&#34;content&#34;,&#34;humans&#34;,&#34;traditional&#34;,&#34;copd&#34;,&#34;thrombosis&#34;,&#34;industrial&#34;,&#34;angiotensin&#34;,&#34;convalescent&#34;,&#34;plasma&#34;,&#34;fractal-fractional&#34;,&#34;cohort&#34;,&#34;mask&#34;,&#34;emotional&#34;,&#34;midst&#34;,&#34;hesitancy&#34;,&#34;density&#34;,&#34;urban&#34;,&#34;obese&#34;,&#34;iran&#34;,&#34;smes&#34;,&#34;initiatives&#34;,&#34;guide&#34;,&#34;small-scale&#34;,&#34;thailand&#34;,&#34;italy&#34;,&#34;essential&#34;,&#34;trend&#34;,&#34;treat&#34;,&#34;scenarios&#34;,&#34;efficacy&#34;,&#34;measure&#34;,&#34;scan&#34;,&#34;nurses&#34;,&#34;geriatric&#34;,&#34;cell&#34;,&#34;city&#34;,&#34;preventing&#34;,&#34;algorithm&#34;,&#34;renin-angiotensin&#34;,&#34;optimal&#34;,&#34;facilities&#34;,&#34;self-efficacy&#34;,&#34;proposed&#34;,&#34;return&#34;,&#34;eating&#34;,&#34;growth&#34;,&#34;short-term&#34;,&#34;lung&#34;,&#34;situation&#34;,&#34;tool&#34;,&#34;bibliometric&#34;,&#34;wake&#34;,&#34;binding&#34;,&#34;analisis&#34;,&#34;disorder&#34;,&#34;rheumatic&#34;,&#34;restriction&#34;,&#34;curve&#34;,&#34;tracing&#34;,&#34;institution&#34;,&#34;update&#34;,&#34;actions&#34;,&#34;conspiracy&#34;,&#34;theories&#34;,&#34;hypertension&#34;,&#34;intelligence&#34;,&#34;semasa&#34;,&#34;dalam&#34;,&#34;direct&#34;,&#34;paediatric&#34;,&#34;gastroenterology&#34;,&#34;platforms&#34;,&#34;induced&#34;,&#34;yemen&#34;,&#34;targeted&#34;,&#34;topsis&#34;,&#34;risks&#34;,&#34;predictive&#34;,&#34;stressors&#34;,&#34;therapeutics&#34;,&#34;resistance&#34;,&#34;aid&#34;,&#34;leadership&#34;,&#34;spectrum&#34;,&#34;variations&#34;,&#34;architecture&#34;,&#34;pollution&#34;,&#34;box&#34;,&#34;responding&#34;,&#34;efforts&#34;,&#34;volatility&#34;,&#34;silver&#34;,&#34;poor&#34;,&#34;conceptual&#34;,&#34;healthy&#34;,&#34;violence&#34;,&#34;viral&#34;,&#34;identify&#34;,&#34;chain&#34;,&#34;frailty&#34;,&#34;thromboembolism&#34;,&#34;klang&#34;,&#34;valley&#34;,&#34;persistent&#34;,&#34;penang&#34;,&#34;reactions&#34;,&#34;asthma&#34;,&#34;unprecedented&#34;,&#34;leading&#34;,&#34;trajectory&#34;,&#34;engineering&#34;,&#34;airway&#34;,&#34;kuala&#34;,&#34;lumpur&#34;,&#34;selangor&#34;,&#34;feature&#34;,&#34;fight&#34;,&#34;africa&#34;,&#34;pandemicâ&#34;,&#34;belief&#34;,&#34;thinking&#34;,&#34;targets&#34;,&#34;exposure&#34;,&#34;meteorological&#34;,&#34;robust&#34;,&#34;reducing&#34;,&#34;willingness&#34;,&#34;classroom&#34;,&#34;iraq&#34;,&#34;ã&#34;,&#34;solutions&#34;,&#34;malay&#34;,&#34;emergence&#34;,&#34;common&#34;,&#34;clustering&#34;,&#34;ñƒñ&#34;,&#34;counselling&#34;,&#34;inhaler&#34;,&#34;lower&#34;,&#34;radiotherapy&#34;,&#34;cruise&#34;,&#34;gynecological&#34;,&#34;lupus&#34;,&#34;brand&#34;,&#34;fasting&#34;,&#34;music&#34;,&#34;pandemije&#34;,&#34;gen&#34;,&#34;shocks&#34;,&#34;vitamin&#34;,&#34;liver&#34;,&#34;proposal&#34;,&#34;upper&#34;,&#34;universal&#34;,&#34;coverage&#34;,&#34;multiple&#34;,&#34;populations&#34;,&#34;covid-19-related&#34;,&#34;function&#34;,&#34;search&#34;,&#34;types&#34;,&#34;universities&#34;,&#34;communications&#34;,&#34;sri&#34;,&#34;sexual&#34;,&#34;sedentary&#34;,&#34;ventilation&#34;,&#34;private&#34;,&#34;lives&#34;,&#34;scans&#34;,&#34;dexamethasone&#34;,&#34;aged&#34;,&#34;dataset&#34;,&#34;nanomaterials&#34;,&#34;cities&#34;,&#34;wavelet-based&#34;,&#34;cov-2&#34;,&#34;reduction&#34;,&#34;consumers&#34;,&#34;buying&#34;,&#34;survivors&#34;,&#34;factor&#34;,&#34;personal&#34;,&#34;limited&#34;,&#34;lifestyle&#34;,&#34;admitted&#34;,&#34;masks&#34;,&#34;ongoing&#34;,&#34;past&#34;,&#34;panic&#34;,&#34;rising&#34;,&#34;infrastructure&#34;,&#34;anti-sars-cov-2&#34;,&#34;peptides&#34;,&#34;mrna&#34;,&#34;affected&#34;,&#34;administration&#34;,&#34;kits&#34;,&#34;projects&#34;,&#34;phytochemicals&#34;,&#34;large-scale&#34;,&#34;restrictions&#34;,&#34;sentiment&#34;,&#34;strains&#34;,&#34;co2&#34;,&#34;options&#34;,&#34;provide&#34;,&#34;solution&#34;,&#34;supporting&#34;,&#34;circular&#34;,&#34;prophylaxis&#34;,&#34;low&#34;,&#34;outbreaks&#34;,&#34;inhibitor&#34;,&#34;controlled&#34;,&#34;combined&#34;,&#34;modulation&#34;,&#34;sir&#34;,&#34;derivative&#34;,&#34;wastewater&#34;,&#34;tocilizumab&#34;,&#34;commentary&#34;,&#34;planning&#34;,&#34;perioperative&#34;,&#34;illness&#34;,&#34;college&#34;,&#34;importance&#34;,&#34;agricultural&#34;,&#34;costs&#34;,&#34;enhancing&#34;,&#34;socioeconomic&#34;,&#34;entrepreneurs&#34;,&#34;peninsular&#34;,&#34;australian&#34;,&#34;norms&#34;,&#34;paradigm&#34;,&#34;entrepreneurial&#34;,&#34;contagion&#34;,&#34;integration&#34;,&#34;disinfectant&#34;,&#34;terhadap&#34;,&#34;pengalaman&#34;,&#34;improved&#34;,&#34;road&#34;,&#34;fluid&#34;,&#34;sociodemographic&#34;,&#34;process&#34;,&#34;borneo&#34;,&#34;lineage&#34;,&#34;epidemiological&#34;,&#34;weight&#34;,&#34;controlling&#34;,&#34;reproduction&#34;,&#34;nursesâ&#34;,&#34;aquatic&#34;,&#34;chains&#34;,&#34;palliative&#34;,&#34;technique&#34;,&#34;europe&#34;,&#34;burnout&#34;,&#34;cross&#34;,&#34;sectional&#34;,&#34;emergencies&#34;,&#34;preadmission&#34;,&#34;disorders&#34;,&#34;repurposing&#34;,&#34;revolution&#34;,&#34;vulnerable&#34;,&#34;affecting&#34;,&#34;examination&#34;,&#34;implementing&#34;,&#34;mixed-method&#34;,&#34;statistical&#34;,&#34;dentists&#34;,&#34;pregnancy&#34;,&#34;progression&#34;,&#34;users&#34;,&#34;focused&#34;,&#34;qt&#34;,&#34;caring&#34;,&#34;precautionary&#34;,&#34;nigerian&#34;,&#34;tools&#34;,&#34;drives&#34;,&#34;free&#34;,&#34;agenda&#34;,&#34;prevent&#34;,&#34;female&#34;,&#34;metabolism&#34;,&#34;nasopharyngeal&#34;,&#34;converting&#34;,&#34;enzyme&#34;,&#34;manage&#34;,&#34;institutional&#34;,&#34;synthetic&#34;,&#34;routine&#34;,&#34;myths&#34;,&#34;polymerase&#34;,&#34;genetic&#34;,&#34;herd&#34;,&#34;wuhan&#34;,&#34;special&#34;,&#34;kidney&#34;,&#34;sample&#34;,&#34;possibly&#34;,&#34;isolated&#34;,&#34;quarantined&#34;,&#34;augmented&#34;,&#34;e-commerce&#34;,&#34;antibiotics&#34;,&#34;recurrent&#34;,&#34;mini-review&#34;,&#34;battling&#34;,&#34;internet&#34;,&#34;programme&#34;,&#34;fever&#34;,&#34;adult&#34;,&#34;advance&#34;,&#34;schools&#34;,&#34;immunomodulatory&#34;,&#34;dysfunction&#34;,&#34;infectious&#34;,&#34;english&#34;,&#34;inquiry&#34;,&#34;singapore&#34;,&#34;firms&#34;,&#34;postgraduate&#34;,&#34;instant&#34;,&#34;burden&#34;,&#34;australia&#34;,&#34;indicators&#34;,&#34;intentions&#34;,&#34;homes&#34;,&#34;interactions&#34;,&#34;indonesian&#34;,&#34;aquaculture&#34;,&#34;heparin&#34;,&#34;concentration&#34;,&#34;reporting&#34;,&#34;insecurity&#34;,&#34;happiness&#34;,&#34;massive&#34;,&#34;nutrition&#34;,&#34;automated&#34;,&#34;america&#34;,&#34;journal&#34;,&#34;urology&#34;,&#34;observational&#34;,&#34;dominant&#34;,&#34;methodology&#34;,&#34;determinants&#34;,&#34;domestic&#34;,&#34;blood&#34;,&#34;referral&#34;,&#34;temporal&#34;,&#34;communities&#34;,&#34;waves&#34;,&#34;availability&#34;,&#34;selected&#34;,&#34;regions&#34;,&#34;biological&#34;,&#34;benefits&#34;,&#34;finance&#34;,&#34;health-care&#34;,&#34;logistics&#34;,&#34;deployment&#34;,&#34;geographical&#34;,&#34;critically&#34;,&#34;ill&#34;,&#34;south-east&#34;,&#34;mini&#34;,&#34;requirements&#34;,&#34;exploration&#34;,&#34;statins&#34;,&#34;regulatory&#34;,&#34;faculty&#34;,&#34;correlation&#34;,&#34;rights&#34;,&#34;weather&#34;,&#34;visual&#34;,&#34;battle&#34;,&#34;extraction&#34;,&#34;respond&#34;,&#34;lining&#34;,&#34;robotic&#34;,&#34;generated&#34;,&#34;esl&#34;,&#34;electronic&#34;,&#34;compounds&#34;,&#34;orientation&#34;,&#34;pasaran&#34;,&#34;faced&#34;,&#34;na&#34;,&#34;zinc&#34;,&#34;putative&#34;,&#34;ethical&#34;,&#34;disinfection&#34;,&#34;apps&#34;,&#34;mindfulness&#34;,&#34;malaysia&#39;s&#34;,&#34;cascade&#34;,&#34;orthopaedic&#34;,&#34;contemporary&#34;,&#34;sarawak&#34;,&#34;manifestation&#34;,&#34;assay&#34;,&#34;worldwide&#34;,&#34;target&#34;,&#34;chloroquine&#34;,&#34;pharmacologic&#34;,&#34;agents&#34;,&#34;cycle&#34;,&#34;south&#34;,&#34;corticosteroids&#34;,&#34;corona&#34;,&#34;mediated&#34;,&#34;neurological&#34;,&#34;reverse&#34;,&#34;transcription&#34;,&#34;amplification&#34;,&#34;prophylactic&#34;,&#34;reference&#34;,&#34;multicenter&#34;,&#34;azithromycin&#34;,&#34;pharmacotherapeutic&#34;,&#34;receiving&#34;,&#34;al-quran&#34;,&#34;expert&#34;,&#34;plan&#34;],&#34;freq&#34;:[259,198,193,144,143,135,134,132,128,122,113,100,99,93,90,90,88,84,80,80,79,76,76,75,74,74,71,70,69,68,68,68,68,68,68,68,67,67,66,65,63,63,63,62,61,61,61,58,58,57,57,56,56,55,53,53,53,53,52,51,51,49,49,49,49,49,49,49,48,48,48,47,47,45,45,44,44,44,44,44,43,43,42,42,41,41,40,40,40,40,40,40,40,40,38,38,38,38,38,37,37,37,37,37,37,37,37,36,36,36,36,35,35,35,35,35,34,34,34,34,34,34,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,31,31,31,31,31,30,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,28,28,28,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,26,26,26,26,26,26,26,26,26,26,26,26,26,26,26,26,26,26,25,25,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,23,23,23,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,19,19,19,19,19,19,19,19,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,16,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,14,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,11,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8],&#34;fontFamily&#34;:&#34;Segoe UI&#34;,&#34;fontWeight&#34;:&#34;bold&#34;,&#34;color&#34;:&#34;random-dark&#34;,&#34;minSize&#34;:0,&#34;weightFactor&#34;:0.694980694980695,&#34;backgroundColor&#34;:&#34;white&#34;,&#34;gridSize&#34;:0,&#34;minRotation&#34;:-0.785398163397448,&#34;maxRotation&#34;:0.785398163397448,&#34;shuffle&#34;:true,&#34;rotateRatio&#34;:0.4,&#34;shape&#34;:&#34;circle&#34;,&#34;ellipticity&#34;:0.65,&#34;figBase64&#34;:null,&#34;hover&#34;:null},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Top 1000 terms extracted from the title
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wordcloud2(covid_wc$abstract %&amp;gt;% 
             slice(1:1000) %&amp;gt;% 
             mutate(frequency = round(frequency)))&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-10&#34;&gt;&lt;/span&gt;
&lt;div id=&#34;htmlwidget-2&#34; style=&#34;width:672px;height:480px;&#34; class=&#34;wordcloud2 html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-2&#34;&gt;{&#34;x&#34;:{&#34;word&#34;:[&#34;patients&#34;,&#34;students&#34;,&#34;learning&#34;,&#34;study&#34;,&#34;health&#34;,&#34;covid-19&#34;,&#34;sars-cov-2&#34;,&#34;pandemic&#34;,&#34;anxiety&#34;,&#34;online&#34;,&#34;malaysia&#34;,&#34;social&#34;,&#34;data&#34;,&#34;disease&#34;,&#34;model&#34;,&#34;countries&#34;,&#34;mco&#34;,&#34;research&#34;,&#34;coronavirus&#34;,&#34;impact&#34;,&#34;healthcare&#34;,&#34;control&#34;,&#34;risk&#34;,&#34;results&#34;,&#34;analysis&#34;,&#34;virus&#34;,&#34;ci&#34;,&#34;infection&#34;,&#34;clinical&#34;,&#34;care&#34;,&#34;education&#34;,&#34;knowledge&#34;,&#34;stress&#34;,&#34;measures&#34;,&#34;public&#34;,&#34;depression&#34;,&#34;spread&#34;,&#34;significant&#34;,&#34;system&#34;,&#34;outbreak&#34;,&#34;information&#34;,&#34;psychological&#34;,&#34;factors&#34;,&#34;world&#34;,&#34;studies&#34;,&#34;findings&#34;,&#34;reported&#34;,&#34;respondents&#34;,&#34;media&#34;,&#34;global&#34;,&#34;medical&#34;,&#34;vaccine&#34;,&#34;lockdown&#34;,&#34;paper&#34;,&#34;participants&#34;,&#34;transmission&#34;,&#34;review&#34;,&#34;positive&#34;,&#34;mental&#34;,&#34;people&#34;,&#34;based&#34;,&#34;economic&#34;,&#34;severe&#34;,&#34;symptoms&#34;,&#34;management&#34;,&#34;methods&#34;,&#34;survey&#34;,&#34;level&#34;,&#34;current&#34;,&#34;treatment&#34;,&#34;food&#34;,&#34;respiratory&#34;,&#34;perceived&#34;,&#34;due&#34;,&#34;development&#34;,&#34;performance&#34;,&#34;total&#34;,&#34;mortality&#34;,&#34;quality&#34;,&#34;challenges&#34;,&#34;rights&#34;,&#34;government&#34;,&#34;including&#34;,&#34;teaching&#34;,&#34;strategies&#34;,&#34;conducted&#34;,&#34;services&#34;,&#34;future&#34;,&#34;time&#34;,&#34;human&#34;,&#34;approach&#34;,&#34;workers&#34;,&#34;method&#34;,&#34;malaysian&#34;,&#34;reserved&#34;,&#34;crisis&#34;,&#34;significantly&#34;,&#34;effects&#34;,&#34;support&#34;,&#34;tourism&#34;,&#34;period&#34;,&#34;proposed&#34;,&#34;movement&#34;,&#34;design&#34;,&#34;effect&#34;,&#34;university&#34;,&#34;potential&#34;,&#34;compared&#34;,&#34;rate&#34;,&#34;abstract&#34;,&#34;affected&#34;,&#34;found&#34;,&#34;score&#34;,&#34;effective&#34;,&#34;financial&#34;,&#34;limited&#34;,&#34;march&#34;,&#34;author&#34;,&#34;provide&#34;,&#34;population&#34;,&#34;acute&#34;,&#34;technology&#34;,&#34;china&#34;,&#34;levels&#34;,&#34;intention&#34;,&#34;air&#34;,&#34;activities&#34;,&#34;detection&#34;,&#34;syndrome&#34;,&#34;caused&#34;,&#34;aims&#34;,&#34;related&#34;,&#34;negative&#34;,&#34;impacts&#34;,&#34;literature&#34;,&#34;physical&#34;,&#34;questionnaire&#34;,&#34;waste&#34;,&#34;practice&#34;,&#34;response&#34;,&#34;market&#34;,&#34;viral&#34;,&#34;infected&#34;,&#34;article&#34;,&#34;relationship&#34;,&#34;authors&#34;,&#34;vaccines&#34;,&#34;increased&#34;,&#34;conclusion&#34;,&#34;international&#34;,&#34;e-learning&#34;,&#34;worldwide&#34;,&#34;practices&#34;,&#34;patient&#34;,&#34;role&#34;,&#34;images&#34;,&#34;test&#34;,&#34;age&#34;,&#34;industry&#34;,&#34;islamic&#34;,&#34;responses&#34;,&#34;prevalence&#34;,&#34;community&#34;,&#34;implications&#34;,&#34;perception&#34;,&#34;process&#34;,&#34;background&#34;,&#34;cancer&#34;,&#34;published&#34;,&#34;coping&#34;,&#34;digital&#34;,&#34;distancing&#34;,&#34;drugs&#34;,&#34;developed&#34;,&#34;systems&#34;,&#34;country&#34;,&#34;evidence&#34;,&#34;sector&#34;,&#34;business&#34;,&#34;science&#34;,&#34;individuals&#34;,&#34;experience&#34;,&#34;identify&#34;,&#34;association&#34;,&#34;policy&#34;,&#34;terms&#34;,&#34;hospital&#34;,&#34;included&#34;,&#34;reduce&#34;,&#34;daily&#34;,&#34;purpose&#34;,&#34;increase&#34;,&#34;identified&#34;,&#34;understanding&#34;,&#34;children&#34;,&#34;life&#34;,&#34;implementation&#34;,&#34;society&#34;,&#34;â&#34;,&#34;situation&#34;,&#34;stock&#34;,&#34;models&#34;,&#34;collected&#34;,&#34;home&#34;,&#34;access&#34;,&#34;trials&#34;,&#34;accuracy&#34;,&#34;prevention&#34;,&#34;april&#34;,&#34;main&#34;,&#34;lower&#34;,&#34;investigate&#34;,&#34;epidemic&#34;,&#34;travel&#34;,&#34;observed&#34;,&#34;diseases&#34;,&#34;attitude&#34;,&#34;technologies&#34;,&#34;revealed&#34;,&#34;image&#34;,&#34;personal&#34;,&#34;aimed&#34;,&#34;distress&#34;,&#34;environment&#34;,&#34;infections&#34;,&#34;drug&#34;,&#34;diagnosis&#34;,&#34;result&#34;,&#34;nature&#34;,&#34;scale&#34;,&#34;confirmed&#34;,&#34;preventive&#34;,&#34;issues&#34;,&#34;acceptance&#34;,&#34;communication&#34;,&#34;objective&#34;,&#34;safety&#34;,&#34;resources&#34;,&#34;switzerland&#34;,&#34;virtual&#34;,&#34;major&#34;,&#34;testing&#34;,&#34;outcomes&#34;,&#34;performed&#34;,&#34;variables&#34;,&#34;articles&#34;,&#34;phase&#34;,&#34;developing&#34;,&#34;framework&#34;,&#34;provided&#34;,&#34;key&#34;,&#34;techniques&#34;,&#34;deaths&#34;,&#34;considered&#34;,&#34;critical&#34;,&#34;features&#34;,&#34;rapid&#34;,&#34;low&#34;,&#34;screening&#34;,&#34;women&#34;,&#34;universities&#34;,&#34;cross-sectional&#34;,&#34;springer&#34;,&#34;status&#34;,&#34;asia&#34;,&#34;environmental&#34;,&#34;emergency&#34;,&#34;majority&#34;,&#34;activity&#34;,&#34;hcws&#34;,&#34;indonesia&#34;,&#34;institutions&#34;,&#34;severity&#34;,&#34;policies&#34;,&#34;protein&#34;,&#34;assess&#34;,&#34;fear&#34;,&#34;days&#34;,&#34;death&#34;,&#34;licensee&#34;,&#34;security&#34;,&#34;news&#34;,&#34;protective&#34;,&#34;change&#34;,&#34;examine&#34;,&#34;mdpi&#34;,&#34;basel&#34;,&#34;national&#34;,&#34;contact&#34;,&#34;multiple&#34;,&#34;obtained&#34;,&#34;conditions&#34;,&#34;aim&#34;,&#34;distribution&#34;,&#34;selected&#34;,&#34;local&#34;,&#34;conclusions&#34;,&#34;prevent&#34;,&#34;behaviour&#34;,&#34;elsevier&#34;,&#34;sample&#34;,&#34;dental&#34;,&#34;staff&#34;,&#34;quarantine&#34;,&#34;regression&#34;,&#34;ensure&#34;,&#34;hospitals&#34;,&#34;engagement&#34;,&#34;addition&#34;,&#34;income&#34;,&#34;reduction&#34;,&#34;specific&#34;,&#34;training&#34;,&#34;interventions&#34;,&#34;lack&#34;,&#34;restrictions&#34;,&#34;effectiveness&#34;,&#34;characteristics&#34;,&#34;satisfaction&#34;,&#34;student&#34;,&#34;improve&#34;,&#34;application&#34;,&#34;methodology&#34;,&#34;stroke&#34;,&#34;moderate&#34;,&#34;december&#34;,&#34;al&#34;,&#34;ppe&#34;,&#34;criteria&#34;,&#34;network&#34;,&#34;region&#34;,&#34;organization&#34;,&#34;importance&#34;,&#34;growth&#34;,&#34;energy&#34;,&#34;pakistan&#34;,&#34;teachers&#34;,&#34;sustainable&#34;,&#34;systematic&#34;,&#34;scores&#34;,&#34;recommendations&#34;,&#34;academic&#34;,&#34;economy&#34;,&#34;million&#34;,&#34;guidelines&#34;,&#34;approaches&#34;,&#34;strategy&#34;,&#34;implemented&#34;,&#34;vaccination&#34;,&#34;recent&#34;,&#34;evaluate&#34;,&#34;adults&#34;,&#34;family&#34;,&#34;similar&#34;,&#34;distributed&#34;,&#34;equipment&#34;,&#34;recovery&#34;,&#34;correlation&#34;,&#34;reduced&#34;,&#34;binding&#34;,&#34;body&#34;,&#34;sharing&#34;,&#34;researchers&#34;,&#34;wuhan&#34;,&#34;machine&#34;,&#34;analyzed&#34;,&#34;immune&#34;,&#34;assessment&#34;,&#34;awareness&#34;,&#34;x-ray&#34;,&#34;religious&#34;,&#34;sensitivity&#34;,&#34;globally&#34;,&#34;explore&#34;,&#34;ongoing&#34;,&#34;led&#34;,&#34;behavior&#34;,&#34;publishing&#34;,&#34;direct&#34;,&#34;procedures&#34;,&#34;theory&#34;,&#34;rates&#34;,&#34;determine&#34;,&#34;develop&#34;,&#34;resilience&#34;,&#34;bangladesh&#34;,&#34;exposure&#34;,&#34;context&#34;,&#34;factor&#34;,&#34;focus&#34;,&#34;understand&#34;,&#34;pneumonia&#34;,&#34;questions&#34;,&#34;readiness&#34;,&#34;well-being&#34;,&#34;î&#34;,&#34;deep&#34;,&#34;relevant&#34;,&#34;required&#34;,&#34;asian&#34;,&#34;usage&#34;,&#34;compounds&#34;,&#34;differences&#34;,&#34;odds&#34;,&#34;january&#34;,&#34;copyright&#34;,&#34;threat&#34;,&#34;covid&#34;,&#34;aspects&#34;,&#34;chest&#34;,&#34;essential&#34;,&#34;google&#34;,&#34;delivery&#34;,&#34;sustainability&#34;,&#34;sectors&#34;,&#34;affect&#34;,&#34;monitoring&#34;,&#34;primary&#34;,&#34;qualitative&#34;,&#34;ratio&#34;,&#34;discussed&#34;,&#34;ct&#34;,&#34;hand&#34;,&#34;antiviral&#34;,&#34;common&#34;,&#34;infectious&#34;,&#34;perceptions&#34;,&#34;existing&#34;,&#34;suggest&#34;,&#34;normal&#34;,&#34;sampling&#34;,&#34;professionals&#34;,&#34;universiti&#34;,&#34;concern&#34;,&#34;solution&#34;,&#34;search&#34;,&#34;individual&#34;,&#34;carried&#34;,&#34;standard&#34;,&#34;concerns&#34;,&#34;assessed&#34;,&#34;tested&#34;,&#34;lives&#34;,&#34;attitudes&#34;,&#34;facilities&#34;,&#34;rapidly&#34;,&#34;quantitative&#34;,&#34;service&#34;,&#34;preparedness&#34;,&#34;influence&#34;,&#34;limitations&#34;,&#34;presence&#34;,&#34;confidence&#34;,&#34;internet&#34;,&#34;parameters&#34;,&#34;applications&#34;,&#34;index&#34;,&#34;report&#34;,&#34;databases&#34;,&#34;ace2&#34;,&#34;analyses&#34;,&#34;efforts&#34;,&#34;increasing&#34;,&#34;type&#34;,&#34;previous&#34;,&#34;markets&#34;,&#34;inhibitors&#34;,&#34;unprecedented&#34;,&#34;applied&#34;,&#34;highly&#34;,&#34;reports&#34;,&#34;products&#34;,&#34;wave&#34;,&#34;structural&#34;,&#34;compliance&#34;,&#34;surgery&#34;,&#34;crucial&#34;,&#34;received&#34;,&#34;faced&#34;,&#34;source&#34;,&#34;âˆ&#34;,&#34;production&#34;,&#34;adequate&#34;,&#34;providing&#34;,&#34;impacted&#34;,&#34;enhance&#34;,&#34;illness&#34;,&#34;therapy&#34;,&#34;intervention&#34;,&#34;educational&#34;,&#34;governments&#34;,&#34;dataset&#34;,&#34;technique&#34;,&#34;vulnerable&#34;,&#34;spreading&#34;,&#34;hydroxychloroquine&#34;,&#34;platform&#34;,&#34;molecular&#34;,&#34;experiences&#34;,&#34;day&#34;,&#34;gender&#34;,&#34;practical&#34;,&#34;demonstrated&#34;,&#34;laboratory&#34;,&#34;faculty&#34;,&#34;traditional&#34;,&#34;attention&#34;,&#34;address&#34;,&#34;lead&#34;,&#34;analyze&#34;,&#34;surveillance&#34;,&#34;studentsâ&#34;,&#34;recommended&#34;,&#34;fake&#34;,&#34;prediction&#34;,&#34;set&#34;,&#34;range&#34;,&#34;dynamics&#34;,&#34;mild&#34;,&#34;examined&#34;,&#34;measure&#34;,&#34;investigated&#34;,&#34;platforms&#34;,&#34;poor&#34;,&#34;increases&#34;,&#34;pm2.5&#34;,&#34;classification&#34;,&#34;chain&#34;,&#34;diagnostic&#34;,&#34;pooled&#34;,&#34;trend&#34;,&#34;items&#34;,&#34;involved&#34;,&#34;emotional&#34;,&#34;mobile&#34;,&#34;samples&#34;,&#34;objectives&#34;,&#34;nurses&#34;,&#34;tools&#34;,&#34;mass&#34;,&#34;size&#34;,&#34;saliva&#34;,&#34;qtc&#34;,&#34;predict&#34;,&#34;form&#34;,&#34;uk&#34;,&#34;play&#34;,&#34;average&#34;,&#34;male&#34;,&#34;adverse&#34;,&#34;tests&#34;,&#34;statistical&#34;,&#34;classes&#34;,&#34;therapeutic&#34;,&#34;licence&#34;,&#34;taylor&#34;,&#34;francis&#34;,&#34;telemedicine&#34;,&#34;wellbeing&#34;,&#34;predicted&#34;,&#34;evaluation&#34;,&#34;types&#34;,&#34;june&#34;,&#34;shown&#34;,&#34;active&#34;,&#34;employees&#34;,&#34;challenge&#34;,&#34;employed&#34;,&#34;trading&#34;,&#34;india&#34;,&#34;mitigate&#34;,&#34;past&#34;,&#34;patterns&#34;,&#34;analysed&#34;,&#34;affecting&#34;,&#34;motivation&#34;,&#34;consequences&#34;,&#34;reality&#34;,&#34;private&#34;,&#34;providers&#34;,&#34;opportunities&#34;,&#34;issue&#34;,&#34;managing&#34;,&#34;medicine&#34;,&#34;experienced&#34;,&#34;confinement&#34;,&#34;meta-analysis&#34;,&#34;include&#34;,&#34;causing&#34;,&#34;supply&#34;,&#34;scientific&#34;,&#34;outcome&#34;,&#34;events&#34;,&#34;descriptive&#34;,&#34;authorities&#34;,&#34;consumers&#34;,&#34;potentially&#34;,&#34;introduction&#34;,&#34;organizations&#34;,&#34;risks&#34;,&#34;interviews&#34;,&#34;secondary&#34;,&#34;demand&#34;,&#34;sd&#34;,&#34;times&#34;,&#34;declared&#34;,&#34;rna&#34;,&#34;tool&#34;,&#34;construction&#34;,&#34;users&#34;,&#34;apps&#34;,&#34;resulted&#34;,&#34;sleep&#34;,&#34;stability&#34;,&#34;completed&#34;,&#34;safe&#34;,&#34;burden&#34;,&#34;action&#34;,&#34;fever&#34;,&#34;leading&#34;,&#34;hygiene&#34;,&#34;networks&#34;,&#34;alternative&#34;,&#34;sources&#34;,&#34;outbreaks&#34;,&#34;female&#34;,&#34;suggested&#34;,&#34;adjusted&#34;,&#34;estimated&#34;,&#34;llc&#34;,&#34;finally&#34;,&#34;specifically&#34;,&#34;i.e&#34;,&#34;months&#34;,&#34;marketing&#34;,&#34;emerged&#34;,&#34;original&#34;,&#34;disorders&#34;,&#34;mechanisms&#34;,&#34;ieee&#34;,&#34;long-term&#34;,&#34;solutions&#34;,&#34;pollution&#34;,&#34;distance&#34;,&#34;decision&#34;,&#34;globe&#34;,&#34;benefits&#34;,&#34;pattern&#34;,&#34;temperature&#34;,&#34;lung&#34;,&#34;ict&#34;,&#34;iot&#34;,&#34;policymakers&#34;,&#34;chronic&#34;,&#34;reproduction&#34;,&#34;central&#34;,&#34;recently&#34;,&#34;insights&#34;,&#34;aor&#34;,&#34;algorithm&#34;,&#34;southeast&#34;,&#34;difference&#34;,&#34;isolation&#34;,&#34;sars&#34;,&#34;assay&#34;,&#34;describe&#34;,&#34;emergence&#34;,&#34;content&#34;,&#34;protection&#34;,&#34;skills&#34;,&#34;manage&#34;,&#34;healthy&#34;,&#34;originality&#34;,&#34;adopted&#34;,&#34;barriers&#34;,&#34;designed&#34;,&#34;february&#34;,&#34;mitigation&#34;,&#34;communities&#34;,&#34;asymptomatic&#34;,&#34;comprehensive&#34;,&#34;oral&#34;,&#34;highlights&#34;,&#34;trust&#34;,&#34;males&#34;,&#34;continue&#34;,&#34;affects&#34;,&#34;humans&#34;,&#34;engineering&#34;,&#34;person&#34;,&#34;prices&#34;,&#34;fast&#34;,&#34;pubmed&#34;,&#34;english&#34;,&#34;medium&#34;,&#34;spike&#34;,&#34;close&#34;,&#34;prior&#34;,&#34;reporting&#34;,&#34;target&#34;,&#34;regions&#34;,&#34;short&#34;,&#34;discuss&#34;,&#34;penerbit&#34;,&#34;africa&#34;,&#34;host&#34;,&#34;neural&#34;,&#34;predictors&#34;,&#34;basis&#34;,&#34;females&#34;,&#34;school&#34;,&#34;natural&#34;,&#34;institute&#34;,&#34;week&#34;,&#34;genome&#34;,&#34;emerging&#34;,&#34;pandemics&#34;,&#34;east&#34;,&#34;aerosol&#34;,&#34;suspected&#34;,&#34;conventional&#34;,&#34;specificity&#34;,&#34;achieve&#34;,&#34;odl&#34;,&#34;unique&#34;,&#34;version&#34;,&#34;detected&#34;,&#34;investment&#34;,&#34;guide&#34;,&#34;disruption&#34;,&#34;press&#34;,&#34;median&#34;,&#34;burnout&#34;,&#34;availability&#34;,&#34;collection&#34;,&#34;emerald&#34;,&#34;informa&#34;,&#34;e.g&#34;,&#34;basic&#34;,&#34;city&#34;,&#34;ability&#34;,&#34;require&#34;,&#34;planning&#34;,&#34;creative&#34;,&#34;capacity&#34;,&#34;cells&#34;,&#34;cell&#34;,&#34;independent&#34;,&#34;wiley&#34;,&#34;dentists&#34;,&#34;epidemiological&#34;,&#34;self-efficacy&#34;,&#34;effectively&#34;,&#34;price&#34;,&#34;values&#34;,&#34;proteins&#34;,&#34;behavioural&#34;,&#34;surgical&#34;,&#34;journal&#34;,&#34;facing&#34;,&#34;protease&#34;,&#34;medicines&#34;,&#34;coverage&#34;,&#34;field&#34;,&#34;reducing&#34;,&#34;oil&#34;,&#34;condition&#34;,&#34;evaluated&#34;,&#34;real-time&#34;,&#34;contribute&#34;,&#34;highlight&#34;,&#34;receiving&#34;,&#34;partial&#34;,&#34;efficacy&#34;,&#34;living&#34;,&#34;proper&#34;,&#34;remdesivir&#34;,&#34;remains&#34;,&#34;studied&#34;,&#34;statistics&#34;,&#34;reviewed&#34;,&#34;cognitive&#34;,&#34;hypertension&#34;,&#34;sites&#34;,&#34;aged&#34;,&#34;fight&#34;,&#34;exclusive&#34;,&#34;inclusion&#34;,&#34;immunity&#34;,&#34;complications&#34;,&#34;banking&#34;,&#34;additional&#34;,&#34;detect&#34;,&#34;building&#34;,&#34;consumption&#34;,&#34;assist&#34;,&#34;efficient&#34;,&#34;improvement&#34;,&#34;coronaviruses&#34;,&#34;incidence&#34;,&#34;united&#34;,&#34;job&#34;,&#34;estimate&#34;,&#34;properties&#34;,&#34;no2&#34;,&#34;behavioral&#34;,&#34;expected&#34;,&#34;improved&#34;,&#34;doctors&#34;,&#34;destination&#34;,&#34;modelling&#34;,&#34;electronic&#34;,&#34;involving&#34;,&#34;simulation&#34;,&#34;morbidity&#34;,&#34;chinese&#34;,&#34;database&#34;,&#34;initial&#34;,&#34;widely&#34;,&#34;face-to-face&#34;,&#34;integrated&#34;,&#34;diabetes&#34;,&#34;logistic&#34;,&#34;decrease&#34;,&#34;versus&#34;,&#34;interval&#34;,&#34;web&#34;,&#34;demographic&#34;,&#34;phases&#34;,&#34;date&#34;,&#34;influenza&#34;,&#34;achieved&#34;,&#34;actions&#34;,&#34;additionally&#34;,&#34;tracing&#34;,&#34;smart&#34;,&#34;europe&#34;,&#34;contagious&#34;,&#34;middle&#34;,&#34;perspective&#34;,&#34;climate&#34;,&#34;swab&#34;,&#34;language&#34;,&#34;linear&#34;,&#34;treat&#34;,&#34;interaction&#34;,&#34;interactions&#34;,&#34;zakat&#34;,&#34;health-care&#34;,&#34;eating&#34;,&#34;resulting&#34;,&#34;relationships&#34;,&#34;questionnaires&#34;,&#34;discussion&#34;,&#34;license&#34;,&#34;spatial&#34;,&#34;ministry&#34;,&#34;vital&#34;,&#34;discusses&#34;,&#34;recorded&#34;,&#34;usefulness&#34;,&#34;programs&#34;,&#34;american&#34;,&#34;depressive&#34;,&#34;materials&#34;,&#34;strong&#34;,&#34;modified&#34;,&#34;cost&#34;,&#34;explored&#34;,&#34;random&#34;,&#34;weeks&#34;,&#34;adoption&#34;,&#34;comorbidities&#34;,&#34;ml&#34;,&#34;algorithms&#34;,&#34;comparison&#34;,&#34;improving&#34;,&#34;established&#34;,&#34;structure&#34;,&#34;function&#34;,&#34;physicians&#34;,&#34;degree&#34;,&#34;entry&#34;,&#34;highlighted&#34;,&#34;reaction&#34;,&#34;singapore&#34;,&#34;infrastructure&#34;,&#34;complex&#34;,&#34;suitable&#34;,&#34;section&#34;,&#34;demonstrate&#34;,&#34;sciences&#34;,&#34;challenging&#34;,&#34;generated&#34;,&#34;decreased&#34;,&#34;south&#34;,&#34;masks&#34;,&#34;selection&#34;,&#34;hajj&#34;,&#34;rt-pcr&#34;,&#34;single&#34;,&#34;proportion&#34;,&#34;b.v&#34;,&#34;equation&#34;,&#34;ai&#34;,&#34;series&#34;,&#34;feature&#34;,&#34;mechanism&#34;,&#34;believed&#34;,&#34;accurate&#34;,&#34;scopus&#34;,&#34;reliability&#34;,&#34;pâ&#34;,&#34;lt&#34;,&#34;measured&#34;,&#34;history&#34;,&#34;finding&#34;,&#34;liver&#34;,&#34;companies&#34;,&#34;sars-cov&#34;,&#34;nigeria&#34;,&#34;receptor&#34;,&#34;frontline&#34;,&#34;combination&#34;,&#34;software&#34;,&#34;determined&#34;,&#34;urgent&#34;,&#34;returns&#34;,&#34;participated&#34;,&#34;acid&#34;,&#34;post-covid-19&#34;,&#34;psychosocial&#34;,&#34;successful&#34;,&#34;empirical&#34;,&#34;directly&#34;,&#34;real&#34;,&#34;spss&#34;,&#34;duration&#34;,&#34;plan&#34;,&#34;beginning&#34;,&#34;covid-19-related&#34;,&#34;success&#34;,&#34;blood&#34;,&#34;personnel&#34;,&#34;remain&#34;,&#34;imposed&#34;,&#34;created&#34;,&#34;examines&#34;,&#34;requires&#34;,&#34;called&#34;,&#34;commons&#34;,&#34;focused&#34;,&#34;final&#34;,&#34;water&#34;,&#34;periods&#34;,&#34;advanced&#34;,&#34;utilized&#34;,&#34;addressing&#34;,&#34;damage&#34;,&#34;citizens&#34;,&#34;plasma&#34;,&#34;amount&#34;,&#34;influenced&#34;,&#34;curve&#34;,&#34;curb&#34;,&#34;discovered&#34;,&#34;icu&#34;,&#34;treatments&#34;,&#34;nations&#34;,&#34;optimal&#34;,&#34;urological&#34;,&#34;neurosurgical&#34;,&#34;rf-ssa&#34;,&#34;trends&#34;,&#34;frequency&#34;,&#34;italy&#34;,&#34;stakeholders&#34;,&#34;reliable&#34;,&#34;handling&#34;,&#34;enhanced&#34;,&#34;august&#34;,&#34;agents&#34;,&#34;viruses&#34;,&#34;perspectives&#34;,&#34;emotion&#34;,&#34;negatively&#34;,&#34;introduced&#34;,&#34;analyse&#34;,&#34;residents&#34;,&#34;adult&#34;,&#34;corona&#34;,&#34;approved&#34;,&#34;nursing&#34;,&#34;thinking&#34;,&#34;forced&#34;,&#34;complete&#34;,&#34;suggests&#34;,&#34;populations&#34;,&#34;emissions&#34;,&#34;economies&#34;],&#34;freq&#34;:[634,569,561,539,526,483,481,447,443,441,433,417,396,394,388,374,367,355,353,352,343,336,335,334,332,331,330,329,322,321,319,314,311,310,302,300,299,299,296,295,294,293,292,290,290,288,288,285,284,282,280,279,279,276,273,272,271,271,269,268,268,265,264,261,260,258,257,257,252,252,252,250,248,248,248,248,242,241,241,241,237,232,232,231,230,230,230,228,228,227,226,226,225,225,224,224,223,221,221,219,215,215,214,214,214,213,211,211,211,210,209,208,206,204,204,202,199,199,198,197,196,196,196,196,195,195,194,194,194,192,192,192,191,190,190,190,189,189,187,186,186,185,185,185,183,182,179,179,178,178,177,176,176,176,175,175,172,172,172,171,169,169,169,168,168,167,167,165,165,165,164,164,164,162,162,161,161,161,160,160,159,159,159,159,158,158,157,157,157,156,155,155,154,154,154,154,154,153,153,153,152,152,152,152,152,151,151,151,151,150,150,149,149,149,149,148,148,148,148,148,147,147,147,146,146,145,144,144,144,143,142,142,142,142,142,141,141,141,141,140,140,140,140,140,139,139,139,139,138,138,138,138,138,137,137,137,137,137,136,136,136,135,135,134,134,133,133,133,133,133,132,132,131,131,131,131,130,130,130,129,128,128,128,128,127,127,127,127,126,126,126,126,125,125,125,125,125,124,124,124,123,123,123,123,122,122,122,121,121,121,120,120,120,120,120,119,119,119,119,119,118,118,118,118,118,118,118,117,116,116,116,116,116,115,115,115,115,115,115,115,114,114,114,114,114,114,113,113,113,113,112,112,112,112,112,112,111,111,111,111,111,111,111,111,111,110,110,110,110,110,110,109,109,109,109,109,109,108,108,108,108,107,107,107,107,107,107,107,107,106,106,106,105,105,105,105,105,105,105,105,105,105,105,105,105,104,104,104,104,104,104,103,103,103,102,102,102,102,102,102,102,102,101,101,101,101,101,101,101,101,101,101,101,101,101,100,99,99,99,99,99,99,99,98,98,98,98,98,98,98,98,97,97,97,97,97,97,97,97,97,96,96,96,95,95,95,95,95,95,95,95,95,95,95,94,94,94,94,94,94,94,94,94,94,94,94,94,94,94,94,93,93,93,93,93,93,93,92,92,92,92,92,92,92,92,91,91,91,91,91,91,91,91,91,91,90,90,90,90,90,90,90,90,90,90,90,90,89,89,89,89,89,89,88,88,88,88,88,88,88,88,88,88,88,88,87,87,87,87,87,87,87,87,86,86,86,86,86,86,86,86,86,86,86,86,86,85,85,85,85,85,85,85,84,84,84,84,84,84,84,84,84,84,84,84,84,83,83,83,83,83,83,83,83,82,82,82,82,82,82,82,82,82,81,81,81,81,81,81,81,81,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,79,79,79,79,79,79,79,79,78,78,78,78,78,78,78,78,77,77,77,77,77,77,77,77,77,77,77,76,76,76,76,76,76,76,76,76,76,76,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,75,74,74,74,74,74,74,74,74,74,74,74,74,73,73,73,73,73,73,73,73,73,73,73,72,72,72,72,72,72,72,72,72,72,72,72,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,71,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,70,69,69,69,69,69,69,69,69,69,69,69,69,69,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,68,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,67,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,66,65,65,65,65,65,65,65,65,65,65,65,65,65,65,65,65,65,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,63,63,63,63,63,63,63,63,63,63,63,63,63,63,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,62,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,61,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,59,59,59,59,59,59,59,59,59,59,59,59,59,59,59,59,59,59,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,57,56,56,56,56,56,56],&#34;fontFamily&#34;:&#34;Segoe UI&#34;,&#34;fontWeight&#34;:&#34;bold&#34;,&#34;color&#34;:&#34;random-dark&#34;,&#34;minSize&#34;:0,&#34;weightFactor&#34;:0.28391167192429,&#34;backgroundColor&#34;:&#34;white&#34;,&#34;gridSize&#34;:0,&#34;minRotation&#34;:-0.785398163397448,&#34;maxRotation&#34;:0.785398163397448,&#34;shuffle&#34;:true,&#34;rotateRatio&#34;:0.4,&#34;shape&#34;:&#34;circle&#34;,&#34;ellipticity&#34;:0.65,&#34;figBase64&#34;:null,&#34;hover&#34;:null},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Top 1000 terms extracted from the abstract
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wordcloud2(covid_wc$author_keywords%&amp;gt;% 
             slice(1:1000) %&amp;gt;% 
             mutate(frequency = round(frequency)))&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-11&#34;&gt;&lt;/span&gt;
&lt;div id=&#34;htmlwidget-3&#34; style=&#34;width:672px;height:480px;&#34; class=&#34;wordcloud2 html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-3&#34;&gt;{&#34;x&#34;:{&#34;word&#34;:[&#34;covid-19&#34;,&#34;pandemic&#34;,&#34;learning&#34;,&#34;coronavirus&#34;,&#34;health&#34;,&#34;sars-cov-2&#34;,&#34;malaysia&#34;,&#34;social&#34;,&#34;online&#34;,&#34;education&#34;,&#34;disease&#34;,&#34;analysis&#34;,&#34;control&#34;,&#34;anxiety&#34;,&#34;technology&#34;,&#34;students&#34;,&#34;teaching&#34;,&#34;mental&#34;,&#34;movement&#34;,&#34;model&#34;,&#34;public&#34;,&#34;management&#34;,&#34;media&#34;,&#34;stress&#34;,&#34;healthcare&#34;,&#34;lockdown&#34;,&#34;machine&#34;,&#34;psychological&#34;,&#34;quality&#34;,&#34;risk&#34;,&#34;medical&#34;,&#34;food&#34;,&#34;policy&#34;,&#34;system&#34;,&#34;depression&#34;,&#34;vaccine&#34;,&#34;respiratory&#34;,&#34;care&#34;,&#34;university&#34;,&#34;impact&#34;,&#34;clinical&#34;,&#34;deep&#34;,&#34;knowledge&#34;,&#34;economic&#34;,&#34;virus&#34;,&#34;diseases&#34;,&#34;tourism&#34;,&#34;digital&#34;,&#34;neural&#34;,&#34;network&#34;,&#34;theory&#34;,&#34;waste&#34;,&#34;development&#34;,&#34;islamic&#34;,&#34;image&#34;,&#34;epidemic&#34;,&#34;e-learning&#34;,&#34;mortality&#34;,&#34;performance&#34;,&#34;infectious&#34;,&#34;sustainable&#34;,&#34;workers&#34;,&#34;covid&#34;,&#34;syndrome&#34;,&#34;artificial&#34;,&#34;intention&#34;,&#34;antiviral&#34;,&#34;drug&#34;,&#34;asia&#34;,&#34;transmission&#34;,&#34;practice&#34;,&#34;infection&#34;,&#34;global&#34;,&#34;index&#34;,&#34;perception&#34;,&#34;air&#34;,&#34;security&#34;,&#34;acceptance&#34;,&#34;medicine&#34;,&#34;stock&#34;,&#34;distance&#34;,&#34;distancing&#34;,&#34;information&#34;,&#34;pneumonia&#34;,&#34;communication&#34;,&#34;resilience&#34;,&#34;screening&#34;,&#34;acute&#34;,&#34;perceived&#34;,&#34;attitude&#34;,&#34;virtual&#34;,&#34;outbreak&#34;,&#34;sars&#34;,&#34;sustainability&#34;,&#34;data&#34;,&#34;crisis&#34;,&#34;pollution&#34;,&#34;community&#34;,&#34;measures&#34;,&#34;fear&#34;,&#34;review&#34;,&#34;protective&#34;,&#34;support&#34;,&#34;behavior&#34;,&#34;emergency&#34;,&#34;response&#34;,&#34;financial&#34;,&#34;student&#34;,&#34;systems&#34;,&#34;epidemiology&#34;,&#34;life&#34;,&#34;quarantine&#34;,&#34;prevention&#34;,&#34;industry&#34;,&#34;intelligence&#34;,&#34;forecasting&#34;,&#34;smart&#34;,&#34;coping&#34;,&#34;pandemics&#34;,&#34;drugs&#34;,&#34;x-ray&#34;,&#34;travel&#34;,&#34;detection&#34;,&#34;survey&#34;,&#34;cancer&#34;,&#34;monitoring&#34;,&#34;bangladesh&#34;,&#34;hydroxychloroquine&#34;,&#34;physical&#34;,&#34;pakistan&#34;,&#34;immunity&#34;,&#34;molecular&#34;,&#34;decision&#34;,&#34;equipment&#34;,&#34;services&#34;,&#34;news&#34;,&#34;distress&#34;,&#34;price&#34;,&#34;service&#34;,&#34;factors&#34;,&#34;chest&#34;,&#34;destination&#34;,&#34;human&#34;,&#34;satisfaction&#34;,&#34;research&#34;,&#34;hospital&#34;,&#34;regression&#34;,&#34;diagnosis&#34;,&#34;vaccines&#34;,&#34;modelling&#34;,&#34;simulation&#34;,&#34;behaviour&#34;,&#34;covid19&#34;,&#34;mco&#34;,&#34;networks&#34;,&#34;transfer&#34;,&#34;supply&#34;,&#34;chain&#34;,&#34;environmental&#34;,&#34;therapy&#34;,&#34;corona&#34;,&#34;rapid&#34;,&#34;government&#34;,&#34;personal&#34;,&#34;stability&#34;,&#34;southeast&#34;,&#34;home&#34;,&#34;market&#34;,&#34;internet&#34;,&#34;motivation&#34;,&#34;fake&#34;,&#34;optimization&#34;,&#34;neurosurgery&#34;,&#34;events&#34;,&#34;energy&#34;,&#34;surgery&#34;,&#34;children&#34;,&#34;change&#34;,&#34;hiv&#34;,&#34;assessment&#34;,&#34;safety&#34;,&#34;population&#34;,&#34;strategy&#34;,&#34;protection&#34;,&#34;treatment&#34;,&#34;international&#34;,&#34;immune&#34;,&#34;activity&#34;,&#34;environment&#34;,&#34;vaccination&#34;,&#34;docking&#34;,&#34;2019-ncov&#34;,&#34;women&#34;,&#34;engagement&#34;,&#34;severe&#34;,&#34;challenges&#34;,&#34;diabetes&#34;,&#34;mobile&#34;,&#34;project&#34;,&#34;spike&#34;,&#34;markets&#34;,&#34;adverse&#34;,&#34;convolutional&#34;,&#34;perceptions&#34;,&#34;protein&#34;,&#34;protease&#34;,&#34;sharing&#34;,&#34;plasma&#34;,&#34;storm&#34;,&#34;disorders&#34;,&#34;educational&#34;,&#34;volatility&#34;,&#34;oil&#34;,&#34;countries&#34;,&#34;ct&#34;,&#34;reproduction&#34;,&#34;climate&#34;,&#34;impacts&#34;,&#34;well-being&#34;,&#34;mass&#34;,&#34;preventive&#34;,&#34;mathematical&#34;,&#34;readiness&#34;,&#34;worker&#34;,&#34;people&#34;,&#34;policies&#34;,&#34;school&#34;,&#34;indonesia&#34;,&#34;diagnostic&#34;,&#34;resources&#34;,&#34;nigeria&#34;,&#34;images&#34;,&#34;ace2&#34;,&#34;approach&#34;,&#34;systematic&#34;,&#34;infections&#34;,&#34;nutrition&#34;,&#34;rna&#34;,&#34;cytokine&#34;,&#34;business&#34;,&#34;fuzzy&#34;,&#34;pharmacy&#34;,&#34;prediction&#34;,&#34;pharmacists&#34;,&#34;classification&#34;,&#34;academic&#34;,&#34;adults&#34;,&#34;china&#34;,&#34;wavelet&#34;,&#34;economy&#34;,&#34;test&#34;,&#34;intervention&#34;,&#34;saudi&#34;,&#34;arabia&#34;,&#34;herd&#34;,&#34;critical&#34;,&#34;method&#34;,&#34;computed&#34;,&#34;tomography&#34;,&#34;viral&#34;,&#34;dental&#34;,&#34;awareness&#34;,&#34;preparedness&#34;,&#34;status&#34;,&#34;application&#34;,&#34;trend&#34;,&#34;forest&#34;,&#34;finance&#34;,&#34;rate&#34;,&#34;testing&#34;,&#34;strategies&#34;,&#34;africa&#34;,&#34;google&#34;,&#34;brand&#34;,&#34;iran&#34;,&#34;literacy&#34;,&#34;loss&#34;,&#34;nervous&#34;,&#34;hand&#34;,&#34;hygiene&#34;,&#34;cytokines&#34;,&#34;training&#34;,&#34;therapeutics&#34;,&#34;body&#34;,&#34;indices&#34;,&#34;construction&#34;,&#34;stroke&#34;,&#34;convalescent&#34;,&#34;ict&#34;,&#34;modeling&#34;,&#34;patients&#34;,&#34;reality&#34;,&#34;responsibility&#34;,&#34;equity&#34;,&#34;gold&#34;,&#34;misinformation&#34;,&#34;pedagogy&#34;,&#34;meta-analysis&#34;,&#34;thinking&#34;,&#34;recovery&#34;,&#34;experience&#34;,&#34;hesitancy&#34;,&#34;sir&#34;,&#34;malaysian&#34;,&#34;multi-criteria&#34;,&#34;industrial&#34;,&#34;delivery&#34;,&#34;derivative&#34;,&#34;numerical&#34;,&#34;national&#34;,&#34;cov&#34;,&#34;interaction&#34;,&#34;world&#34;,&#34;repurposing&#34;,&#34;innovation&#34;,&#34;biomarkers&#34;,&#34;vector&#34;,&#34;wellbeing&#34;,&#34;angiotensin-converting&#34;,&#34;enzyme&#34;,&#34;contact&#34;,&#34;growth&#34;,&#34;pathology&#34;,&#34;mining&#34;,&#34;chloroquine&#34;,&#34;design&#34;,&#34;smes&#34;,&#34;telemedicine&#34;,&#34;characteristics&#34;,&#34;oral&#34;,&#34;influenza&#34;,&#34;employee&#34;,&#34;fractional&#34;,&#34;algorithm&#34;,&#34;revolution&#34;,&#34;otolaryngology&#34;,&#34;angiotensin&#34;,&#34;leadership&#34;,&#34;privacy&#34;,&#34;iomt&#34;,&#34;country&#34;,&#34;convolution&#34;,&#34;staff&#34;,&#34;isolation&#34;,&#34;consumption&#34;,&#34;cardiovascular&#34;,&#34;sars-cov2&#34;,&#34;aids&#34;,&#34;sensitivity&#34;,&#34;generation&#34;,&#34;laboratory&#34;,&#34;imaging&#34;,&#34;basic&#34;,&#34;sentiment&#34;,&#34;psychosocial&#34;,&#34;feature&#34;,&#34;cnn&#34;,&#34;pulmonary&#34;,&#34;stimulus&#34;,&#34;inflammatory&#34;,&#34;anaesthesia&#34;,&#34;religious&#34;,&#34;secondary&#34;,&#34;evolution&#34;,&#34;inhibitors&#34;,&#34;factor&#34;,&#34;aid&#34;,&#34;normal&#34;,&#34;remdesivir&#34;,&#34;spectrum&#34;,&#34;random&#34;,&#34;study&#34;,&#34;receptor&#34;,&#34;sciences&#34;,&#34;multiple&#34;,&#34;symptoms&#34;,&#34;banking&#34;,&#34;nanoparticles&#34;,&#34;curve&#34;,&#34;practices&#34;,&#34;action&#34;,&#34;topsis&#34;,&#34;english&#34;,&#34;asean&#34;,&#34;corporate&#34;,&#34;conservation&#34;,&#34;outcome&#34;,&#34;urology&#34;,&#34;carbon&#34;,&#34;remote&#34;,&#34;cross-sectional&#34;,&#34;seir&#34;,&#34;entropy&#34;,&#34;opportunity&#34;,&#34;hearing&#34;,&#34;visual&#34;,&#34;spillover&#34;,&#34;type&#34;,&#34;mask&#34;,&#34;airline&#34;,&#34;wastewater&#34;,&#34;covidâ&#34;,&#34;trade&#34;,&#34;science&#34;,&#34;perspective&#34;,&#34;frailty&#34;,&#34;app&#34;,&#34;psychology&#34;,&#34;discourse&#34;,&#34;customer&#34;,&#34;organizational&#34;,&#34;text&#34;,&#34;music&#34;,&#34;belief&#34;,&#34;sequence&#34;,&#34;middle-income&#34;,&#34;bitcoin&#34;,&#34;investment&#34;,&#34;male&#34;,&#34;panel&#34;,&#34;circular&#34;,&#34;cell&#34;,&#34;wave&#34;,&#34;gender&#34;,&#34;plastic&#34;,&#34;cloud&#34;,&#34;severity&#34;,&#34;uncertainty&#34;,&#34;real-time&#34;,&#34;infrastructure&#34;,&#34;sensors&#34;,&#34;electronic&#34;,&#34;content&#34;,&#34;co2&#34;,&#34;collaboration&#34;,&#34;chronic&#34;,&#34;venous&#34;,&#34;immunotherapy&#34;,&#34;fractal-fractional&#34;,&#34;fisheries&#34;,&#34;aerosols&#34;,&#34;trial&#34;,&#34;occupational&#34;,&#34;density&#34;,&#34;income&#34;,&#34;entrepreneurial&#34;,&#34;surveillance&#34;,&#34;institutions&#34;,&#34;natural&#34;,&#34;vulnerability&#34;,&#34;linear&#34;,&#34;structural&#34;,&#34;engineering&#34;,&#34;mindfulness&#34;,&#34;features&#34;,&#34;mitigation&#34;,&#34;reproductive&#34;,&#34;aquaculture&#34;,&#34;biosensor&#34;,&#34;asia-pacific&#34;,&#34;adaptive&#34;,&#34;disinfection&#34;,&#34;eating&#34;,&#34;methods&#34;,&#34;iot&#34;,&#34;favipiravir&#34;,&#34;recurrent&#34;,&#34;singular&#34;,&#34;vulnerable&#34;,&#34;integration&#34;,&#34;interventions&#34;,&#34;k-nearest&#34;,&#34;vision&#34;,&#34;economics&#34;,&#34;pregnancy&#34;,&#34;fatality&#34;,&#34;tracing&#34;,&#34;guidelines&#34;,&#34;models&#34;,&#34;conspiracy&#34;,&#34;nasopharyngeal&#34;,&#34;augmented&#34;,&#34;antibody&#34;,&#34;results&#34;,&#34;mutation&#34;,&#34;health-care&#34;,&#34;shopping&#34;,&#34;lifestyle&#34;,&#34;resistance&#34;,&#34;attitudes&#34;,&#34;availability&#34;,&#34;emissions&#34;,&#34;arima&#34;,&#34;medicines&#34;,&#34;family&#34;,&#34;returns&#34;,&#34;behavioural&#34;,&#34;governance&#34;,&#34;tool&#34;,&#34;language&#34;,&#34;clinic&#34;,&#34;qualitative&#34;,&#34;orientation&#34;,&#34;insecurity&#34;,&#34;emerging&#34;,&#34;zoonotic&#34;,&#34;logistic&#34;,&#34;sars-cov&#34;,&#34;zinc&#34;,&#34;rises&#34;,&#34;cov-2&#34;,&#34;cognitive&#34;,&#34;experiences&#34;,&#34;teachers&#34;,&#34;classroom&#34;,&#34;correlation&#34;,&#34;dynamic&#34;,&#34;ann&#34;,&#34;professionals&#34;,&#34;geographical&#34;,&#34;thromboembolism&#34;,&#34;coronaviruses&#34;,&#34;dentistry&#34;,&#34;endoscopy&#34;,&#34;clustering&#34;,&#34;proteomics&#34;,&#34;sources&#34;,&#34;computing&#34;,&#34;tree&#34;,&#34;planning&#34;,&#34;poverty&#34;,&#34;sales&#34;,&#34;jakarta&#34;,&#34;efficiency&#34;,&#34;operator&#34;,&#34;systemic&#34;,&#34;gut&#34;,&#34;chart&#34;,&#34;water&#34;,&#34;exercise&#34;,&#34;gastrointestinal&#34;,&#34;ecological&#34;,&#34;binary&#34;,&#34;selection&#34;,&#34;form&#34;,&#34;rights&#34;,&#34;wildlife&#34;,&#34;continuity&#34;,&#34;green&#34;,&#34;upper&#34;,&#34;pacific&#34;,&#34;function&#34;,&#34;infertility&#34;,&#34;semen&#34;,&#34;concern&#34;,&#34;collaborative&#34;,&#34;azithromycin&#34;,&#34;buying&#34;,&#34;usage&#34;,&#34;surface&#34;,&#34;product&#34;,&#34;scarcity&#34;,&#34;adoption&#34;,&#34;sequencing&#34;,&#34;accessibility&#34;,&#34;mers-cov&#34;,&#34;biomarker&#34;,&#34;processing&#34;,&#34;aviation&#34;,&#34;rt-pcr&#34;,&#34;polymerase&#34;,&#34;variants&#34;,&#34;adolescents&#34;,&#34;cycle&#34;,&#34;panic&#34;,&#34;module&#34;,&#34;domestic&#34;,&#34;emission&#34;,&#34;pm2.5&#34;,&#34;lightweight&#34;,&#34;non-pharmaceutical&#34;,&#34;dynamics&#34;,&#34;twitter&#34;,&#34;opinion&#34;,&#34;animal&#34;,&#34;cerebral&#34;,&#34;thrombosis&#34;,&#34;tools&#34;,&#34;enterprises&#34;,&#34;sector&#34;,&#34;existence&#34;,&#34;adams-bashforth&#34;,&#34;ab&#34;,&#34;package&#34;,&#34;bias&#34;,&#34;emotional&#34;,&#34;city&#34;,&#34;illness&#34;,&#34;law&#34;,&#34;window&#34;,&#34;decision-making&#34;,&#34;process&#34;,&#34;droplets&#34;,&#34;scan&#34;,&#34;cells&#34;,&#34;inflammation&#34;,&#34;physics&#34;,&#34;biosensors&#34;,&#34;segmentation&#34;,&#34;airway&#34;,&#34;sedentary&#34;,&#34;weight&#34;,&#34;active&#34;,&#34;matrix&#34;,&#34;optimal&#34;,&#34;nurses&#34;,&#34;chains&#34;,&#34;palliative&#34;,&#34;capacity&#34;,&#34;techniques&#34;,&#34;coherence&#34;,&#34;burnout&#34;,&#34;evaluation&#34;,&#34;devices&#34;,&#34;surveys&#34;,&#34;questionnaires&#34;,&#34;psychiatry&#34;,&#34;organization&#34;,&#34;ivermectin&#34;,&#34;robot&#34;,&#34;goals&#34;,&#34;sdg&#34;,&#34;behavioral&#34;,&#34;rational&#34;,&#34;phytochemicals&#34;,&#34;neighbor&#34;,&#34;matter&#34;,&#34;blood&#34;,&#34;nucleocapsid&#34;,&#34;workplace&#34;,&#34;inhibitor&#34;,&#34;consensus&#34;,&#34;particulate&#34;,&#34;limited&#34;,&#34;migrant&#34;,&#34;silver&#34;,&#34;job&#34;,&#34;pay&#34;,&#34;marketing&#34;,&#34;technologies&#34;,&#34;lstm&#34;,&#34;institution&#34;,&#34;chinese&#34;,&#34;facilities&#34;,&#34;saliva&#34;,&#34;dysfunction&#34;,&#34;hypertension&#34;,&#34;thailand&#34;,&#34;coverage&#34;,&#34;immunoassay&#34;,&#34;targeted&#34;,&#34;agents&#34;,&#34;lung&#34;,&#34;genetic&#34;,&#34;yemen&#34;,&#34;parameters&#34;,&#34;integrated&#34;,&#34;addiction&#34;,&#34;opioid&#34;,&#34;substance&#34;,&#34;stressors&#34;,&#34;apps&#34;,&#34;behaviors&#34;,&#34;effectiveness&#34;,&#34;trust&#34;,&#34;sarawak&#34;,&#34;tb&#34;,&#34;stigma&#34;,&#34;taiwan&#34;,&#34;mhealth&#34;,&#34;utaut2&#34;,&#34;emotion&#34;,&#34;spatial&#34;,&#34;pollutants&#34;,&#34;rural&#34;,&#34;singapore&#34;,&#34;exchange&#34;,&#34;utaut&#34;,&#34;principles&#34;,&#34;humanitarian&#34;,&#34;disaster&#34;,&#34;fiscal&#34;,&#34;barriers&#34;,&#34;self-efficacy&#34;,&#34;pattern&#34;,&#34;relationship&#34;,&#34;trials&#34;,&#34;employment&#34;,&#34;inclusion&#34;,&#34;contagion&#34;,&#34;asthma&#34;,&#34;happiness&#34;,&#34;alternative&#34;,&#34;death&#34;,&#34;ppe&#34;,&#34;condition&#34;,&#34;mcdm&#34;,&#34;violence&#34;,&#34;simulations&#34;,&#34;temporal&#34;,&#34;users&#34;,&#34;coronavirus-2&#34;,&#34;shortages&#34;,&#34;india&#34;,&#34;literature&#34;,&#34;logistics&#34;,&#34;d-dimer&#34;,&#34;fintech&#34;,&#34;sabah&#34;,&#34;unemployment&#34;,&#34;issues&#34;,&#34;ground-glass&#34;,&#34;regulatory&#34;,&#34;exposure&#34;,&#34;spread&#34;,&#34;forecast&#34;,&#34;hospitality&#34;,&#34;willingness&#34;,&#34;agency&#34;,&#34;x-rays&#34;,&#34;esl&#34;,&#34;pathogenesis&#34;,&#34;low&#34;,&#34;studentâ&#34;,&#34;ethical&#34;,&#34;consumer&#34;,&#34;injury&#34;,&#34;delay&#34;,&#34;aerosol&#34;,&#34;efficacy&#34;,&#34;habits&#34;,&#34;gamification&#34;,&#34;resource&#34;,&#34;local&#34;,&#34;monetary&#34;,&#34;oxidative&#34;,&#34;scale&#34;,&#34;ncov&#34;,&#34;tract&#34;,&#34;ethics&#34;,&#34;fractal&#34;,&#34;complexity&#34;,&#34;covid-&#34;,&#34;studies&#34;,&#34;mellitus&#34;,&#34;divide&#34;,&#34;wallet&#34;,&#34;software&#34;,&#34;disinfectant&#34;,&#34;mosque&#34;,&#34;post-acute&#34;,&#34;graph&#34;,&#34;health-promoting&#34;,&#34;structures&#34;,&#34;cruise&#34;,&#34;haematology&#34;,&#34;t-cell&#34;,&#34;kidney&#34;,&#34;aedes&#34;,&#34;microbiome&#34;,&#34;aec&#34;,&#34;anatomy&#34;,&#34;framing&#34;,&#34;atrial&#34;,&#34;kit&#34;,&#34;absolute&#34;,&#34;shrinkage&#34;,&#34;lasso&#34;,&#34;ultrasound&#34;,&#34;fracture&#34;,&#34;expectancy&#34;,&#34;peptides&#34;,&#34;website&#34;,&#34;liver&#34;,&#34;corticosteroid&#34;,&#34;motility&#34;,&#34;space&#34;,&#34;referees&#34;,&#34;structure&#34;,&#34;publication&#34;,&#34;ensemble&#34;,&#34;commodities&#34;,&#34;sexual&#34;,&#34;sri&#34;,&#34;lanka&#34;,&#34;behaviours&#34;,&#34;outdoors&#34;,&#34;play&#34;,&#34;cytomegalovirus&#34;,&#34;ministry&#34;,&#34;private&#34;,&#34;fcv-19s&#34;,&#34;infodemiology&#34;,&#34;diploma&#34;,&#34;production&#34;,&#34;pls-sem&#34;,&#34;conventional&#34;,&#34;outpatient&#34;,&#34;entry&#34;,&#34;hajj&#34;,&#34;graphene&#34;,&#34;value-added&#34;,&#34;purchasing&#34;,&#34;cost&#34;,&#34;platform&#34;,&#34;epilepsy&#34;,&#34;turkey&#34;,&#34;japan&#34;,&#34;nanomaterials&#34;,&#34;fossil&#34;,&#34;fuel&#34;,&#34;peptide&#34;,&#34;bioinformatics&#34;,&#34;descriptive&#34;,&#34;child&#34;,&#34;thoracic&#34;,&#34;communications&#34;,&#34;reliability&#34;,&#34;validity&#34;,&#34;antigen&#34;,&#34;foreign&#34;,&#34;main&#34;,&#34;restrictions&#34;,&#34;repair&#34;,&#34;success&#34;,&#34;projects&#34;,&#34;coagulopathy&#34;,&#34;immunomodulatory&#34;,&#34;obstructive&#34;,&#34;immunomodulation&#34;,&#34;procedures&#34;,&#34;department&#34;,&#34;spacer&#34;,&#34;loneliness&#34;,&#34;bayesian&#34;,&#34;size&#34;,&#34;reactions&#34;,&#34;crowding&#34;,&#34;culture&#34;,&#34;disinfectants&#34;,&#34;effects&#34;,&#34;entrepreneurs&#34;,&#34;coastal&#34;,&#34;therapeutic&#34;,&#34;biomedical&#34;,&#34;balance&#34;,&#34;cross-cultural&#34;,&#34;empathy&#34;,&#34;individualism&#34;,&#34;power&#34;,&#34;multilevel&#34;,&#34;equation&#34;,&#34;paediatric&#34;,&#34;responses&#34;,&#34;equilibrium&#34;,&#34;adaptation&#34;,&#34;relations&#34;,&#34;electrochemical&#34;,&#34;region&#34;,&#34;seafood&#34;,&#34;prices&#34;,&#34;handwashing&#34;,&#34;planned&#34;,&#34;mixed&#34;,&#34;taste&#34;,&#34;road&#34;,&#34;transport&#34;,&#34;technical&#34;,&#34;video&#34;,&#34;geofencing&#34;,&#34;location&#34;,&#34;tracking&#34;,&#34;andrology&#34;,&#34;adult&#34;,&#34;capitalists&#34;,&#34;handling&#34;,&#34;mutual&#34;,&#34;assistance&#34;,&#34;advantages&#34;,&#34;eigentriples&#34;,&#34;length&#34;,&#34;fourth&#34;,&#34;adolescent&#34;,&#34;long-covid&#34;,&#34;phenoconversion&#34;,&#34;correlations&#34;,&#34;non-rational&#34;,&#34;self-isolation&#34;,&#34;synthetic&#34;,&#34;affect&#34;,&#34;bibliometric&#34;,&#34;optical&#34;,&#34;zoonosis&#34;,&#34;blended&#34;,&#34;cluster&#34;,&#34;nsp15&#34;,&#34;phase&#34;,&#34;prognostic&#34;,&#34;indicators&#34;,&#34;post&#34;,&#34;dentists&#34;,&#34;goal&#34;,&#34;promotion&#34;,&#34;set&#34;,&#34;guidance&#34;,&#34;nanomedicine&#34;,&#34;reinfection&#34;,&#34;sirs&#34;,&#34;qtc&#34;,&#34;prolongation&#34;,&#34;commitment&#34;,&#34;trends&#34;,&#34;recognition&#34;,&#34;measurement&#34;,&#34;firm&#34;,&#34;intelligent&#34;,&#34;marine&#34;,&#34;agricultural&#34;,&#34;regulation&#34;,&#34;anti-covid-19&#34;,&#34;traditional&#34;,&#34;e-government&#34;,&#34;reasoned&#34;,&#34;theories&#34;,&#34;b40&#34;,&#34;household&#34;,&#34;disposal&#34;,&#34;metabolic&#34;,&#34;swab&#34;,&#34;medicinal&#34;,&#34;plants&#34;,&#34;converting&#34;,&#34;turnover&#34;,&#34;lmics&#34;,&#34;quantitative&#34;,&#34;binding&#34;,&#34;comparative&#34;,&#34;flavonoid&#34;,&#34;electrocardiogram&#34;,&#34;prolonged&#34;,&#34;susceptibility&#34;,&#34;dining&#34;,&#34;experiencescape&#34;,&#34;female&#34;,&#34;travelers&#34;,&#34;compliance&#34;,&#34;wuhan&#34;,&#34;mpro&#34;,&#34;layer&#34;,&#34;myanmar&#34;,&#34;prioritisation&#34;,&#34;serological&#34;,&#34;lupus&#34;,&#34;erythematosus&#34;,&#34;harm&#34;,&#34;reduction&#34;,&#34;agonist&#34;,&#34;disorder&#34;,&#34;borneo&#34;,&#34;sociodemographic&#34;,&#34;wellness&#34;,&#34;migration&#34;,&#34;database&#34;,&#34;arm&#34;,&#34;arduino&#34;,&#34;nano&#34;,&#34;confinement&#34;,&#34;kap&#34;,&#34;asymptomatic&#34;,&#34;mann-kendall&#34;,&#34;rf&#34;,&#34;ssa&#34;,&#34;anthropogenic&#34;,&#34;aquatic&#34;,&#34;tuberculosis&#34;,&#34;peritraumatic&#34;,&#34;operation&#34;,&#34;blockchain&#34;,&#34;integrity&#34;,&#34;particle&#34;,&#34;swarm&#34;,&#34;domain&#34;,&#34;oropharyngeal&#34;,&#34;smell&#34;,&#34;capital&#34;,&#34;sensorineural&#34;,&#34;s-o-r&#34;,&#34;renal&#34;,&#34;failure&#34;,&#34;transfusion&#34;],&#34;freq&#34;:[225,191,189,179,175,166,119,114,114,103,94,83,77,74,72,70,68,66,65,65,63,63,61,58,58,57,57,56,56,56,55,55,54,54,53,52,50,49,49,49,49,49,48,47,47,46,46,45,45,45,45,45,44,44,44,43,43,43,42,42,41,41,41,41,38,38,37,37,37,37,37,36,36,36,35,35,35,34,34,34,34,34,33,33,33,33,33,33,33,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,31,31,30,30,30,29,29,29,29,29,29,29,28,28,27,27,27,27,27,27,27,26,26,26,26,26,26,26,26,26,26,25,25,25,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,23,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,21,21,21,21,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,16,16,16,16,16,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,14,14,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,12,12,12,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],&#34;fontFamily&#34;:&#34;Segoe UI&#34;,&#34;fontWeight&#34;:&#34;bold&#34;,&#34;color&#34;:&#34;random-dark&#34;,&#34;minSize&#34;:0,&#34;weightFactor&#34;:0.8,&#34;backgroundColor&#34;:&#34;white&#34;,&#34;gridSize&#34;:0,&#34;minRotation&#34;:-0.785398163397448,&#34;maxRotation&#34;:0.785398163397448,&#34;shuffle&#34;:true,&#34;rotateRatio&#34;:0.4,&#34;shape&#34;:&#34;circle&#34;,&#34;ellipticity&#34;:0.65,&#34;figBase64&#34;:null,&#34;hover&#34;:null},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 3: Top 1000 terms extracted from the author’s keywords
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wordcloud2(covid_wc$index_keywords%&amp;gt;% 
             slice(1:1000) %&amp;gt;% 
             mutate(frequency = round(frequency)))&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-12&#34;&gt;&lt;/span&gt;
&lt;div id=&#34;htmlwidget-4&#34; style=&#34;width:672px;height:480px;&#34; class=&#34;wordcloud2 html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-4&#34;&gt;{&#34;x&#34;:{&#34;word&#34;:[&#34;health&#34;,&#34;disease&#34;,&#34;coronavirus&#34;,&#34;care&#34;,&#34;virus&#34;,&#34;adult&#34;,&#34;drug&#34;,&#34;study&#34;,&#34;aged&#34;,&#34;pneumonia&#34;,&#34;infection&#34;,&#34;male&#34;,&#34;female&#34;,&#34;malaysia&#34;,&#34;betacoronavirus&#34;,&#34;covid-19&#34;,&#34;risk&#34;,&#34;pandemic&#34;,&#34;control&#34;,&#34;human&#34;,&#34;syndrome&#34;,&#34;respiratory&#34;,&#34;clinical&#34;,&#34;humans&#34;,&#34;analysis&#34;,&#34;social&#34;,&#34;middle&#34;,&#34;article&#34;,&#34;protein&#34;,&#34;sars-cov-2&#34;,&#34;learning&#34;,&#34;acute&#34;,&#34;pandemics&#34;,&#34;personnel&#34;,&#34;viral&#34;,&#34;cross-sectional&#34;,&#34;mental&#34;,&#34;transmission&#34;,&#34;agent&#34;,&#34;patient&#34;,&#34;severe&#34;,&#34;management&#34;,&#34;hospital&#34;,&#34;factor&#34;,&#34;system&#34;,&#34;infections&#34;,&#34;waste&#34;,&#34;blood&#34;,&#34;medical&#34;,&#34;anxiety&#34;,&#34;education&#34;,&#34;therapy&#34;,&#34;reaction&#34;,&#34;public&#34;,&#34;priority&#34;,&#34;journal&#34;,&#34;chain&#34;,&#34;mortality&#34;,&#34;review&#34;,&#34;stress&#34;,&#34;questionnaire&#34;,&#34;assessment&#34;,&#34;polymerase&#34;,&#34;prevention&#34;,&#34;studies&#34;,&#34;angiotensin&#34;,&#34;child&#34;,&#34;epidemiology&#34;,&#34;air&#34;,&#34;epidemic&#34;,&#34;time&#34;,&#34;adolescent&#34;,&#34;controlled&#34;,&#34;practice&#34;,&#34;enzyme&#34;,&#34;acid&#34;,&#34;procedures&#34;,&#34;cell&#34;,&#34;diagnosis&#34;,&#34;interleukin&#34;,&#34;tomography&#34;,&#34;quality&#34;,&#34;receptor&#34;,&#34;organization&#34;,&#34;severity&#34;,&#34;quarantine&#34;,&#34;lung&#34;,&#34;letter&#34;,&#34;china&#34;,&#34;depression&#34;,&#34;nonhuman&#34;,&#34;outcome&#34;,&#34;vaccine&#34;,&#34;surveys&#34;,&#34;asia&#34;,&#34;global&#34;,&#34;behavior&#34;,&#34;data&#34;,&#34;pollution&#34;,&#34;factors&#34;,&#34;research&#34;,&#34;environmental&#34;,&#34;surgery&#34;,&#34;rate&#34;,&#34;topic&#34;,&#34;binding&#34;,&#34;virology&#34;,&#34;isolation&#34;,&#34;computer&#34;,&#34;communicable&#34;,&#34;attitude&#34;,&#34;major&#34;,&#34;diagnostic&#34;,&#34;information&#34;,&#34;psychology&#34;,&#34;disorder&#34;,&#34;hydroxychloroquine&#34;,&#34;decision&#34;,&#34;reverse&#34;,&#34;treatment&#34;,&#34;occupational&#34;,&#34;monitoring&#34;,&#34;psychological&#34;,&#34;scale&#34;,&#34;activity&#34;,&#34;economic&#34;,&#34;systems&#34;,&#34;safety&#34;,&#34;screening&#34;,&#34;equipment&#34;,&#34;protective&#34;,&#34;transcription&#34;,&#34;service&#34;,&#34;detection&#34;,&#34;complication&#34;,&#34;prevalence&#34;,&#34;media&#34;,&#34;united&#34;,&#34;cancer&#34;,&#34;policy&#34;,&#34;inhibitor&#34;,&#34;immunity&#34;,&#34;antiviral&#34;,&#34;food&#34;,&#34;pakistan&#34;,&#34;trial&#34;,&#34;questionnaires&#34;,&#34;diabetes&#34;,&#34;emergency&#34;,&#34;molecular&#34;,&#34;rna&#34;,&#34;testing&#34;,&#34;mellitus&#34;,&#34;computed&#34;,&#34;student&#34;,&#34;immune&#34;,&#34;population&#34;,&#34;comorbidity&#34;,&#34;cytokine&#34;,&#34;artificial&#34;,&#34;status&#34;,&#34;model&#34;,&#34;agents&#34;,&#34;neural&#34;,&#34;machine&#34;,&#34;fever&#34;,&#34;antibody&#34;,&#34;laboratory&#34;,&#34;sensitivity&#34;,&#34;viruses&#34;,&#34;vaccination&#34;,&#34;age&#34;,&#34;perception&#34;,&#34;distress&#34;,&#34;brain&#34;,&#34;x-ray&#34;,&#34;image&#34;,&#34;intensive&#34;,&#34;retrospective&#34;,&#34;test&#34;,&#34;hypertension&#34;,&#34;converting&#34;,&#34;diseases&#34;,&#34;government&#34;,&#34;world&#34;,&#34;spike&#34;,&#34;immunoglobulin&#34;,&#34;chronic&#34;,&#34;delivery&#34;,&#34;failure&#34;,&#34;techniques&#34;,&#34;mass&#34;,&#34;cost&#34;,&#34;survey&#34;,&#34;development&#34;,&#34;networks&#34;,&#34;hospitalization&#34;,&#34;guideline&#34;,&#34;ventilation&#34;,&#34;income&#34;,&#34;distancing&#34;,&#34;kidney&#34;,&#34;support&#34;,&#34;response&#34;,&#34;liver&#34;,&#34;internet&#34;,&#34;effect&#34;,&#34;knowledge&#34;,&#34;impact&#34;,&#34;lopinavir&#34;,&#34;tract&#34;,&#34;real&#34;,&#34;interferon&#34;,&#34;ritonavir&#34;,&#34;azithromycin&#34;,&#34;radiography&#34;,&#34;personal&#34;,&#34;online&#34;,&#34;life&#34;,&#34;dipeptidyl&#34;,&#34;antivirus&#34;,&#34;students&#34;,&#34;asymptomatic&#34;,&#34;genetic&#34;,&#34;east&#34;,&#34;exposure&#34;,&#34;body&#34;,&#34;thrombosis&#34;,&#34;lymphocyte&#34;,&#34;hand&#34;,&#34;telemedicine&#34;,&#34;deep&#34;,&#34;specificity&#34;,&#34;purification&#34;,&#34;international&#34;,&#34;teaching&#34;,&#34;illness&#34;,&#34;elderly&#34;,&#34;level&#34;,&#34;sars&#34;,&#34;coughing&#34;,&#34;imaging&#34;,&#34;examination&#34;,&#34;sex&#34;,&#34;assisted&#34;,&#34;remdesivir&#34;,&#34;systematic&#34;,&#34;planning&#34;,&#34;bangladesh&#34;,&#34;university&#34;,&#34;sleep&#34;,&#34;cardiovascular&#34;,&#34;europe&#34;,&#34;contact&#34;,&#34;medicine&#34;,&#34;infectious&#34;,&#34;heart&#34;,&#34;gene&#34;,&#34;glycoprotein&#34;,&#34;metabolism&#34;,&#34;communication&#34;,&#34;models&#34;,&#34;tumor&#34;,&#34;africa&#34;,&#34;chloroquine&#34;,&#34;carboxypeptidase&#34;,&#34;thorax&#34;,&#34;indonesia&#34;,&#34;country&#34;,&#34;mutation&#34;,&#34;physical&#34;,&#34;immunodeficiency&#34;,&#34;unit&#34;,&#34;elective&#34;,&#34;feature&#34;,&#34;animal&#34;,&#34;influenza&#34;,&#34;insulin&#34;,&#34;energy&#34;,&#34;oxygen&#34;,&#34;industry&#34;,&#34;distance&#34;,&#34;incidence&#34;,&#34;injury&#34;,&#34;sequence&#34;,&#34;meta&#34;,&#34;extract&#34;,&#34;association&#34;,&#34;community&#34;,&#34;surgical&#34;,&#34;structure&#34;,&#34;vaccines&#34;,&#34;classification&#34;,&#34;forecasting&#34;,&#34;services&#34;,&#34;thromboembolism&#34;,&#34;report&#34;,&#34;corticosteroid&#34;,&#34;design&#34;,&#34;comparative&#34;,&#34;genetics&#34;,&#34;economics&#34;,&#34;death&#34;,&#34;antagonist&#34;,&#34;gastrointestinal&#34;,&#34;distribution&#34;,&#34;low&#34;,&#34;particulate&#34;,&#34;matter&#34;,&#34;index&#34;,&#34;interaction&#34;,&#34;unclassified&#34;,&#34;accuracy&#34;,&#34;venous&#34;,&#34;cohort&#34;,&#34;simulation&#34;,&#34;pressure&#34;,&#34;symptom&#34;,&#34;biological&#34;,&#34;technology&#34;,&#34;aerosol&#34;,&#34;note&#34;,&#34;physiology&#34;,&#34;coronaviruses&#34;,&#34;movement&#34;,&#34;network&#34;,&#34;statistical&#34;,&#34;heparin&#34;,&#34;hepatitis&#34;,&#34;environment&#34;,&#34;derivative&#34;,&#34;pregnancy&#34;,&#34;asthma&#34;,&#34;dynamics&#34;,&#34;physician&#34;,&#34;efficacy&#34;,&#34;lockdown&#34;,&#34;pathogenicity&#34;,&#34;weight&#34;,&#34;randomized&#34;,&#34;training&#34;,&#34;necrosis&#34;,&#34;home&#34;,&#34;method&#34;,&#34;dyspnea&#34;,&#34;immunology&#34;,&#34;genome&#34;,&#34;educational&#34;,&#34;sustainable&#34;,&#34;newborn&#34;,&#34;animals&#34;,&#34;plasma&#34;,&#34;release&#34;,&#34;antigen&#34;,&#34;change&#34;,&#34;facility&#34;,&#34;surveillance&#34;,&#34;obesity&#34;,&#34;mobile&#34;,&#34;intelligence&#34;,&#34;aspect&#34;,&#34;water&#34;,&#34;construction&#34;,&#34;function&#34;,&#34;access&#34;,&#34;singapore&#34;,&#34;industrial&#34;,&#34;adverse&#34;,&#34;replication&#34;,&#34;vitamin&#34;,&#34;algorithm&#34;,&#34;fear&#34;,&#34;cerebrovascular&#34;,&#34;plastic&#34;,&#34;washing&#34;,&#34;follow&#34;,&#34;swab&#34;,&#34;approach&#34;,&#34;expression&#34;,&#34;throat&#34;,&#34;multiple&#34;,&#34;disorders&#34;,&#34;process&#34;,&#34;methods&#34;,&#34;endoscopy&#34;,&#34;immunization&#34;,&#34;antibiotic&#34;,&#34;load&#34;,&#34;infant&#34;,&#34;nasopharynx&#34;,&#34;developing&#34;,&#34;count&#34;,&#34;satisfaction&#34;,&#34;kingdom&#34;,&#34;outbreaks&#34;,&#34;pathophysiology&#34;,&#34;coping&#34;,&#34;fatigue&#34;,&#34;nucleic&#34;,&#34;disinfection&#34;,&#34;pain&#34;,&#34;performance&#34;,&#34;prediction&#34;,&#34;compliance&#34;,&#34;awareness&#34;,&#34;angiotensin-converting&#34;,&#34;inhibitors&#34;,&#34;qualitative&#34;,&#34;critical&#34;,&#34;professional&#34;,&#34;tocilizumab&#34;,&#34;difference&#34;,&#34;reduction&#34;,&#34;multicenter&#34;,&#34;anticoagulant&#34;,&#34;prognosis&#34;,&#34;geographic&#34;,&#34;engineering&#34;,&#34;north&#34;,&#34;diarrhea&#34;,&#34;algorithms&#34;,&#34;amino&#34;,&#34;innate&#34;,&#34;production&#34;,&#34;coronary&#34;,&#34;pharmacy&#34;,&#34;dioxide&#34;,&#34;dependent&#34;,&#34;phylogeny&#34;,&#34;e-learning&#34;,&#34;malaysian&#34;,&#34;reactive&#34;,&#34;socioeconomics&#34;,&#34;favipiravir&#34;,&#34;school&#34;,&#34;computing&#34;,&#34;activities&#34;,&#34;nucleocapsid&#34;,&#34;prospective&#34;,&#34;repositioning&#34;,&#34;cooperation&#34;,&#34;dna&#34;,&#34;preschool&#34;,&#34;processing&#34;,&#34;nasopharyngeal&#34;,&#34;amplification&#34;,&#34;consensus&#34;,&#34;nitrogen&#34;,&#34;variation&#34;,&#34;entry&#34;,&#34;south&#34;,&#34;job&#34;,&#34;carbon&#34;,&#34;transfusion&#34;,&#34;digital&#34;,&#34;adaptive&#34;,&#34;mathematical&#34;,&#34;nose&#34;,&#34;loss&#34;,&#34;procedure&#34;,&#34;effectiveness&#34;,&#34;culture&#34;,&#34;supply&#34;,&#34;proteins&#34;,&#34;travel&#34;,&#34;participation&#34;,&#34;tuberculosis&#34;,&#34;america&#34;,&#34;alanine&#34;,&#34;attitudes&#34;,&#34;chemistry&#34;,&#34;demography&#34;,&#34;fatality&#34;,&#34;dexamethasone&#34;,&#34;antiinflammatory&#34;,&#34;convolutional&#34;,&#34;hospitals&#34;,&#34;spread&#34;,&#34;technique&#34;,&#34;workforce&#34;,&#34;workplace&#34;,&#34;security&#34;,&#34;administration&#34;,&#34;clustering&#34;,&#34;aminotransferase&#34;,&#34;phase&#34;,&#34;resilience&#34;,&#34;tertiary&#34;,&#34;cross&#34;,&#34;based&#34;,&#34;quantitative&#34;,&#34;storm&#34;,&#34;chinese&#34;,&#34;measurement&#34;,&#34;handling&#34;,&#34;obstructive&#34;,&#34;reproduction&#34;,&#34;spatial&#34;,&#34;southeast&#34;,&#34;sector&#34;,&#34;correlation&#34;,&#34;local&#34;,&#34;pathology&#34;,&#34;financial&#34;,&#34;neoplasm&#34;,&#34;inflammation&#34;,&#34;nucleotide&#34;,&#34;italy&#34;,&#34;enhancement&#34;,&#34;sequencing&#34;,&#34;hemorrhage&#34;,&#34;intelligent&#34;,&#34;headache&#34;,&#34;pyrolysis&#34;,&#34;interview&#34;,&#34;protease&#34;,&#34;plant&#34;,&#34;survival&#34;,&#34;clotting&#34;,&#34;regression&#34;,&#34;anesthesia&#34;,&#34;preoperative&#34;,&#34;sustainability&#34;,&#34;poverty&#34;,&#34;temperature&#34;,&#34;artery&#34;,&#34;nurse&#34;,&#34;burden&#34;,&#34;hearing&#34;,&#34;literature&#34;,&#34;software&#34;,&#34;socioeconomic&#34;,&#34;applications&#34;,&#34;deficiency&#34;,&#34;ward&#34;,&#34;household&#34;,&#34;signal&#34;,&#34;glucose&#34;,&#34;zinc&#34;,&#34;neurosurgery&#34;,&#34;admission&#34;,&#34;patterns&#34;,&#34;event&#34;,&#34;exacerbation&#34;,&#34;morbidity&#34;,&#34;application&#34;,&#34;kinase&#34;,&#34;predictive&#34;,&#34;docking&#34;,&#34;site&#34;,&#34;biology&#34;,&#34;complement&#34;,&#34;proteinase&#34;,&#34;taiwan&#34;,&#34;cluster&#34;,&#34;employment&#34;,&#34;well-being&#34;,&#34;devices&#34;,&#34;gamma&#34;,&#34;hygiene&#34;,&#34;evaluation&#34;,&#34;type&#34;,&#34;worker&#34;,&#34;neoplasms&#34;,&#34;methodology&#34;,&#34;mechanism&#34;,&#34;assay&#34;,&#34;nursing&#34;,&#34;oil&#34;,&#34;utilization&#34;,&#34;scoring&#34;,&#34;effects&#34;,&#34;ribavirin&#34;,&#34;dimer&#34;,&#34;contamination&#34;,&#34;tourism&#34;,&#34;theory&#34;,&#34;epidemiological&#34;,&#34;transport&#34;,&#34;basic&#34;,&#34;disposal&#34;,&#34;specimen&#34;,&#34;stroke&#34;,&#34;chest&#34;,&#34;renin&#34;,&#34;family&#34;,&#34;vomiting&#34;,&#34;rating&#34;,&#34;host&#34;,&#34;tissue&#34;,&#34;length&#34;,&#34;stay&#34;,&#34;india&#34;,&#34;staff&#34;,&#34;videoconferencing&#34;,&#34;infarction&#34;,&#34;infertility&#34;,&#34;metformin&#34;,&#34;bilirubin&#34;,&#34;myalgia&#34;,&#34;resource&#34;,&#34;conditions&#34;,&#34;wellbeing&#34;,&#34;editorial&#34;,&#34;density&#34;,&#34;qt&#34;,&#34;resistance&#34;,&#34;longitudinal&#34;,&#34;smoking&#34;,&#34;sexual&#34;,&#34;sinus&#34;,&#34;turkey&#34;,&#34;malaria&#34;,&#34;fibrinolytic&#34;,&#34;thailand&#34;,&#34;antibodies&#34;,&#34;dose&#34;,&#34;intervention&#34;,&#34;organ&#34;,&#34;frailty&#34;,&#34;fuzzy&#34;,&#34;vector&#34;,&#34;computerized&#34;,&#34;aldosterone&#34;,&#34;extraction&#34;,&#34;shedding&#34;,&#34;coronavirinae&#34;,&#34;radiation&#34;,&#34;intake&#34;,&#34;validity&#34;,&#34;intubation&#34;,&#34;climate&#34;,&#34;experiment&#34;,&#34;hyperglycemia&#34;,&#34;workload&#34;,&#34;crowding&#34;,&#34;mixed&#34;,&#34;reductase&#34;,&#34;islam&#34;,&#34;exercise&#34;,&#34;oropharynx&#34;,&#34;rural&#34;,&#34;postoperative&#34;,&#34;accident&#34;,&#34;spatiotemporal&#34;,&#34;primary&#34;,&#34;single&#34;,&#34;observational&#34;,&#34;nausea&#34;,&#34;nonstructural&#34;,&#34;domestic&#34;,&#34;pacific&#34;,&#34;monoclonal&#34;,&#34;japan&#34;,&#34;passive&#34;,&#34;republic&#34;,&#34;adenosine&#34;,&#34;hemorrhagic&#34;,&#34;immunosuppressive&#34;,&#34;ambulatory&#34;,&#34;healthcare&#34;,&#34;acceptance&#34;,&#34;short&#34;,&#34;emotional&#34;,&#34;gender&#34;,&#34;interactions&#34;,&#34;immunomodulation&#34;,&#34;trend&#34;,&#34;macrophage&#34;,&#34;modeling&#34;,&#34;center&#34;,&#34;asian&#34;,&#34;smear&#34;,&#34;critically&#34;,&#34;ill&#34;,&#34;strategy&#34;,&#34;standard&#34;,&#34;outpatient&#34;,&#34;real-time&#34;,&#34;anosmia&#34;,&#34;rhinorrhea&#34;,&#34;physicians&#34;,&#34;guidelines&#34;,&#34;tests&#34;,&#34;behavioral&#34;,&#34;cerebral&#34;,&#34;remote&#34;,&#34;functional&#34;,&#34;brazil&#34;,&#34;herd&#34;,&#34;embolism&#34;,&#34;iran&#34;,&#34;mining&#34;,&#34;sperm&#34;,&#34;marketing&#34;,&#34;epilepsy&#34;,&#34;immunoassay&#34;,&#34;biomarkers&#34;,&#34;promotion&#34;,&#34;emotion&#34;,&#34;period&#34;,&#34;ratio&#34;,&#34;testis&#34;,&#34;fractional&#34;,&#34;spain&#34;,&#34;antihypertensive&#34;,&#34;bioinformatics&#34;,&#34;carrier&#34;,&#34;1beta&#34;,&#34;physiological&#34;,&#34;immunotherapy&#34;,&#34;vein&#34;,&#34;west&#34;,&#34;protection&#34;,&#34;score&#34;,&#34;methotrexate&#34;,&#34;predisposition&#34;,&#34;atmospheric&#34;,&#34;consumption&#34;,&#34;secondary&#34;,&#34;evidence&#34;,&#34;serine&#34;,&#34;muscle&#34;,&#34;sore&#34;,&#34;oseltamivir&#34;,&#34;alpha&#34;,&#34;beta&#34;,&#34;lactate&#34;,&#34;dehydrogenase&#34;,&#34;arterial&#34;,&#34;spectrometry&#34;,&#34;glycemic&#34;,&#34;burnout&#34;,&#34;operating&#34;,&#34;adrenal&#34;,&#34;hemoglobin&#34;,&#34;dissemination&#34;,&#34;spectroscopy&#34;,&#34;taste&#34;,&#34;aortic&#34;,&#34;selection&#34;,&#34;nervous&#34;,&#34;dizziness&#34;,&#34;differential&#34;,&#34;science&#34;,&#34;western&#34;,&#34;concept&#34;,&#34;intention&#34;,&#34;database&#34;,&#34;activation&#34;,&#34;structural&#34;,&#34;reproducibility&#34;,&#34;ncov&#34;,&#34;laryngoscopy&#34;,&#34;nanoparticle&#34;,&#34;bayes&#34;,&#34;theorem&#34;,&#34;rheumatic&#34;,&#34;nanomedicine&#34;,&#34;competence&#34;,&#34;frail&#34;,&#34;violence&#34;,&#34;ethnic&#34;,&#34;concentration&#34;,&#34;preventive&#34;,&#34;referral&#34;,&#34;fatty&#34;,&#34;lavage&#34;,&#34;transfer&#34;,&#34;lifestyle&#34;,&#34;philippines&#34;,&#34;sneezing&#34;,&#34;antimalarial&#34;,&#34;pollutant&#34;,&#34;motivation&#34;,&#34;urban&#34;,&#34;wuhan&#34;,&#34;pharmacist&#34;,&#34;fluid&#34;,&#34;bulgaria&#34;,&#34;deafness&#34;,&#34;valve&#34;,&#34;publication&#34;,&#34;universities&#34;,&#34;infectivity&#34;,&#34;linked&#34;,&#34;immunosorbent&#34;,&#34;search&#34;,&#34;measures&#34;,&#34;adaptation&#34;,&#34;structured&#34;,&#34;trials&#34;,&#34;size&#34;,&#34;product&#34;,&#34;experience&#34;,&#34;inventory&#34;,&#34;marker&#34;,&#34;methylprednisolone&#34;,&#34;religion&#34;,&#34;azathioprine&#34;,&#34;regulation&#34;,&#34;inflammatory&#34;,&#34;virtual&#34;,&#34;consultation&#34;,&#34;medium&#34;,&#34;patient-to-professional&#34;,&#34;medication&#34;,&#34;recycling&#34;,&#34;square&#34;,&#34;palliative&#34;,&#34;maternal&#34;,&#34;essential&#34;,&#34;ischemia&#34;,&#34;lupus&#34;,&#34;combination&#34;,&#34;esophagus&#34;,&#34;workflow&#34;,&#34;urology&#34;,&#34;oral&#34;,&#34;language&#34;,&#34;line&#34;,&#34;cycle&#34;,&#34;emission&#34;,&#34;disaster&#34;,&#34;department&#34;,&#34;series&#34;,&#34;history&#34;,&#34;natural&#34;,&#34;numerical&#34;,&#34;gas&#34;,&#34;particle&#34;,&#34;australia&#34;,&#34;transduction&#34;,&#34;feeding&#34;,&#34;oxidative&#34;,&#34;immunocompromised&#34;,&#34;falciparum&#34;,&#34;electrochemical&#34;,&#34;degradation&#34;,&#34;incineration&#34;,&#34;morbid&#34;,&#34;vertigo&#34;,&#34;abuse&#34;,&#34;endoscopic&#34;,&#34;solid&#34;,&#34;hiv&#34;,&#34;chains&#34;,&#34;vulnerable&#34;,&#34;error&#34;,&#34;uncertainty&#34;,&#34;viet&#34;,&#34;nam&#34;,&#34;sulfur&#34;,&#34;convolution&#34;,&#34;recombinant&#34;,&#34;parameters&#34;,&#34;histogram&#34;,&#34;organizational&#34;,&#34;program&#34;,&#34;saudi&#34;,&#34;arabia&#34;,&#34;commerce&#34;,&#34;sampling&#34;,&#34;occupation&#34;,&#34;glucocorticoid&#34;,&#34;risks&#34;,&#34;iot&#34;,&#34;teleconsultation&#34;,&#34;serology&#34;,&#34;tools&#34;,&#34;sites&#34;,&#34;growth&#34;,&#34;capacity&#34;,&#34;peptide&#34;,&#34;bronchoscopy&#34;,&#34;results&#34;,&#34;endotracheal&#34;,&#34;availability&#34;,&#34;shortage&#34;,&#34;mask&#34;,&#34;countries&#34;,&#34;electronic&#34;,&#34;placebo&#34;,&#34;pathogenesis&#34;,&#34;renin-angiotensin&#34;,&#34;platforms&#34;,&#34;affinity&#34;,&#34;newcastle-ottawa&#34;,&#34;crisis&#34;,&#34;chemical&#34;,&#34;validation&#34;,&#34;stigma&#34;,&#34;bleeding&#34;,&#34;conceptual&#34;,&#34;pathogen&#34;,&#34;computational&#34;,&#34;shop&#34;,&#34;peptidyl-dipeptidase&#34;,&#34;malignant&#34;,&#34;anticoagulants&#34;,&#34;pharmaceutical&#34;,&#34;sanitizer&#34;,&#34;municipal&#34;,&#34;radiology&#34;,&#34;architecture&#34;,&#34;nepal&#34;,&#34;erythematosus&#34;,&#34;intestine&#34;,&#34;toxicity&#34;,&#34;microbiology&#34;,&#34;tenofovir&#34;,&#34;linear&#34;,&#34;ireland&#34;,&#34;genital&#34;,&#34;yemen&#34;,&#34;immunologic&#34;,&#34;nigeria&#34;,&#34;folic&#34;,&#34;private&#34;,&#34;biosensing&#34;,&#34;korea&#34;,&#34;iv&#34;,&#34;dental&#34;,&#34;obstruction&#34;,&#34;monoxide&#34;,&#34;politics&#34;,&#34;humoral&#34;,&#34;systemic&#34;,&#34;dietary&#34;,&#34;germany&#34;,&#34;toll&#34;,&#34;business&#34;,&#34;cardiac&#34;,&#34;sample&#34;,&#34;dengue&#34;,&#34;transplantation&#34;,&#34;creatinine&#34;,&#34;entropy&#34;,&#34;ii&#34;,&#34;impedance&#34;,&#34;mapping&#34;,&#34;power&#34;,&#34;fracture&#34;,&#34;people&#34;,&#34;thoracic&#34;,&#34;partial&#34;,&#34;sensing&#34;,&#34;framework&#34;,&#34;cognitive&#34;,&#34;insurance&#34;,&#34;person&#34;,&#34;psychosocial&#34;,&#34;logistic&#34;,&#34;cultural&#34;,&#34;inhibition&#34;,&#34;nearest&#34;,&#34;myanmar&#34;,&#34;membrane&#34;,&#34;strategic&#34;,&#34;condition&#34;,&#34;content&#34;,&#34;traffic&#34;,&#34;virulence&#34;,&#34;diet&#34;,&#34;cd4&#34;,&#34;tachycardia&#34;,&#34;ferritin&#34;,&#34;respiration&#34;,&#34;comparison&#34;,&#34;storage&#34;,&#34;oxygenation&#34;,&#34;cd8&#34;,&#34;networking&#34;,&#34;market&#34;,&#34;automatic&#34;,&#34;current&#34;,&#34;coefficient&#34;,&#34;ethnicity&#34;,&#34;transmissions&#34;,&#34;mitigation&#34;,&#34;interpersonal&#34;,&#34;misinformation&#34;,&#34;oxides&#34;,&#34;emissions&#34;,&#34;pulmonary&#34;,&#34;baricitinib&#34;,&#34;convalescent&#34;,&#34;erythrocyte&#34;,&#34;metabolic&#34;,&#34;role&#34;,&#34;protocol&#34;,&#34;sectors&#34;,&#34;antiinfective&#34;,&#34;umifenovir&#34;,&#34;resources&#34;,&#34;monocyte&#34;,&#34;phenomena&#34;,&#34;southeastern&#34;,&#34;society&#34;,&#34;shock&#34;],&#34;freq&#34;:[674,530,524,418,377,371,363,362,359,349,335,318,311,310,309,305,295,290,290,287,273,264,262,260,255,250,246,242,237,236,228,223,220,219,218,217,209,201,197,197,194,189,182,181,177,176,174,170,165,164,162,160,159,159,156,156,156,155,155,155,154,153,153,152,147,146,145,144,141,139,137,137,136,136,135,134,133,133,131,130,129,127,127,126,125,124,123,123,123,123,121,120,120,118,115,113,113,112,111,110,109,109,109,108,108,108,108,106,106,106,105,105,105,105,105,104,102,102,102,102,101,100,100,100,100,100,99,98,98,97,97,97,97,96,96,95,95,95,94,94,94,92,92,92,92,91,91,91,90,89,89,88,88,88,88,88,88,87,86,86,86,86,86,86,85,85,85,84,84,82,82,82,82,81,81,80,80,80,80,79,79,78,78,78,77,77,77,77,77,76,76,76,76,76,75,75,74,73,73,72,71,71,71,70,70,70,70,69,68,68,68,68,67,67,67,67,67,66,66,65,65,65,65,65,65,65,65,65,64,64,64,64,64,63,63,63,63,62,62,62,62,62,62,62,62,62,61,61,61,61,61,61,61,61,61,61,61,61,60,60,59,59,59,59,58,57,57,57,57,57,56,56,56,56,56,56,56,55,55,54,54,54,54,54,54,54,54,53,53,53,53,53,52,52,52,52,52,52,51,51,51,51,51,50,50,50,50,50,50,50,50,49,49,49,49,49,49,49,49,49,48,48,48,48,48,48,48,48,48,48,47,47,47,47,47,47,47,47,46,46,46,45,45,45,45,45,45,45,45,45,45,45,45,45,45,44,44,44,44,44,44,44,44,44,44,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,42,42,42,42,42,42,42,42,42,42,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,40,39,39,39,39,39,39,39,39,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,35,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,33,33,33,33,33,33,33,33,33,33,33,33,33,33,33,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,30,30,30,30,30,30,30,30,30,30,30,30,30,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,26,26,26,26,26,26,26,26,26,26,26,26,26,26,25,25,25,25,25,25,25,25,25,25,25,25,25,25,25,25,25,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,21,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17],&#34;fontFamily&#34;:&#34;Segoe UI&#34;,&#34;fontWeight&#34;:&#34;bold&#34;,&#34;color&#34;:&#34;random-dark&#34;,&#34;minSize&#34;:0,&#34;weightFactor&#34;:0.267062314540059,&#34;backgroundColor&#34;:&#34;white&#34;,&#34;gridSize&#34;:0,&#34;minRotation&#34;:-0.785398163397448,&#34;maxRotation&#34;:0.785398163397448,&#34;shuffle&#34;:true,&#34;rotateRatio&#34;:0.4,&#34;shape&#34;:&#34;circle&#34;,&#34;ellipticity&#34;:0.65,&#34;figBase64&#34;:null,&#34;hover&#34;:null},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 4: Top 1000 terms extracted from the Scopus’s keywords
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;There are some weird symbols in the plot and the wordcloud, it’s better remove to it. However, I am to lazy to remove it, so I will leave it 😃.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;These are some of the explorative text analysis that can be done. These relevant terms may provide some insight to our current research of COVID-19 in Malaysia. However, by no means its fully reflect our current COVID-19 research.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Hyperparameter tuning in tidymodels</title>
      <link>https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/</link>
      <pubDate>Sun, 05 Sep 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;This post will not go very detail in each of the approach of hyperparameter tuning. This post mainly aims to summarize a few things that I studied for the last couple of days.
Generally, there are two approaches to hyperparameter tuning in tidymodels.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Grid search:&lt;br /&gt;
– Regular grid search&lt;br /&gt;
– Random grid search&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Iterative search:&lt;br /&gt;
– Bayesian optimization&lt;br /&gt;
– Simulated annealing&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;grid-search&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Grid search&lt;/h2&gt;
&lt;p&gt;So, in grid search, we provide the combination of parameters and the algorithm will go through each combination of parameters. There are two types of grid search:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regular grid search&lt;br /&gt;
– The algorithm will go through each combinations of parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_regular(mtry(c(1, 13)), 
             trees(), 
             min_n(),
             levels = 3) # how many from each parameter&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 27 x 3
##     mtry trees min_n
##    &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
##  1     1     1     2
##  2     7     1     2
##  3    13     1     2
##  4     1  1000     2
##  5     7  1000     2
##  6    13  1000     2
##  7     1  2000     2
##  8     7  2000     2
##  9    13  2000     2
## 10     1     1    21
## # ... with 17 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Random grid search&lt;br /&gt;
– The algorithm will randomly select a number of combination of parameters instead of go through each of them.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_random(mtry(c(1, 13)),
            trees(), 
            min_n(), 
            size = 100) # size of parameters combination&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 100 x 3
##     mtry trees min_n
##    &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
##  1     5  1216    40
##  2     8  1374    13
##  3     9   859    39
##  4     6   282    12
##  5     2  1210     9
##  6     8  1828    39
##  7    11   550    14
##  8    13  1157    32
##  9     5   282     6
## 10    10  1018    28
## # ... with 90 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By default, tidymodels uses space-filling-design to make sure the combination of parameters are on “equidistance” to each other.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;iterative-search&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Iterative search&lt;/h2&gt;
&lt;p&gt;In iterative search, we need to specify some initial parameters/values to start the search.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bayesian optimization&lt;br /&gt;
– This algorithm/function will search the next best combination of parameters based on the previous combination of parameters (priori).&lt;/li&gt;
&lt;li&gt;Simulated annealing&lt;br /&gt;
– Generally, this algorithm works relatively similar to bayesian optimization.&lt;br /&gt;
– However, as the figure below illustrates this algorithm is able to explore in the worst combination of parameters for a short term (barrier of local search), in order to find the best combination of parameters (global minima).
&lt;img src=&#34;images/sim-anneal.png&#34; alt=&#34;Simulated annealing&#34; /&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Futher details on iterative search or both methods above can be found &lt;a href=&#34;https://www.tmwr.org/iterative-search.html#iterative-search&#34;&gt;here&lt;/a&gt;. So, as both iterative methods need a starting parameters, we can actually combine with any of the grid search methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-methods&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Other methods&lt;/h2&gt;
&lt;p&gt;By default, if we do not supply any combination of parameters, tidymodels will randomly pick 10 combinations of parameters from the default range of values from the model. Additionally, we can set this values to other values as shown below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tune_grid(
  resamples = dat_cv, # cross validation data set
  grid = 20,  # 20 combinations of parameters
  control = control, # some control parameters
  metrics = metrics # some metrics parameters (roc_auc, etc)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are another special cases of grid search; &lt;code&gt;tune_race_anova()&lt;/code&gt; and &lt;code&gt;tune_race_win_loss()&lt;/code&gt;. Both of these methods supposed to be more efficient way of grid search. In general, both methods evaluate the tuning parameters on a small initial set. The combination of parameters with a worst performance will be eliminated. Thus, makes them more efficient in grid search. The main difference between these two methods is how the worst combination of parameters are evaluated and eliminated.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-codes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;R codes&lt;/h2&gt;
&lt;p&gt;Load the packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Packages
library(tidyverse)
library(tidymodels)
library(finetune)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will only use a small chunk of the data for ease of computation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Data
data(income, package = &amp;quot;kernlab&amp;quot;)

# Make data smaller for computation
set.seed(2021)
income2 &amp;lt;- 
  income %&amp;gt;% 
  filter(INCOME == &amp;quot;[75.000-&amp;quot; | INCOME == &amp;quot;[50.000-75.000)&amp;quot;) %&amp;gt;% 
  slice_sample(n = 600) %&amp;gt;% 
  mutate(INCOME = fct_drop(INCOME), 
         INCOME = fct_recode(INCOME, 
                             rich = &amp;quot;[75.000-&amp;quot;,
                             less_rich = &amp;quot;[50.000-75.000)&amp;quot;), 
         INCOME = factor(INCOME, ordered = F)) %&amp;gt;% 
  mutate(across(-INCOME, fct_drop))

# Summary of data
glimpse(income2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 600
## Columns: 14
## $ INCOME         &amp;lt;fct&amp;gt; less_rich, rich, rich, rich, less_rich, rich, rich, les~
## $ SEX            &amp;lt;fct&amp;gt; F, M, F, M, F, F, F, M, F, M, M, M, F, F, F, F, M, M, M~
## $ MARITAL.STATUS &amp;lt;fct&amp;gt; Married, Married, Married, Single, Single, NA, Married,~
## $ AGE            &amp;lt;ord&amp;gt; 35-44, 25-34, 45-54, 18-24, 18-24, 14-17, 25-34, 25-34,~
## $ EDUCATION      &amp;lt;ord&amp;gt; 1 to 3 years of college, Grad Study, College graduate, ~
## $ OCCUPATION     &amp;lt;fct&amp;gt; &amp;quot;Professional/Managerial&amp;quot;, &amp;quot;Professional/Managerial&amp;quot;, &amp;quot;~
## $ AREA           &amp;lt;ord&amp;gt; 10+ years, 7-10 years, 10+ years, -1 year, 4-6 years, 7~
## $ DUAL.INCOMES   &amp;lt;fct&amp;gt; Yes, Yes, Yes, Not Married, Not Married, Not Married, N~
## $ HOUSEHOLD.SIZE &amp;lt;ord&amp;gt; Five, Two, Four, Two, Four, Two, Three, Two, Five, One,~
## $ UNDER18        &amp;lt;ord&amp;gt; Three, None, None, None, None, None, One, None, Three, ~
## $ HOUSEHOLDER    &amp;lt;fct&amp;gt; Own, Own, Own, Rent, Family, Own, Own, Rent, Own, Own, ~
## $ HOME.TYPE      &amp;lt;fct&amp;gt; House, House, House, House, House, Apartment, House, Ho~
## $ ETHNIC.CLASS   &amp;lt;fct&amp;gt; White, White, White, White, White, White, White, White,~
## $ LANGUAGE       &amp;lt;fct&amp;gt; English, English, English, English, English, NA, Englis~&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Outcome variable
table(income2$INCOME)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## less_rich      rich 
##       362       238&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Missing data
DataExplorer::plot_missing(income)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Split the data and create a 10-fold cross validation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(2021)
dat_index &amp;lt;- initial_split(income2, strata = INCOME)
dat_train &amp;lt;- training(dat_index)
dat_test &amp;lt;- testing(dat_index)

## CV
set.seed(2021)
dat_cv &amp;lt;- vfold_cv(dat_train, v = 10, repeats = 1, strata = INCOME)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to impute the NAs with mode value since all the variable are categorical.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Recipe
dat_rec &amp;lt;- 
  recipe(INCOME ~ ., data = dat_train) %&amp;gt;% 
  step_impute_mode(all_predictors()) %&amp;gt;% 
  step_ordinalscore(AGE, EDUCATION, AREA, HOUSEHOLD.SIZE, UNDER18)

# Model
rf_mod &amp;lt;- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;)

# Workflow
rf_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_recipe(dat_rec) %&amp;gt;% 
  add_model(rf_mod)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Parameters for grid search&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Regular grid
reg_grid &amp;lt;- grid_regular(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         levels = 3)

# Random grid
rand_grid &amp;lt;- grid_random(mtry(c(1, 13)), 
                         trees(), 
                         min_n(), 
                         size = 100)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tune models using regular grid search. We going to use &lt;code&gt;doParallel&lt;/code&gt; library to do parallel processing.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ctrl &amp;lt;- control_grid(save_pred = T,
                        extract = extract_model)
measure &amp;lt;- metric_set(roc_auc)  

# Parallel for regular grid
library(doParallel)

# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_regular &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(
    resamples = dat_cv, 
    grid = reg_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for regular grid search:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_regular)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_regular)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     7  1000    21 roc_auc binary     0.690    10  0.0148 Preprocessor1_Model14
## 2     7  1000    40 roc_auc binary     0.689    10  0.0179 Preprocessor1_Model23
## 3     7  2000    40 roc_auc binary     0.689    10  0.0178 Preprocessor1_Model26
## 4     7  1000     2 roc_auc binary     0.688    10  0.0173 Preprocessor1_Model05
## 5     7  2000    21 roc_auc binary     0.688    10  0.0159 Preprocessor1_Model17&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tune models using random grid search.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for random grid
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_random &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = ctrl, 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for random grid search:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_random)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_random)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_Model0~
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_Model0~
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_Model0~
## 4     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_Model0~
## 5     6   160    31 roc_auc binary     0.692    10  0.0172 Preprocessor1_Model0~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Random grid search has slightly a better result. Let’s use this random search result as a base for iterative search. Firstly, we limit the parameters based on the plot from a random grid search.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rf_param &amp;lt;- 
  rf_wf %&amp;gt;% 
  parameters() %&amp;gt;% 
  update(mtry = mtry(c(5, 13)), 
         trees = trees(c(1, 500)), 
         min_n = min_n(c(5, 30)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we do a bayesian optimization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for bayesian optimization
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
bayes_tune &amp;lt;-  
  rf_wf %&amp;gt;% 
  tune_bayes(    
    resamples = dat_cv,
    param_info = rf_param,
    iter = 60,
    initial = tune_random, # result from random grid search        
    control = control_bayes(no_improve = 30, verbose = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Result for bayesian optimization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(bayes_tune, &amp;quot;performance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(bayes_tune)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 10
##    mtry trees min_n .metric .estimator  mean     n std_err .config         .iter
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;           &amp;lt;int&amp;gt;
## 1     4  1016     4 roc_auc binary     0.694    10  0.0164 Preprocessor1_~     0
## 2     5  1360     3 roc_auc binary     0.693    10  0.0168 Preprocessor1_~     0
## 3     6   129    14 roc_auc binary     0.693    10  0.0164 Preprocessor1_~     0
## 4     6   189    15 roc_auc binary     0.693    10  0.0153 Iter1               1
## 5     5  1235     3 roc_auc binary     0.692    10  0.0168 Preprocessor1_~     0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We get a slightly better result from bayesian optimization. I will not do a simulated annealing approach since I get an error, though I am not sure why.&lt;/p&gt;
&lt;p&gt;Lastly, we do a race anova.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Parallel for race anova
# Create a cluster object and then register: 
cl &amp;lt;- makePSOCKcluster(4)
registerDoParallel(cl)

# Run tune
set.seed(2021)
tune_efficient &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_race_anova(
    resamples = dat_cv, 
    grid = rand_grid,         
    control = control_race(verbose_elim = T, save_pred = T), 
    metrics = measure)

stopCluster(cl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We get a relatively similar result to random grid search but with faster computation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(tune_efficient)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-19-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;show_best(tune_efficient)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1     5  1425     5 roc_auc binary     0.695    10  0.0161 Preprocessor1_Model0~
## 2    11   406     2 roc_auc binary     0.694    10  0.0183 Preprocessor1_Model0~
## 3     6   631     3 roc_auc binary     0.692    10  0.0171 Preprocessor1_Model0~
## 4     7  1264     4 roc_auc binary     0.692    10  0.0159 Preprocessor1_Model0~
## 5     9  1264     3 roc_auc binary     0.692    10  0.0188 Preprocessor1_Model0~&lt;/code&gt;&lt;/pre&gt;
We can also compare ROCs of all approaches. All approaches looks more or less similar.
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# regular grid
rf_reg &amp;lt;- 
  tune_regular %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

reg_auc &amp;lt;- 
  tune_regular %&amp;gt;% 
  collect_predictions(parameters = rf_reg) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;regular_grid&amp;quot;)

# random grid
rf_rand &amp;lt;- 
  tune_random %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

rand_auc &amp;lt;- 
  tune_random %&amp;gt;% 
  collect_predictions(parameters = rf_rand) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;random_grid&amp;quot;)

# bayes
rf_bayes &amp;lt;- 
  bayes_tune %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

bayes_auc &amp;lt;- 
  bayes_tune %&amp;gt;% 
  collect_predictions(parameters = rf_bayes) %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;bayes&amp;quot;)

# race_anova
rf_eff &amp;lt;- 
  tune_efficient %&amp;gt;% 
  select_best(metric = &amp;quot;roc_auc&amp;quot;)

eff_auc &amp;lt;- 
  tune_efficient %&amp;gt;% 
  collect_predictions(parameters = rf_eff) %&amp;gt;%
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  mutate(model = &amp;quot;race_anova&amp;quot;)

# Compare ROC between all tuning approach
bind_rows(reg_auc, rand_auc, bayes_auc, eff_auc) %&amp;gt;% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = &amp;quot;plasma&amp;quot;, end = .6) +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-21-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Finally, we fit our best model (bayesian optimization) to the testing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Finalize workflow
best_rf &amp;lt;-
  select_best(bayes_tune, &amp;quot;roc_auc&amp;quot;)

final_wf &amp;lt;- 
  rf_wf %&amp;gt;% 
  finalize_workflow(best_rf)
final_wf&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_impute_mode()
## * step_ordinalscore()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 4
##   trees = 1016
##   min_n = 4
## 
## Computational engine: ranger&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Last fit
test_fit &amp;lt;- 
  final_wf %&amp;gt;%
  last_fit(dat_index) 

# Evaluation metrics 
test_fit %&amp;gt;%
  collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary         0.583 Preprocessor1_Model1
## 2 roc_auc  binary         0.611 Preprocessor1_Model1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test_fit %&amp;gt;%
  collect_predictions() %&amp;gt;% 
  roc_curve(INCOME, .pred_less_rich) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/hyperparameter-tuning-in-tidymodels/index.en_files/figure-html/unnamed-chunk-22-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The result is not that good. Our AUC is quite lower. However, we did use only about 8% from the overall data. Nonetheless, the aim of this post is to cover an overview of hyperparameter tuning in tidymodels.&lt;/p&gt;
&lt;p&gt;Additionally, there are another two function to construct parameter grids that I did not cover in this post; &lt;code&gt;grid_max_entropy()&lt;/code&gt; and &lt;code&gt;grid_latin_hypercube()&lt;/code&gt;. Both of these functions do not have much resources explaining them (or at least I did not found it), however, for those interested, a good start will be the tidymodels &lt;a href=&#34;https://dials.tidymodels.org/reference/grid_max_entropy.html&#34;&gt;website&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
&lt;a href=&#34;https://www.tmwr.org/grid-search.html&#34; class=&#34;uri&#34;&gt;https://www.tmwr.org/grid-search.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://www.tmwr.org/iterative-search.html&#34; class=&#34;uri&#34;&gt;https://www.tmwr.org/iterative-search.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://oliviergimenez.github.io/learning-machine-learning/#&#34; class=&#34;uri&#34;&gt;https://oliviergimenez.github.io/learning-machine-learning/#&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Data exploration in R</title>
      <link>https://tengkuhanis.netlify.app/post/data-exploration-in-r/</link>
      <pubDate>Sun, 22 Aug 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/data-exploration-in-r/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;These are some of the packages that I find useful for data exploration. Basically, this post serves more as my note for future reference. I will list out packages (and some awesome functions from that particular package) rather than specific functions. Further, base R and tidyverse packages will not be included specifically in this list.&lt;/p&gt;
&lt;p&gt;Load supporting packages&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data we are going to use is from dlookr package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(heartfailure)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 299
## Columns: 13
## $ age               &amp;lt;int&amp;gt; 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 62, 45, ~
## $ anaemia           &amp;lt;fct&amp;gt; No, No, No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, N~
## $ cpk_enzyme        &amp;lt;dbl&amp;gt; 582, 7861, 146, 111, 160, 47, 246, 315, 157, 123, 81~
## $ diabetes          &amp;lt;fct&amp;gt; No, No, No, No, Yes, No, No, Yes, No, No, No, No, No~
## $ ejection_fraction &amp;lt;dbl&amp;gt; 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 25, 30, ~
## $ hblood_pressure   &amp;lt;fct&amp;gt; Yes, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes,~
## $ platelets         &amp;lt;dbl&amp;gt; 265000, 263358, 162000, 210000, 327000, 204000, 1270~
## $ creatinine        &amp;lt;dbl&amp;gt; 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.10, 1.50~
## $ sodium            &amp;lt;dbl&amp;gt; 130, 136, 129, 137, 116, 132, 137, 131, 138, 133, 13~
## $ sex               &amp;lt;fct&amp;gt; Male, Male, Male, Male, Female, Male, Male, Male, Fe~
## $ smoking           &amp;lt;fct&amp;gt; No, No, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes~
## $ time              &amp;lt;int&amp;gt; 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11, 11, 12~
## $ death_event       &amp;lt;fct&amp;gt; Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will create a few NAs in our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(2021)
heartfailure[sample(seq(nrow(heartfailure)), 20), &amp;quot;age&amp;quot;] &amp;lt;- NA
heartfailure[sample(seq(nrow(heartfailure)), 10), &amp;quot;sex&amp;quot;] &amp;lt;- NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;1) dataMaid&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dataMaid)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One of the very useful function in dataMaid is &lt;code&gt;makeDataReport()&lt;/code&gt; which give report on the data. By default it will give a pdf output, but other output options such as word and html are also available.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;makeDataReport(heartfailure, replace = T)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the output example in &lt;a href=&#34;https://tengkuhanis.netlify.app/files/dataMaid_heartfailure.pdf&#34;&gt;pdf&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2) DataExplorer&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(DataExplorer)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;General visualization:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% plot_intro()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Since we have missing data, we can further visualize it:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% plot_missing()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% profile_missing()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              feature num_missing pct_missing
## 1                age          20  0.06688963
## 2            anaemia           0  0.00000000
## 3         cpk_enzyme           0  0.00000000
## 4           diabetes           0  0.00000000
## 5  ejection_fraction           0  0.00000000
## 6    hblood_pressure           0  0.00000000
## 7          platelets           0  0.00000000
## 8         creatinine           0  0.00000000
## 9             sodium           0  0.00000000
## 10               sex          10  0.03344482
## 11           smoking           0  0.00000000
## 12              time           0  0.00000000
## 13       death_event           0  0.00000000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also do a correlation plot&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  select_if(is.numeric) %&amp;gt;% 
  drop_na() %&amp;gt;% 
  plot_correlation()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However, I do think correlation plot from corrplot packages gives a better and clean plot. Here is a plot from corrplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(corrplot)

heartfailure %&amp;gt;% 
  select_if(is.numeric) %&amp;gt;% 
  drop_na() %&amp;gt;% 
  cor() %&amp;gt;% 
  corrplot(type = &amp;quot;upper&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Finally, we can get an overall html report from DataExplorer package using the function &lt;code&gt;create_report()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3) dlookr&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dlookr)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can assess normality of the data using this package. The code below will plot normality for all numeric variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_normality()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, for the sake of the simplicity in this post, we will run only for one variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_normality(age)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can also get a correlation matrix plot from this package, and no need to remove the NAs and filter the numeric variable before running the function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_correlate()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, from dlookr we can get the overall report of the data exploration in pdf (and other formats as well). This report is quite comprehensive, have a &lt;a href=&#34;https://tengkuhanis.netlify.app/files/EDA_Paged_Report.pdf&#34;&gt;look&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  eda_paged_report(target = &amp;quot;death_event&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4) skimr&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;skimr package, especially &lt;code&gt;skim()&lt;/code&gt; function did not display correctly when using the blogdown. Hence, I included the screenshot of the result that we will typically see in the R console.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(skimr)
skim(heartfailure) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;images/black.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, from skimr we can get an overview that includes the histogram for numerical data as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5) outliertree&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This package identify outlier using a decision tree. I will not go in detail about the approach, but for those who want to read &lt;a href=&#34;https://arxiv.org/abs/2001.00636&#34;&gt;further&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(outliertree)
outlier.tree(heartfailure)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Reporting top 2 outliers [out of 2 found]
## 
## row [251] - suspicious column: [creatinine] - suspicious value: [0.50]
##  distribution: 96.000% &amp;gt;= 0.70 - [mean: 1.35] - [sd: 1.22] - [norm. obs: 24]
##  given:
##      [cpk_enzyme] &amp;gt; [1610.00] (value: 2522.00)
## 
## 
## row [32] - suspicious column: [cpk_enzyme] - suspicious value: [23.00]
##  distribution: 98.958% &amp;gt;= 47.00 - [mean: 677.01] - [sd: 1321.86] - [norm. obs: 95]
##  given:
##      [death_event] = [Yes]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Outlier Tree model
##  Numeric variables: 7
##  Categorical variables: 6
## 
## Consists of 369 clusters, spread across 48 tree branches&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further explore the detected outliers using histogram and boxplot. Let’s do for variable creatinine.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# histogram
hist(heartfailure$creatinine, breaks = 50, col = &amp;quot;navy&amp;quot;,
     xlab = &amp;quot;Creatinine&amp;quot;, 
     main = &amp;quot;Creatinine level&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-19-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# boxplot
boxplot(heartfailure$creatinine)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-19-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Probably in the future I will delve into more detail about outlier detection and any awesome packages in R related to it. If I ever written any post about it, I will link it here.&lt;/p&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;These are some useful package that I find. I may edit this post in the future to add more additional data exploration package. Furthermore, there are shiny apps for data exploration as well, though I think it’s better to sticks with coded approach in data analysis/exploration. Thus, I did not explore those apps in this post. Another thing to remember is to set the variable type accordingly prior to the data exploration.&lt;/p&gt;
&lt;p&gt;Hope this is useful!&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
&lt;a href=&#34;https://github.com/ekstroem/dataMaid&#34; class=&#34;uri&#34;&gt;https://github.com/ekstroem/dataMaid&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/&#34; class=&#34;uri&#34;&gt;https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html&#34; class=&#34;uri&#34;&gt;https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html&#34; class=&#34;uri&#34;&gt;https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>A summary of forcats package</title>
      <link>https://tengkuhanis.netlify.app/post/a-summary-of-forcats-package/</link>
      <pubDate>Tue, 18 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/a-summary-of-forcats-package/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/a-summary-of-forcats-package/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;img src=&#34;forcats_logo.png&#34; width=&#34;30%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I just watched a &lt;a href=&#34;https://youtu.be/qWYgNjnHNWI&#34;&gt;youtube video by Andrew Couch&lt;/a&gt; about his commonly used function in readr, stringr, and forcats packages. Although, I have used forcats package before, I realised that I have not fully utilised all of its function.&lt;/p&gt;
&lt;p&gt;So, in this post, I have summarised main function of forcats that I find useful in my day-to-day R coding. Basically, more like a note to myself.&lt;/p&gt;
&lt;div id=&#34;main-functions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Main functions&lt;/h2&gt;
&lt;p&gt;We will use &lt;a href=&#34;https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars&#34;&gt;mtcars data&lt;/a&gt; to demonstrate each function. forcats is part of tiyverse packages. So, it will load, once we load the tidyverse packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
glimpse(mtcars)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 32
## Columns: 11
## $ mpg  &amp;lt;dbl&amp;gt; 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  &amp;lt;dbl&amp;gt; 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp &amp;lt;dbl&amp;gt; 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   &amp;lt;dbl&amp;gt; 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat &amp;lt;dbl&amp;gt; 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   &amp;lt;dbl&amp;gt; 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec &amp;lt;dbl&amp;gt; 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   &amp;lt;dbl&amp;gt; 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   &amp;lt;dbl&amp;gt; 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear &amp;lt;dbl&amp;gt; 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb &amp;lt;dbl&amp;gt; 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are 9 forcats functions that I think very useful.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;factor()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;code&gt;factor()&lt;/code&gt; changes variable type into a factor or categorical type&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mtcars$carb &amp;lt;- factor(mtcars$carb)
glimpse(mtcars)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 32
## Columns: 11
## $ mpg  &amp;lt;dbl&amp;gt; 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  &amp;lt;dbl&amp;gt; 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp &amp;lt;dbl&amp;gt; 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   &amp;lt;dbl&amp;gt; 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat &amp;lt;dbl&amp;gt; 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   &amp;lt;dbl&amp;gt; 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec &amp;lt;dbl&amp;gt; 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   &amp;lt;dbl&amp;gt; 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   &amp;lt;dbl&amp;gt; 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear &amp;lt;dbl&amp;gt; 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb &amp;lt;fct&amp;gt; 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_inorder()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function sorts factor levels based on the order of appearance in the dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;levels(mtcars$carb) # original levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; &amp;quot;6&amp;quot; &amp;quot;8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_inorder(mtcars$carb) # levels based on the order of appearance&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 1 2 3 6 8&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_infreq()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function sorts factor levels based on the frequency of values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_count(mtcars$carb) # this is forcats function as well, count factor level&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 2
##   f         n
##   &amp;lt;fct&amp;gt; &amp;lt;int&amp;gt;
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;levels(mtcars$carb) # original levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; &amp;quot;6&amp;quot; &amp;quot;8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_infreq(mtcars$carb) # levels based on the frequency values&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 2 4 1 3 6 8&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_relevel()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function can be used to change the order manually.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;levels(mtcars$carb) # original levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; &amp;quot;6&amp;quot; &amp;quot;8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_relevel(mtcars$carb, c(&amp;quot;8&amp;quot;, &amp;quot;6&amp;quot;, &amp;quot;4&amp;quot;, &amp;quot;3&amp;quot;, &amp;quot;2&amp;quot;, &amp;quot;1&amp;quot;)) # manually changed new levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 8 6 4 3 2 1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;fct_relevel()&lt;/code&gt; can also be used to change one factor level only.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;levels(mtcars$carb) # original levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; &amp;quot;6&amp;quot; &amp;quot;8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_relevel(mtcars$carb, &amp;quot;8&amp;quot;, after = 2) # change level 8 to the third place&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 8 3 4 6&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;5&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_reorder()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function changes the order based on another variable. Let’s change variable carb’s levels based on value of variable disp.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;levels(mtcars$carb) # original levels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; &amp;quot;6&amp;quot; &amp;quot;8&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_reorder(mtcars$carb, mtcars$disp, .fun = sum, .desc = TRUE) # new level based on disp value&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 2 1 3 8 6&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mtcars %&amp;gt;% 
  group_by(carb) %&amp;gt;% 
  summarise(sum_disp = sum(disp)) %&amp;gt;% 
  arrange(desc(sum_disp)) # this is basically what we do with fct_reorder() above&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 2
##   carb  sum_disp
##   &amp;lt;fct&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 4        3088.
## 2 2        2082.
## 3 1         940.
## 4 3         827.
## 5 8         301 
## 6 6         145&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Additionally, &lt;code&gt;fct_reorder()&lt;/code&gt; can be used with plotting as well.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Original plot
ggplot(mtcars, aes(x = carb, y = disp)) +
  geom_col()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-summary-of-forcats-package/index.en_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Plot with changed levels
mtcars %&amp;gt;% 
  mutate(carb = fct_reorder(carb, disp, .fun = sum, .desc = TRUE)) %&amp;gt;% 
  ggplot(aes(x = carb, y = disp)) +
  geom_col()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/a-summary-of-forcats-package/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;ol start=&#34;6&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_lump()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function lumps factor levels into other factors. There are 5 variants of this function:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;fct_lump()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fct_lump_min()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fct_lump_n()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fct_lump_lowfreq()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The remaining one variant is &lt;code&gt;fct_lump_prop()&lt;/code&gt;. It is not in the example below as I do not find it useful at least for my current R coding routine.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;fct_lump()&lt;/code&gt; automatically lump small frequency factor group into one group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_count(mtcars$carb) # this is forcats function as well, count factor level&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 2
##   f         n
##   &amp;lt;fct&amp;gt; &amp;lt;int&amp;gt;
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fct_lump(mtcars$carb) %&amp;gt;% fct_count() &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 4 x 2
##   f         n
##   &amp;lt;fct&amp;gt; &amp;lt;int&amp;gt;
## 1 1         7
## 2 2        10
## 3 4        10
## 4 Other     5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;fct_lump_min()&lt;/code&gt; lump factor group into one group based on the given value.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_lump_min(mtcars$carb, min = 2)) # group 6 and 8 lump into one group&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##     1     2     3     4 Other 
##     7    10     3    10     2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;fct_lump_n()&lt;/code&gt; lump all level except for the &lt;em&gt;n&lt;/em&gt; most frequent factor groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_lump_n(mtcars$carb, n = 2)) # 2 frequent group only, others in one group&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##     2     4 Other 
##    10    10    12&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;fct_lump_lowfreq()&lt;/code&gt; lump small frequent groups into one group, while making sure that particular one group is still the smallest.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_lump_lowfreq(mtcars$carb, other_level = &amp;quot;low&amp;quot;)) # group low is still the smallest&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##   1   2   4 low 
##   7  10  10   5&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;7&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_other()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;code&gt;fct_other()&lt;/code&gt; is much like &lt;code&gt;fct_lump()&lt;/code&gt;, except we manually choose which factor groups to be combined.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_other(mtcars$carb, keep = c(&amp;quot;8&amp;quot;, &amp;quot;6&amp;quot;))) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##     6     8 Other 
##     1     1    30&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;8&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_recode()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This function is used to rename or relabel the factor group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_recode(mtcars$carb, hanis = &amp;quot;8&amp;quot;)) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##     1     2     3     4     6 hanis 
##     7    10     3    10     1     1&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;9&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;fct_relabel()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;code&gt;fct_relabel()&lt;/code&gt; is extremely useful if we want to rename quite a number of factor groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(mtcars$carb) # original groups&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(fct_relabel(mtcars$carb, ~ c(&amp;quot;abu&amp;quot;, &amp;quot;ali&amp;quot;, &amp;quot;chong&amp;quot;, &amp;quot;siti&amp;quot;, &amp;quot;krish&amp;quot;, &amp;quot;lee&amp;quot;))) # new named groups&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##   abu   ali chong  siti krish   lee 
##     7    10     3    10     1     1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference:&lt;br /&gt;
&lt;a href=&#34;https://forcats.tidyverse.org/index.html&#34; class=&#34;uri&#34;&gt;https://forcats.tidyverse.org/index.html&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Handling imbalanced data</title>
      <link>https://tengkuhanis.netlify.app/post/handling-imbalanced-data/</link>
      <pubDate>Fri, 14 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/handling-imbalanced-data/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;overview&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Overview&lt;/h2&gt;
&lt;p&gt;Imbalance data happens when there is unequal distribution of data within a categorical outcome variable. Imbalance data occurs due to several reasons such as biased sampling method and measurement errors. However, the imbalance may also be the inherent characteristic of the data. For example, a rare disease predictive model, in this case, the imbalance is expected.&lt;/p&gt;
&lt;p&gt;Generally, there are two types of imbalanced problem:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slight imbalance: the imbalance is small, like 4:6&lt;/li&gt;
&lt;li&gt;Severe imbalance: the imbalance is large, like 1:100 or more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In slight imbalanced cases, usually it is not a concern, while severe imbalanced cases require a more specialised method to to build a predictive model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-problem&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The problem&lt;/h2&gt;
&lt;p&gt;What’s the problem with the imbalanced data?&lt;br /&gt;
Firstly, a predictive model of an imbalanced data is bias towards the majority class. The minority class becomes harder to predict as there are few data from this class. So, the detection rate for a minority class will be very low.
Secondly, accuracy is not a good measure in this case. We may get a good accuracy,but in reality the accuracy does not reflect the unequal distribution of the data. This is known as an &lt;a href=&#34;https://en.wikipedia.org/wiki/Accuracy_paradox&#34;&gt;accuracy paradox&lt;/a&gt;. Imagine we have 90% of data belong to the majority class, while the remaining 10% belong to the minority class. So, just by predicting all data as a majority class, the model can easily get 90% accuracy.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;handling-approach&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Handling approach&lt;/h2&gt;
&lt;p&gt;The easiest approach is to collect more data, though this may not be practical in all situation. Fortunately, there are a few machine learning techniques available to tackle this problem.&lt;/p&gt;
&lt;p&gt;Here is a summary of resampling techniques available in &lt;code&gt;themis&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;method-themis.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Over-sampling approach is preferred when the dataset is small. The under-sampling approach can be used when the dataset is large, though this approach may lead to loss of information. Additionally, ensemble technique such as random forest is said to be able to model the imbalanced data, though some references/blogs say otherwise.&lt;/p&gt;
&lt;p&gt;So, we are going to compare four of over-sampling techniques (upsample, SMOTE, ADASYN, and ROSE), and three of under-sampling techniques (downsample, nearmiss and tomek). The base model is a decision tree, which will be used for all the techniques. The decision trees are not going to be extensively hyperparameter tuned, for the sake of simplicity. Additionally, random forest is also going to be included in the comparison.&lt;/p&gt;
&lt;p&gt;The dataset is from &lt;a href=&#34;https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv&#34;&gt;here&lt;/a&gt;. This is a summary of the dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(df)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we can see from the summary, variable admit has a moderate imbalanced data about 1:3 ratio.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(df, aes(admit)) + 
  geom_bar() +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/figure-html/barplot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Below is the code for each model.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Packages
library(tidyverse)
library(magrittr)
library(tidymodels)
library(themis)

# Data
df &amp;lt;- read.csv(&amp;quot;https://raw.githubusercontent.com/finnstats/finnstats/main/binary.csv&amp;quot;)

# Split data
set.seed(1234)
df_split &amp;lt;- initial_split(df)
df_train &amp;lt;- training(df_split)
df_test &amp;lt;- testing(df_split)

# 1) Decision tree ----

# Recipe
dt_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank)

df_train_rec &amp;lt;- 
  dt_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)
  
df_test_rec &amp;lt;- 
  dt_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv &amp;lt;- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
dt_mod &amp;lt;- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %&amp;gt;% 
  set_engine(&amp;quot;rpart&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)

## Specify workflow
dt_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune &amp;lt;- 
  dt_wf %&amp;gt;% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune &amp;lt;- dt_tune %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final &amp;lt;- 
  dt_wf %&amp;gt;% 
  finalize_workflow(best_tune)

# Fit on train data
dt_train &amp;lt;- 
  dt_wf_final %&amp;gt;% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train, new_data = df_test_rec)) %&amp;gt;% 
  rename(pred = .pred_class)

# 2) Oversampling ----
## step_upsample() ----

# Recipe
up_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_upsample(admit,
                seed = 1234)

df_train_up &amp;lt;- 
  up_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_up &amp;lt;- 
  up_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_up &amp;lt;- vfold_cv(df_train_up)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_up &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_up &amp;lt;- 
  dt_wf_up %&amp;gt;% 
  tune_grid(resamples = df_cv_up,
            metrics = metric_set(accuracy))

## Select best model
best_tune_up &amp;lt;- dt_tune_up %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_up &amp;lt;- 
  dt_wf_up %&amp;gt;% 
  finalize_workflow(best_tune_up)

# Fit on train data
dt_train_up &amp;lt;- 
  dt_wf_final_up %&amp;gt;% 
  fit(data = df_train_up)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_up, new_data = df_test_rec_up)) %&amp;gt;% 
  rename(pred_up = .pred_class)

## step_smote() ----

# Recipe
smote_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_smote(admit, 
             seed = 1234)

df_train_smote &amp;lt;- 
  smote_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_smote &amp;lt;- 
  smote_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_smote &amp;lt;- vfold_cv(df_train_smote)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_smote &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_smote &amp;lt;- 
  dt_wf_smote %&amp;gt;% 
  tune_grid(resamples = df_cv_smote,
            metrics = metric_set(accuracy))

## Select best model
best_tune_smote &amp;lt;- dt_tune_smote %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_smote &amp;lt;- 
  dt_wf_smote %&amp;gt;% 
  finalize_workflow(best_tune_smote)

# Fit on train data
dt_train_smote &amp;lt;- 
  dt_wf_final_smote %&amp;gt;% 
  fit(data = df_train_smote)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_smote, new_data = df_test_rec_smote)) %&amp;gt;% 
  rename(pred_smote = .pred_class)

## step_rose() ----

# Recipe
rose_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_rose(admit, 
             seed = 1234)

df_train_rose &amp;lt;- 
  rose_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_rose &amp;lt;- 
  rose_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_rose &amp;lt;- vfold_cv(df_train_rose)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_rose &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_rose &amp;lt;- 
  dt_wf_rose %&amp;gt;% 
  tune_grid(resamples = df_cv_rose,
            metrics = metric_set(accuracy))

## Select best model
best_tune_rose &amp;lt;- dt_tune_rose %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_rose &amp;lt;- 
  dt_wf_rose %&amp;gt;% 
  finalize_workflow(best_tune_rose)

# Fit on train data
dt_train_rose &amp;lt;- 
  dt_wf_final_rose %&amp;gt;% 
  fit(data = df_train_rose)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_rose, new_data = df_test_rec_rose)) %&amp;gt;% 
  rename(pred_rose = .pred_class)

## step_adasyn() ----

# Recipe
adasyn_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_adasyn(admit, 
            seed = 1234)

df_train_adasyn &amp;lt;- 
  adasyn_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_adasyn &amp;lt;- 
  adasyn_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_adasyn &amp;lt;- vfold_cv(df_train_adasyn)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_adasyn &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_adasyn &amp;lt;- 
  dt_wf_adasyn %&amp;gt;% 
  tune_grid(resamples = df_cv_adasyn,
            metrics = metric_set(accuracy))

## Select best model
best_tune_adasyn &amp;lt;- dt_tune_adasyn %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_adasyn &amp;lt;- 
  dt_wf_adasyn %&amp;gt;% 
  finalize_workflow(best_tune_adasyn)

# Fit on train data
dt_train_adasyn &amp;lt;- 
  dt_wf_final_adasyn %&amp;gt;% 
  fit(data = df_train_adasyn)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_adasyn, new_data = df_test_rec_adasyn)) %&amp;gt;% 
  rename(pred_adasyn = .pred_class)

# 3) Undersampling ----
## step_downsample() ----

# Recipe
down_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_downsample(admit,
                seed = 1234)

df_train_down &amp;lt;- 
  down_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_down &amp;lt;- 
  down_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_down &amp;lt;- vfold_cv(df_train_down)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_down &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_down &amp;lt;- 
  dt_wf_down %&amp;gt;% 
  tune_grid(resamples = df_cv_down,
            metrics = metric_set(accuracy))

## Select best model
best_tune_down &amp;lt;- dt_tune_down %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_down &amp;lt;- 
  dt_wf_down %&amp;gt;% 
  finalize_workflow(best_tune_down)

# Fit on train data
dt_train_down &amp;lt;- 
  dt_wf_final_down %&amp;gt;% 
  fit(data = df_train_down)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_down, new_data = df_test_rec_down)) %&amp;gt;% 
  rename(pred_down = .pred_class)

## step_nearmiss() ----

# Recipe
nearmiss_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_nearmiss(admit,
                  seed = 1234)

df_train_nearmiss &amp;lt;- 
  nearmiss_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_nearmiss &amp;lt;- 
  nearmiss_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_nearmiss &amp;lt;- vfold_cv(df_train_nearmiss)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_nearmiss &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_nearmiss &amp;lt;- 
  dt_wf_nearmiss %&amp;gt;% 
  tune_grid(resamples = df_cv_nearmiss,
            metrics = metric_set(accuracy))

## Select best model
best_tune_nearmiss &amp;lt;- dt_tune_nearmiss %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_nearmiss &amp;lt;- 
  dt_wf_nearmiss %&amp;gt;% 
  finalize_workflow(best_tune_nearmiss)

# Fit on train data
dt_train_nearmiss &amp;lt;- 
  dt_wf_final_nearmiss %&amp;gt;% 
  fit(data = df_train_nearmiss)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_nearmiss, new_data = df_test_rec_nearmiss)) %&amp;gt;% 
  rename(pred_nearmiss = .pred_class)

## step_tomek() ----

# Recipe
tomek_rec &amp;lt;- 
  recipe(admit ~., data = df_train) %&amp;gt;% 
  step_mutate_at(c(&amp;quot;admit&amp;quot;, &amp;quot;rank&amp;quot;), fn = as_factor) %&amp;gt;% 
  step_dummy(rank) %&amp;gt;% 
  step_tomek(admit,
                  seed = 1234)

df_train_tomek &amp;lt;- 
  tomek_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = NULL)

df_test_rec_tomek &amp;lt;- 
  tomek_rec %&amp;gt;% 
  prep() %&amp;gt;% 
  bake(new_data = df_test)

## 10-folds CV
set.seed(1234)
df_cv_tomek &amp;lt;- vfold_cv(df_train_tomek)

# Tune and finalize workflow
## Specify model
# same as before

## Specify workflow
dt_wf_tomek &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(dt_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
dt_tune_tomek &amp;lt;- 
  dt_wf_tomek %&amp;gt;% 
  tune_grid(resamples = df_cv_tomek,
            metrics = metric_set(accuracy))

## Select best model
best_tune_tomek &amp;lt;- dt_tune_tomek %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
dt_wf_final_tomek &amp;lt;- 
  dt_wf_tomek %&amp;gt;% 
  finalize_workflow(best_tune_tomek)

# Fit on train data
dt_train_tomek &amp;lt;- 
  dt_wf_final_tomek %&amp;gt;% 
  fit(data = df_train_tomek)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(dt_train_tomek, new_data = df_test_rec_tomek)) %&amp;gt;% 
  rename(pred_tomek = .pred_class)

# 4) Ensemble approach: random forest ----

## 10-folds CV
set.seed(1234)
df_cv &amp;lt;- vfold_cv(df_train_rec)

# Tune and finalize workflow
## Specify model
rf_mod &amp;lt;- rand_forest(
 mtry = tune(),
 trees = tune(),
 min_n = tune()
 ) %&amp;gt;% 
  set_engine(&amp;quot;ranger&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)

## Specify workflow
rf_wf &amp;lt;- 
  workflow() %&amp;gt;% 
  add_model(rf_mod) %&amp;gt;% 
  add_formula(admit ~.)

## Tune model
set.seed(1234)
rf_tune &amp;lt;- 
  rf_wf %&amp;gt;% 
  tune_grid(resamples = df_cv,
            metrics = metric_set(accuracy))

## Select best model
best_tune &amp;lt;- rf_tune %&amp;gt;% select_best(&amp;quot;accuracy&amp;quot;)

## Finalize workflow
rf_wf_final &amp;lt;- 
  rf_wf %&amp;gt;% 
  finalize_workflow(best_tune)

# Fit on train data
rf_train &amp;lt;- 
  rf_wf_final %&amp;gt;% 
  fit(data = df_train_rec)

# Fit on test data and get accuracy
df_test  %&amp;lt;&amp;gt;%  
  bind_cols(predict(rf_train, new_data = df_test_rec)) %&amp;gt;% 
  rename(pred_rf = .pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;Now, let’s get the accuracy, sensitivity, specificity, and &lt;a href=&#34;https://en.wikipedia.org/wiki/Matthews_correlation_coefficient#Advantages_of_MCC_over_accuracy_and_F1_score&#34;&gt;Mathews Correlation Coefficient (MCC)&lt;/a&gt; for each model.&lt;/p&gt;
&lt;details&gt;
&lt;summary&gt;
Show code
&lt;/summary&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get all measurements
df_test$admit %&amp;lt;&amp;gt;% as_factor()
pred_col &amp;lt;- colnames(df_test)[5:13]
result &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
sensi &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
specif &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)
mathew &amp;lt;- vector(&amp;quot;list&amp;quot;, 0)

for (i in seq_along(pred_col)) {
  # accuracy
  result[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    accuracy(admit, df_test[,pred_col[i]])
  
  # sensitivity
  sensi[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    sensitivity(admit, df_test[,pred_col[i]])
  
  # specificity
  specif[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    specificity(admit, df_test[,pred_col[i]])
  
  # MCC
  mathew[[i]] &amp;lt;-
    df_test %&amp;gt;% 
    mcc(admit, df_test[,pred_col[i]])
}

## Turn into dataframe
result  %&amp;lt;&amp;gt;%  
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;)) %&amp;gt;% 
  rename(model = name, 
         accuracy = .estimate) %&amp;gt;% 
  select(model, accuracy) %&amp;gt;% 
  mutate(model = factor(model,labels = 
                          c(
                            &amp;quot;1&amp;quot; = &amp;quot;base&amp;quot;,
                            &amp;quot;2&amp;quot; = &amp;quot;upsample&amp;quot;,
                            &amp;quot;3&amp;quot; = &amp;quot;smote&amp;quot;,
                            &amp;quot;4&amp;quot; = &amp;quot;rose&amp;quot;,
                            &amp;quot;5&amp;quot; = &amp;quot;adasyn&amp;quot;,
                            &amp;quot;6&amp;quot; = &amp;quot;downsample&amp;quot;,
                            &amp;quot;7&amp;quot; = &amp;quot;nearmiss&amp;quot;,
                            &amp;quot;8&amp;quot; = &amp;quot;tomek&amp;quot;,
                            &amp;quot;9&amp;quot; = &amp;quot;random_forest&amp;quot;
                            )
                        ))

sensi  %&amp;lt;&amp;gt;%  
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

specif %&amp;lt;&amp;gt;% 
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

mathew %&amp;lt;&amp;gt;% 
  enframe() %&amp;gt;% 
  unnest(cols = c(&amp;quot;value&amp;quot;))

result %&amp;lt;&amp;gt;% 
  bind_cols(sensitive = sensi$.estimate, specific = specif$.estimate, mathew = mathew$.estimate)

# Plot the result
result %&amp;gt;% 
  pivot_longer(cols = 2:5, names_to = &amp;quot;measure&amp;quot;) %&amp;gt;% 
  ggplot(aes(x = model, y = value, fill = measure)) +
  geom_bar(position = &amp;quot;dodge&amp;quot;, stat = &amp;quot;identity&amp;quot;) +
  theme_bw() +
  coord_flip() +
  geom_text(aes(label = paste0(round(value*100, digits = 1), &amp;quot;%&amp;quot;)), 
            position = position_dodge(0.9), vjust = 0.3, size = 2.7, hjust = -0.1) +
  labs(title = &amp;quot;Comparison of unbalanced data techniques&amp;quot;, 
       x = &amp;quot;Techniques&amp;quot;, 
       y = &amp;quot;Performance&amp;quot;) +
  scale_fill_discrete(name = &amp;quot;Metrics:&amp;quot;,
                      labels = c(&amp;quot;Accuracy&amp;quot;, &amp;quot;MCC&amp;quot;, &amp;quot;Sensitivity&amp;quot;, &amp;quot;Specificity&amp;quot;)) +
  theme(legend.position = &amp;quot;bottom&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/handling-imbalanced-data/index.en_files/figure-html/summary-measure2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see from the above plot, the base model (decision tree) clearly has a low detection rate for a minority class (specificity). All methods able to increase the specificity, while sacrificing the accuracy and sensitivity. As mentioned earlier, accuracy is not a good metrics for this kind of model (ie; accuracy paradox). MCC on the other hand, takes into account all values of confusion matrix; true positive, false positive, true negative, and false negative. Hence, MCC is more informative compared to accuracy (and F score, which has not been included in the plot, for the sake of simplicity).&lt;/p&gt;
&lt;p&gt;A more balanced model probably downsample approach based on MCC, specificity, and sensitivity. However, this does not mean that downsample technique is the best as I believes each technique behaves differently from one data to another.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://themis.tidymodels.org/reference/index.html&#34; class=&#34;uri&#34;&gt;https://themis.tidymodels.org/reference/index.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/&#34; class=&#34;uri&#34;&gt;https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7&#34; class=&#34;uri&#34;&gt;https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Exponentially Weighted Average in Deep Learning</title>
      <link>https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/</link>
      <pubDate>Sun, 09 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I have been reading about lost functions and optimisers in deep learning for the last couple of days when I stumble upon the term Exponentially Weighted Average (EWA). So, in this post I aims to explain my understanding of EWA.&lt;/p&gt;
&lt;div id=&#34;overview-of-ewa&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Overview of EWA&lt;/h2&gt;
&lt;p&gt;EWA basically is an important concept in deep learning and have been used in several optimisers to smoothen the noise of the data.&lt;/p&gt;
&lt;p&gt;Let’s see the formula for EWA:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;formula.png&#34; width=&#34;60%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;V&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; is some smoothen value at point &lt;em&gt;t&lt;/em&gt;, while &lt;em&gt;S&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; is a data point at point &lt;em&gt;t&lt;/em&gt;. &lt;em&gt;B&lt;/em&gt; here is a hyperparameter that we need to tune in our network. So, the choice of &lt;em&gt;B&lt;/em&gt; will determine how many data points that we average the value of &lt;em&gt;V&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; as shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;beta.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ewa-in-deep-learnings-optimiser&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;EWA in deep learnings’ optimiser&lt;/h2&gt;
&lt;p&gt;So, some of the optimisers that adopt the approach of EWA are (red box indicates the EWA part in each formula):&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Stochastic gradient descent (SGD) with momentum&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The issue with SGD is the present of noise while searching for global minima. So, SGD with momentum integrated the EWA, which reduces these noises and helps the network converges faster.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;SGD-momentum2.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Adaptive delta (Adadelta) and Root Mean Square Propagation (RMSprop)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Adadelta and RMSprop are proposed in attempt to solve the issue of diminishing learning rate of adaptive gradient (Adagrad) optimiser. The use of EWA in both optimisers actually helps to achieve this. Both optimisers have quite a similar formula, but attached below is the formula for Adadelta.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;adadelta2.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Adaptive moment estimation (ADAM)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;ADAM basically combined the SGD with momentum with Adadelta. As shown earlier, both optimisers use EWA.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;more-details-on-ewa&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;More details on EWA&lt;/h2&gt;
&lt;p&gt;Now, let’s go back to EWA. Here is the example of calculation of EWA:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;seq1.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Keep in mind that &lt;em&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/em&gt; is the latest time point, followed by &lt;em&gt;t&lt;sub&gt;2&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;t&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt;, respectively. So, if we want to calculate &lt;em&gt;V&lt;sub&gt;3&lt;/sub&gt;&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;seq2.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, if we were to varies the value of &lt;em&gt;B&lt;/em&gt; across the equation (while the values of &lt;em&gt;a&lt;sub&gt;1&lt;/sub&gt;…a&lt;sub&gt;n&lt;/sub&gt;&lt;/em&gt; remain constant), we can do so in R.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) 

func &amp;lt;- function(b) (1 - b) * b^((20:1) - 1)
beta &amp;lt;- seq(0.1, 0.9, by=0.2)

dat &amp;lt;- t(sapply(beta, func)) %&amp;gt;% 
  as.data.frame()
colnames(dat)[1:20] &amp;lt;- 1:20

dat %&amp;gt;%  
  mutate(beta = as_factor(beta)) %&amp;gt;%
  pivot_longer(cols = 1:20, names_to = &amp;quot;data_point&amp;quot;, values_to = &amp;quot;weight&amp;quot;) %&amp;gt;% 
  ggplot(aes(x=as.numeric(data_point), y=weight, color=beta)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:20) +
  labs(title = &amp;quot;Change of Exponentially Weighted Average function&amp;quot;, 
       subtitle = &amp;quot;Time at t20 is the recent time, and t1 is the initial time&amp;quot;) +
  scale_colour_discrete(&amp;quot;Beta:&amp;quot;) +
  xlab(&amp;quot;Time(t)&amp;quot;) +
  ylab(&amp;quot;Weights/Coefficients&amp;quot;) +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that time at t&lt;sub&gt;20&lt;/sub&gt; is the recent time, and t&lt;sub&gt;1&lt;/sub&gt; is the initial time. Thus, two main points from the above plot are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The EWA function acts in a decaying manner.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;As beta, &lt;em&gt;B&lt;/em&gt; increases we actually put more emphasize on the recent data point.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;Side note: I have tried to do the plot in plotly, not sure why it did not work&lt;/em&gt; 😕&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
1) &lt;a href=&#34;https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f&lt;/a&gt; (all the equations are from this reference)&lt;br /&gt;
2) &lt;a href=&#34;https://youtu.be/NxTFlzBjS-4&#34; class=&#34;uri&#34;&gt;https://youtu.be/NxTFlzBjS-4&lt;/a&gt;&lt;br /&gt;
3) &lt;a href=&#34;https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46&#34; class=&#34;uri&#34;&gt;https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Base R vs tidyverse</title>
      <link>https://tengkuhanis.netlify.app/post/2021-05-04-base-r-vs-tidyverse/</link>
      <pubDate>Tue, 04 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/2021-05-04-base-r-vs-tidyverse/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/2021-05-04-base-r-vs-tidyverse/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;First of all, this write up is mean for a beginner in R.&lt;/p&gt;
&lt;p&gt;Things can be done in many ways in R. In facts, R has been very flexible in this regard compared to other statistical softwares. Basic things such as selecting a column, slicing a row, filtering a data based on certain condition can be done using a base R function. However, all these things can also be done using a tidyverse approach.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;Tidyverse&lt;/a&gt; basically, a collection of packages that can be loaded in a line of function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tidyverse is developed by “RStudio people” pioneered by &lt;a href=&#34;http://hadley.nz/&#34;&gt;Hadley Wickham&lt;/a&gt;, which means that these packages will be continuously maintained and updated.&lt;/p&gt;
&lt;p&gt;So, without further ado, these are the comparisons between these two approaches for some very basic thingy:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Select or deselect a column and a row&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Base R
iris[1:5, c(&amp;quot;Sepal.Length&amp;quot;, &amp;quot;Sepal.Width&amp;quot;)]
iris[1:5,c(1,2)] # similar to above
iris[1:5, -1]

# Tidyverse
iris %&amp;gt;% 
  select(Sepal.Length, Sepal.Width) %&amp;gt;% 
  slice(1:5)
iris %&amp;gt;% 
  select(-Sepal.Length) %&amp;gt;% 
  slice(1:5)&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Filter based on condition&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Base R
iris[iris$Species == &amp;quot;setosa&amp;quot;, ]

# Tidyverse
iris %&amp;gt;% 
  filter(Species == &amp;quot;setosa&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Mutate (transmute replace the variable)&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Base R
iris$SL_minus10 &amp;lt;- iris$Sepal.Length - 10

# Tidyverse
iris %&amp;gt;% 
  mutate(SL_minus10 = Sepal.Length - 10)&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;4&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Sort variable&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Base R
iris[order(-iris$Sepal.Width),]

# Tidyverse
iris %&amp;gt;% 
  arrange(desc(Sepal.Length))&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;5&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Group by (and get mean for variable Sepal.Width)&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Not really base R
doBy::summaryBy(Sepal.Width~Species, iris, FUN = mean) 

# Tidyverse
iris %&amp;gt;% 
  group_by(Species) %&amp;gt;% 
  summarise(mean_SW = mean(Sepal.Width))&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&#34;6&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Rename variable&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Base R
colnames(iris)[6] &amp;lt;- &amp;quot;hanis&amp;quot;

# Tidyverse
iris %&amp;gt;% 
  rename(Species = hanis)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, that’s it. Overall, tidyverse give a clarity in understanding the code as it reads from left to right. On the contrary, the base R approach reads from inside to outside, especially for a more complicated code.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Loop vs apply in R</title>
      <link>https://tengkuhanis.netlify.app/post/loop-vs-apply-in-r/</link>
      <pubDate>Tue, 04 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/loop-vs-apply-in-r/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/loop-vs-apply-in-r/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I have heard quite a several times that apply function is faster than loop function in R. Loop function is said to be inefficient, though in certain situation loop is the only way.&lt;/p&gt;
&lt;p&gt;Let’s compare between loop function and apply function in R.&lt;/p&gt;
&lt;p&gt;First, make a very big fake data contain a list of vector.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(2021)
xlist &amp;lt;- list(col1 = rnorm(10000000), 
              col2 = rnorm(10000000),
              col3 = rnorm(100000000),
              col4 = rnorm(1000000)) # this will take a few seconds&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, calculate the mean of each vector using &lt;code&gt;for loop()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ptm &amp;lt;- proc.time() #-- start the clock

mean_loop &amp;lt;- vector(&amp;quot;list&amp;quot;, 0) # place holder for a value
for (i in seq_along(xlist)) {
  mean_loop[[i]] &amp;lt;- mean(xlist[[i]])
}

proc.time() - ptm #-- stop the clock (time in seconds)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    user  system elapsed 
##    0.38    0.00    0.37&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, using &lt;code&gt;lapply()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ptm &amp;lt;- proc.time() #-- start the clock

mean_apply &amp;lt;- lapply(xlist, mean)

proc.time() - ptm #-- stop the clock&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    user  system elapsed 
##    0.34    0.00    0.35&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, &lt;code&gt;lapply()&lt;/code&gt; is a little bit faster. Obviously, with a very big dataset and a more complicated objective, &lt;code&gt;lapply()&lt;/code&gt; is the right choice, but for a “normal” size dataset, the use of any of the two functions probably do not make much different.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>How many Malaysian should be vaccinated to get herd immunity from COVID-19?</title>
      <link>https://tengkuhanis.netlify.app/post/how-many-malaysian-should-be-vaccinated-to-get-herd-immunity-from-covid-19/</link>
      <pubDate>Mon, 07 Dec 2020 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/how-many-malaysian-should-be-vaccinated-to-get-herd-immunity-from-covid-19/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/how-many-malaysian-should-be-vaccinated-to-get-herd-immunity-from-covid-19/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Recently I have read an &lt;a href=&#34;https://codeblue.galencentre.org/2020/11/27/malaysia-buying-pfizers-ultra-cold-covid-19-vaccine-posing-major-distribution-issues/#:~:text=According%20to%20BioSpace%2C%20the%20Covid,price%20sold%20to%20the%20US&#34;&gt;article&lt;/a&gt; that the Malaysian government have made a deal with Pfizer for 6.4 million Malaysian to be vaccinated. So, I am wondering what is the minimal number of people should be vaccinated.&lt;/p&gt;
&lt;p&gt;I have also come across this interesting &lt;a href=&#34;https://www.cebm.net/covid-19/when-will-it-be-over-an-introduction-to-viral-reproduction-numbers-r0-and-re/&#34;&gt;article&lt;/a&gt;, which explains how we can calculate a minimal number of people to be vaccinated to achieves herd immunity based on R naught (R&lt;sub&gt;0&lt;/sub&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R naught (R&lt;sub&gt;0&lt;/sub&gt;)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The basic idea of R&lt;sub&gt;0&lt;/sub&gt; or basic reproduction number is quite simple. It describes how many secondary infections will derive from the first case. I think Figure 1 below describes this idea very well.&lt;/p&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:unnamed-chunk-1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;r0.png&#34; alt=&#34;Basic idea of R~0~(image from https://www.atrainceu.com/content/3-basic-reproduction-number-r-naught)&#34; width=&#34;60%&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Basic idea of R&lt;sub&gt;0&lt;/sub&gt;(image from &lt;a href=&#34;https://www.atrainceu.com/content/3-basic-reproduction-number-r-naught&#34; class=&#34;uri&#34;&gt;https://www.atrainceu.com/content/3-basic-reproduction-number-r-naught&lt;/a&gt;)
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;So, R&lt;sub&gt;0&lt;/sub&gt; can be affected by a few factors, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;proportion of susceptible people at the initial outbreak&lt;/li&gt;
&lt;li&gt;infectiousness of the virus or the disease&lt;/li&gt;
&lt;li&gt;rate of recovery or death&lt;/li&gt;
&lt;li&gt;and a few other factors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As R&lt;sub&gt;0&lt;/sub&gt; increases more than 1, the spread of the disease will increases, while R&lt;sub&gt;0&lt;/sub&gt; below 1 indicates the spread of the disease will decrease and eventually dies out.&lt;/p&gt;
&lt;p&gt;However, I noticed that quite a few including KKM (Ministry of Health, Malaysia) have used the term R&lt;sub&gt;0&lt;/sub&gt; in their reports instead of R&lt;sub&gt;e&lt;/sub&gt; or R&lt;sub&gt;t&lt;/sub&gt; which is the effective reproduction number or time-varying reproduction number. R&lt;sub&gt;0&lt;/sub&gt; refers to the initial reproduction number at the beginning of the outbreak. The “naught” or “zero” in R naught (R&lt;sub&gt;0&lt;/sub&gt;) is referring to population condition that has zero immunity to the disease.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Herd immunity&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Herd immunity is said to occur when a significant proportion of the population is immunized. Subsequently, those whose susceptible (not immunized) will be protected.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How many should be vaccinated&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;So, back to the initial topic. We can use the formula below to answer this question.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[P_i &amp;gt; 1 - \frac{1}{R_0}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;P&lt;sub&gt;i&lt;/sub&gt; refers to the number of proportion that should be immunized or in this case, vaccinated.&lt;/p&gt;
&lt;p&gt;So, after googling, I found one calculation by my lecturer in Biostat Unit, USM, &lt;a href=&#34;https://wnarifin.github.io/&#34;&gt;Dr Wan Arifin&lt;/a&gt; and his colleague. The R&lt;sub&gt;0&lt;/sub&gt; based on his &lt;a href=&#34;https://wnarifin.github.io/covid-19-malaysia-sir/&#34;&gt;calculation&lt;/a&gt; is 2.673. Also, I found another &lt;a href=&#34;https://codeblue.galencentre.org/2020/04/10/mco-slashed-malaysia-covid-19-infection-rate-by-over-three-times/&#34;&gt;article&lt;/a&gt; reported that the R&lt;sub&gt;0&lt;/sub&gt; is 3.55 in March, according to KKM.&lt;/p&gt;
&lt;p&gt;Malaysian’s population is estimated at &lt;a href=&#34;https://www.dosm.gov.my/v1/index.php?r=column/cthemeByCat&amp;amp;cat=155&amp;amp;bul_id=OVByWjg5YkQ3MWFZRTN5bDJiaEVhZz09&amp;amp;menu_id=L0pheU43NWJwRWVSZklWdzQ4TlhUUT09&#34;&gt;32.7 million&lt;/a&gt; by the Department of Statistics, Malaysia (DOSM). So, using the formula above, about 63% to 72% of Malaysian population should vaccinated, and this translates to about 20.6 to 23.5 million people.&lt;/p&gt;
&lt;p&gt;The deal that the Malaysian government made with Pfizer is far from enough, but of course, this is a very good and quick decision. We also have other vaccines like Moderna’s vaccine coming up.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Disclaimer: This is just my opinion. Please take it with a massive grain of salt.&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
