<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Data exploration | Tengku Hanis</title>
    <link>https://tengkuhanis.netlify.app/tag/data-exploration/</link>
      <atom:link href="https://tengkuhanis.netlify.app/tag/data-exploration/index.xml" rel="self" type="application/rss+xml" />
    <description>Data exploration</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>©Tengku Hanis 2020-2025 Made with [blogdown](https://github.com/rstudio/blogdown)</copyright><lastBuildDate>Wed, 07 Aug 2024 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://tengkuhanis.netlify.app/images/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_2.png</url>
      <title>Data exploration</title>
      <link>https://tengkuhanis.netlify.app/tag/data-exploration/</link>
    </image>
    
    <item>
      <title>Basic plotting with Matplotlib and Seaborn</title>
      <link>https://tengkuhanis.netlify.app/post/basic-plotting-in-python/</link>
      <pubDate>Wed, 07 Aug 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/basic-plotting-in-python/</guid>
      <description>


&lt;p&gt;This post is continuation of my previous post about Python. For those interested:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;a href=&#34;https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/&#34;&gt;Basic data wrangling with Python&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Basic plotting with matplotlib and seaborn&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Comparison of ggplot in R versus in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are several packages or libraries available in Python for plotting and visualization. However, the most commonly used package is &lt;a href=&#34;https://matplotlib.org/&#34;&gt;matplotlib&lt;/a&gt;. This package is quite extensive and often time can be quite complicated to use. Thus, &lt;a href=&#34;https://seaborn.pydata.org/&#34;&gt;seaborn&lt;/a&gt; package is another alternative and complementary to matplotlib. Seaborn is based on matplotlib and provides a high-level functionality compare to matplotlib.&lt;/p&gt;
&lt;p&gt;So, in this blog post, let us compare several basic plots using both packages.&lt;/p&gt;
&lt;div id=&#34;load-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load packages&lt;/h2&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;load-dataset&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load dataset&lt;/h2&gt;
&lt;p&gt;We going to use the &lt;a href=&#34;https://www.kaggle.com/datasets/arshid/dat-flower-dataset&#34;&gt;iris dataset&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;dat = sns.load_dataset(&amp;#39;iris&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further see the information on this dataset.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;dat.head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;histogram&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Histogram&lt;/h2&gt;
&lt;p&gt;Let’s plot the histogram using matplotlib first.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.hist(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Notice that this histogram does not has any label. So, to add a label, we need to do this manually.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.hist(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.xlabel(&amp;#39;Sepal length&amp;#39;) #x-axis label
plt.ylabel(&amp;#39;Frequency&amp;#39;) #y-axis label
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-5-3.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However, using seaborn, the label is extracted from the variable name, which is pretty convenient.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.histplot(dat[&amp;#39;sepal_length&amp;#39;], bins=30)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-6-5.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Let’s say we want to plot the histogram according to different levels.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;species = [&amp;#39;setosa&amp;#39;, &amp;#39;versicolor&amp;#39;, &amp;#39;virginica&amp;#39;]

for i in species:
    subset = dat[dat[&amp;#39;species&amp;#39;] == i]
    plt.hist(subset[&amp;#39;sepal_length&amp;#39;], label = i)

plt.legend(loc = &amp;#39;upper right&amp;#39;)
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Frequency&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-7-7.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The codes above are quite long. In seaborn, the histogram above can be generated quite easily.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.histplot(x = &amp;#39;sepal_length&amp;#39;, hue = &amp;#39;species&amp;#39;, data = dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-8-9.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;boxplot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Boxplot&lt;/h2&gt;
&lt;p&gt;First, let’s do boxplot using matplotlib.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;bp = plt.boxplot(dat[&amp;#39;sepal_length&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-9-11.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we wanto to do boxplot according to other variable. The codes become a bit complicated especially for beginners.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;species = dat.groupby(&amp;#39;species&amp;#39;)
setosa = species.get_group(&amp;#39;setosa&amp;#39;)[&amp;#39;sepal_length&amp;#39;]
versicolor = species.get_group(&amp;#39;versicolor&amp;#39;)[&amp;#39;sepal_length&amp;#39;]
virginica = species.get_group(&amp;#39;virginica&amp;#39;)[&amp;#39;sepal_length&amp;#39;]

bp = plt.boxplot([setosa, versicolor, virginica], labels = [&amp;#39;setosa&amp;#39;, &amp;#39;versicolor&amp;#39;, &amp;#39;virginica&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-10-13.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Both plots above are quite easy to do in seaborn. Below are the codes for the basic histogram.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.boxplot(dat[&amp;#39;sepal_length&amp;#39;])
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-11-15.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Next, to plot &lt;code&gt;sepal_length&lt;/code&gt; based on &lt;code&gt;species&lt;/code&gt; is pretty much straightforward in seaborn.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.boxplot(y=&amp;#39;sepal_length&amp;#39;, hue=&amp;#39;species&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-12-17.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;scatter-plot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Scatter plot&lt;/h2&gt;
&lt;p&gt;Lastly, let’s see the scatter plot using matplotlib.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;plt.scatter(x=dat[&amp;#39;sepal_length&amp;#39;], y=dat[&amp;#39;sepal_width&amp;#39;])
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Sepal width&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-13-19.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can further extend this plot by categorising it into different species.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;# Define the species to colors mapping
species_to_color = {&amp;#39;setosa&amp;#39;: &amp;#39;blue&amp;#39;, &amp;#39;versicolor&amp;#39;: &amp;#39;green&amp;#39;, &amp;#39;virginica&amp;#39;: &amp;#39;red&amp;#39;}
colors = dat[&amp;#39;species&amp;#39;].map(species_to_color)

# Create the scatter plot
plt.scatter(x=dat[&amp;#39;sepal_length&amp;#39;], y=dat[&amp;#39;sepal_width&amp;#39;], c=colors)
plt.xlabel(&amp;#39;Sepal length&amp;#39;)
plt.ylabel(&amp;#39;Sepal width&amp;#39;)
plt.legend(handles=[plt.Line2D([0], [0], marker=&amp;#39;o&amp;#39;, color=&amp;#39;w&amp;#39;, markerfacecolor=color, markersize=10, label=species) for species, color in species_to_color.items()], title=&amp;#39;Species&amp;#39;)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-14-21.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now, let’s see the seaborn package. This is the basic scatter plot.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.scatterplot(x=&amp;#39;sepal_length&amp;#39;, y=&amp;#39;sepal_width&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-15-23.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To extend this plot by categorising it into different species in seaborn is actually quite simple.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;sns.scatterplot(x=&amp;#39;sepal_length&amp;#39;, y=&amp;#39;sepal_width&amp;#39;, hue=&amp;#39;species&amp;#39;, data=dat)
plt.show()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/basic-plotting-in-python/index.en_files/figure-html/unnamed-chunk-16-25.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, matplotlib and seaborn complement each other well. Seaborn is an excellent choice for quick and standard plots, thanks to its high-level interface. On the other hand, matplotlib offers a more extensive range of customization options and is ideal for creating complex and detailed visualizations. Ultimately, choosing between matplotlib and seaborn depends on the specific requirements of the visualization task.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Explore data using PCA</title>
      <link>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</link>
      <pubDate>Wed, 09 Feb 2022 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/explore-data-using-pca/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;principal-component-analysis-pca&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Principal component analysis (PCA)&lt;/h2&gt;
&lt;p&gt;PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.&lt;/p&gt;
&lt;p&gt;PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.&lt;/p&gt;
&lt;p&gt;For those who want to study the theoretical side of PCA can further read on this &lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34;&gt;link&lt;/a&gt;. We going to focus more on the coding part in the machine learning framework (using &lt;code&gt;tidymodels&lt;/code&gt; package) in this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-in-r&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Example in R&lt;/h2&gt;
&lt;p&gt;These are the packages that we going to use.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidymodels)
library(tidyverse)
library(mlbench) #data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(&amp;quot;PimaIndiansDiabetes&amp;quot;)
glimpse(PimaIndiansDiabetes)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 768
## Columns: 9
## $ pregnant &amp;lt;dbl&amp;gt; 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  &amp;lt;dbl&amp;gt; 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure &amp;lt;dbl&amp;gt; 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  &amp;lt;dbl&amp;gt; 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  &amp;lt;dbl&amp;gt; 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     &amp;lt;dbl&amp;gt; 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree &amp;lt;dbl&amp;gt; 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      &amp;lt;dbl&amp;gt; 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes &amp;lt;fct&amp;gt; pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1)

ind &amp;lt;- initial_split(PimaIndiansDiabetes)
dat_train &amp;lt;- training(ind)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We create a recipe and apply normalization and PCA techniques. Then, we prep it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Recipe
pca_rec &amp;lt;- 
  recipe(diabetes ~ ., data = dat_train) %&amp;gt;% 
  step_normalize(all_numeric_predictors()) %&amp;gt;% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep &amp;lt;- prep(pca_rec)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, we can extract the PCA data using &lt;code&gt;tidy()&lt;/code&gt;. &lt;code&gt;type = &#34;coef&#34;&lt;/code&gt; indicates that we want the loadings values. So, the values in the data are the loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;coef&amp;quot;)
pca_tidied&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 64 x 4
##    terms     value component id       
##    &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.&lt;/p&gt;
&lt;p&gt;We can further visualise these loadings.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied %&amp;gt;% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab(&amp;quot;&amp;quot;) +
  xlab(&amp;quot;Loadings&amp;quot;) + 
  theme_minimal()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied2 &amp;lt;- tidy(pca_prep, 2, type = &amp;quot;variance&amp;quot;)

pca_tidied2 %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value, names_prefix = &amp;quot;PC&amp;quot;) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  mutate_if(is.numeric, round, digits = 1) %&amp;gt;% 
  kableExtra::kable(&amp;quot;simple&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;terms&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC3&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC4&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC5&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC6&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC7&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;PC8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.9&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cumulative percent variance&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;60.7&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;81.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;89.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;95.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;100.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca_tidied3 &amp;lt;- 
  pca_tidied %&amp;gt;% 
  filter(component %in% c(&amp;quot;PC1&amp;quot;, &amp;quot;PC2&amp;quot;)) %&amp;gt;% 
  select(-id) %&amp;gt;% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 8 x 3
##   terms      PC1    PC2
##   &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we can make a scatter plot using training set data (&lt;code&gt;juice(pca_prep)&lt;/code&gt;) and the loadings data (&lt;code&gt;pca_tidied3&lt;/code&gt;). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;juice(pca_prep) %&amp;gt;% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, &amp;quot;picas&amp;quot;)),
               color = &amp;quot;blue&amp;quot;) +
  annotate(&amp;quot;text&amp;quot;, 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab(&amp;quot;PC1 (26.2%)&amp;quot;) +
  ylab(&amp;quot;PC2 (21.5%)&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/explore-data-using-pca/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, from this scatter plot we learn that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase&lt;/li&gt;
&lt;li&gt;As PC2 decreases, pregnant and age increase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&#34; class=&#34;uri&#34;&gt;http://strata.uga.edu/8370/lecturenotes/principalComponents.html&lt;/a&gt;&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://juliasilge.com/blog/cocktail-recipes-umap/&#34; class=&#34;uri&#34;&gt;https://juliasilge.com/blog/cocktail-recipes-umap/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Extract a table from a pdf</title>
      <link>https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/</link>
      <pubDate>Mon, 01 Nov 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/extract-a-table-from-a-pdf/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In a couple of days, I am going to conduct a pre-conference workshop for Malaysian &lt;a href=&#34;https://www.r-conference.com/&#34;&gt;R conference 2021&lt;/a&gt;. So, some of the data that I am going to use for this workshop is available in a table in pdf form. Hence, this post is about how I get that particular table from the pdf into R for further analysis.&lt;/p&gt;
&lt;p&gt;So, this is a table we going to extract.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;images/table.png&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;extracting-a-table-from-pdf&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extracting a table from pdf&lt;/h2&gt;
&lt;p&gt;We going to use &lt;code&gt;tabulizer&lt;/code&gt; package for this. However, not every pdf works with this package. In our case, it works but need further preprocessing.&lt;/p&gt;
&lt;p&gt;Load the required packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tabulizer)
library(dplyr)
library(stringr)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Read a table from a pdf.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;raw_table &amp;lt;- extract_tables(&amp;quot;https://static-content.springer.com/esm/art%3A10.1038%2Fs41440-021-00720-3/MediaObjects/41440_2021_720_MOESM1_ESM.pdf&amp;quot;, 
                          pages = 17, 
                          output = &amp;quot;data.frame&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, this is the extracted table.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;raw_table[[1]] %&amp;gt;% head(10)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                X     X.1     X.2     X.3  X.4     X.5 X.6     X.7  X.8
## 1                                                                     
## 2                                                                     
## 3    Ahmed, 2019 Unclear Unclear Unclear High Unclear Low Unclear High
## 4                                                                     
## 5   Badrov, 2013 Unclear    High    High High Unclear Low Unclear High
## 6   Baross, 2012 Unclear Unclear    High High Unclear Low Unclear High
## 7   Baross, 2013 Unclear Unclear    High High Unclear Low Unclear High
## 8  Carlson, 2016     Low    High    High  Low Unclear Low     Low High
## 9  Correia, 2020     Low     Low     Low High Unclear Low     Low High
## 10                                                                    
##                              X.9
## 1      1- selection bias: random
## 2            sequence generation
## 3  2- selection bias: allocation
## 4                    concealment
## 5                               
## 6   3- reporting bias: selective
## 7                      reporting
## 8                               
## 9  4- Performance bias: blinding
## 10  (participants and personnel)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, a few preprocessing steps needed:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Remove column X.9 - this column supposed to be a header&lt;/li&gt;
&lt;li&gt;Rename a header based on column X.9&lt;/li&gt;
&lt;li&gt;Remove a space between the author name - “Ahmed,2019” instead of “Ahmed, 2019”&lt;/li&gt;
&lt;li&gt;Remove empty rows&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;irt_rob &amp;lt;- 
  raw_table[[1]] %&amp;gt;% 
  select(-X.9) %&amp;gt;%  
  rename(Study = X, 
         Random.sequence.generation. = X.1, 
         Allocation.concealment. = X.2,
         Selective.reporting. = X.3,
         Blinding.of.participants.and.personnel. = X.4, 
         Blinding.of.outcome.assessment = X.5, 
         Incomplete.outcome.data = X.6, 
         Other.sources.of.bias. = X.7, 
         Overall = X.8) %&amp;gt;% 
  as_tibble() %&amp;gt;% 
  mutate(Study = str_replace_all(Study, &amp;quot; &amp;quot;, &amp;quot;&amp;quot;)) %&amp;gt;% 
  mutate(id_del = str_match(Study, &amp;quot;.&amp;quot;)) %&amp;gt;% 
  filter(!is.na(id_del)) %&amp;gt;% 
  select(-id_del)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, our data is ready.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;irt_rob&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##          Study Random.sequence.generation. Allocation.concealment.
## 1   Ahmed,2019                     Unclear                 Unclear
## 2  Badrov,2013                     Unclear                    High
## 3  Baross,2012                     Unclear                 Unclear
## 4  Baross,2013                     Unclear                 Unclear
## 5 Carlson,2016                         Low                    High
##   Selective.reporting. Blinding.of.participants.and.personnel.
## 1              Unclear                                    High
## 2                 High                                    High
## 3                 High                                    High
## 4                 High                                    High
## 5                 High                                     Low
##   Blinding.of.outcome.assessment Incomplete.outcome.data Other.sources.of.bias.
## 1                        Unclear                     Low                Unclear
## 2                        Unclear                     Low                Unclear
## 3                        Unclear                     Low                Unclear
## 4                        Unclear                     Low                Unclear
## 5                        Unclear                     Low                    Low
##   Overall
## 1    High
## 2    High
## 3    High
## 4    High
## 5    High&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Data exploration in R</title>
      <link>https://tengkuhanis.netlify.app/post/data-exploration-in-r/</link>
      <pubDate>Sun, 22 Aug 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/data-exploration-in-r/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;These are some of the packages that I find useful for data exploration. Basically, this post serves more as my note for future reference. I will list out packages (and some awesome functions from that particular package) rather than specific functions. Further, base R and tidyverse packages will not be included specifically in this list.&lt;/p&gt;
&lt;p&gt;Load supporting packages&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data we are going to use is from dlookr package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(heartfailure)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 299
## Columns: 13
## $ age               &amp;lt;int&amp;gt; 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 62, 45, ~
## $ anaemia           &amp;lt;fct&amp;gt; No, No, No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, N~
## $ cpk_enzyme        &amp;lt;dbl&amp;gt; 582, 7861, 146, 111, 160, 47, 246, 315, 157, 123, 81~
## $ diabetes          &amp;lt;fct&amp;gt; No, No, No, No, Yes, No, No, Yes, No, No, No, No, No~
## $ ejection_fraction &amp;lt;dbl&amp;gt; 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 25, 30, ~
## $ hblood_pressure   &amp;lt;fct&amp;gt; Yes, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes,~
## $ platelets         &amp;lt;dbl&amp;gt; 265000, 263358, 162000, 210000, 327000, 204000, 1270~
## $ creatinine        &amp;lt;dbl&amp;gt; 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.10, 1.50~
## $ sodium            &amp;lt;dbl&amp;gt; 130, 136, 129, 137, 116, 132, 137, 131, 138, 133, 13~
## $ sex               &amp;lt;fct&amp;gt; Male, Male, Male, Male, Female, Male, Male, Male, Fe~
## $ smoking           &amp;lt;fct&amp;gt; No, No, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes~
## $ time              &amp;lt;int&amp;gt; 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11, 11, 12~
## $ death_event       &amp;lt;fct&amp;gt; Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will create a few NAs in our data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(2021)
heartfailure[sample(seq(nrow(heartfailure)), 20), &amp;quot;age&amp;quot;] &amp;lt;- NA
heartfailure[sample(seq(nrow(heartfailure)), 10), &amp;quot;sex&amp;quot;] &amp;lt;- NA&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;1) dataMaid&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dataMaid)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One of the very useful function in dataMaid is &lt;code&gt;makeDataReport()&lt;/code&gt; which give report on the data. By default it will give a pdf output, but other output options such as word and html are also available.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;makeDataReport(heartfailure, replace = T)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the output example in &lt;a href=&#34;https://tengkuhanis.netlify.app/files/dataMaid_heartfailure.pdf&#34;&gt;pdf&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2) DataExplorer&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(DataExplorer)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;General visualization:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% plot_intro()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Since we have missing data, we can further visualize it:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% plot_missing()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% profile_missing()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              feature num_missing pct_missing
## 1                age          20  0.06688963
## 2            anaemia           0  0.00000000
## 3         cpk_enzyme           0  0.00000000
## 4           diabetes           0  0.00000000
## 5  ejection_fraction           0  0.00000000
## 6    hblood_pressure           0  0.00000000
## 7          platelets           0  0.00000000
## 8         creatinine           0  0.00000000
## 9             sodium           0  0.00000000
## 10               sex          10  0.03344482
## 11           smoking           0  0.00000000
## 12              time           0  0.00000000
## 13       death_event           0  0.00000000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also do a correlation plot&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  select_if(is.numeric) %&amp;gt;% 
  drop_na() %&amp;gt;% 
  plot_correlation()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However, I do think correlation plot from corrplot packages gives a better and clean plot. Here is a plot from corrplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(corrplot)

heartfailure %&amp;gt;% 
  select_if(is.numeric) %&amp;gt;% 
  drop_na() %&amp;gt;% 
  cor() %&amp;gt;% 
  corrplot(type = &amp;quot;upper&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-11-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Finally, we can get an overall html report from DataExplorer package using the function &lt;code&gt;create_report()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3) dlookr&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dlookr)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can assess normality of the data using this package. The code below will plot normality for all numeric variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_normality()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, for the sake of the simplicity in this post, we will run only for one variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_normality(age)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can also get a correlation matrix plot from this package, and no need to remove the NAs and filter the numeric variable before running the function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  plot_correlate()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, from dlookr we can get the overall report of the data exploration in pdf (and other formats as well). This report is quite comprehensive, have a &lt;a href=&#34;https://tengkuhanis.netlify.app/files/EDA_Paged_Report.pdf&#34;&gt;look&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;heartfailure %&amp;gt;% 
  eda_paged_report(target = &amp;quot;death_event&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4) skimr&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;skimr package, especially &lt;code&gt;skim()&lt;/code&gt; function did not display correctly when using the blogdown. Hence, I included the screenshot of the result that we will typically see in the R console.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(skimr)
skim(heartfailure) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;images/black.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, from skimr we can get an overview that includes the histogram for numerical data as well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5) outliertree&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This package identify outlier using a decision tree. I will not go in detail about the approach, but for those who want to read &lt;a href=&#34;https://arxiv.org/abs/2001.00636&#34;&gt;further&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(outliertree)
outlier.tree(heartfailure)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Reporting top 2 outliers [out of 2 found]
## 
## row [251] - suspicious column: [creatinine] - suspicious value: [0.50]
##  distribution: 96.000% &amp;gt;= 0.70 - [mean: 1.35] - [sd: 1.22] - [norm. obs: 24]
##  given:
##      [cpk_enzyme] &amp;gt; [1610.00] (value: 2522.00)
## 
## 
## row [32] - suspicious column: [cpk_enzyme] - suspicious value: [23.00]
##  distribution: 98.958% &amp;gt;= 47.00 - [mean: 677.01] - [sd: 1321.86] - [norm. obs: 95]
##  given:
##      [death_event] = [Yes]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Outlier Tree model
##  Numeric variables: 7
##  Categorical variables: 6
## 
## Consists of 369 clusters, spread across 48 tree branches&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further explore the detected outliers using histogram and boxplot. Let’s do for variable creatinine.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# histogram
hist(heartfailure$creatinine, breaks = 50, col = &amp;quot;navy&amp;quot;,
     xlab = &amp;quot;Creatinine&amp;quot;, 
     main = &amp;quot;Creatinine level&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-19-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# boxplot
boxplot(heartfailure$creatinine)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/data-exploration-in-r/index.en_files/figure-html/unnamed-chunk-19-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Probably in the future I will delve into more detail about outlier detection and any awesome packages in R related to it. If I ever written any post about it, I will link it here.&lt;/p&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;These are some useful package that I find. I may edit this post in the future to add more additional data exploration package. Furthermore, there are shiny apps for data exploration as well, though I think it’s better to sticks with coded approach in data analysis/exploration. Thus, I did not explore those apps in this post. Another thing to remember is to set the variable type accordingly prior to the data exploration.&lt;/p&gt;
&lt;p&gt;Hope this is useful!&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
&lt;a href=&#34;https://github.com/ekstroem/dataMaid&#34; class=&#34;uri&#34;&gt;https://github.com/ekstroem/dataMaid&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/&#34; class=&#34;uri&#34;&gt;https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html&#34; class=&#34;uri&#34;&gt;https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html&lt;/a&gt;&lt;br /&gt;
&lt;a href=&#34;https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html&#34; class=&#34;uri&#34;&gt;https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
