Data exploration | Tengku Hanis

Basic plotting with Matplotlib and Seaborn

Wed, 07 Aug 2024 00:00:00 +0000

This post is continuation of my previous post about Python. For those interested:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

There are several packages or libraries available in Python for plotting and visualization. However, the most commonly used package is matplotlib. This package is quite extensive and often time can be quite complicated to use. Thus, seaborn package is another alternative and complementary to matplotlib. Seaborn is based on matplotlib and provides a high-level functionality compare to matplotlib.

So, in this blog post, let us compare several basic plots using both packages.

Load packages

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Load dataset

We going to use the iris dataset.

dat = sns.load_dataset('iris')

We can further see the information on this dataset.

dat.head(5)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Histogram

Let’s plot the histogram using matplotlib first.

plt.hist(dat['sepal_length'], bins=30)
plt.show()

Notice that this histogram does not has any label. So, to add a label, we need to do this manually.

plt.hist(dat['sepal_length'], bins=30)
plt.xlabel('Sepal length') #x-axis label
plt.ylabel('Frequency') #y-axis label
plt.show()

However, using seaborn, the label is extracted from the variable name, which is pretty convenient.

sns.histplot(dat['sepal_length'], bins=30)
plt.show()

Let’s say we want to plot the histogram according to different levels.

species = ['setosa', 'versicolor', 'virginica']

for i in species:
    subset = dat[dat['species'] == i]
    plt.hist(subset['sepal_length'], label = i)

plt.legend(loc = 'upper right')
plt.xlabel('Sepal length')
plt.ylabel('Frequency')
plt.show()

The codes above are quite long. In seaborn, the histogram above can be generated quite easily.

sns.histplot(x = 'sepal_length', hue = 'species', data = dat)
plt.show()

Boxplot

First, let’s do boxplot using matplotlib.

bp = plt.boxplot(dat['sepal_length'])
plt.xlabel('Sepal length')
plt.show()

If we wanto to do boxplot according to other variable. The codes become a bit complicated especially for beginners.

species = dat.groupby('species')
setosa = species.get_group('setosa')['sepal_length']
versicolor = species.get_group('versicolor')['sepal_length']
virginica = species.get_group('virginica')['sepal_length']

bp = plt.boxplot([setosa, versicolor, virginica], labels = ['setosa', 'versicolor', 'virginica'])
plt.xlabel('Sepal length')
plt.show()

Both plots above are quite easy to do in seaborn. Below are the codes for the basic histogram.

sns.boxplot(dat['sepal_length'])
plt.show()

Next, to plot sepal_length based on species is pretty much straightforward in seaborn.

sns.boxplot(y='sepal_length', hue='species', data=dat)
plt.show()

Scatter plot

Lastly, let’s see the scatter plot using matplotlib.

plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

We can further extend this plot by categorising it into different species.

# Define the species to colors mapping
species_to_color = {'setosa': 'blue', 'versicolor': 'green', 'virginica': 'red'}
colors = dat['species'].map(species_to_color)

# Create the scatter plot
plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'], c=colors)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, label=species) for species, color in species_to_color.items()], title='Species')
plt.show()

Now, let’s see the seaborn package. This is the basic scatter plot.

sns.scatterplot(x='sepal_length', y='sepal_width', data=dat)
plt.show()

To extend this plot by categorising it into different species in seaborn is actually quite simple.

sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=dat)
plt.show()

Conclusion

In conclusion, matplotlib and seaborn complement each other well. Seaborn is an excellent choice for quick and standard plots, thanks to its high-level interface. On the other hand, matplotlib offers a more extensive range of customization options and is ideal for creating complex and detailed visualizations. Ultimately, choosing between matplotlib and seaborn depends on the specific requirements of the visualization task.

Explore data using PCA

Wed, 09 Feb 2022 00:00:00 +0000

Principal component analysis (PCA)

PCA is a dimension reduction techniques. So, if we have a large number of predictors, instead of using all the predictors for modelling or other analysis, we can compressed all the information from the variables and create a new set of variables. This new set of variables are known as components or principal component (PC). So, now we have a smaller number of variables which contain the information from the original variables.

PCA usually used for a dataset with a large features or predictors like genomic data. Additionally, PCA is a good pre-processing option if you have a correlated variable or have a multicollinearity issue in the model. Also, we can use PCA for exploration of the data and have a better understanding of our data.

For those who want to study the theoretical side of PCA can further read on this link. We going to focus more on the coding part in the machine learning framework (using tidymodels package) in this post.

Example in R

These are the packages that we going to use.

library(tidymodels)
library(tidyverse)
library(mlbench) #data

We going to use diabetes dataset. The outcome is binary; positive = diabetes and negative = non-diabetes/healthy. All other variables are numerical values.

data("PimaIndiansDiabetes")
glimpse(PimaIndiansDiabetes)

## Rows: 768
## Columns: 9
## $ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
## $ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
## $ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
## $ triceps  <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
## $ insulin  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
## $ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
## $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
## $ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
## $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~

We going to split the data and extract the training dataset. We going to explore only the training set since we going to do this in a machine learning framework.

set.seed(1)

ind <- initial_split(PimaIndiansDiabetes)
dat_train <- training(ind)

We create a recipe and apply normalization and PCA techniques. Then, we prep it.

# Recipe
pca_rec <- 
  recipe(diabetes ~ ., data = dat_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors())

# Prep
pca_prep <- prep(pca_rec)

So, we can extract the PCA data using tidy(). type = "coef" indicates that we want the loadings values. So, the values in the data are the loadings.

pca_tidied <- tidy(pca_prep, 2, type = "coef")
pca_tidied

## # A tibble: 64 x 4
##    terms     value component id       
##    <chr>     <dbl> <chr>     <chr>    
##  1 pregnant  0.107 PC1       pca_JtuLZ
##  2 glucose   0.357 PC1       pca_JtuLZ
##  3 pressure  0.330 PC1       pca_JtuLZ
##  4 triceps   0.460 PC1       pca_JtuLZ
##  5 insulin   0.466 PC1       pca_JtuLZ
##  6 mass      0.447 PC1       pca_JtuLZ
##  7 pedigree  0.315 PC1       pca_JtuLZ
##  8 age       0.158 PC1       pca_JtuLZ
##  9 pregnant -0.597 PC2       pca_JtuLZ
## 10 glucose  -0.192 PC2       pca_JtuLZ
## # ... with 54 more rows

So, basically the loadings indicate how much each variable contributes to each component (PC). A large loading (positive or negative) indicates a strong relationship between the variables and the related components. The sign indicates a negative or positive correlation between the variables and components.

We can further visualise these loadings.

pca_tidied %>% 
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = F) +
  facet_wrap(~ component) +
  ylab("") +
  xlab("Loadings") + 
  theme_minimal()

Besides the loadings, we can also get a variance information. Variance of each component (or PC) measures how much that particular component explains the variability in the data. For example, PC1 explain 26.2% variance in the data.

pca_tidied2 <- tidy(pca_prep, 2, type = "variance")

pca_tidied2 %>% 
  pivot_wider(names_from = component, values_from = value, names_prefix = "PC") %>% 
  select(-id) %>% 
  mutate_if(is.numeric, round, digits = 1) %>% 
  kableExtra::kable("simple")

terms	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
variance	2.1	1.7	1.0	0.8	0.8	0.7	0.5	0.4
cumulative variance	2.1	3.8	4.9	5.7	6.5	7.2	7.6	8.0
percent variance	26.2	21.5	12.9	10.6	9.9	8.5	5.7	4.7
cumulative percent variance	26.2	47.7	60.7	71.2	81.1	89.6	95.3	100.0

Next, we can visualise PC1 and PC2 in a scatter plot and see how each variable influences both PCs. First, we need to extract the loadings and convert into a wide format for our arrow coordinate in the scatter plot.

pca_tidied3 <- 
  pca_tidied %>% 
  filter(component %in% c("PC1", "PC2")) %>% 
  select(-id) %>% 
  pivot_wider(names_from = component, values_from = value)
pca_tidied3

## # A tibble: 8 x 3
##   terms      PC1    PC2
##   <chr>    <dbl>  <dbl>
## 1 pregnant 0.107 -0.597
## 2 glucose  0.357 -0.192
## 3 pressure 0.330 -0.234
## 4 triceps  0.460  0.279
## 5 insulin  0.466  0.200
## 6 mass     0.447  0.121
## 7 pedigree 0.315  0.110
## 8 age      0.158 -0.638

Now, we can make a scatter plot using training set data (juice(pca_prep)) and the loadings data (pca_tidied3). Also, we going to add percentage of variance for PC1 and PC2 in the axis labels.

juice(pca_prep) %>% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = diabetes, shape = diabetes), size = 2, alpha = 0.6) +
  geom_segment(data = pca_tidied3, 
               aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), 
               arrow = arrow(length = unit(1/2, "picas")),
               color = "blue") +
  annotate("text", 
           x = pca_tidied3$PC1 * 5.2, 
           y = pca_tidied3$PC2 * 5.2, 
           label = pca_tidied3$terms) +
  theme_minimal() +
  xlab("PC1 (26.2%)") +
  ylab("PC2 (21.5%)")

So, from this scatter plot we learn that:

(triceps, insulin, pedigree and mass), (glucose and pressure) and (pregnant and age) are correlated as their lines are close to each other
As PC1 and PC2 increase, triceps, insulin, pedigree and mass also increase
As PC2 decreases, pregnant and age increase

References:

Extract a table from a pdf

Mon, 01 Nov 2021 00:00:00 +0000

In a couple of days, I am going to conduct a pre-conference workshop for Malaysian R conference 2021. So, some of the data that I am going to use for this workshop is available in a table in pdf form. Hence, this post is about how I get that particular table from the pdf into R for further analysis.

So, this is a table we going to extract.

Extracting a table from pdf

We going to use tabulizer package for this. However, not every pdf works with this package. In our case, it works but need further preprocessing.

Load the required packages.

library(tabulizer)
library(dplyr)
library(stringr)

Read a table from a pdf.

raw_table <- extract_tables("https://static-content.springer.com/esm/art%3A10.1038%2Fs41440-021-00720-3/MediaObjects/41440_2021_720_MOESM1_ESM.pdf", 
                          pages = 17, 
                          output = "data.frame")

So, this is the extracted table.

raw_table[[1]] %>% head(10)

##                X     X.1     X.2     X.3  X.4     X.5 X.6     X.7  X.8
## 1                                                                     
## 2                                                                     
## 3    Ahmed, 2019 Unclear Unclear Unclear High Unclear Low Unclear High
## 4                                                                     
## 5   Badrov, 2013 Unclear    High    High High Unclear Low Unclear High
## 6   Baross, 2012 Unclear Unclear    High High Unclear Low Unclear High
## 7   Baross, 2013 Unclear Unclear    High High Unclear Low Unclear High
## 8  Carlson, 2016     Low    High    High  Low Unclear Low     Low High
## 9  Correia, 2020     Low     Low     Low High Unclear Low     Low High
## 10                                                                    
##                              X.9
## 1      1- selection bias: random
## 2            sequence generation
## 3  2- selection bias: allocation
## 4                    concealment
## 5                               
## 6   3- reporting bias: selective
## 7                      reporting
## 8                               
## 9  4- Performance bias: blinding
## 10  (participants and personnel)

So, a few preprocessing steps needed:

Remove column X.9 - this column supposed to be a header
Rename a header based on column X.9
Remove a space between the author name - “Ahmed,2019” instead of “Ahmed, 2019”
Remove empty rows

irt_rob <- 
  raw_table[[1]] %>% 
  select(-X.9) %>%  
  rename(Study = X, 
         Random.sequence.generation. = X.1, 
         Allocation.concealment. = X.2,
         Selective.reporting. = X.3,
         Blinding.of.participants.and.personnel. = X.4, 
         Blinding.of.outcome.assessment = X.5, 
         Incomplete.outcome.data = X.6, 
         Other.sources.of.bias. = X.7, 
         Overall = X.8) %>% 
  as_tibble() %>% 
  mutate(Study = str_replace_all(Study, " ", "")) %>% 
  mutate(id_del = str_match(Study, ".")) %>% 
  filter(!is.na(id_del)) %>% 
  select(-id_del)

Finally, our data is ready.

irt_rob

##          Study Random.sequence.generation. Allocation.concealment.
## 1   Ahmed,2019                     Unclear                 Unclear
## 2  Badrov,2013                     Unclear                    High
## 3  Baross,2012                     Unclear                 Unclear
## 4  Baross,2013                     Unclear                 Unclear
## 5 Carlson,2016                         Low                    High
##   Selective.reporting. Blinding.of.participants.and.personnel.
## 1              Unclear                                    High
## 2                 High                                    High
## 3                 High                                    High
## 4                 High                                    High
## 5                 High                                     Low
##   Blinding.of.outcome.assessment Incomplete.outcome.data Other.sources.of.bias.
## 1                        Unclear                     Low                Unclear
## 2                        Unclear                     Low                Unclear
## 3                        Unclear                     Low                Unclear
## 4                        Unclear                     Low                Unclear
## 5                        Unclear                     Low                    Low
##   Overall
## 1    High
## 2    High
## 3    High
## 4    High
## 5    High

Data exploration in R

Sun, 22 Aug 2021 00:00:00 +0000

These are some of the packages that I find useful for data exploration. Basically, this post serves more as my note for future reference. I will list out packages (and some awesome functions from that particular package) rather than specific functions. Further, base R and tidyverse packages will not be included specifically in this list.

Load supporting packages

library(tidyverse)

The data we are going to use is from dlookr package:

glimpse(heartfailure)

## Rows: 299
## Columns: 13
## $ age               <int> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 62, 45, ~
## $ anaemia           <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, N~
## $ cpk_enzyme        <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, 123, 81~
## $ diabetes          <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, No~
## $ ejection_fraction <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 25, 30, ~
## $ hblood_pressure   <fct> Yes, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes,~
## $ platelets         <dbl> 265000, 263358, 162000, 210000, 327000, 204000, 1270~
## $ creatinine        <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.10, 1.50~
## $ sodium            <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, 133, 13~
## $ sex               <fct> Male, Male, Male, Male, Female, Male, Male, Male, Fe~
## $ smoking           <fct> No, No, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes~
## $ time              <int> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11, 11, 12~
## $ death_event       <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~

We will create a few NAs in our data.

set.seed(2021)
heartfailure[sample(seq(nrow(heartfailure)), 20), "age"] <- NA
heartfailure[sample(seq(nrow(heartfailure)), 10), "sex"] <- NA

1) dataMaid

library(dataMaid)

One of the very useful function in dataMaid is makeDataReport() which give report on the data. By default it will give a pdf output, but other output options such as word and html are also available.

makeDataReport(heartfailure, replace = T)

This is the output example in pdf.

2) DataExplorer

library(DataExplorer)

General visualization:

heartfailure %>% plot_intro()

Since we have missing data, we can further visualize it:

heartfailure %>% plot_missing()

heartfailure %>% profile_missing()

##              feature num_missing pct_missing
## 1                age          20  0.06688963
## 2            anaemia           0  0.00000000
## 3         cpk_enzyme           0  0.00000000
## 4           diabetes           0  0.00000000
## 5  ejection_fraction           0  0.00000000
## 6    hblood_pressure           0  0.00000000
## 7          platelets           0  0.00000000
## 8         creatinine           0  0.00000000
## 9             sodium           0  0.00000000
## 10               sex          10  0.03344482
## 11           smoking           0  0.00000000
## 12              time           0  0.00000000
## 13       death_event           0  0.00000000

We can also do a correlation plot

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  plot_correlation()

However, I do think correlation plot from corrplot packages gives a better and clean plot. Here is a plot from corrplot.

library(corrplot)

heartfailure %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  cor() %>% 
  corrplot(type = "upper")

Finally, we can get an overall html report from DataExplorer package using the function create_report().

3) dlookr

library(dlookr)

We can assess normality of the data using this package. The code below will plot normality for all numeric variable.

heartfailure %>% 
  plot_normality()

However, for the sake of the simplicity in this post, we will run only for one variable.

heartfailure %>% 
  plot_normality(age)

We can also get a correlation matrix plot from this package, and no need to remove the NAs and filter the numeric variable before running the function.

heartfailure %>% 
  plot_correlate()

Lastly, from dlookr we can get the overall report of the data exploration in pdf (and other formats as well). This report is quite comprehensive, have a look.

heartfailure %>% 
  eda_paged_report(target = "death_event")

4) skimr

skimr package, especially skim() function did not display correctly when using the blogdown. Hence, I included the screenshot of the result that we will typically see in the R console.

library(skimr)
skim(heartfailure)

So, from skimr we can get an overview that includes the histogram for numerical data as well.

5) outliertree

This package identify outlier using a decision tree. I will not go in detail about the approach, but for those who want to read further.

library(outliertree)
outlier.tree(heartfailure)

## Reporting top 2 outliers [out of 2 found]
## 
## row [251] - suspicious column: [creatinine] - suspicious value: [0.50]
##  distribution: 96.000% >= 0.70 - [mean: 1.35] - [sd: 1.22] - [norm. obs: 24]
##  given:
##      [cpk_enzyme] > [1610.00] (value: 2522.00)
## 
## 
## row [32] - suspicious column: [cpk_enzyme] - suspicious value: [23.00]
##  distribution: 98.958% >= 47.00 - [mean: 677.01] - [sd: 1321.86] - [norm. obs: 95]
##  given:
##      [death_event] = [Yes]

## Outlier Tree model
##  Numeric variables: 7
##  Categorical variables: 6
## 
## Consists of 369 clusters, spread across 48 tree branches

We can further explore the detected outliers using histogram and boxplot. Let’s do for variable creatinine.

# histogram
hist(heartfailure$creatinine, breaks = 50, col = "navy",
     xlab = "Creatinine", 
     main = "Creatinine level")

# boxplot
boxplot(heartfailure$creatinine)

Probably in the future I will delve into more detail about outlier detection and any awesome packages in R related to it. If I ever written any post about it, I will link it here.

Conclusion

These are some useful package that I find. I may edit this post in the future to add more additional data exploration package. Furthermore, there are shiny apps for data exploration as well, though I think it’s better to sticks with coded approach in data analysis/exploration. Thus, I did not explore those apps in this post. Another thing to remember is to set the variable type accordingly prior to the data exploration.

Hope this is useful!

References:
https://github.com/ekstroem/dataMaid
https://finnstats.com/index.php/2021/05/04/exploratory-data-analysis/
https://cran.r-project.org/web/packages/dlookr/vignettes/EDA.html
https://cran.r-project.org/web/packages/outliertree/vignettes/Introducing_OutlierTree.html