Python | Tengku Hanis

Basic plotting with Matplotlib and Seaborn

Wed, 07 Aug 2024 00:00:00 +0000

This post is continuation of my previous post about Python. For those interested:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

There are several packages or libraries available in Python for plotting and visualization. However, the most commonly used package is matplotlib. This package is quite extensive and often time can be quite complicated to use. Thus, seaborn package is another alternative and complementary to matplotlib. Seaborn is based on matplotlib and provides a high-level functionality compare to matplotlib.

So, in this blog post, let us compare several basic plots using both packages.

Load packages

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Load dataset

We going to use the iris dataset.

dat = sns.load_dataset('iris')

We can further see the information on this dataset.

dat.head(5)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Histogram

Let’s plot the histogram using matplotlib first.

plt.hist(dat['sepal_length'], bins=30)
plt.show()

Notice that this histogram does not has any label. So, to add a label, we need to do this manually.

plt.hist(dat['sepal_length'], bins=30)
plt.xlabel('Sepal length') #x-axis label
plt.ylabel('Frequency') #y-axis label
plt.show()

However, using seaborn, the label is extracted from the variable name, which is pretty convenient.

sns.histplot(dat['sepal_length'], bins=30)
plt.show()

Let’s say we want to plot the histogram according to different levels.

species = ['setosa', 'versicolor', 'virginica']

for i in species:
    subset = dat[dat['species'] == i]
    plt.hist(subset['sepal_length'], label = i)

plt.legend(loc = 'upper right')
plt.xlabel('Sepal length')
plt.ylabel('Frequency')
plt.show()

The codes above are quite long. In seaborn, the histogram above can be generated quite easily.

sns.histplot(x = 'sepal_length', hue = 'species', data = dat)
plt.show()

Boxplot

First, let’s do boxplot using matplotlib.

bp = plt.boxplot(dat['sepal_length'])
plt.xlabel('Sepal length')
plt.show()

If we wanto to do boxplot according to other variable. The codes become a bit complicated especially for beginners.

species = dat.groupby('species')
setosa = species.get_group('setosa')['sepal_length']
versicolor = species.get_group('versicolor')['sepal_length']
virginica = species.get_group('virginica')['sepal_length']

bp = plt.boxplot([setosa, versicolor, virginica], labels = ['setosa', 'versicolor', 'virginica'])
plt.xlabel('Sepal length')
plt.show()

Both plots above are quite easy to do in seaborn. Below are the codes for the basic histogram.

sns.boxplot(dat['sepal_length'])
plt.show()

Next, to plot sepal_length based on species is pretty much straightforward in seaborn.

sns.boxplot(y='sepal_length', hue='species', data=dat)
plt.show()

Scatter plot

Lastly, let’s see the scatter plot using matplotlib.

plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

We can further extend this plot by categorising it into different species.

# Define the species to colors mapping
species_to_color = {'setosa': 'blue', 'versicolor': 'green', 'virginica': 'red'}
colors = dat['species'].map(species_to_color)

# Create the scatter plot
plt.scatter(x=dat['sepal_length'], y=dat['sepal_width'], c=colors)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, label=species) for species, color in species_to_color.items()], title='Species')
plt.show()

Now, let’s see the seaborn package. This is the basic scatter plot.

sns.scatterplot(x='sepal_length', y='sepal_width', data=dat)
plt.show()

To extend this plot by categorising it into different species in seaborn is actually quite simple.

sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=dat)
plt.show()

Conclusion

In conclusion, matplotlib and seaborn complement each other well. Seaborn is an excellent choice for quick and standard plots, thanks to its high-level interface. On the other hand, matplotlib offers a more extensive range of customization options and is ideal for creating complex and detailed visualizations. Ultimately, choosing between matplotlib and seaborn depends on the specific requirements of the visualization task.

Basic data wrangling with Python

Thu, 18 Jul 2024 00:00:00 +0000

Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

Once I finish writing any of the topics, I will link it to the above.

So, let’s start.

Loading necessary packages

Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to this link to install the packages using pip command or this link to install the packages using conda command. For those who has both R and Python in your machine, I suggest to use a conda command.

Let’s load the required packages.

import numpy as np 
import pandas as pd
from seaborn import load_dataset

All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in pandas package can be accessed through pd or to be specific pd.. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use pd for pandas and np for numpy, but this is a convention or standard practice widely adopted in the Python community.

Load the data

We going to use iris dataset. This dataset is readily available in seaborn package.

iris = load_dataset('iris')

Once we load the data, we need to check the variable type.

iris.dtypes

## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object

Variable species, by right, is a categorical variable. So, we can use Categorical() from pandas to change it from an object variable type to a category. pd. here, means we access the function from pandas package as I explained it previously.

iris['species'] = pd.Categorical(iris['species'])

If we check the variable type again, we can see the species variable is a category.

iris.dtypes

## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object

Next, we can also see the data. Let’s see the first 10 rows.

iris.head(10)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa

Slicing and indexing

To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.

iris['sepal_length'][0:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Similarly, we can also index as below to get the first 10 rows of sepal_length variable.

iris['sepal_length'][:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Next to access the first 5 rows, we can do as below.

iris[0:5]

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

We can also use iloc() and loc() functions. The main difference between the two functions is that iloc() can only accept a numerical value and loc() function can accept a string value.

iris.iloc[0:2, 0:3] #rows, then columns

##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4

iris.loc[0:2, ['sepal_length', 'species']]

##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa

Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.

ind = iris['petal_length'] > 6
iris['petal_length'][ind]

## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64

Let’s say we want our data to include only setosa species.

ind = iris['species'] == 'setosa'
iris.loc[ind, :].head()

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:

row 1, 2, 3, and 4 of sepal_length to NA values
row 6 of species and sepal_width to NA values

iris.loc[0:3, 'sepal_length'] = np.nan 
iris.iloc[5, [1, 4]] = np.nan

Let’s see the result.

iris.head(6)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN

Missing values

If we want to see if we have any missing values in our data, we can use isnull() function.

iris.isnull().any().any() #For overall

## True

iris.isnull().any() #Check for each column

## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool

We can further calculate how many missing values that we have.

iris.isnull().sum()

## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64

Descriptive statistics

To get a basic descriptive statistics, we can use describe() function. Below, we additionally use round() to round up the results into one decimal points.

iris.describe().round()

##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0

Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add include = all as below.

iris.describe(include = 'all').round()

##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN

Alternatively, we can also calculate the unique values for the categorical variable. value_counts() only calculate the non-missing values.

iris['species'].value_counts()

## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64

Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use mean().

iris['sepal_width'].mean().round()

## 3.0

That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.