Basic data wrangling with Python

Thu, 18 Jul 2024 00:00:00 +0000

Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:

Basic data wrangling with Python
Basic plotting with matplotlib and seaborn
Comparison of ggplot in R versus in Python

Once I finish writing any of the topics, I will link it to the above.

So, let’s start.

Loading necessary packages

Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to this link to install the packages using pip command or this link to install the packages using conda command. For those who has both R and Python in your machine, I suggest to use a conda command.

Let’s load the required packages.

import numpy as np 
import pandas as pd
from seaborn import load_dataset

All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in pandas package can be accessed through pd or to be specific pd.. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use pd for pandas and np for numpy, but this is a convention or standard practice widely adopted in the Python community.

Load the data

We going to use iris dataset. This dataset is readily available in seaborn package.

iris = load_dataset('iris')

Once we load the data, we need to check the variable type.

iris.dtypes

## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object

Variable species, by right, is a categorical variable. So, we can use Categorical() from pandas to change it from an object variable type to a category. pd. here, means we access the function from pandas package as I explained it previously.

iris['species'] = pd.Categorical(iris['species'])

If we check the variable type again, we can see the species variable is a category.

iris.dtypes

## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object

Next, we can also see the data. Let’s see the first 10 rows.

iris.head(10)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa

Slicing and indexing

To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.

iris['sepal_length'][0:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Similarly, we can also index as below to get the first 10 rows of sepal_length variable.

iris['sepal_length'][:10]

## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64

Next to access the first 5 rows, we can do as below.

iris[0:5]

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

We can also use iloc() and loc() functions. The main difference between the two functions is that iloc() can only accept a numerical value and loc() function can accept a string value.

iris.iloc[0:2, 0:3] #rows, then columns

##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4

iris.loc[0:2, ['sepal_length', 'species']]

##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa

Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.

ind = iris['petal_length'] > 6
iris['petal_length'][ind]

## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64

Let’s say we want our data to include only setosa species.

ind = iris['species'] == 'setosa'
iris.loc[ind, :].head()

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa

Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:

row 1, 2, 3, and 4 of sepal_length to NA values
row 6 of species and sepal_width to NA values

iris.loc[0:3, 'sepal_length'] = np.nan 
iris.iloc[5, [1, 4]] = np.nan

Let’s see the result.

iris.head(6)

##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN

Missing values

If we want to see if we have any missing values in our data, we can use isnull() function.

iris.isnull().any().any() #For overall

## True

iris.isnull().any() #Check for each column

## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool

We can further calculate how many missing values that we have.

iris.isnull().sum()

## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64

Descriptive statistics

To get a basic descriptive statistics, we can use describe() function. Below, we additionally use round() to round up the results into one decimal points.

iris.describe().round()

##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0

Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add include = all as below.

iris.describe(include = 'all').round()

##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN

Alternatively, we can also calculate the unique values for the categorical variable. value_counts() only calculate the non-missing values.

iris['species'].value_counts()

## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64

Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use mean().

iris['sepal_width'].mean().round()

## 3.0

That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.

What makes data "good enough" for a statistical analysis?

Thu, 29 Feb 2024 00:00:00 +0000

A few days earlier, someone asked me to help her with the data analysis. However, the data that she gave me was so bad that it was completely impossible to run the analysis unless a serious data cleaning was done first.

So, I am thinking about what is a general guideline to consider a data is good enough to run the statistical analysis with it.

First thing first, what are the basic format of a “good enough” data.

1. Each row represents an observation

2. For a categorical variable, make sure the levels are standardised.

For example, for gender variable, make sure to have only “male” and “female” instead of “male”, “female”, “men”, and “women”.

3. For a numerical variable, make sure the value is numeric and do not contain any text.

For example, for height variable, do not put “1.68m” or “1.68 meter”.

4. For numerical variable as well, make sure the numeric values in the variable are in the same scale.

For example, for weight variable, do not mix the weight in grams and kilograms. If you want to use grams, use it consistently throughout the data, or at least throughout the variable.

5. Do not use symbol in your data.

For example, do not use “X” as no and “/” as yes.

6. The data should be an individual data.

An individual data means that each row consists of information about each sample or observation. Each observation in the dataset represents a single entity or unit (e.g., a person, a transaction, a product) and includes all relevant attributes or variables for that unit. Individual data allow for detailed analysis at the level of individual observations. Here is an example of individual data:

Id	Age	Obese
1	<50	yes
2	<50	yes
3	>50	no
4	>50	no
5	>50	yes
6	<50	yes

Aggregated data, on the other hand, combines multiple individual observations into summary statistics or groups. Instead of representing individual units, aggregated data presents information at a higher level of abstraction, such as groups, categories, or intervals. This aggregation typically involves summarizing data using functions like sums, averages, counts, or percentages. Here is an example of aggregated data based on the individual data previously:

Age	Obese	Count
<50	yes	3
>50	no	2
<50	no	0
>50	yes	1

I think the above six points are the basics of building a good enough dataset for a statistical analysis. While we are at it, let’s go through the two main formats of a dataset. These formats are more common when you have a repeated measure study design whereby each participant has several values/measurements/responds at several time points.

Wide format

In the wide format, the response of each participant will be in a single row. For example, below is the data of time taking by two participants in answering three questions in second. As we can see each row consists of the time in second in answering all three questions.

ID	Question1	Question2	Question3
1	5	10	50
2	8	20	40

Long format

In the long format (also known as tidy format in R community), the response at each time point of each participant will be in a single row. By using the same data previously, below is the in the long format. As we can see the data is arrange in format that each row represents each time taking to answer the question by each participant.

ID	Time	Question
1	5	1
1	10	2
1	50	3
2	8	1
2	20	2
2	40	3