A summary of forcats package

Tue, 18 May 2021 00:00:00 +0000

I just watched a youtube video by Andrew Couch about his commonly used function in readr, stringr, and forcats packages. Although, I have used forcats package before, I realised that I have not fully utilised all of its function.

So, in this post, I have summarised main function of forcats that I find useful in my day-to-day R coding. Basically, more like a note to myself.

Main functions

We will use mtcars data to demonstrate each function. forcats is part of tiyverse packages. So, it will load, once we load the tidyverse packages.

library(tidyverse)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

There are 9 forcats functions that I think very useful.

factor()

factor() changes variable type into a factor or categorical type

mtcars$carb <- factor(mtcars$carb)
glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

fct_inorder()

This function sorts factor levels based on the order of appearance in the dataset.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_inorder(mtcars$carb) # levels based on the order of appearance

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 1 2 3 6 8

fct_infreq()

This function sorts factor levels based on the frequency of values.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_infreq(mtcars$carb) # levels based on the frequency values

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 2 4 1 3 6 8

fct_relevel()

This function can be used to change the order manually.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, c("8", "6", "4", "3", "2", "1")) # manually changed new levels

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 8 6 4 3 2 1

fct_relevel() can also be used to change one factor level only.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_relevel(mtcars$carb, "8", after = 2) # change level 8 to the third place

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 1 2 8 3 4 6

fct_reorder()

This function changes the order based on another variable. Let’s change variable carb’s levels based on value of variable disp.

levels(mtcars$carb) # original levels

## [1] "1" "2" "3" "4" "6" "8"

fct_reorder(mtcars$carb, mtcars$disp, .fun = sum, .desc = TRUE) # new level based on disp value

##  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
## Levels: 4 2 1 3 8 6

mtcars %>% 
  group_by(carb) %>% 
  summarise(sum_disp = sum(disp)) %>% 
  arrange(desc(sum_disp)) # this is basically what we do with fct_reorder() above

## # A tibble: 6 x 2
##   carb  sum_disp
##   <fct>    <dbl>
## 1 4        3088.
## 2 2        2082.
## 3 1         940.
## 4 3         827.
## 5 8         301 
## 6 6         145

Additionally, fct_reorder() can be used with plotting as well.

# Original plot
ggplot(mtcars, aes(x = carb, y = disp)) +
  geom_col()

# Plot with changed levels
mtcars %>% 
  mutate(carb = fct_reorder(carb, disp, .fun = sum, .desc = TRUE)) %>% 
  ggplot(aes(x = carb, y = disp)) +
  geom_col()

fct_lump()

This function lumps factor levels into other factors. There are 5 variants of this function:

fct_lump()
fct_lump_min()
fct_lump_n()
fct_lump_lowfreq()

The remaining one variant is fct_lump_prop(). It is not in the example below as I do not find it useful at least for my current R coding routine.

fct_lump() automatically lump small frequency factor group into one group.

fct_count(mtcars$carb) # this is forcats function as well, count factor level

## # A tibble: 6 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 3         3
## 4 4        10
## 5 6         1
## 6 8         1

fct_lump(mtcars$carb) %>% fct_count()

## # A tibble: 4 x 2
##   f         n
##   <fct> <int>
## 1 1         7
## 2 2        10
## 3 4        10
## 4 Other     5

fct_lump_min() lump factor group into one group based on the given value.

table(fct_lump_min(mtcars$carb, min = 2)) # group 6 and 8 lump into one group

## 
##     1     2     3     4 Other 
##     7    10     3    10     2

fct_lump_n() lump all level except for the n most frequent factor groups.

table(fct_lump_n(mtcars$carb, n = 2)) # 2 frequent group only, others in one group

## 
##     2     4 Other 
##    10    10    12

fct_lump_lowfreq() lump small frequent groups into one group, while making sure that particular one group is still the smallest.

table(fct_lump_lowfreq(mtcars$carb, other_level = "low")) # group low is still the smallest

## 
##   1   2   4 low 
##   7  10  10   5

fct_other()

fct_other() is much like fct_lump(), except we manually choose which factor groups to be combined.

table(fct_other(mtcars$carb, keep = c("8", "6")))

## 
##     6     8 Other 
##     1     1    30

fct_recode()

This function is used to rename or relabel the factor group.

table(fct_recode(mtcars$carb, hanis = "8"))

## 
##     1     2     3     4     6 hanis 
##     7    10     3    10     1     1

fct_relabel()

fct_relabel() is extremely useful if we want to rename quite a number of factor groups.

table(mtcars$carb) # original groups

## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

table(fct_relabel(mtcars$carb, ~ c("abu", "ali", "chong", "siti", "krish", "lee"))) # new named groups

## 
##   abu   ali chong  siti krish   lee 
##     7    10     3    10     1     1

Reference:
https://forcats.tidyverse.org/index.html

Base R vs tidyverse

Tue, 04 May 2021 00:00:00 +0000

First of all, this write up is mean for a beginner in R.

Things can be done in many ways in R. In facts, R has been very flexible in this regard compared to other statistical softwares. Basic things such as selecting a column, slicing a row, filtering a data based on certain condition can be done using a base R function. However, all these things can also be done using a tidyverse approach.

Tidyverse basically, a collection of packages that can be loaded in a line of function.

library(tidyverse)

Tidyverse is developed by “RStudio people” pioneered by Hadley Wickham, which means that these packages will be continuously maintained and updated.

So, without further ado, these are the comparisons between these two approaches for some very basic thingy:

Select or deselect a column and a row

# Base R
iris[1:5, c("Sepal.Length", "Sepal.Width")]
iris[1:5,c(1,2)] # similar to above
iris[1:5, -1]

# Tidyverse
iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  slice(1:5)
iris %>% 
  select(-Sepal.Length) %>% 
  slice(1:5)

Filter based on condition

# Base R
iris[iris$Species == "setosa", ]

# Tidyverse
iris %>% 
  filter(Species == "setosa")

Mutate (transmute replace the variable)

# Base R
iris$SL_minus10 <- iris$Sepal.Length - 10

# Tidyverse
iris %>% 
  mutate(SL_minus10 = Sepal.Length - 10)

Sort variable

# Base R
iris[order(-iris$Sepal.Width),]

# Tidyverse
iris %>% 
  arrange(desc(Sepal.Length))

Group by (and get mean for variable Sepal.Width)

# Not really base R
doBy::summaryBy(Sepal.Width~Species, iris, FUN = mean) 

# Tidyverse
iris %>% 
  group_by(Species) %>% 
  summarise(mean_SW = mean(Sepal.Width))

Rename variable

# Base R
colnames(iris)[6] <- "hanis"

# Tidyverse
iris %>% 
  rename(Species = hanis)

So, that’s it. Overall, tidyverse give a clarity in understanding the code as it reads from left to right. On the contrary, the base R approach reads from inside to outside, especially for a more complicated code.

tidyverse | Tengku Hanis

A summary of forcats package

Main functions

Base R vs tidyverse