base R | Tengku Hanis

Fitted vs predict in R

Sun, 09 Jan 2022 00:00:00 +0000

There are two functions in R that seems almost similar yet different:

fitted()
predict()

First let’s prepare some data first.

# Packages
library(dplyr)

# Data
set.seed(123)
dat <- 
  iris %>% 
  mutate(twoGp = sample(c("Gp1", "Gp2"), 150, replace = T), #create two group factor
         twoGp = as.factor(twoGp))

This is our data.

summary(dat)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species   twoGp   
##  setosa    :50   Gp1:76  
##  versicolor:50   Gp2:74  
##  virginica :50           
##                          
##                          
##

fitted() is used to get a predicted values or \(\hat{y}\) based on the data. Let’s see this on the logistic regression.

logR <- glm(twoGp ~ ., family = binomial(), data = dat)

These are the fitted values.

fitted(logR) %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

For predict(), we have three types:

Response
Link - default
Terms

If no new data supplied to predict(), it will use the original data used to fit the model.

1. Response

The type response is identical to fitted().

predict(logR, type = "response") %>% head()

##         1         2         3         4         5         6 
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198

We can confirm this as below.

all.equal(fitted(logR), predict(logR, type = "response"))

## [1] TRUE

Thus, fitted() and predict(type = "response") give use predicted probabilities on the scale of the response variable. The first observation of this values can be interpreted as probability of Gp2 (Gp1 is a reference group) for first observation is 0.41.

2. Link

predict(type = "link") gives us predicted probabilities on the logit scale or log odds prediction.

predict(logR, type = "link") %>% head() #similar to predict(logR)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

So, the log odds prediction of Gp2 for the first observation is -0.37. Hence, we can get the same values if we apply a link function to the fitted values.

The link function for logistic regression is:

\[ ln(\frac{\mu}{1 - \mu}) \] So, we apply this link function to the fitted values.

logOddsProb <- log(fitted(logR) / (1 - fitted(logR))) 
head(logOddsProb)

##          1          2          3          4          5          6 
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578

We can further confirm this as we did previously.

all.equal(logOddsProb, predict(logR, type = "link"))

## [1] TRUE

Also, we can conclude predict(type = "link") give use a fitted values before an application of link function (log odds).

3. Terms

Lastly, we have predict(type = "terms"). This type gives us a matrix of fitted values for each variable of each observation in the model on the scale of linear predictor.

predict(logR, type = "terms") %>% head()

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1   0.07988782  0.28070682    0.4819893  -0.2736677 -0.9178543
## 2   0.10138230 -0.03635661    0.4819893  -0.2736677 -0.9178543
## 3   0.12287679  0.09046877    0.5024299  -0.2736677 -0.9178543
## 4   0.13362403  0.02705608    0.4615487  -0.2736677 -0.9178543
## 5   0.09063506  0.34411951    0.4819893  -0.2736677 -0.9178543
## 6   0.04764610  0.53435757    0.4206675  -0.2188976 -0.9178543

So, if we add up the values of the first observation and the constant (or intercept), we will get the same values as the log odds prediction (predict(type = "link")).

predTerm <- predict(logR, type = "terms")
sum(predTerm[1, ], attr(predTerm, "constant")) #add up the first observation and the constant

## [1] -0.374315

logOddsProb[1]

##         1 
## -0.374315

Those values also similar to if we calculate manually using coefficient from the summary().

\[ LogOdds(Gp2) = \beta_0 + \beta_1(Sepal.Length) + \beta_2(Sepal.Width) + \] \[ \beta_3(Petal.Length) + \beta_4(Petal.Width) + \beta_5(Species) \] So, this is the values we get from the first observation.

coef(logR)[1] + coef(logR)[2]*dat$Sepal.Length[1] + coef(logR)[3]*dat$Sepal.Width[1] + coef(logR)[4]*dat$Petal.Length[1] + coef(logR)[5]*dat$Petal.Width[1] + 0 #setosa species

## (Intercept) 
##   -0.374315

However, in predict(type = "terms") the values is centered, thus we have a different values for constant/intercept and for \(\beta_1(Sepal.Length)\),\(\beta_2(Sepal.Width)\) and so on. For example, the values for intercept for both models are:

# Intercept/constant from predict(type = "terms")
attr(predTerm, "constant")

## [1] -0.02537694

# Intercept/constant from summary()
coef(logR)[1]

## (Intercept) 
##   -1.814251

References:

Base R vs tidyverse

Tue, 04 May 2021 00:00:00 +0000

First of all, this write up is mean for a beginner in R.

Things can be done in many ways in R. In facts, R has been very flexible in this regard compared to other statistical softwares. Basic things such as selecting a column, slicing a row, filtering a data based on certain condition can be done using a base R function. However, all these things can also be done using a tidyverse approach.

Tidyverse basically, a collection of packages that can be loaded in a line of function.

library(tidyverse)

Tidyverse is developed by “RStudio people” pioneered by Hadley Wickham, which means that these packages will be continuously maintained and updated.

So, without further ado, these are the comparisons between these two approaches for some very basic thingy:

Select or deselect a column and a row

# Base R
iris[1:5, c("Sepal.Length", "Sepal.Width")]
iris[1:5,c(1,2)] # similar to above
iris[1:5, -1]

# Tidyverse
iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  slice(1:5)
iris %>% 
  select(-Sepal.Length) %>% 
  slice(1:5)

Filter based on condition

# Base R
iris[iris$Species == "setosa", ]

# Tidyverse
iris %>% 
  filter(Species == "setosa")

Mutate (transmute replace the variable)

# Base R
iris$SL_minus10 <- iris$Sepal.Length - 10

# Tidyverse
iris %>% 
  mutate(SL_minus10 = Sepal.Length - 10)

Sort variable

# Base R
iris[order(-iris$Sepal.Width),]

# Tidyverse
iris %>% 
  arrange(desc(Sepal.Length))

Group by (and get mean for variable Sepal.Width)

# Not really base R
doBy::summaryBy(Sepal.Width~Species, iris, FUN = mean) 

# Tidyverse
iris %>% 
  group_by(Species) %>% 
  summarise(mean_SW = mean(Sepal.Width))

Rename variable

# Base R
colnames(iris)[6] <- "hanis"

# Tidyverse
iris %>% 
  rename(Species = hanis)

So, that’s it. Overall, tidyverse give a clarity in understanding the code as it reads from left to right. On the contrary, the base R approach reads from inside to outside, especially for a more complicated code.

Loop vs apply in R

Tue, 04 May 2021 00:00:00 +0000

I have heard quite a several times that apply function is faster than loop function in R. Loop function is said to be inefficient, though in certain situation loop is the only way.

Let’s compare between loop function and apply function in R.

First, make a very big fake data contain a list of vector.

set.seed(2021)
xlist <- list(col1 = rnorm(10000000), 
              col2 = rnorm(10000000),
              col3 = rnorm(100000000),
              col4 = rnorm(1000000)) # this will take a few seconds

Then, calculate the mean of each vector using for loop().

ptm <- proc.time() #-- start the clock

mean_loop <- vector("list", 0) # place holder for a value
for (i in seq_along(xlist)) {
  mean_loop[[i]] <- mean(xlist[[i]])
}

proc.time() - ptm #-- stop the clock (time in seconds)

##    user  system elapsed 
##    0.38    0.00    0.37

Now, using lapply() function.

ptm <- proc.time() #-- start the clock

mean_apply <- lapply(xlist, mean)

proc.time() - ptm #-- stop the clock

##    user  system elapsed 
##    0.34    0.00    0.35

So, lapply() is a little bit faster. Obviously, with a very big dataset and a more complicated objective, lapply() is the right choice, but for a “normal” size dataset, the use of any of the two functions probably do not make much different.