missing data | Tengku Hanis

Stepwise selection after multiple imputation

Tue, 04 Jan 2022 00:00:00 +0000

Some note

I have written two post previously about multiple imputation using mice package:

This post probably my last post about multiple imputation using mice package.

Stepwise selection

The general steps in mice package are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.

Brand (1999) proposed this solution:

Perform stepwise selection separately on each imputed dataset
Fit a preliminary model that contains all variables that present in at least half of the models in the step 1
Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p > 0.05)
Repeat step 3 until all variables have p values < 0.05

So, we going to do this solution and use multivariate Wald test (D1() in mice package) for model comparison instead of pooled likelihood ratio p value.

Example in R

Load the packages.

library(mice)
library(tidyverse)

Create a missing data. We going to use the famous mtcars dataset, which already available in R.

set.seed(123)
dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor)) %>% 
  select(-mpg) %>% 
  missForest::prodNA(0.1) %>% 
  bind_cols(mpg = mtcars$mpg)
summary(dat)

##       cyl             disp             hp             drat      
##  Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:4.000   1st Qu.:120.7   1st Qu.:103.0   1st Qu.:3.150  
##  Median :6.000   Median :225.0   Median :123.0   Median :3.715  
##  Mean   :6.148   Mean   :232.8   Mean   :147.4   Mean   :3.642  
##  3rd Qu.:8.000   3rd Qu.:334.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##  NA's   :5       NA's   :1       NA's   :4       NA's   :2      
##        wt             qsec          vs        am          gear     
##  Min.   :1.513   Min.   :14.50   0   :17   0   :18   Min.   :3.00  
##  1st Qu.:2.429   1st Qu.:16.88   1   :11   1   :10   1st Qu.:3.00  
##  Median :3.203   Median :17.51   NA's: 4   NA's: 4   Median :4.00  
##  Mean   :3.112   Mean   :17.75                       Mean   :3.71  
##  3rd Qu.:3.533   3rd Qu.:18.83                       3rd Qu.:4.00  
##  Max.   :5.424   Max.   :22.90                       Max.   :5.00  
##  NA's   :4       NA's   :2                           NA's   :1     
##       carb            mpg       
##  Min.   :1.000   Min.   :10.40  
##  1st Qu.:2.000   1st Qu.:15.43  
##  Median :2.000   Median :19.20  
##  Mean   :2.667   Mean   :20.09  
##  3rd Qu.:4.000   3rd Qu.:22.80  
##  Max.   :6.000   Max.   :33.90  
##  NA's   :5

Run mice() on missing data with 10 imputed datasets (m = 10).

datImp <- mice(dat, m = 10, printFlag = F, seed = 123)
datImp

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##      cyl     disp       hp     drat       wt     qsec       vs       am 
##    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"    "pmm" "logreg" "logreg" 
##     gear     carb      mpg 
##    "pmm"    "pmm"       "" 
## PredictorMatrix:
##      cyl disp hp drat wt qsec vs am gear carb mpg
## cyl    0    1  1    1  1    1  1  1    1    1   1
## disp   1    0  1    1  1    1  1  1    1    1   1
## hp     1    1  0    1  1    1  1  1    1    1   1
## drat   1    1  1    0  1    1  1  1    1    1   1
## wt     1    1  1    1  0    1  1  1    1    1   1
## qsec   1    1  1    1  1    0  1  1    1    1   1

Run stepwise selection on each imputed dataset.

sc <- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, 
           lower = ~ 1)
exp <- expression(f1 <- lm(mpg ~ 1),
                  f2 <- step(f1, scope = sc, trace = 0))
fit <- with(datImp, exp)

Next, we calculate how many times each variable selected in the each model by stepwise selection.

fit$analyses %>% 
  map(formula) %>% #get the formula
  map(terms) %>% #get the terms
  map(labels) %>% #get the name of variables
  unlist() %>% 
  table()

## .
##   am carb  cyl disp drat   hp qsec   vs   wt 
##    7    5    3    2    4    5    3    4    7

We going to select:

am
carb
hp
wt

These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.

fit_full1 <- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2         am1  3.06689135 1.94363342  1.577917 13.06329 1.384846e-01
## 3        carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4          hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5          wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02

We exclude carb variable in the next model as it has the largest non-significant p value.

fit_full2 <- with(datImp, lm(mpg ~ am + hp + wt))

Next, we compare using multivariate Wald test.

D1(fit_full1, fit_full2)

##    test statistic df1     df2 dfcom   p.value       riv
##  1 ~~ 2 0.9765411   1 9.21378    27 0.3482934 0.6935655

P > 0.05. So, we opt for the simpler model.

pool(fit_full2) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2         am1  2.50264907 1.79966590  1.390619 15.31418 1.842201e-01
## 3          hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4          wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02

We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.

fit_full3 <- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)

##    test statistic df1      df2 dfcom   p.value       riv
##  1 ~~ 2   1.93382   1 12.90982    28 0.1878483 0.4392918

Again, we opt for the simple model.

pool(fit_full3) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2          hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3          wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05

There is no non-significant variable in the model anymore. Thus, this is our final model.

gtsummary::tbl_regression(fit_full3)

Characteristic	Beta	95% CI¹	p-value
hp	-0.03	-0.05, -0.01	0.005
wt	-3.9	-5.5, -2.4	<0.001
¹ CI = Confidence Interval

Reference:

https://stefvanbuuren.name/fimd/sec-stepwise.html

Variable selection for imputation model in {mice}

Mon, 22 Nov 2021 00:00:00 +0000

Some note

I have written a short post about missing data and multiple imputation in mice package previously. This post will add to that previous post.

Imputation model

Imputation model is the model that we use for our imputation approach. There is another term which is complete-data model. This is a model that we want to fit after we impute the missing values (i.e; the complete-data model is the final model).

Generally, we need to include as many relevant variables into the imputation model. However, this general advise may not be very efficient as we may have multicollinearity and computational issue if we include too many predictors. As a rule of thumb, the number of included variables should be no more than 15-20. van Buuren et al. (2011) mentioned that increased in explained variance in linear regression is negligible after 15 variables are included.

There are 4 steps suggested by van Buuren et al. (1999) for variable selection in the case of big data:

Include all variables that appear in the complete-data model (final model)
- This may include the interaction terms as well (passive imputation can be used to specify the interaction terms in mice package)
Include variable that have influence on the occurrence of the missing data
- This can be assessed by a correlation matrix between NAs variables and non-NAs variables
Include variable that explain a considerable amount of variance
- This can be crudely assessed by a correlation matrix between NAs variables and non-NAs variables
Remove variable that have too many missing values within the subgroup of incomplete cases
- This can be assessed by a proportion of usable cases (PUC) - how many cases with missing data in a certain variable have an observed values on the predictor variables

All these steps should be done on the key variables only. There is another more efficient yet laborious approach suggested by Oudshoorn et al. (1999), which take into account important predictor of predictors. We are going to focus on the four steps above, and not cover the latter suggested approach in this post.

R codes

These are the required packages.

library(mice)
library(corrplot)

Our data.

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We have 2 variables; Ozone and Solar.R with missing values or NAs. We can further explore the pattern of missing variable.

md.pattern(airquality)

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

There are 2 rows with NAs in Ozone and Solar.R, 35 rows with NAs only in Ozone, and 5 rows with NAs only in Solar.R. Next, we can check the correlation.

cor(airquality, use = "pairwise.complete.obs") |>
  corrplot(method = "number", type = "upper")

The correlations of Ozone-Temp and Ozone-Wind are the highest. Now, let’s do a correlation between the NAs variable and non-NAs variable.

cor(y = airquality, x = !is.na(airquality), use = "pairwise.complete.obs") |>
  round(digits = 2)

##         Ozone Solar.R  Wind Temp Month   Day
## Ozone      NA   -0.02 -0.05 0.00  0.26 -0.05
## Solar.R     0      NA  0.06 0.11  0.11  0.17
## Wind       NA      NA    NA   NA    NA    NA
## Temp       NA      NA    NA   NA    NA    NA
## Month      NA      NA    NA   NA    NA    NA
## Day        NA      NA    NA   NA    NA    NA

We can ignore the warnings and the NAs as only Ozone and Solar.R have a missing values. So, the highest correlation is 0.26 between Month-Ozone - correlation between Month values with Ozone-related NAs and Month values with non-Ozone-related NAs. The column variable in the correlation matrix is the indicators of NAs and the row variables is the variable with observed values.

Lastly we can calculate ‘manually’ the PUC (proportion of usable cases). md.pairs() here calculate the number of observation per variable pair.

var_pair <- md.pairs(airquality)
round(var_pair$mr / (var_pair$mr + var_pair$mm), digits = 3)

##         Ozone Solar.R Wind Temp Month Day
## Ozone   0.000   0.946    1    1     1   1
## Solar.R 0.714   0.000    1    1     1   1
## Wind      NaN     NaN  NaN  NaN   NaN NaN
## Temp      NaN     NaN  NaN  NaN   NaN NaN
## Month     NaN     NaN  NaN  NaN   NaN NaN
## Day       NaN     NaN  NaN  NaN   NaN NaN

Low value of PUC indicate there is a little information on the predictor to impute the target NAs variable. NaN is shown as the variables have no missing values. The row variable are the target variables to be imputed, and the column variables are the predictors in imputation model. We can see that to impute Solar.R (on the row) Ozone has a little less information (0.714) compare to Wind, Temp, and Day. The diagonal elements will always be 0 or NaN. So, from here we can drop predictors with say, 0 PUC as they contain no information to help impute the target NAs variable.

Actually, we have a nice function from mice that can do what we ‘manually’ did just now.

quickpred(airquality)

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Again, the column variables are the predictors, and the row variables are the target NAs variables. The above matrix is known as predictor matrix, which going to be used in the imputation model. 1 denote a variable included as predictors and 0 vice versa. The two main arguments in quickpred() are:

mincor - if any of the absolute values in the two correlation matrix that we did earlier above 0.1 (default), the predictors will be included in the predictor matrix
minpuc - the default values for PUC is 0, so the predictors are retained even if they have no information to help imputation model

Notice that, variable Day is excluded from the predictors of Ozone. The correlation values are 0 and -0.05 from the first and second correlation matrices, respectively which do not exceed the default setting of 0.1. That’s why, variable Day is excluded. Also, we can observe a similar situation for variable Wind , which is excluded from the predictors of Solar.R (the correlation coefficients are -0.60 and 0.06). The negative (-) sign does not matter as we actually evaluate the absolute values.

Intuitively, we can change these two arguments as we see fit to do a variable selection for imputation model. Once we finalise our variable selection, we can do the multiple imputation using mice().

# Finalised variable selection
var_sel <- quickpred(airquality)
var_sel

##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

# Impute
imp <- mice(airquality, m = 5, predictorMatrix = var_sel, printFlag = F)
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   0
## Solar.R     1       0    0    1     1   1
## Wind        0       0    0    0     0   0
## Temp        0       0    0    0     0   0
## Month       0       0    0    0     0   0
## Day         0       0    0    0     0   0

Notice that mice() uses the predictor matrix that we provide.

References:

https://www.jstatsoft.org/article/view/v045i03 - paper written by Staf van Buuren (a bit outdated in terms of codes, but runnable)
https://stefvanbuuren.name/fimd/ - online book written by Stef van Buuren (See chapter 6.3.2 and 9.1.6)

A short note on multiple imputation

Fri, 29 Oct 2021 00:00:00 +0000

Background

Missing data is quite challenging to deal with. Deleting it may be the easiest solution, but may not be the best solution. Missing data can be categorised into 3 types (Rubin, 1976):

MCAR
- Missing Completely At Random
- Example; some of the observations are missing due to lost of records during the flood
MAR
- Missing At Random
- Example; variable income are missing as some participant refuse to give their salary information which they deems as very personal information
MNAR
- Missing Not At Random
- Example; weight variable is missing for morbidly obese participants since the scale is unable to weight them

Out of the 3 types above, the most problematic is MNAR, though there exist methods to deal with this type. For example, the miceMNAR package in R.

There are several approaches in handling missing data:

Listwise-deletion
- Best approach if the amount of missingness is very small
Simple imputation
- Using mean/median/mode imputation
- This approach is not advisable as it leads to bias due to reduce variance, though the mean is not affected
Single imputation
- Simple imputation above is considered as single imputation as well
- This approach ignores uncertainty of the imputation and almost always underestimate the variance
Multiple imputation
- A bit advanced and it cover the limitation of single imputation approach

However, the main assumption for any imputation methods is the missingness should be MCAR or MAR.

Multiple imputation

In short, there are 2 approaches of multiple imputation implemented by packages in R:

Joint modeling (JM) or joint multivariate normal distribution multiple imputation
- The main assumption for this method is that the observed data follows a multivariate normal distribution
- A violation of this assumption produces incorrect values, though a slight violation is still okay
- Some packages that implemented this method: Amelia and norm
Fully conditional specification (FCS) or conditional multiple imputation
- Also known as multivariate imputation by chained equation (MICE)
- This approach is a bit flexible as distribution is assumed for each variable rather than the whole dataset
- Some package that implemented this method: mice and mi

Example

In mice package, the general steps are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

Figure 1: Main steps in mice package.

These are the required packages.

library(tidyverse)
library(mice)
library(VIM)
#library(missForest) we want to use prodNA() function from this package
library(naniar)
library(niceFunction) #install from github (https://github.com/tengku-hanis/niceFunction)
library(dplyr)
library(gtsummary)

We going to produce some NAs randomly.

set.seed(123)
dat <- iris %>% 
  select(-Sepal.Length)%>% 
  missForest::prodNA(0.2) %>%  # randomly insert 20% NAs
  mutate(Sepal.Length = iris$Sepal.Length)

Explore the NAs and the data.

naniar::miss_var_summary(dat)

## # A tibble: 5 x 3
##   variable     n_miss pct_miss
##   <chr>         <int>    <dbl>
## 1 Petal.Length     38     25.3
## 2 Sepal.Width      33     22  
## 3 Species          28     18.7
## 4 Petal.Width      21     14  
## 5 Sepal.Length      0      0

Some references recommend to remove variables with more than 50% NAs. However, we purposely introduce 20% NAs into our data.

As a guideline, we can check for MCAR for our NAs.

naniar::mcar_test(dat) #p > 0.05, MCAR is indicated

## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      38.8    40   0.522               14

Next step is to evaluate the pattern of missingness in our data.

md.pattern(dat, rotate.names = T, plot = T)

##    Sepal.Length Petal.Width Species Sepal.Width Petal.Length    
## 64            1           1       1           1            1   0
## 21            1           1       1           1            0   1
## 15            1           1       1           0            1   1
## 3             1           1       1           0            0   2
## 14            1           1       0           1            1   1
## 4             1           1       0           1            0   2
## 6             1           1       0           0            1   2
## 2             1           1       0           0            0   3
## 7             1           0       1           1            1   1
## 6             1           0       1           1            0   2
## 4             1           0       1           0            1   2
## 2             1           0       1           0            0   3
## 1             1           0       0           1            1   2
## 1             1           0       0           0            1   3
##               0          21      28          33           38 120

aggr(dat, prop = F, numbers = T)

We have 13 patterns (numbers on the right) of NAs in our data. These 2 functions work well with small dataset, but with a larger dataset (and with lot more pattern of NAs), it’s probably quite difficult to assess the pattern.

matrixplot() probably more appropriate for a larger dataset.

matrixplot(dat)

In terms of the missingness pattern, we can also assess the distribution of NAs of Sepal.Width is dependent on the variable Sepal.Length.

niceFunction::histNA_byVar(dat, Sepal.Width, Sepal.Length)

As we can see the distribution and range of the histograms of the NAs (True) and non-NAs (False) is quite similar. Thus, this may indicated that Sepal.Width is at least MAR. However, by right we should do this for each pair of numerical variable before jumping into any conclusion.

Another good thing to assess is the correlation.

# Data with 1 = NAs, 0 = non-NAs
x <- as.data.frame(abs(is.na(dat))) %>% 
  dplyr::select(-Sepal.Length) #pick variable with NAs only

Firstly, the correlation between the variables with missing data.

cor(x) %>% 
  corrplot::corrplot()

No high correlation among variable with NAs. Secondly, let’s see correlation between NAs in a variable and the observed values of other variables.

cor(dat %>% mutate(Species = as.numeric(Species)), x, use = "pairwise.complete.obs")

##               Sepal.Width Petal.Length  Petal.Width     Species
## Sepal.Width            NA  0.049158733 -0.065917718  0.09948263
## Petal.Length  0.042075695           NA -0.004572405 -0.17265919
## Petal.Width   0.096195805 -0.003320601           NA -0.11024288
## Species       0.045849046 -0.104143925 -0.081055707          NA
## Sepal.Length -0.006435044 -0.052871701 -0.091024799 -0.08527514

Again, there is no high correlation. But, if we were to interpret this correlation matrix; the rows are the observed variables and the columns represent the missingness. For example, missing values of Sepal.Width is more likely to be missing for observations with a high value of Petal.Width (r = 0.05 indicates it’s highly unlikely though).

Now, we can do multiple imputation. These are the methods in the mice package:

methods(mice)

##  [1] mice.impute.2l.bin       mice.impute.2l.lmer      mice.impute.2l.norm     
##  [4] mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart         mice.impute.jomoImpute  
## [10] mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch   mice.impute.mnar.logreg 
## [16] mice.impute.mnar.norm    mice.impute.norm         mice.impute.norm.boot   
## [19] mice.impute.norm.nob     mice.impute.norm.predict mice.impute.panImpute   
## [22] mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [25] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf          
## [28] mice.impute.ri           mice.impute.sample       mice.mids               
## [31] mice.theme              
## see '?methods' for accessing help and source code

By default, mice uses:

pmm (predictive mean matching) for numeric data
logreg (logistic regression imputation) for binary data, factor with 2 levels
polyreg (polytomous regression imputation) for unordered categorical data (factor > 2 levels)
polr (proportional odds model) for ordered, > 2 levels

let’s run the mice function to our data:

imp <- mice(dat, m = 5, seed=1234, maxit = 5, printFlag = F) 
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  Sepal.Width Petal.Length  Petal.Width      Species Sepal.Length 
##        "pmm"        "pmm"        "pmm"    "polyreg"           "" 
## PredictorMatrix:
##              Sepal.Width Petal.Length Petal.Width Species Sepal.Length
## Sepal.Width            0            1           1       1            1
## Petal.Length           1            0           1       1            1
## Petal.Width            1            1           0       1            1
## Species                1            1           1       0            1
## Sepal.Length           1            1           1       1            0

Next, we can do some diagnostic assessment on the imputed data. This is our imputed data.

imp$imp$Sepal.Width %>% head()

##      1   2   3   4   5
## 5  3.4 3.4 4.1 3.1 3.5
## 13 3.2 3.1 3.2 3.6 3.1
## 14 3.1 3.2 2.9 3.4 3.0
## 23 3.6 3.2 3.0 3.8 3.1
## 26 4.1 3.0 3.1 3.5 3.0
## 34 3.4 3.7 3.7 3.4 4.4

One important thing to check is the convergence. We are going increase the number of iteration for this.

imp_conv <- mice.mids(imp, maxit = 30, printFlag = F)
plot(imp_conv)

The line in the plot should be intermingled and no obvious trend should be observed. Our plot above indicates a convergence.

We can also assess density plot of imputed data and the observed data. Blue color is the observed data and red color is the imputed data.

densityplot(imp)

We can further assess variable Sepal.Width.

densityplot(imp, ~ Sepal.Width | .imp)

Lastly, we can assess the strip plot. The imputed observations (red color) should not distributed too far from the observed data (blue color).

stripplot(imp)

So, once we finish the diagnostic checking, we can actually go back and change the imputation method for Sepal.Width, since the its distribution changes quite differently at each iteration. But, we are not going to do that, instead we are going to do the analysis.

# run regression
fit <- with(imp, lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species))
# pool all imputed set
pooled <- pool(fit) 
summary(pooled)

##                term   estimate  std.error statistic       df      p.value
## 1       (Intercept)  2.2008307 0.34577321  6.364954 29.02484 5.859560e-07
## 2       Sepal.Width  0.5233500 0.09717217  5.385801 50.89918 1.854832e-06
## 3      Petal.Length  0.7409159 0.09020153  8.214006 12.73722 1.921415e-06
## 4       Petal.Width -0.3623895 0.18562168 -1.952301 22.34517 6.354332e-02
## 5 Speciesversicolor -0.3891112 0.28166528 -1.381467 15.07547 1.872683e-01
## 6  Speciesvirginica -0.5237106 0.42629920 -1.228505 10.82804 2.452897e-01

Since we have the original dataset without the NAs, we going to compare them.

mimpute <- 
  fit %>% 
  tbl_regression() #with mice

noimpute <- 
  dat %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #w/o mice

original <- 
  iris %>% 
  lm(Sepal.Length ~ ., data = .) %>% 
  tbl_regression() #original data

tbl_merge(
  tbls = list(mimpute, noimpute, original), 
  tab_spanner = c("With MICE", "Without MICE", "Original data")
)

Characteristic	With MICE	Without MICE	Original data
Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
Sepal.Width	0.52	0.33, 0.72	<0.001	0.48	0.17, 0.79	0.003	0.50	0.33, 0.67	<0.001
Petal.Length	0.74	0.55, 0.94	<0.001	0.71	0.51, 0.90	<0.001	0.83	0.69, 1.0	<0.001
Petal.Width	-0.36	-0.75, 0.02	0.064	-0.35	-0.85, 0.14	0.2	-0.32	-0.61, -0.02	0.039
Species
setosa	—	—		—	—		—	—
versicolor	-0.39	-1.0, 0.21	0.2	-0.42	-1.1, 0.30	0.3	-0.72	-1.2, -0.25	0.003
virginica	-0.52	-1.5, 0.42	0.2	-0.42	-1.5, 0.63	0.4	-1.0	-1.7, -0.36	0.003
¹ CI = Confidence Interval

There is a different in the result between the original dataset (no NAs) and with mice imputation. Probably, exploring other imputation methods will produce a better result.

There are a lot more that are not cover in this post. For example passive imputation and post-processing. In fact, there are a series of vignettes written by Gerko Vink and Stef van Buuren (both are the authors of mice) which provides a good tutorial on using mice though quite advanced.

Suggested online books (though, I have not really studied both of the books yet):

Flexible imputation of missing data by Stef van Buuren
Applied missing data analysis with SPSS and (R)Studio

References for this post: