Counting Missing Values (NA) in R

This post is also available in Spanish.

To check for missing values in R you might be tempted to use the equality operator == with your vector on one side and NA on the other. Don’t!

If you insist, you’ll get a useless results.

x <- c(1, 5, NA, 3, NA)
x == NA
## [1] NA NA NA NA NA

Instead use the is.na() function.

is.na(x)
## [1] FALSE FALSE  TRUE FALSE  TRUE

Armed with that knowledge let’s explore how to calculate some basic summary statistics about missing values in your data. First of all, to count the total number of NAs in a vector you can simply sum() up the result of is.na().

sum(is.na(x))
## [1] 2

Confused why you can sum TRUE and FALSE values? R automatically converts logical vectors to integer vectors when using arithmetic functions. In the process TRUE gets turned to 1 and FALSE gets converted to 0. Thus, sum(is.na(x)) gives you the total number of missing values in x.

To get the proportion of missing values you can proceed by dividing the result of the previous operation by the length of the input vector.

sum(is.na(x)) / length(x)
## [1] 0.4

Look careful at the code above. Does that “formula” look somehow familiar? Summing up all elements in a vector and dividing by the total numbers of elements, that’s calculating the arithmetic mean! So, instead of using sum() and length() we can simply use mean() to get the proportion of NAs in a vector.

mean(is.na(x))
## [1] 0.4

Enough of vectors, though, let’s look at counting missing values in a data frame. To illustrate the concepts let me first add some missing values to the mtcars dataset.

data(mtcars)
set.seed(1)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 3)] <- NA

The is.na() function is generic and has a method for data frames so you can directly pass it a data frame as input.

na <- is.na(mtcars)
na[1:15, 1:7]
##                      mpg   cyl  disp    hp  drat    wt  qsec
## Mazda RX4          FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Mazda RX4 Wag      FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Datsun 710         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet 4 Drive     FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet Sportabout  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Valiant            FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Duster 360         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 240D          FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 230           FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280           FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280C          FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Merc 450SE         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SL         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SLC        FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Cadillac Fleetwood FALSE FALSE FALSE FALSE FALSE FALSE FALSE

As you can see the result is a matrix of logical values. Getting the total number of NAs then is simple because sum() works with matrices as well as vectors.

sum(is.na(mtcars))
## [1] 15

Arguably, though, the total number of missing values in a dataset is a rather crude measure. It gets much more interesting if we look at missing values across variables and records in the dataset. That enables detecting patterns that might inform future modeling decisions.

Counting NAs across either rows or columns can be achieved by using the apply() function. This function takes three arguments: X is the input matrix, MARGIN is an integer, and FUN is the function to apply to each row or column. MARGIN = 1 means to apply the function across rows and MARGIN = 2 across columns.

apply(X = is.na(mtcars), MARGIN = 1, FUN = sum)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
##                   3                   3                   0                   0
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D
##                   0                   0                   0                   0
##            Merc 230            Merc 280           Merc 280C          Merc 450SE
##                   0                   0                   3                   0
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
##                   0                   0                   0                   0
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
##                   0                   0                   0                   0
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
##                   0                   0                   3                   0
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
##                   0                   0                   0                   0
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
##                   3                   0                   0                   0
apply(X = is.na(mtcars), MARGIN = 2, FUN = sum)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
##    0    0    0    5    0    0    5    0    5    0    0

If you want to get the proportion of missing values per row or column just change the FUN argument to mean.

apply(X = is.na(mtcars), MARGIN = 1, FUN = mean)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
##           0.2727273           0.2727273           0.0000000           0.0000000
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D
##           0.0000000           0.0000000           0.0000000           0.0000000
##            Merc 230            Merc 280           Merc 280C          Merc 450SE
##           0.0000000           0.0000000           0.2727273           0.0000000
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
##           0.0000000           0.0000000           0.0000000           0.0000000
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
##           0.0000000           0.0000000           0.0000000           0.0000000
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
##           0.0000000           0.0000000           0.2727273           0.0000000
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
##           0.0000000           0.0000000           0.0000000           0.0000000
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
##           0.2727273           0.0000000           0.0000000           0.0000000

To finish up this post, let’s have a quick look at how to visualize missing data. There are a lot of different packages for that purpose out there but I really like the {heatmaply} package which generates an interactive heatmap.

heatmaply::heatmaply_na(mtcars)

Neat, isn’t it?


Thomas Neitmann

base

3572 Words

2021-01-25 00:00 +0700

860d4e8 @ 2021-10-03