Counting Missing Values (NA) in R
This post is also available in Spanish.
To check for missing values in R
you might be tempted to use the equality operator ==
with your vector on one side and NA
on the other. Don’t!
If you insist, you’ll get a useless results.
x <- c(1, 5, NA, 3, NA)
x == NA
## [1] NA NA NA NA NA
Instead use the is.na()
function.
is.na(x)
## [1] FALSE FALSE TRUE FALSE TRUE
Armed with that knowledge let’s explore how to calculate some basic summary statistics about missing values in your data. First of all, to count the total number of NA
s in a vector you can simply sum()
up the result of is.na()
.
sum(is.na(x))
## [1] 2
Confused why you can sum TRUE
and FALSE
values? R automatically converts logical
vectors to integer
vectors when using arithmetic functions. In the process TRUE
gets turned to 1
and FALSE
gets converted to 0
. Thus, sum(is.na(x))
gives you the total number of missing values in x
.
To get the proportion of missing values you can proceed by dividing the result of the previous operation by the length of the input vector.
sum(is.na(x)) / length(x)
## [1] 0.4
Look careful at the code above. Does that “formula” look somehow familiar? Summing up all elements in a vector and dividing by the total numbers of elements, that’s calculating the arithmetic mean! So, instead of using sum()
and length()
we can simply use mean()
to get the proportion of NA
s in a vector.
mean(is.na(x))
## [1] 0.4
Enough of vectors, though, let’s look at counting missing values in a data frame. To illustrate the concepts let me first add some missing values to the mtcars
dataset.
data(mtcars)
set.seed(1)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 3)] <- NA
The is.na()
function is generic and has a method for data frames so you can directly pass it a data frame as input.
na <- is.na(mtcars)
na[1:15, 1:7]
## mpg cyl disp hp drat wt qsec
## Mazda RX4 FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## Mazda RX4 Wag FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## Datsun 710 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet 4 Drive FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet Sportabout FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Valiant FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Duster 360 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 240D FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 230 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280C FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## Merc 450SE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SL FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SLC FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Cadillac Fleetwood FALSE FALSE FALSE FALSE FALSE FALSE FALSE
As you can see the result is a matrix of logical
values. Getting the total number of NA
s then is simple because sum()
works with matrices as well as vectors.
sum(is.na(mtcars))
## [1] 15
Arguably, though, the total number of missing values in a dataset is a rather crude measure. It gets much more interesting if we look at missing values across variables and records in the dataset. That enables detecting patterns that might inform future modeling decisions.
Counting NA
s across either rows or columns can be achieved by using the apply()
function. This function takes three arguments: X
is the input matrix, MARGIN
is an integer, and FUN
is the function to apply to each row or column. MARGIN = 1
means to apply the function across rows and MARGIN = 2
across columns.
apply(X = is.na(mtcars), MARGIN = 1, FUN = sum)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 3 3 0 0
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 0 0 0 0
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 0 0 3 0
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 0 0 0 0
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 0 0 0 0
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 0 0 3 0
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 0 0 0 0
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3 0 0 0
apply(X = is.na(mtcars), MARGIN = 2, FUN = sum)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 5 0 0 5 0 5 0 0
If you want to get the proportion of missing values per row or column just change the FUN
argument to mean
.
apply(X = is.na(mtcars), MARGIN = 1, FUN = mean)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 0.2727273 0.2727273 0.0000000 0.0000000
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 0.0000000 0.0000000 0.0000000 0.0000000
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 0.0000000 0.0000000 0.2727273 0.0000000
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 0.0000000 0.0000000 0.0000000 0.0000000
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 0.0000000 0.0000000 0.0000000 0.0000000
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 0.0000000 0.0000000 0.2727273 0.0000000
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 0.0000000 0.0000000 0.0000000 0.0000000
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 0.2727273 0.0000000 0.0000000 0.0000000
To finish up this post, let’s have a quick look at how to visualize missing data. There are a lot of different packages for that purpose out there but I really like the {heatmaply}
package which generates an interactive heatmap.
heatmaply::heatmaply_na(mtcars)
Neat, isn’t it?