This post is also available in Spanish.

To check for missing values in `R` you might be tempted to use the equality operator `==` with your vector on one side and `NA` on the other. Don’t!

If you insist, you’ll get a useless results.

``````x <- c(1, 5, NA, 3, NA)
x == NA``````
``## [1] NA NA NA NA NA``

Instead use the `is.na()` function.

``is.na(x)``
``## [1] FALSE FALSE  TRUE FALSE  TRUE``

Armed with that knowledge let’s explore how to calculate some basic summary statistics about missing values in your data. First of all, to count the total number of `NA`s in a vector you can simply `sum()` up the result of `is.na()`.

``sum(is.na(x))``
``## [1] 2``

Confused why you can sum `TRUE` and `FALSE` values? R automatically converts `logical` vectors to `integer` vectors when using arithmetic functions. In the process `TRUE` gets turned to `1` and `FALSE` gets converted to `0`. Thus, `sum(is.na(x))` gives you the total number of missing values in `x`.

To get the proportion of missing values you can proceed by dividing the result of the previous operation by the length of the input vector.

``sum(is.na(x)) / length(x)``
``## [1] 0.4``

Look careful at the code above. Does that “formula” look somehow familiar? Summing up all elements in a vector and dividing by the total numbers of elements, that’s calculating the arithmetic mean! So, instead of using `sum()` and `length()` we can simply use `mean()` to get the proportion of `NA`s in a vector.

``mean(is.na(x))``
``## [1] 0.4``

Enough of vectors, though, let’s look at counting missing values in a data frame. To illustrate the concepts let me first add some missing values to the `mtcars` dataset.

``````data(mtcars)
set.seed(1)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 3)] <- NA``````

The `is.na()` function is generic and has a method for data frames so you can directly pass it a data frame as input.

``````na <- is.na(mtcars)
na[1:15, 1:7]``````
``````##                      mpg   cyl  disp    hp  drat    wt  qsec
## Mazda RX4          FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Mazda RX4 Wag      FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Datsun 710         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet 4 Drive     FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Hornet Sportabout  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Valiant            FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Duster 360         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 240D          FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 230           FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280           FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 280C          FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
## Merc 450SE         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SL         FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Merc 450SLC        FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Cadillac Fleetwood FALSE FALSE FALSE FALSE FALSE FALSE FALSE``````

As you can see the result is a matrix of `logical` values. Getting the total number of `NA`s then is simple because `sum()` works with matrices as well as vectors.

``sum(is.na(mtcars))``
``## [1] 15``

Arguably, though, the total number of missing values in a dataset is a rather crude measure. It gets much more interesting if we look at missing values across variables and records in the dataset. That enables detecting patterns that might inform future modeling decisions.

Counting `NA`s across either rows or columns can be achieved by using the `apply()` function. This function takes three arguments: `X` is the input matrix, `MARGIN` is an integer, and `FUN` is the function to apply to each row or column. `MARGIN = 1` means to apply the function across rows and `MARGIN = 2` across columns.

``apply(X = is.na(mtcars), MARGIN = 1, FUN = sum)``
``````##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
##                   3                   3                   0                   0
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D
##                   0                   0                   0                   0
##            Merc 230            Merc 280           Merc 280C          Merc 450SE
##                   0                   0                   3                   0
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
##                   0                   0                   0                   0
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
##                   0                   0                   0                   0
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
##                   0                   0                   3                   0
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
##                   0                   0                   0                   0
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
##                   3                   0                   0                   0``````
``apply(X = is.na(mtcars), MARGIN = 2, FUN = sum)``
``````##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
##    0    0    0    5    0    0    5    0    5    0    0``````

If you want to get the proportion of missing values per row or column just change the `FUN` argument to `mean`.

``apply(X = is.na(mtcars), MARGIN = 1, FUN = mean)``
``````##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
##           0.2727273           0.2727273           0.0000000           0.0000000
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D
##           0.0000000           0.0000000           0.0000000           0.0000000
##            Merc 230            Merc 280           Merc 280C          Merc 450SE
##           0.0000000           0.0000000           0.2727273           0.0000000
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
##           0.0000000           0.0000000           0.0000000           0.0000000
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
##           0.0000000           0.0000000           0.0000000           0.0000000
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
##           0.0000000           0.0000000           0.2727273           0.0000000
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
##           0.0000000           0.0000000           0.0000000           0.0000000
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
##           0.2727273           0.0000000           0.0000000           0.0000000``````

To finish up this post, let’s have a quick look at how to visualize missing data. There are a lot of different packages for that purpose out there but I really like the `{heatmaply}` package which generates an interactive heatmap.

``heatmaply::heatmaply_na(mtcars)``

Neat, isn’t it?

Thomas Neitmann

3582 Words

2021-01-25 00:00 +0700

f932891 @ 2021-02-07