The goal of mde
is to ease exploration of
missingness.
Loading the package
To get a simple missingness report, use na_summary
:
na_summary(airquality)
#> variable missing complete percent_complete percent_missing
#> 1 Day 0 153 100.00000 0.000000
#> 2 Month 0 153 100.00000 0.000000
#> 3 Ozone 37 116 75.81699 24.183007
#> 4 Solar.R 7 146 95.42484 4.575163
#> 5 Temp 0 153 100.00000 0.000000
#> 6 Wind 0 153 100.00000 0.000000
To sort this summary by a given column :
na_summary(airquality,sort_by = "percent_complete")
#> variable missing complete percent_complete percent_missing
#> 3 Ozone 37 116 75.81699 24.183007
#> 4 Solar.R 7 146 95.42484 4.575163
#> 1 Day 0 153 100.00000 0.000000
#> 2 Month 0 153 100.00000 0.000000
#> 5 Temp 0 153 100.00000 0.000000
#> 6 Wind 0 153 100.00000 0.000000
If one would like to reset (drop) row names, then one can set
row_names
to TRUE
This may especially be
useful in cases where rownames
are simply numeric and do
not have much additional use.
na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE)
#> variable missing complete percent_complete percent_missing
#> 1 Ozone 37 116 75.81699 24.183007
#> 2 Solar.R 7 146 95.42484 4.575163
#> 3 Day 0 153 100.00000 0.000000
#> 4 Month 0 153 100.00000 0.000000
#> 5 Temp 0 153 100.00000 0.000000
#> 6 Wind 0 153 100.00000 0.000000
To sort by percent_missing
instead:
na_summary(airquality, sort_by = "percent_missing")
#> variable missing complete percent_complete percent_missing
#> 1 Day 0 153 100.00000 0.000000
#> 2 Month 0 153 100.00000 0.000000
#> 5 Temp 0 153 100.00000 0.000000
#> 6 Wind 0 153 100.00000 0.000000
#> 4 Solar.R 7 146 95.42484 4.575163
#> 3 Ozone 37 116 75.81699 24.183007
To sort the above in descending order:
na_summary(airquality, sort_by="percent_missing", descending = TRUE)
#> variable missing complete percent_complete percent_missing
#> 3 Ozone 37 116 75.81699 24.183007
#> 4 Solar.R 7 146 95.42484 4.575163
#> 1 Day 0 153 100.00000 0.000000
#> 2 Month 0 153 100.00000 0.000000
#> 5 Temp 0 153 100.00000 0.000000
#> 6 Wind 0 153 100.00000 0.000000
To exclude certain columns from the analysis:
na_summary(airquality, exclude_cols = c("Day", "Wind"))
#> variable missing complete percent_complete percent_missing
#> 1 Month 0 153 100.00000 0.000000
#> 2 Ozone 37 116 75.81699 24.183007
#> 3 Solar.R 7 146 95.42484 4.575163
#> 4 Temp 0 153 100.00000 0.000000
To include or exclude via regex match:
na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S")
#> variable missing complete percent_complete percent_missing
#> 1 Ozone 37 116 75.81699 24.183007
#> 2 Solar.R 7 146 95.42484 4.575163
na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]")
#> variable missing complete percent_complete percent_missing
#> 1 Day 0 153 100 0
#> 2 Month 0 153 100 0
#> 3 Temp 0 153 100 0
#> 4 Wind 0 153 100 0
To get this summary by group:
test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D"))
na_summary(test2,grouping_cols = c("ID","ID2"))
#> # A tibble: 2 × 7
#> ID ID2 variable missing complete percent_complete percent_missing
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 B D Vals 1 1 50 50
#> 2 A E Vals 3 0 0 100
na_summary(test2, grouping_cols="ID")
#> Warning in na_summary.data.frame(test2, grouping_cols = "ID"): All non grouping
#> values used. Using select non groups is currently not supported
#> # A tibble: 4 × 6
#> ID variable missing complete percent_complete percent_missing
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A Vals 3 0 0 100
#> 2 A ID2 0 3 100 0
#> 3 B Vals 1 1 50 50
#> 4 B ID2 0 2 100 0
get_na_counts
This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.)
To get the number of missing values in each column of
airquality
, we can use the function as follows:
The above might be less useful if one would like to get the results
by group. In that case, one can provide a grouping vector of names in
grouping_cols
.
test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L,
1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
get_na_counts(test, grouping_cols = "ID")
#> # A tibble: 2 × 3
#> ID Subject res
#> <fct> <int> <int>
#> 1 1 0 1
#> 2 2 0 0
percent_missing
This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise.
We can get the results by group by providing an optional
grouping_cols
character vector.
percent_missing(test, grouping_cols = "Subject")
#> # A tibble: 2 × 3
#> Subject res ID
#> <fct> <dbl> <dbl>
#> 1 A 50 0
#> 2 B 0 0
To exclude some columns from the above exploration, one can provide
an optional character vector in exclude_cols
.
percent_missing(airquality,exclude_cols = c("Day","Temp"))
#> Ozone Solar.R Wind Month
#> 1 24.18301 4.575163 0 0
sort_by_missingness
This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents.
Usage:
sort_by_missingness(airquality, sort_by = "counts")
#> variable percent
#> 1 Wind 0
#> 2 Temp 0
#> 3 Month 0
#> 4 Day 0
#> 5 Solar.R 7
#> 6 Ozone 37
To sort in descending order:
sort_by_missingness(airquality, sort_by = "counts", descend = TRUE)
#> variable percent
#> 1 Ozone 37
#> 2 Solar.R 7
#> 3 Wind 0
#> 4 Temp 0
#> 5 Month 0
#> 6 Day 0
To use percentages instead:
sort_by_missingness(airquality, sort_by = "percents")
#> variable percent
#> 1 Wind 0.000000
#> 2 Temp 0.000000
#> 3 Month 0.000000
#> 4 Day 0.000000
#> 5 Solar.R 4.575163
#> 6 Ozone 24.183007
Please note that the mde
project is released with a Contributor
Code of Conduct. By contributing to this project, you agree to abide
by its terms.
For further exploration, please
browseVignettes("mde")
.
To raise an issue, please do so here
Thank you, feedback is always welcome :)