data.checker is a package for helping with boilerplate
data checks. It enables you to automate fundamental data checks which,
while simple, can be time-consuming to implement.
data.checker
Checks data against a user supplied schema that defines what columns and data types are expected
Enables user to add additional custom data checks based on multiple columns
Creates exports of the results for QA
Initialising the data checker is simple. All you need to supply are a dataset and a schema. The schema is a named list that tells the data checker what sorts of columns and values to expect.
schema <- list(
check_duplicates = FALSE,
check_completeness = FALSE,
columns = list(
age = list(type = "integer", optional = FALSE),
sex = list(type = "character", optional = FALSE)
)
)
schema
#> $check_duplicates
#> [1] FALSE
#>
#> $check_completeness
#> [1] FALSE
#>
#> $columns
#> $columns$age
#> $columns$age$type
#> [1] "integer"
#>
#> $columns$age$optional
#> [1] FALSE
#>
#>
#> $columns$sex
#> $columns$sex$type
#> [1] "character"
#>
#> $columns$sex$optional
#> [1] FALSERunning the new_validator function will create a
Validator object.
print(validator)
#> System information
#> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#> Date: 2025-01-01
#> sysname: Windows
#> release: 10 x64
#> version:
#> nodename:
#> machine:
#> login: username
#> user: username
#> effective_user: username
#> udomain:
#> R version : R version 4.5.1 (2025-06-13 ucrt)
#> data.checker version: 0.0.0.9000
#>
#> The Validator object logs system information which can
be exported along with the QA log, meaning you have a comprehensive
record of what was done, when and on what systems. Printing the
Validator object will show you the current QA log.
The check function will run the full suite of checks on
your Validator object and add them to the log.
check_results <- data.checker::check(validator)
print(check_results)
#> System information
#> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#> Date: 2025-01-01
#> sysname: Windows
#> release: 10 x64
#> version:
#> nodename:
#> machine:
#> login: username
#> user: username
#> effective_user: username
#> udomain:
#> R version : R version 4.5.1 (2025-06-13 ucrt)
#> data.checker version: 0.0.0.9000
#>
#> Timestamp Description Outcome Failing Ids n Failing Entry Type
#> ---------- ------------------------------------------------------------------------ -------- ------------ ---------- -----------
#> 14:24:36 Column names contain no symbols other than underscores. pass 0 error
#> 14:24:36 Column names contain no capital letters. pass 0 error
#> 14:24:36 All mandatory columns are present. pass 0 error
#> 14:24:36 There are no unexpected columns. pass 0 error
#> 14:24:36 Removed schema information for optional columns that aren't in the data N/A info
#> 14:24:36 Correct column types fail 1 1 errorThe export function will export your log in html, csv,
yaml or json. We strongly recommend exporting these automated QA logs
along with your outputs so you have a record of which automated checks
were done and what they found.
Alternatively, you can use the validate function to run
the full process.
Some additional arguments you can use when running
check_and_export include:
The schema has certain mandatory and optional fields.
check_duplicates: TRUE or FALSE. If TRUE, the dataset
will be checked for duplicate rows. check_completeness:
TRUE or FALSE. If TRUE, the dataset will be checked to ensure there is
at least one row for all combinations of factors. columns:
a list of column names with an entry for each column. To create a subset
for either checking completeness or duplicates, you can use the
duplicate_cols or completeness_cols field and
provide a list of columns to check.
For each column, you should include a type (“character”, “integer”, “double”, “logical”). You also need an “optional” setting (TRUE or FALSE) if TRUE the checker will raise an error if the column is missing. If FALSE the checker data will not raise an error if it’s missing. At least one column in your schema must have optional = TRUE.
You can also optionally define a class if you want it to be checked. There are three special types you can choose - “Date”, “datetime” and “factor”. In R, these are implemented as a specific combination of types and classes, but the data checker simplifies this for you by setting up those parts of the schema for you.
Optional checks can be applied to each column depending on the column
type. In the scheme, these should form part of the columns
list.
Schema objects can get pretty large and clutter your code. You should
also avoid needing to edit your code every time you want to change your
schema. Instead, you can create your schema as a yaml or json file
instead. You can then supply new_validator with the file
path, and the package will do the rest.
df <- data.frame(
id = 1:10,
age = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
sex = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
country = factor(
c("England", "England", "Wales", "Scotland", "Wales", "England", "Northern Ireland", "Wales", "Scotland", "Northern Ireland"),
levels = c("England", "Scotland", "Wales", "Northern Ireland")),
date = lubridate::ymd(c(
"2021-01-01",
"2021-02-01",
"2021-02-01",
"2021-03-01",
"2021-03-01",
"2021-03-01",
"2021-04-01",
"2021-04-01",
"2021-04-01",
"2021-05-01"
))
)
data_check_results <- data.checker::new_validator(schema = "example_schema.yaml", data = df) |>
data.checker::check()print(data_check_results)
#> System information
#> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#> Date: 2025-01-01
#> sysname: Windows
#> release: 10 x64
#> version:
#> nodename:
#> machine:
#> login: username
#> user: username
#> effective_user: username
#> udomain:
#> R version : R version 4.5.1 (2025-06-13 ucrt)
#> data.checker version: 0.0.0.9000
#>
#> Timestamp Description Outcome Failing Ids n Failing Entry Type
#> ---------- -------------------------------------------------------------------------- -------- ------------ ---------- -----------
#> 14:24:37 Column sex unused schema entries: min_length, max_length, allowed_strings N/A warning
#> 14:24:37 Column country unused schema entries: levels N/A warning
#> 14:24:38 Column names contain no symbols other than underscores. pass 0 error
#> 14:24:38 Column names contain no capital letters. pass 0 error
#> 14:24:38 All mandatory columns are present. pass 0 error
#> 14:24:38 There are no unexpected columns. pass 0 error
#> 14:24:37 Removed schema information for optional columns that aren't in the data N/A info
#> 14:24:38 Correct column types pass 0 error
#> 14:24:38 Correct column classes pass 0 error
#> 14:24:38 Column id contains no missing values pass 0 error
#> 14:24:38 Column id: values are above or equal to 0 pass 0 error
#> 14:24:38 Column id: values are below or equal to 1000 pass 0 error
#> 14:24:38 Column age contains no missing values pass 0 error
#> 14:24:38 Column age: values are above or equal to 0 pass 0 error
#> 14:24:38 Column age: values are below or equal to 120 pass 0 error
#> 14:24:38 Column sex contains no missing values pass 0 error
#> 14:24:38 Column country contains no missing values pass 0 error
#> 14:24:38 Column date contains no missing values pass 0 error
#> 14:24:38 Column date: dates are after 2020-01-01 pass 0 error
#> 14:24:38 Column date: dates are before 2023-12-31 pass 0 errorYou can write your own checks using the add_check
function. This is particularly useful for checks involving more than one
column, which cannot be configured using the standard template. The
checks are done in the context of the original data, meaning you can
reference columns as if they are variables in the environment (similar
to tidy evaluation). This is recommended because it guarantees the
checks are done on the correct data only. Alternatively, you can use
standard evaluation (see example below).
df <- data.frame(
id = 1:10,
age = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
sex = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
)
schema <- list(
check_duplicates = FALSE,
check_completeness = FALSE,
columns = list(
id = list(type = "double", optional = FALSE),
age = list(type = "double", optional = FALSE),
sex = list(type = "character", optional = FALSE)
)
)
data_check_results <- data.checker::new_validator(df, schema) |>
data.checker::check() |>
data.checker::add_check(description = "There are no males over 90 (tidy evaluation)", condition = !(sex == "M" & age > 90)) |>
data.checker::add_check(description = "There are no males over 90 (standard evaluation)", condition = !(df$sex == "M" & df$age > 90))print(data_check_results)
#> System information
#> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#> Date: 2025-01-01
#> sysname: Windows
#> release: 10 x64
#> version:
#> nodename:
#> machine:
#> login: username
#> user: username
#> effective_user: username
#> udomain:
#> R version : R version 4.5.1 (2025-06-13 ucrt)
#> data.checker version: 0.0.0.9000
#>
#> Timestamp Description Outcome Failing Ids n Failing Entry Type
#> ---------- ------------------------------------------------------------------------ -------- ------------ ---------- -----------
#> 14:24:38 Column names contain no symbols other than underscores. pass 0 error
#> 14:24:38 Column names contain no capital letters. pass 0 error
#> 14:24:38 All mandatory columns are present. pass 0 error
#> 14:24:38 There are no unexpected columns. pass 0 error
#> 14:24:38 Removed schema information for optional columns that aren't in the data N/A info
#> 14:24:38 Correct column types fail 1 1 error
#> 14:24:38 There are no males over 90 (tidy evaluation) pass 0 error
#> 14:24:38 There are no males over 90 (standard evaluation) pass 0 errorYou can choose to add your own entries to the QA log using the
add_qa_entry function. The function expects a datachecker
object as the first argument and a description. You can also optionally
add:
failing_ids: a vector containing the columns/rows that
failed the checksoutcome: TRUE/FALSE for passing/failing checks or NA if
you want to leave the field blank. Defaults to NAentry_type: either “info”, “warning” or “error”. Info =
neutral log record, warning = something is wrong but could be safely
ignored, error something is wrong that is likely to break your code.
Defaults to “info”.df <- data.frame(
age = c(10, 11, 13, 15, 22, 34, 80),
sex = c("M", "F", "M", "F", "M", "F", "M")
)
schema <- list(
check_completeness = FALSE,
check_duplicates = FALSE,
columns = list(
age = list(type = "integer", optional = FALSE),
sex = list(type = "character", optional = FALSE)
)
)
validator <- data.checker::new_validator(df, schema)
validator <- data.checker::add_qa_entry(
validator,
description = "Example custom log entry",
entry_type = "info"
)print(validator)
#> System information
#> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#> Date: 2025-01-01
#> sysname: Windows
#> release: 10 x64
#> version:
#> nodename:
#> machine:
#> login: username
#> user: username
#> effective_user: username
#> udomain:
#> R version : R version 4.5.1 (2025-06-13 ucrt)
#> data.checker version: 0.0.0.9000
#>
#> Timestamp Description Outcome Failing Ids n Failing Entry Type
#> ---------- ------------------------- -------- ------------ ---------- -----------
#> 14:24:39 Example custom log entry N/A info