datadiff

Data Validation with YAML Rules

Overview

datadiff is a comprehensive R package for data validation that lets you compare datasets using configurable validation rules defined in YAML files.
It supports exact matching, tolerance-based numeric comparisons, text normalization, and row count validation.

Installation

remotes::install_github("thinkr-open/datadiff")

Quick Start

library(datadiff)

# Create reference and candidate datasets
reference <- data.frame(
  id = 1:3,
  amount = c(100.00, 200.00, 300.00),
  category = c("A", "B", "C")
)

candidate <- data.frame(
  id = 1:3,
  amount = c(100.01, 200.01, 300.001),  # Small differences within tolerance
  category = c("a", "b", "c")  # Different case
)

# Zero configuration: near-exact rules generated from the reference structure
result <- compare_datasets_from_yaml(reference, candidate, key = "id")
result$all_passed  # FALSE here: exact rules reject the differences above

# To tune tolerances, generate a rules template and edit it
rules_path <- tempfile(fileext = ".yaml")
write_rules_template(reference, key = "id",
                     path = rules_path,
                     numeric_abs = 0.01,
                     character_case_insensitive = TRUE)

# Compare datasets with the tuned rules (key stays explicit)
result <- compare_datasets_from_yaml(reference, candidate,
                                     key = "id", path = rules_path)

result$all_passed   # overall verdict (TRUE/FALSE)
result$coverage     # one row per check performed: column, type, n, n_failed, status
result$summary      # aggregate counts (n_checks, n_pass, n_fail, ...)

# Printing the response lazily renders the full pointblank-style report
# (built on demand, only when displayed):
print(result$response)

# ...or export that report to a standalone HTML file:
datadiff_report_html(result, file = "report.html")

Key Features

YAML Configuration: Define validation rules in human-readable YAML files
Flexible Comparisons: Support for exact matching and tolerance-based comparisons
Text Normalization: Case-insensitive comparison and whitespace trimming
Type-Aware Validation: Different rules for different data types
Comprehensive Reporting: Detailed validation reports powered by pointblank
Row Count Validation: Ensure datasets have the expected number of rows
Fast on wide tables: an all-pass comparison short-circuits the per-column pointblank agent, so validating hundreds/thousands of columns stays quick
Faithful coverage: result$coverage always lists every check performed (even when everything passes), and result$response renders the full pointblank HTML report lazily, only when printed

Dependencies

This package is built on top of the excellent pointblank package, which provides the robust data validation engine.
Pointblank handles all the heavy lifting for validation logic, error reporting, and test execution.

Configuration

write_rules_template() generates a starting YAML. You then edit by_name to override any rule for a specific column.

Full example with column-level overrides (`by_name`)

version: 1
defaults:
  na_equal: yes
  ignore_columns: 
    documentation
  keys: id
  label: reference vs candidate
row_validation:
  check_count: yes
  expected_count: ~   # null = use reference row count
  tolerance: 0
by_type:
  numeric:
    abs: 1.0e-09      # near-exact by default for all numeric columns
    rel: 0
  integer:
    abs: 0
  character:
    equal_mode: exact
    case_insensitive: no
    trim: no
  date:
    equal_mode: exact
  datetime:
    equal_mode: exact
  logical:
    equal_mode: exact
by_name:
  id: []              # no override - inherits integer rule (abs: 0)
  amount:
    abs: 0.01         # override: accept differences up to 0.01 for this column
    rel: 0
  unit_price:
    abs: 0.001
    rel: 0.0001       # mixed mode: 0.001 + 0.01% of reference value
  category:
    case_insensitive: yes   # override: ignore case for this column only
    trim: yes
  created_at:
    equal_mode: exact       # redundant here, but explicit for documentation

How by_name interacts with by_type: rules are merged, with by_name taking precedence. A column not listed in by_name uses its by_type defaults unchanged.

Column	Effective rule	Source
`id`	`abs: 0`	`by_type.integer`
`amount`	`abs: 0.01, rel: 0`	`by_name.amount` overrides `by_type.numeric`
`unit_price`	`abs: 0.001, rel: 0.0001`	`by_name.unit_price` overrides `by_type.numeric`
`category`	`case_insensitive: yes, trim: yes`	`by_name.category` overrides `by_type.character`
`created_at`	`equal_mode: exact`	`by_type.date` (by_name is redundant here)

Numeric tolerance: formula, edge effects, and best practices

Threshold formula

For numeric columns, validation relies on a single combined threshold:

threshold  = abs + rel * |reference_value|
result = OK  if  |candidate - reference| <= threshold

The two parameters add up: they are not two independent guards.

Parameter	Role	Default value
`abs`	Absolute tolerance: fixed floor, independent of the magnitude of values	`1e-9`
`rel`	Relative tolerance: fraction of the reference value added to the threshold	`0`

Pure absolute mode (recommended by default)

With rel: 0, the threshold is constant regardless of the magnitude of the values being compared:

by_type:
  numeric:
    abs: 0.01   # fixed threshold: any difference > 0.01 is detected
    rel: 0

Reference	Candidate	Difference	Threshold	Result
1.00	1.005	0.005	0.01	OK
1 000 000.00	1 000 000.005	0.005	0.01	OK
0.000001	0.000001 + 0.005	0.005	0.01	OK
1.00	1.02	0.02	0.01	ERROR
1 000 000.00	1 000 000.02	0.02	0.01	ERROR

The same difference of 0.02 is detected whether you compare thousandths or millions.

Pure relative mode

With abs: 0, the threshold is proportional to the reference value:

by_type:
  numeric:
    abs: 0      # no fixed floor
    rel: 0.01   # tolerance of 1% of the reference value

Reference	Candidate	Difference	Threshold (1%)	Result
100.00	100.50	0.50	1.00	OK
1 000 000.00	1 005 000.00	5 000	10 000	OK
0.00	0.001	0.001	0	ERROR (implicit division by zero)

Warning: if the reference value is 0, the relative threshold is 0, so any difference, even tiny, will be detected as an error.
This is why abs acts as a safety floor.

Mixed mode: when and how to use it

Combining both parameters is useful when values can be close to zero and very large, and you want a tolerance that adapts in both cases:

by_type:
  numeric:
    abs: 0.001   # floor: protects the ref ~ 0 case
    rel: 0.01    # +1% for large values

For ref = 1 000 000: threshold = 0.001 + 0.01 * 1 000 000 = 10 000.001

Pitfall: with rel > 0 and large reference values, the threshold can become much wider than you intuitively expect.
For example, rel: 1e-9 on a value of 12 000 000 yields a threshold of ~ 0.012, so a difference of 0.000565 would pass undetected, even with abs: 1e-9.

Rule of thumb: use rel: 0 (the default) unless you explicitly need a tolerance proportional to the magnitude of the data.

Character comparison options

For character columns, you can configure text normalization and comparison behavior:

equal_mode: Comparison mode ("exact" or "normalized")
case_insensitive: Whether to ignore case differences (true/false)
trim: Whether to trim whitespace before comparison (true/false)

Example:

by_type:
  character:
    equal_mode: normalized      # Apply normalization before comparison
    case_insensitive: true      # Ignore case differences
    trim: true                  # Remove leading/trailing whitespace

For reference value "Hello World":

Candidate "hello world" passes with case_insensitive: true
Candidate " Hello World " passes with trim: true
Candidate "HELLO WORLD" passes with both options enabled
Candidate "Hello Universe" fails regardless of normalization

Row count validation

You can validate that datasets have an expected number of rows using the row_validation section:

check_count: Whether to enable row count validation (true/false)
expected_count: Expected number of rows (if null, uses the reference dataset row count)
tolerance: Allowed deviation from the expected count

Example:

row_validation:
  check_count: true
  expected_count: 1000
  tolerance: 50

This validates that the candidate dataset has between 950 and 1050 rows (1000 +/- 50).

If expected_count is not specified, the reference dataset’s row count is used as the expected value.

Comparing Parquet files (large datasets)

datadiff can compare Parquet files that are too large to fit in RAM. The recommended approach uses arrow::open_dataset() - do not call arrow::to_duckdb() yourself before passing to datadiff; the package handles the Arrow -> DuckDB conversion internally with a single connection.

Recommended strategy: Arrow Dataset (lazy, out-of-core)

library(datadiff)
library(arrow)

ds_ref  <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")

# Generate a rules template from the schema (no data loaded)
write_rules_template(ds_ref, key = "ID", path = "rules.yaml")

# Compare - stays lazy until the final boolean slim table
result <- compare_datasets_from_yaml(
  data_reference = ds_ref,
  data_candidate = ds_cand,
  key = "ID",
  path = "rules.yaml"
)

Internally, compare_datasets_from_yaml():

Opens a private DuckDB connection (fresh_con).
Materialises each Parquet dataset as a DuckDB physical temp table via read_parquet() - all memory is managed by DuckDB’s buffer pool, so disk spilling works correctly.
Runs the full join + 125 boolean expressions as a single lazy SQL query.
dplyr::compute() materialises only the slim boolean result table (~125 logical columns * N rows) - the wide source data is never loaded into R.
dplyr::collect() brings the slim boolean table (~few GB) into R and passes a plain data.frame to pointblank - no live DuckDB connection needed during interrogation.
Disconnects and destroys fresh_con on exit.

Memory tuning: `duckdb_memory_limit`

DuckDB’s default memory cap (80 % of total RAM) can leave insufficient headroom when R, Arrow, and the OS are already using significant memory. The duckdb_memory_limit parameter controls how much RAM DuckDB may use before spilling intermediate results to tempdir():

result <- compare_datasets_from_yaml(
  data_reference    = ds_ref,
  data_candidate    = ds_cand,
  key               = "ID",
  path              = "rules.yaml",
  duckdb_memory_limit = "8GB"   # default - safe for machines with >= 16 GB
)

Machine RAM	Recommended `duckdb_memory_limit`	Notes
8 GB	`"3GB"`	Leaves room for R + OS
16 GB	`"6GB"`	Balanced
32 GB	`"8GB"` (default)	Spills when needed
64 GB+	`"20GB"`	Reduces spilling, faster

Increasing the limit reduces disk I/O and speeds up the comparison; decreasing it protects against OOM on memory-constrained machines. The limit only applies when Arrow datasets are used - it has no effect for data.frame or tbl_lazy inputs.

Strategy comparison

Strategy	Input type	RAM usage	Requires
Arrow Dataset ✅ recommended	`arrow::open_dataset()`	Slim boolean table only (~few GB)	arrow, duckdb
Arrow Table	`arrow::read_parquet(as_data_frame=FALSE)`	Same as Arrow Dataset	arrow, duckdb
Lazy table (dbplyr)	`tbl(con, "table_name")`	Slim boolean table only	DBI, dbplyr
`data.frame`	`read.csv()`, `readr::read_csv()`, etc.	Full data in RAM	-

What NOT to do

# ❌ DO NOT convert to DuckDB yourself before passing to datadiff
ds_ref_duckdb  <- arrow::to_duckdb(ds_ref)   # creates Arrow's internal connection
ds_cand_duckdb <- arrow::to_duckdb(ds_cand)  # different connection!
result <- compare_datasets_from_yaml(ds_ref_duckdb, ds_cand_duckdb, ...)
# -> cross-connection join fails in pointblank

# ❌ DO NOT collect before passing
ref_df <- dplyr::collect(ds_ref)   # loads full 4 GB into R RAM

Disk spilling location

DuckDB spills to tempdir() when the memory limit is reached. On Windows this is typically C:\Users\<user>\AppData\Local\Temp. Ensure that directory has sufficient free disk space (up to ~2-3* the size of your Parquet files in the worst case).

Main Functions

write_rules_template(): Generate a YAML rules template
read_rules(): Load and validate rules from YAML
compare_datasets_from_yaml(): Main validation function
detect_column_types(): Infer column data types
preprocess_dataframe(): Apply text normalization
analyze_columns(): Compare column structures

Testing

Run the test suite:

devtools::test()

The development of this package has been sponsored by:

License

MIT License

Dev part

Run the checks and the coverage locally (embedded outputs went stale as the package evolved, so this README no longer freezes them at compile time):

devtools::check()
covr::package_coverage()