Introduction to datadiff

datadiff compares two datasets - a reference and a candidate - using validation rules defined in a YAML file. It is built on top of pointblank and supports exact matching, tolerance-based numeric comparisons, text normalization, and row count validation.

1. A first comparison

library(datadiff)

ref <- data.frame(
  id       = 1:4,
  revenue  = c(1000.00, 2000.00, 3000.00, 4000.00),
  category = c("A", "B", "C", "D"),
  active   = c(TRUE, TRUE, FALSE, TRUE)
)

cand <- data.frame(
  id       = 1:4,
  revenue  = c(1000.005, 2000.001, 3000.009, 4000.00),  # tiny differences
  category = c("a", "b", "c", "D"),                      # lowercase
  active   = c(TRUE, TRUE, FALSE, TRUE)
)

Generate a rules template and tune it:

rules_path <- tempfile(fileext = ".yaml")

write_rules_template(
  ref,
  key                        = "id",
  path                       = rules_path,
  numeric_abs                = 0.01,   # accept differences up to 0.01
  character_case_insensitive = TRUE    # ignore case for all char columns
)

Run the comparison:

result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)
result$all_passed
#> [1] TRUE

2. Understanding the return value

compare_datasets_from_yaml() returns a list with eight elements:

names(result)
#> [1] "all_passed"           "agent"                "reponse"             
#> [4] "missing_in_candidate" "extra_in_candidate"   "applied_rules"       
#> [7] "coverage"             "summary"

Element	Description
`all_passed`	`TRUE` if every validation step passed
`coverage`	One row per check actually performed (see below)
`summary`	Aggregate counts derived from `coverage`
`agent`	The configured pointblank agent (before interrogation)
`reponse`	A `datadiff_report`: a real pointblank agent that renders the full report lazily when printed (see below)
`missing_in_candidate`	Columns present in reference but absent from candidate
`extra_in_candidate`	Columns present in candidate but absent from reference
`applied_rules`	The effective per-column rules that were applied

Seeing exactly what was checked

result$coverage is the first thing to look at: it lists every check that was performed, one row per column, with its check type, the number of rows compared, the number of failures, and a PASS/FAIL status. It is always present - even when everything passes - so a green run never looks empty:

result$coverage
#> datadiff coverage: 7 checks - 7 PASS, 0 FAIL
#>       column      check n n_failed status
#>      revenue col_exists 1        0   PASS
#>     category col_exists 1        0   PASS
#>       active col_exists 1        0   PASS
#>      revenue  tolerance 4        0   PASS
#>     category   equality 4        0   PASS
#>       active   equality 4        0   PASS
#>  <row_count>  row_count 1        0   PASS

result$summary rolls those rows up into counts:

result$summary
#> $n_checks
#> [1] 7
#> 
#> $n_pass
#> [1] 7
#> 
#> $n_fail
#> [1] 0
#> 
#> $n_rows_failed_total
#> [1] 0
#> 
#> $all_passed
#> [1] TRUE

Because coverage is computed directly from the same boolean checks that decide the verdict, result$summary$all_passed always matches result$all_passed. To see only the problems, filter on the status:

result$coverage[result$coverage$status == "FAIL", ]

The full pointblank report

result$reponse stays a genuine pointblank agent - pointblank::all_passed() and pointblank::get_data_extracts() work on it as usual. Printing it renders the full per-column pointblank report (HTML in interactive sessions). That report is built only when you display it, so a comparison itself stays fast even on very wide tables:

print(result$reponse)                              # renders the report on demand
datadiff_report_html(result, file = "report.html") # ...or save it to a file

Inspecting applied rules

applied_rules shows the exact rules used for each column - useful to verify that by_name overrides were applied correctly:

result$applied_rules$revenue
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
result$applied_rules$category
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

Column presence

result$missing_in_candidate
#> character(0)
result$extra_in_candidate
#> character(0)

Accessing failing rows

When all_passed is FALSE, use pointblank::get_sundered_data() to extract the rows that failed at least one validation step:

ref_fail <- data.frame(id = 1:5, value = c(1, 2, 3, 4, 5))
cand_fail <- data.frame(id = 1:5, value = c(1, 2, 99, 4, 99))  # rows 3 and 5 wrong

result_fail <- compare_datasets_from_yaml(ref_fail, cand_fail, key = "id")
result_fail$all_passed
#> [1] FALSE

# Rows that failed at least one step
failed_rows <- pointblank::get_sundered_data(result_fail$reponse, type = "fail")
failed_rows
#>   id value value__reference value__ok row_count_ok
#> 1  3    99                3     FALSE         TRUE
#> 2  5    99                5     FALSE         TRUE

The type = "pass" variant returns rows that passed all steps. This is useful to understand the scope of the problem before investigating further.

Controlling how many failing rows are extracted

For large datasets, extracting all failing rows can consume significant memory. Three mutually exclusive parameters cap this:

# Keep only the first 100 failing rows per validation step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     get_first_n = 100)

# Random sample of 50 failing rows per step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_n = 50)

# 10% of failing rows, capped at 500
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_frac = 0.1, sample_limit = 500)

# Disable extraction entirely (fastest - only pass/fail counts are kept)
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     extract_failed = FALSE)

3. Comparing without a YAML file

When path = NULL (the default), datadiff auto-generates rules from the reference dataset structure. This is useful for a quick sanity check without any configuration:

ref_quick  <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))
cand_quick <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))

# No path needed - rules are generated on the fly
result_quick <- compare_datasets_from_yaml(ref_quick, cand_quick, key = "id")
result_quick$all_passed
#> [1] TRUE

The auto-generated rules use near-exact numeric tolerance (abs = 1e-9) and exact character matching - equivalent to calling write_rules_template() with all defaults.

4. YAML rules in depth

write_rules_template() generates a fully annotated YAML. Here is a complete example with all sections explained:

version: 1

defaults:
  na_equal: yes           # treat NA == NA as a pass
  ignore_columns:         # columns excluded from comparison entirely
    - documentation
    - updated_at
  keys: id                # join key (single or composite)
  label: ref vs cand      # label shown in the pointblank report

row_validation:
  check_count: yes
  expected_count: ~       # null = use reference row count
  tolerance: 0            # exact match required

by_type:                  # rules applied to all columns of a given type
  numeric:
    abs: 1.0e-09          # near-exact by default
    rel: 0
  integer:
    abs: 0                # integers must match exactly
  character:
    equal_mode: exact
    case_insensitive: no
    trim: no
  date:
    equal_mode: exact
  datetime:
    equal_mode: exact
  logical:
    equal_mode: exact

by_name:                  # column-specific overrides (take precedence over by_type)
  id: []                  # no override - inherits integer rule
  revenue:
    abs: 0.01             # accept differences up to 0.01
  category:
    case_insensitive: yes
    trim: yes

Rules are merged: by_name entries extend or override by_type entries. A field not listed in by_name keeps its by_type default.

Column	Effective rule	Source
`id`	`abs: 0`	`by_type.integer`
`revenue`	`abs: 0.01, rel: 0`	`by_name` overrides `by_type.numeric`
`category`	`case_insensitive: yes, trim: yes`	`by_name` overrides `by_type.character`

Reading rules from a file

Use read_rules() to inspect what was actually loaded - useful for debugging or building tooling on top of datadiff:

loaded <- read_rules(rules_path)

loaded$defaults$na_equal
#> [1] TRUE
loaded$by_type$numeric
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
loaded$by_type$character
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

5. A realistic `by_name` example

The following dataset mixes several column types, each requiring a different validation strategy:

ref_full <- data.frame(
  id          = 1:4,
  price       = c(9.99, 19.99, 4.50, 149.00),      # numeric: small absolute tolerance
  quantity    = c(10L, 5L, 20L, 1L),                # integer: exact
  description = c("Widget A", "Widget B", "  Gadget", "TOOL"),  # needs trim + case
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),          # logical: exact
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)

cand_full <- data.frame(
  id          = 1:4,
  price       = c(9.995, 19.99, 4.50, 149.00),  # row 1: diff = 0.005 < 0.01
  quantity    = c(10L, 5L, 20L, 1L),
  description = c("widget a", "Widget B", "Gadget", "tool"),  # case + spaces
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)

Build the YAML and write column-specific overrides:

rules_full <- tempfile(fileext = ".yaml")

write_rules_template(
  ref_full,
  key                        = "id",
  path                       = rules_full,
  numeric_abs                = 1e-9,     # conservative default
  character_case_insensitive = FALSE,    # strict default for character
  character_trim             = FALSE
)

# Read, patch by_name, write back
rules_obj <- read_rules(rules_full)

rules_obj$by_name$price       <- list(abs = 0.01)           # +/-0.01 for price
rules_obj$by_name$description <- list(case_insensitive = TRUE, trim = TRUE)

yaml::write_yaml(rules_obj, rules_full)

result_full <- compare_datasets_from_yaml(ref_full, cand_full,
                                          key = "id", path = rules_full)
result_full$all_passed
#> [1] TRUE

# Verify the effective rules for each column
result_full$applied_rules$price
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
result_full$applied_rules$description
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] TRUE
result_full$applied_rules$quantity
#> $abs
#> [1] 0

6. Numeric tolerance

Formula

For every numeric column, the comparison uses a single combined threshold:

threshold = abs + rel * |reference_value|
PASS  if  |candidate - reference| <= threshold

Absolute tolerance (`abs`)

The threshold is constant, independent of the magnitude of the values:

ref_num  <- data.frame(id = 1:3, price = c(1.00, 1000.00, 1e6))
cand_ok  <- data.frame(id = 1:3, price = c(1.005, 1000.005, 1e6 + 0.005))
cand_nok <- data.frame(id = 1:3, price = c(1.02,  1000.02,  1e6 + 0.02))

rules_abs <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_abs, numeric_abs = 0.01)

compare_datasets_from_yaml(ref_num, cand_ok,  key = "id", path = rules_abs)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_num, cand_nok, key = "id", path = rules_abs)$all_passed
#> [1] FALSE

The same threshold 0.01 applies whether the value is 1 or 1 000 000.

Relative tolerance (`rel`)

The threshold is proportional to the reference value - useful when you want to accept a percentage deviation:

rules_rel <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_rel,
                     numeric_abs = 0, numeric_rel = 0.01)

# ref = 1000, diff = 9, threshold = 0.01 * 1000 = 10 -> PASS
cand_pct <- data.frame(id = 1:3, price = c(1.009, 1009.0, 1e6 * 1.009))
compare_datasets_from_yaml(ref_num, cand_pct, key = "id", path = rules_rel)$all_passed
#> [1] TRUE

Warning: if a reference value is 0, the relative threshold is 0 and any difference will be flagged as an error. Use abs as a safety floor.

Mixed mode

Combine both parameters when values span a wide range including near-zero:

by_type:
  numeric:
    abs: 0.001   # floor: protects against false positives when ref ~ 0
    rel: 0.005   # +0.5% for larger values

For ref = 1 000 000: threshold = 0.001 + 0.005 * 1 000 000 = 5000.001

Rule of thumb: keep rel: 0 (the default) unless you explicitly need a tolerance proportional to the magnitude of the data.

IEEE 754 correction

Floating-point subtraction can introduce rounding errors:

# In double precision, this is slightly above 0.01
100.01 - 100.00
#> [1] 0.01

datadiff automatically adds a correction of 8 * .Machine$double.eps * |ref| to the threshold to absorb these representation errors without meaningfully widening the user-specified tolerance.

`warn_at` and `stop_at`

These two parameters control the pointblank action thresholds, expressed as the fraction of rows that fail a validation step:

result <- compare_datasets_from_yaml(
  ref, cand,
  key     = "id",
  warn_at = 0.05,   # warn if > 5% of rows fail any step
  stop_at = 0.20    # stop (error) if > 20% of rows fail any step
)

The default (1e-14) means that any single failing row triggers the threshold, which is appropriate for data validation where zero differences are expected. Raise these values if you want the report to remain green while a small fraction of rows diverge.

7. Text comparison

Three independent options control character column comparison:

Option	Effect
`case_insensitive: yes`	Convert both values to lowercase before comparing
`trim: yes`	Strip leading/trailing whitespace before comparing
`equal_mode: normalized`	Apply both transformations

ref_txt  <- data.frame(id = 1:4, label = c("Hello", "World", "Foo", "Bar"))
cand_txt <- data.frame(
  id    = 1:4,
  label = c("hello", "  World  ", "FOO", "Baz")  # case, spaces, mismatch
)

# Strict: rows 1, 2, 3 fail
rules_strict <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_strict)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_strict)$all_passed
#> [1] FALSE

# Relaxed: case + trim - only row 4 ("Baz" vs "Bar") fails
rules_relax <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_relax,
                     character_case_insensitive = TRUE,
                     character_trim             = TRUE)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_relax)$all_passed
#> [1] FALSE

Column-level overrides in by_name apply only to the specified column, leaving all other character columns unaffected.

8. Row count validation

The row_validation section checks that the candidate has the expected number of rows.

ref_rows  <- data.frame(id = 1:5, value = 1:5)
cand_ok   <- data.frame(id = 1:5, value = 1:5)   # 5 rows - exact match
cand_more <- data.frame(id = 1:7, value = 1:7)   # 7 rows - 2 extra

rules_count <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_count,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 0)

compare_datasets_from_yaml(ref_rows, cand_ok,   key = "id",
                           path = rules_count)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_count)$all_passed
#> [1] FALSE

With a tolerance:

rules_tol <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_tol,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 3)  # accept 5 +/- 3

# row_count check PASSES: |7 - 5| = 2 is within tolerance 3.
# all_passed is still FALSE here: the 2 extra keyed rows (id 6, 7) have
# no reference counterpart and fail the per-column value comparison.
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_tol)$all_passed
#> [1] FALSE

When expected_count is null in the YAML (or expected_count_default = NULL in write_rules_template()), the reference row count is used as the target.

9. Handling NA values

The na_equal setting controls whether NA == NA is treated as a pass:

ref_na  <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))
cand_na <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))  # identical NAs

# na_equal: yes (default) - NA == NA passes
rules_na_yes <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_yes,
                     na_equal_default = TRUE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_yes)$all_passed
#> [1] TRUE

# na_equal: no - NA == NA fails
rules_na_no <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_no,
                     na_equal_default = FALSE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_no)$all_passed
#> [1] FALSE

na_equal applies to all column types including numeric (with tolerance), character, logical, and date columns.

10. Column management

Ignoring columns

Columns listed in ignore_columns_default are excluded from comparison. Presence/absence checks for those columns are also skipped:

ref_ign  <- data.frame(id = 1:3, value = 1:3, updated_at = Sys.time())
cand_ign <- data.frame(id = 1:3, value = 1:3,
                       updated_at = Sys.time() + 3600)  # different timestamp

rules_ign <- tempfile(fileext = ".yaml")
write_rules_template(ref_ign, key = "id", path = rules_ign,
                     ignore_columns_default = "updated_at")

compare_datasets_from_yaml(ref_ign, cand_ign, key = "id",
                           path = rules_ign)$all_passed
#> [1] TRUE

Missing and extra columns

Columns present in the reference but absent from the candidate generate a dedicated failing step. Extra columns in the candidate are reported but do not cause a failure:

ref_cols  <- data.frame(id = 1:2, a = 1:2, b = 1:2)
cand_cols <- data.frame(id = 1:2, a = 1:2, c = 1:2)  # b missing, c extra

result_cols <- compare_datasets_from_yaml(ref_cols, cand_cols, key = "id")
result_cols$missing_in_candidate   # b
#> [1] "b"
result_cols$extra_in_candidate     # c
#> [1] "c"
result_cols$all_passed             # FALSE: b is missing
#> [1] FALSE

Utility: `analyze_columns()`

analyze_columns() exposes the column comparison logic independently - useful for pre-flight checks before running the full validation:

analysis <- analyze_columns(ref_cols, cand_cols,
                            ignore_columns = character(0))
str(analysis)
#> List of 6
#>  $ cols_reference      : chr [1:3] "id" "a" "b"
#>  $ cols_candidate      : chr [1:3] "id" "a" "c"
#>  $ missing_in_candidate: chr "b"
#>  $ extra_in_candidate  : chr "c"
#>  $ common_cols         : chr [1:2] "id" "a"
#>  $ ignored_cols        : chr(0)

11. Key-based vs positional comparison

With a key

A key column joins the candidate to the reference, handling different row orders and unequal row counts gracefully:

ref_key  <- data.frame(id = 1:3, value = c(10, 20, 30))
cand_key <- data.frame(id = c(3, 1, 2), value = c(30, 10, 20))  # shuffled

result_key <- compare_datasets_from_yaml(ref_key, cand_key, key = "id")
result_key$all_passed
#> [1] TRUE

Without a key (positional)

Rows are compared position by position. Both datasets must have the same number of rows:

ref_pos  <- data.frame(value = c(1.0, 2.0, 3.0))
cand_pos <- data.frame(value = c(1.0, 2.0, 3.0))

result_pos <- compare_datasets_from_yaml(ref_pos, cand_pos)
#> key is missing
result_pos$all_passed
#> [1] TRUE

Composite keys

Multiple columns can form a composite key:

ref_comp <- data.frame(
  year  = c(2023, 2023, 2024),
  month = c(1, 2, 1),
  value = c(100, 200, 300)
)
cand_comp <- data.frame(
  year  = c(2024, 2023, 2023),
  month = c(1,    2,    1),
  value = c(300,  200,  100)
)

result_comp <- compare_datasets_from_yaml(ref_comp, cand_comp,
                                          key = c("year", "month"))
result_comp$all_passed
#> [1] TRUE

Key in YAML vs key parameter

The key parameter to compare_datasets_from_yaml() takes precedence over the keys field in the YAML defaults section. This lets you reuse a shared YAML file while overriding the join key programmatically:

rules_key <- tempfile(fileext = ".yaml")
write_rules_template(ref_comp, key = "year", path = rules_key)  # YAML says year

# Override at call time with the composite key
result_override <- compare_datasets_from_yaml(
  ref_comp, cand_comp,
  key  = c("year", "month"),   # overrides YAML
  path = rules_key
)
result_override$all_passed

Duplicate key detection

If key values are not unique, datadiff warns before running the comparison:

ref_dup  <- data.frame(id = c(1, 1, 2), value = c(10, 11, 20))
cand_dup <- data.frame(id = c(1, 2),    value = c(10, 20))

tryCatch(
  compare_datasets_from_yaml(ref_dup, cand_dup, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Duplicate keys detected! The key column(s) [id] must be unique in both datasets.
#>   - data_reference: 1 duplicate key value(s) affecting 2 rows (examples: id = 1)
#> Comparison results will be unreliable: the join will produce multiple rows per key, leading to incorrect or non-deterministic validation results.
#> Please ensure your key column(s) uniquely identify each row, or choose different key column(s).

12. Type mismatch detection

When a column has incompatible types in reference and candidate, datadiff warns and adds a dedicated failing step - instead of silently coercing or crashing:

ref_type  <- data.frame(id = 1:2, year = c(2023L, 2024L))  # integer
cand_type <- data.frame(id = 1:2, year = c("2023", "2024")) # character

tryCatch(
  compare_datasets_from_yaml(ref_type, cand_type, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Type mismatch detected in 1 column(s): 'year' (reference: integer, candidate: character). Each will be reported as a validation error.

integer and numeric are treated as compatible types - tolerance arithmetic works correctly across them and no mismatch is raised.

13. Utility functions

`detect_column_types()`

Returns the datadiff type inferred for each column ("integer", "numeric", "character", "date", "datetime", "logical"):

df_types <- data.frame(
  id        = 1L,
  amount    = 1.5,
  label     = "x",
  flag      = TRUE,
  day       = Sys.Date(),
  timestamp = Sys.time()
)

detect_column_types(df_types)
#>          id      amount       label        flag         day   timestamp 
#>   "integer"   "numeric" "character"   "logical"      "date"  "datetime"

These are the same types used to match columns against by_type rules in the YAML.

`derive_column_rules()`

Shows the merged per-column rules for a given dataset and rules object - equivalent to result$applied_rules but callable without running the full comparison:

rules_obj2 <- read_rules(rules_path)
merged <- derive_column_rules(ref, rules_obj2)
merged$revenue
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
merged$category
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

`analyze_columns()`

Already shown in section 10. Useful to quickly check which columns are common, missing, or extra before committing to a full validation run.

`preprocess_dataframe()`

Applies text normalization rules to a dataframe. Useful for inspecting what the data looks like after normalization, before comparing:

df_raw <- data.frame(label = c("  Hello ", "WORLD", "  Foo  "))
rules_norm <- list(
  label = list(equal_mode = "normalized",
               case_insensitive = TRUE,
               trim = TRUE)
)
preprocess_dataframe(df_raw, rules_norm)
#>   label
#> 1 hello
#> 2 world
#> 3   foo

`add_tolerance_columns()`

Adds the __absdiff, __thresh, and __ok diagnostic columns to a comparison dataframe. Useful for debugging which rows are right on the edge of the tolerance threshold:

cmp <- data.frame(
  value            = c(1.005, 1.02, 1.0),
  value__reference = c(1.000, 1.00, 1.0)
)
rules_debug <- list(value = list(abs = 0.01, rel = 0))

cmp_annotated <- add_tolerance_columns(cmp, "value", rules_debug,
                                       ref_suffix = "__reference",
                                       na_equal   = TRUE)
cmp_annotated[, c("value__absdiff", "value__thresh", "value__ok")]
#>   value__absdiff value__thresh value__ok
#> 1          0.005          0.01      TRUE
#> 2          0.020          0.01     FALSE
#> 3          0.000          0.01      TRUE

14. Language and locale

By default, pointblank reports are rendered in English. You can change the language per call or globally for a session.

Per call

result_fr <- compare_datasets_from_yaml(
  ref, cand,
  key    = "id",
  lang   = "fr",
  locale = "fr_FR"
)

Global option

Set once in your script or .Rprofile and all subsequent calls will use it:

options(datadiff.lang   = "fr",
        datadiff.locale = "fr_FR")

# All calls now produce French reports without passing lang/locale every time
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)

Supported languages include "en", "fr", "de", "it", "es", "pt", "zh", "ja", "ru". See the pointblank documentation for the full list.

15. Working with large datasets

Many columns (wide tables)

datadiff first computes the verdict in a single vectorised pass over boolean comparison columns. When everything passes, it skips building a per-column pointblank agent entirely (that step grows roughly quadratically with the number of columns), so a wide reference - hundreds or thousands of columns - validates in seconds instead of minutes. When something fails, only the columns that actually fail get a pointblank step, so the cost scales with the number of failing columns, not the total.

This optimisation is invisible to you: all_passed, coverage, and the extracted failing cells are exactly what a full per-column run would produce. The only thing deferred is the rendered report, which is built on demand when you print result$reponse (section 2).

Lazy tables (dbplyr)

Any SQL-backed table wrapped in dplyr::tbl() can be passed directly. The join, normalization, and boolean expressions are pushed down to SQL - no data is loaded into R until the final slim result table:

library(DBI)
library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "reference", ref)
DBI::dbWriteTable(con, "candidate", cand)

tbl_ref  <- dplyr::tbl(con, "reference")
tbl_cand <- dplyr::tbl(con, "candidate")

result_lazy <- compare_datasets_from_yaml(
  tbl_ref, tbl_cand,
  key  = "id",
  path = rules_path
)

result_lazy$all_passed
DBI::dbDisconnect(con)

This works with any DBI-compatible backend: SQLite, PostgreSQL, Snowflake, etc.

Arrow / Parquet (out-of-core)

For files too large to fit in RAM, pass arrow::open_dataset() directly - the package handles the Arrow -> DuckDB conversion internally with a single private connection:

library(arrow)

ds_ref  <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")

# Generate a template from the schema (no data loaded into RAM)
write_rules_template(ds_ref, key = "id", path = "rules.yaml")

result <- compare_datasets_from_yaml(
  data_reference      = ds_ref,
  data_candidate      = ds_cand,
  key                 = "id",
  path                = "rules.yaml",
  duckdb_memory_limit = "8GB"    # tune to your machine's RAM
)

result$all_passed

Do not call arrow::to_duckdb() yourself before passing to datadiff. The package opens its own private DuckDB connection; passing pre-converted tables from a different connection will cause a cross-connection join error.

Memory tuning

Machine RAM	Recommended `duckdb_memory_limit`
8 GB	`"3GB"`
16 GB	`"6GB"`
32 GB	`"8GB"` (default)
64 GB+	`"20GB"`

Function	Role
`write_rules_template()`	Generate a YAML rules template from a reference dataset
`read_rules()`	Load and validate a YAML rules file
`compare_datasets_from_yaml()`	Compare reference and candidate datasets
`detect_column_types()`	Inspect the type inferred for each column
`derive_column_rules()`	See the merged per-column rules for a dataset + rules pair
`analyze_columns()`	Compare column structure between two datasets
`preprocess_dataframe()`	Apply text normalization rules to a dataframe
`add_tolerance_columns()`	Add `__absdiff`, `__thresh`, `__ok` columns for debugging