Introduction to datadiff

datadiff compares two datasets — a reference and a candidate — using validation rules defined in a YAML file. It is built on top of pointblank and supports exact matching, tolerance-based numeric comparisons, text normalization, and row count validation.

The typical workflow is:

  1. Generate a YAML rules template from the reference dataset.
  2. Edit the YAML to configure tolerances, ignored columns, etc.
  3. Call compare_datasets_from_yaml() and inspect the result.

1. A first comparison

library(datadiff)

ref <- data.frame(
  id       = 1:4,
  revenue  = c(1000.00, 2000.00, 3000.00, 4000.00),
  category = c("A", "B", "C", "D"),
  active   = c(TRUE, TRUE, FALSE, TRUE)
)

cand <- data.frame(
  id       = 1:4,
  revenue  = c(1000.005, 2000.001, 3000.009, 4000.00),  # tiny differences
  category = c("a", "b", "c", "D"),                      # lowercase
  active   = c(TRUE, TRUE, FALSE, TRUE)
)

Generate a rules template and tune it:

rules_path <- tempfile(fileext = ".yaml")

write_rules_template(
  ref,
  key                        = "id",
  path                       = rules_path,
  numeric_abs                = 0.01,   # accept differences up to 0.01
  character_case_insensitive = TRUE    # ignore case for all char columns
)

Run the comparison:

result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)
result$all_passed
#> [1] TRUE

2. Understanding the return value

compare_datasets_from_yaml() returns a list with six elements:

names(result)
#> [1] "all_passed"           "agent"                "reponse"             
#> [4] "missing_in_candidate" "extra_in_candidate"   "applied_rules"
Element Description
all_passed TRUE if every validation step passed
agent The configured pointblank agent (before interrogation)
reponse The interrogated pointblank agent (full results)
missing_in_candidate Columns present in reference but absent from candidate
extra_in_candidate Columns present in candidate but absent from reference
applied_rules The effective per-column rules that were applied

Inspecting applied rules

applied_rules shows the exact rules used for each column — useful to verify that by_name overrides were applied correctly:

result$applied_rules$revenue
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
result$applied_rules$category
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

Column presence

result$missing_in_candidate
#> character(0)
result$extra_in_candidate
#> character(0)

Accessing failing rows

When all_passed is FALSE, use pointblank::get_sundered_data() to extract the rows that failed at least one validation step:

ref_fail <- data.frame(id = 1:5, value = c(1, 2, 3, 4, 5))
cand_fail <- data.frame(id = 1:5, value = c(1, 2, 99, 4, 99))  # rows 3 and 5 wrong

result_fail <- compare_datasets_from_yaml(ref_fail, cand_fail, key = "id")
result_fail$all_passed
#> [1] FALSE

# Rows that failed at least one step
failed_rows <- pointblank::get_sundered_data(result_fail$reponse, type = "fail")
failed_rows
#>   id value value__reference value__absdiff value__thresh value__ok row_count_ok
#> 1  3    99                3             96         1e-09     FALSE         TRUE
#> 2  5    99                5             94         1e-09     FALSE         TRUE

The type = "pass" variant returns rows that passed all steps. This is useful to understand the scope of the problem before investigating further.

Controlling how many failing rows are extracted

For large datasets, extracting all failing rows can consume significant memory. Three mutually exclusive parameters cap this:

# Keep only the first 100 failing rows per validation step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     get_first_n = 100)

# Random sample of 50 failing rows per step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_n = 50)

# 10% of failing rows, capped at 500
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_frac = 0.1, sample_limit = 500)

# Disable extraction entirely (fastest — only pass/fail counts are kept)
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     extract_failed = FALSE)

3. Comparing without a YAML file

When path = NULL (the default), datadiff auto-generates rules from the reference dataset structure. This is useful for a quick sanity check without any configuration:

ref_quick  <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))
cand_quick <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))

# No path needed — rules are generated on the fly
result_quick <- compare_datasets_from_yaml(ref_quick, cand_quick, key = "id")
result_quick$all_passed
#> [1] TRUE

The auto-generated rules use near-exact numeric tolerance (abs = 1e-9) and exact character matching — equivalent to calling write_rules_template() with all defaults.


4. YAML rules in depth

write_rules_template() generates a fully annotated YAML. Here is a complete example with all sections explained:

version: 1

defaults:
  na_equal: yes           # treat NA == NA as a pass
  ignore_columns:         # columns excluded from comparison entirely
    - documentation
    - updated_at
  keys: id                # join key (single or composite)
  label: ref vs cand      # label shown in the pointblank report

row_validation:
  check_count: yes
  expected_count: ~       # null = use reference row count
  tolerance: 0            # exact match required

by_type:                  # rules applied to all columns of a given type
  numeric:
    abs: 1.0e-09          # near-exact by default
    rel: 0
  integer:
    abs: 0                # integers must match exactly
  character:
    equal_mode: exact
    case_insensitive: no
    trim: no
  date:
    equal_mode: exact
  datetime:
    equal_mode: exact
  logical:
    equal_mode: exact

by_name:                  # column-specific overrides (take precedence over by_type)
  id: []                  # no override — inherits integer rule
  revenue:
    abs: 0.01             # accept differences up to 0.01
  category:
    case_insensitive: yes
    trim: yes

Rules are merged: by_name entries extend or override by_type entries. A field not listed in by_name keeps its by_type default.

Column Effective rule Source
id abs: 0 by_type.integer
revenue abs: 0.01, rel: 0 by_name overrides by_type.numeric
category case_insensitive: yes, trim: yes by_name overrides by_type.character

Reading rules from a file

Use read_rules() to inspect what was actually loaded — useful for debugging or building tooling on top of datadiff:

loaded <- read_rules(rules_path)

loaded$defaults$na_equal
#> [1] TRUE
loaded$by_type$numeric
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
loaded$by_type$character
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

5. A realistic by_name example

The following dataset mixes several column types, each requiring a different validation strategy:

ref_full <- data.frame(
  id          = 1:4,
  price       = c(9.99, 19.99, 4.50, 149.00),      # numeric: small absolute tolerance
  quantity    = c(10L, 5L, 20L, 1L),                # integer: exact
  description = c("Widget A", "Widget B", "  Gadget", "TOOL"),  # needs trim + case
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),          # logical: exact
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)

cand_full <- data.frame(
  id          = 1:4,
  price       = c(9.995, 19.99, 4.50, 149.00),  # row 1: diff = 0.005 < 0.01
  quantity    = c(10L, 5L, 20L, 1L),
  description = c("widget a", "Widget B", "Gadget", "tool"),  # case + spaces
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)

Build the YAML and write column-specific overrides:

rules_full <- tempfile(fileext = ".yaml")

write_rules_template(
  ref_full,
  key                        = "id",
  path                       = rules_full,
  numeric_abs                = 1e-9,     # conservative default
  character_case_insensitive = FALSE,    # strict default for character
  character_trim             = FALSE
)

# Read, patch by_name, write back
rules_obj <- read_rules(rules_full)

rules_obj$by_name$price       <- list(abs = 0.01)           # ±0.01 for price
rules_obj$by_name$description <- list(case_insensitive = TRUE, trim = TRUE)

yaml::write_yaml(rules_obj, rules_full)
result_full <- compare_datasets_from_yaml(ref_full, cand_full,
                                          key = "id", path = rules_full)
result_full$all_passed
#> [1] TRUE

# Verify the effective rules for each column
result_full$applied_rules$price
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
result_full$applied_rules$description
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] TRUE
result_full$applied_rules$quantity
#> $abs
#> [1] 0

6. Numeric tolerance

Formula

For every numeric column, the comparison uses a single combined threshold:

threshold = abs + rel × |reference_value|
PASS  if  |candidate − reference| ≤ threshold

Absolute tolerance (abs)

The threshold is constant, independent of the magnitude of the values:

ref_num  <- data.frame(id = 1:3, price = c(1.00, 1000.00, 1e6))
cand_ok  <- data.frame(id = 1:3, price = c(1.005, 1000.005, 1e6 + 0.005))
cand_nok <- data.frame(id = 1:3, price = c(1.02,  1000.02,  1e6 + 0.02))

rules_abs <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_abs, numeric_abs = 0.01)

compare_datasets_from_yaml(ref_num, cand_ok,  key = "id", path = rules_abs)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_num, cand_nok, key = "id", path = rules_abs)$all_passed
#> [1] FALSE

The same threshold 0.01 applies whether the value is 1 or 1 000 000.

Relative tolerance (rel)

The threshold is proportional to the reference value — useful when you want to accept a percentage deviation:

rules_rel <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_rel,
                     numeric_abs = 0, numeric_rel = 0.01)

# ref = 1000, diff = 9, threshold = 0.01 × 1000 = 10 → PASS
cand_pct <- data.frame(id = 1:3, price = c(1.009, 1009.0, 1e6 * 1.009))
compare_datasets_from_yaml(ref_num, cand_pct, key = "id", path = rules_rel)$all_passed
#> [1] TRUE

Warning: if a reference value is 0, the relative threshold is 0 and any difference will be flagged as an error. Use abs as a safety floor.

Mixed mode

Combine both parameters when values span a wide range including near-zero:

by_type:
  numeric:
    abs: 0.001   # floor: protects against false positives when ref ≈ 0
    rel: 0.005   # +0.5% for larger values

For ref = 1 000 000: threshold = 0.001 + 0.005 × 1 000 000 = 5000.001

Rule of thumb: keep rel: 0 (the default) unless you explicitly need a tolerance proportional to the magnitude of the data.

IEEE 754 correction

Floating-point subtraction can introduce rounding errors:

# In double precision, this is slightly above 0.01
100.01 - 100.00
#> [1] 0.01

datadiff automatically adds a correction of 8 × .Machine$double.eps × |ref| to the threshold to absorb these representation errors without meaningfully widening the user-specified tolerance.

warn_at and stop_at

These two parameters control the pointblank action thresholds, expressed as the fraction of rows that fail a validation step:

result <- compare_datasets_from_yaml(
  ref, cand,
  key     = "id",
  warn_at = 0.05,   # warn if > 5% of rows fail any step
  stop_at = 0.20    # stop (error) if > 20% of rows fail any step
)

The default (1e-14) means that any single failing row triggers the threshold, which is appropriate for data validation where zero differences are expected. Raise these values if you want the report to remain green while a small fraction of rows diverge.


7. Text comparison

Three independent options control character column comparison:

Option Effect
case_insensitive: yes Convert both values to lowercase before comparing
trim: yes Strip leading/trailing whitespace before comparing
equal_mode: normalized Apply both transformations
ref_txt  <- data.frame(id = 1:4, label = c("Hello", "World", "Foo", "Bar"))
cand_txt <- data.frame(
  id    = 1:4,
  label = c("hello", "  World  ", "FOO", "Baz")  # case, spaces, mismatch
)

# Strict: rows 1, 2, 3 fail
rules_strict <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_strict)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_strict)$all_passed
#> [1] FALSE

# Relaxed: case + trim — only row 4 ("Baz" vs "Bar") fails
rules_relax <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_relax,
                     character_case_insensitive = TRUE,
                     character_trim             = TRUE)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_relax)$all_passed
#> [1] FALSE

Column-level overrides in by_name apply only to the specified column, leaving all other character columns unaffected.


8. Row count validation

The row_validation section checks that the candidate has the expected number of rows.

ref_rows  <- data.frame(id = 1:5, value = 1:5)
cand_ok   <- data.frame(id = 1:5, value = 1:5)   # 5 rows — exact match
cand_more <- data.frame(id = 1:7, value = 1:7)   # 7 rows — 2 extra

rules_count <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_count,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 0)

compare_datasets_from_yaml(ref_rows, cand_ok,   key = "id",
                           path = rules_count)$all_passed
#> [1] TRUE
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_count)$all_passed
#> [1] FALSE

With a tolerance:

rules_tol <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_tol,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 3)  # accept 5 ± 3

# 7 rows: |7 - 5| = 2 ≤ 3 → PASS
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_tol)$all_passed
#> [1] FALSE

When expected_count is null in the YAML (or expected_count_default = NULL in write_rules_template()), the reference row count is used as the target.


9. Handling NA values

The na_equal setting controls whether NA == NA is treated as a pass:

ref_na  <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))
cand_na <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))  # identical NAs

# na_equal: yes (default) — NA == NA passes
rules_na_yes <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_yes,
                     na_equal_default = TRUE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_yes)$all_passed
#> [1] TRUE

# na_equal: no — NA == NA fails
rules_na_no <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_no,
                     na_equal_default = FALSE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_no)$all_passed
#> [1] FALSE

na_equal applies to all column types including numeric (with tolerance), character, logical, and date columns.


10. Column management

Ignoring columns

Columns listed in ignore_columns_default are excluded from comparison. Presence/absence checks for those columns are also skipped:

ref_ign  <- data.frame(id = 1:3, value = 1:3, updated_at = Sys.time())
cand_ign <- data.frame(id = 1:3, value = 1:3,
                       updated_at = Sys.time() + 3600)  # different timestamp

rules_ign <- tempfile(fileext = ".yaml")
write_rules_template(ref_ign, key = "id", path = rules_ign,
                     ignore_columns_default = "updated_at")

compare_datasets_from_yaml(ref_ign, cand_ign, key = "id",
                           path = rules_ign)$all_passed
#> [1] TRUE

Missing and extra columns

Columns present in the reference but absent from the candidate generate a dedicated failing step. Extra columns in the candidate are reported but do not cause a failure:

ref_cols  <- data.frame(id = 1:2, a = 1:2, b = 1:2)
cand_cols <- data.frame(id = 1:2, a = 1:2, c = 1:2)  # b missing, c extra

result_cols <- compare_datasets_from_yaml(ref_cols, cand_cols, key = "id")
result_cols$missing_in_candidate   # b
#> [1] "b"
result_cols$extra_in_candidate     # c
#> [1] "c"
result_cols$all_passed             # FALSE: b is missing
#> [1] FALSE

Utility: analyze_columns()

analyze_columns() exposes the column comparison logic independently — useful for pre-flight checks before running the full validation:

analysis <- analyze_columns(ref_cols, cand_cols,
                            ignore_columns = character(0))
str(analysis)
#> List of 6
#>  $ cols_reference      : chr [1:3] "id" "a" "b"
#>  $ cols_candidate      : chr [1:3] "id" "a" "c"
#>  $ missing_in_candidate: chr "b"
#>  $ extra_in_candidate  : chr "c"
#>  $ common_cols         : chr [1:2] "id" "a"
#>  $ ignored_cols        : chr(0)

11. Key-based vs positional comparison

With a key

A key column joins the candidate to the reference, handling different row orders and unequal row counts gracefully:

ref_key  <- data.frame(id = 1:3, value = c(10, 20, 30))
cand_key <- data.frame(id = c(3, 1, 2), value = c(30, 10, 20))  # shuffled

result_key <- compare_datasets_from_yaml(ref_key, cand_key, key = "id")
result_key$all_passed
#> [1] TRUE

Without a key (positional)

Rows are compared position by position. Both datasets must have the same number of rows:

ref_pos  <- data.frame(value = c(1.0, 2.0, 3.0))
cand_pos <- data.frame(value = c(1.0, 2.0, 3.0))

result_pos <- compare_datasets_from_yaml(ref_pos, cand_pos)
#> key is missing
result_pos$all_passed
#> [1] TRUE

Composite keys

Multiple columns can form a composite key:

ref_comp <- data.frame(
  year  = c(2023, 2023, 2024),
  month = c(1, 2, 1),
  value = c(100, 200, 300)
)
cand_comp <- data.frame(
  year  = c(2024, 2023, 2023),
  month = c(1,    2,    1),
  value = c(300,  200,  100)
)

result_comp <- compare_datasets_from_yaml(ref_comp, cand_comp,
                                          key = c("year", "month"))
result_comp$all_passed
#> [1] TRUE

Key in YAML vs key parameter

The key parameter to compare_datasets_from_yaml() takes precedence over the keys field in the YAML defaults section. This lets you reuse a shared YAML file while overriding the join key programmatically:

rules_key <- tempfile(fileext = ".yaml")
write_rules_template(ref_comp, key = "year", path = rules_key)  # YAML says year

# Override at call time with the composite key
result_override <- compare_datasets_from_yaml(
  ref_comp, cand_comp,
  key  = c("year", "month"),   # overrides YAML
  path = rules_key
)
result_override$all_passed

Duplicate key detection

If key values are not unique, datadiff warns before running the comparison:

ref_dup  <- data.frame(id = c(1, 1, 2), value = c(10, 11, 20))
cand_dup <- data.frame(id = c(1, 2),    value = c(10, 20))

tryCatch(
  compare_datasets_from_yaml(ref_dup, cand_dup, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Duplicate keys detected! The key column(s) [id] must be unique in both datasets.
#>   - data_reference: 1 duplicate key value(s) affecting 2 rows (examples: id = 1)
#> Comparison results will be unreliable: the join will produce multiple rows per key, leading to incorrect or non-deterministic validation results.
#> Please ensure your key column(s) uniquely identify each row, or choose different key column(s).

12. Type mismatch detection

When a column has incompatible types in reference and candidate, datadiff warns and adds a dedicated failing step — instead of silently coercing or crashing:

ref_type  <- data.frame(id = 1:2, year = c(2023L, 2024L))  # integer
cand_type <- data.frame(id = 1:2, year = c("2023", "2024")) # character

tryCatch(
  compare_datasets_from_yaml(ref_type, cand_type, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
#> Warning: Type mismatch detected in 1 column(s): 'year' (reference: integer, candidate: character). Each will be reported as a validation error.

integer and numeric are treated as compatible types — tolerance arithmetic works correctly across them and no mismatch is raised.


13. Utility functions

detect_column_types()

Returns the datadiff type inferred for each column ("integer", "numeric", "character", "date", "datetime", "logical"):

df_types <- data.frame(
  id        = 1L,
  amount    = 1.5,
  label     = "x",
  flag      = TRUE,
  day       = Sys.Date(),
  timestamp = Sys.time()
)

detect_column_types(df_types)
#>          id      amount       label        flag         day   timestamp 
#>   "integer"   "numeric" "character"   "logical"      "date"  "datetime"

These are the same types used to match columns against by_type rules in the YAML.

derive_column_rules()

Shows the merged per-column rules for a given dataset and rules object — equivalent to result$applied_rules but callable without running the full comparison:

rules_obj2 <- read_rules(rules_path)
merged <- derive_column_rules(ref, rules_obj2)
merged$revenue
#> $abs
#> [1] 0.01
#> 
#> $rel
#> [1] 0
merged$category
#> $equal_mode
#> [1] "exact"
#> 
#> $case_insensitive
#> [1] TRUE
#> 
#> $trim
#> [1] FALSE

analyze_columns()

Already shown in section 10. Useful to quickly check which columns are common, missing, or extra before committing to a full validation run.

preprocess_dataframe()

Applies text normalization rules to a dataframe. Useful for inspecting what the data looks like after normalization, before comparing:

df_raw <- data.frame(label = c("  Hello ", "WORLD", "  Foo  "))
rules_norm <- list(
  label = list(equal_mode = "normalized",
               case_insensitive = TRUE,
               trim = TRUE)
)
preprocess_dataframe(df_raw, rules_norm)
#>   label
#> 1 hello
#> 2 world
#> 3   foo

add_tolerance_columns()

Adds the __absdiff, __thresh, and __ok diagnostic columns to a comparison dataframe. Useful for debugging which rows are right on the edge of the tolerance threshold:

cmp <- data.frame(
  value            = c(1.005, 1.02, 1.0),
  value__reference = c(1.000, 1.00, 1.0)
)
rules_debug <- list(value = list(abs = 0.01, rel = 0))

cmp_annotated <- add_tolerance_columns(cmp, "value", rules_debug,
                                       ref_suffix = "__reference",
                                       na_equal   = TRUE)
cmp_annotated[, c("value__absdiff", "value__thresh", "value__ok")]
#>   value__absdiff value__thresh value__ok
#> 1          0.005          0.01      TRUE
#> 2          0.020          0.01     FALSE
#> 3          0.000          0.01      TRUE

14. Language and locale

By default, pointblank reports are rendered in English. You can change the language per call or globally for a session.

Per call

result_fr <- compare_datasets_from_yaml(
  ref, cand,
  key    = "id",
  lang   = "fr",
  locale = "fr_FR"
)

Global option

Set once in your script or .Rprofile and all subsequent calls will use it:

options(datadiff.lang   = "fr",
        datadiff.locale = "fr_FR")

# All calls now produce French reports without passing lang/locale every time
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)

Supported languages include "en", "fr", "de", "it", "es", "pt", "zh", "ja", "ru". See the pointblank documentation for the full list.


15. Working with large datasets

Lazy tables (dbplyr)

Any SQL-backed table wrapped in dplyr::tbl() can be passed directly. The join, normalization, and boolean expressions are pushed down to SQL — no data is loaded into R until the final slim result table:

library(DBI)
library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "reference", ref)
DBI::dbWriteTable(con, "candidate", cand)

tbl_ref  <- dplyr::tbl(con, "reference")
tbl_cand <- dplyr::tbl(con, "candidate")

result_lazy <- compare_datasets_from_yaml(
  tbl_ref, tbl_cand,
  key  = "id",
  path = rules_path
)

result_lazy$all_passed
DBI::dbDisconnect(con)

This works with any DBI-compatible backend: SQLite, PostgreSQL, Snowflake, etc.

Arrow / Parquet (out-of-core)

For files too large to fit in RAM, pass arrow::open_dataset() directly — the package handles the Arrow → DuckDB conversion internally with a single private connection:

library(arrow)

ds_ref  <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")

# Generate a template from the schema (no data loaded into RAM)
write_rules_template(ds_ref, key = "id", path = "rules.yaml")

result <- compare_datasets_from_yaml(
  data_reference      = ds_ref,
  data_candidate      = ds_cand,
  key                 = "id",
  path                = "rules.yaml",
  duckdb_memory_limit = "8GB"    # tune to your machine's RAM
)

result$all_passed

Do not call arrow::to_duckdb() yourself before passing to datadiff. The package opens its own private DuckDB connection; passing pre-converted tables from a different connection will cause a cross-connection join error.

Memory tuning

Machine RAM Recommended duckdb_memory_limit
8 GB "3GB"
16 GB "6GB"
32 GB "8GB" (default)
64 GB+ "20GB"

Summary

Function Role
write_rules_template() Generate a YAML rules template from a reference dataset
read_rules() Load and validate a YAML rules file
compare_datasets_from_yaml() Compare reference and candidate datasets
detect_column_types() Inspect the type inferred for each column
derive_column_rules() See the merged per-column rules for a dataset + rules pair
analyze_columns() Compare column structure between two datasets
preprocess_dataframe() Apply text normalization rules to a dataframe
add_tolerance_columns() Add __absdiff, __thresh, __ok columns for debugging