In any data analysis project involving multiple tables, a primary
task is to identify how these tables can be joined together. This
typically involves matching foreign keys in one table to primary keys in
another. The map_join_paths()
function automates this
discovery process, making it faster and less error-prone.
This function operates in two powerful modes:
Metadata-Driven Discovery: It can find join
paths based solely on a metadata definition, matching tables where a
defined grouping_variable
identically matches another
table’s identifier_columns
(primary key). This is extremely
fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find “inferred” joins where key names do not match. This is invaluable for exploring new or messy datasets.
First, let’s define metadata for a standard retail scenario with
customers
, products
, and
transactions
tables.
# Define metadata for each table
<- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))
customers_meta
<- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))
products_meta
<- table_info("transactions", "t.csv", "trans_id", list(
transactions_meta list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
# This grouping variable will match the primary key of 'customers'
list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"),
# This one will match the primary key of 'products'
list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id")
))
))
# Combine into a master metadata object
<- rbindlist(list(customers_meta, products_meta, transactions_meta)) master_meta
Now, we can find the join paths using only this metadata.
# Find paths without looking at the data
<- map_join_paths(master_meta)
metadata_paths print(metadata_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: transactions customers customer_id customer_id
#> 2: transactions products product_id product_id
The function correctly identifies two METADATA
paths:
transactions
can be joined to customers
on
customer_id
, and to products
on
product_id
.
The function also handles composite keys. Let’s imagine a table
daily_promos
whose primary key is the combination of
product_id
and region
.
<- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))
daily_promos_meta
# Add a grouping variable to transactions that matches this composite key
<- table_info("transactions", "t.csv", "trans_id", list(
transactions_multi_meta list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region"))
))
))
<- rbindlist(list(daily_promos_meta, transactions_multi_meta)) multi_key_meta
<- map_join_paths(multi_key_meta)
multi_key_paths print(multi_key_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: transactions daily_promos product_id,region product_id,region
The function correctly identifies the single, multi-variable join path.
What if our data is messy and key names don’t align? Consider an
inventory
table where the product key is called
sku
, and an orders
table that refers to it as
product_code
.
# Define the data
<- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5))
inventory_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2"))
orders_data
<- list(
data_list inventory = inventory_data,
orders = orders_data
)
# Define the metadata. Note the mismatched names.
<- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock")))))
inventory_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code")))))
orders_meta
<- rbindlist(list(inventory_meta, orders_meta)) inferred_meta
Running map_join_paths
with only metadata would fail to
find a path here. But by providing the data_list
, we enable
the data-driven search.
<- map_join_paths(inferred_meta, data_list = data_list)
inferred_paths print(inferred_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: orders inventory product_code sku
Success! The function scanned the values and found that
orders$product_code
can be joined to
inventory$sku
. It correctly marks this path as
INFERRED
.
The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.