Automated Join Path Discovery

Introduction

In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The map_join_paths() function automates this discovery process, making it faster and less error-prone.

This function operates in two powerful modes:

  1. Metadata-Driven Discovery: It can find join paths based solely on a metadata definition, matching tables where a defined grouping_variable identically matches another table’s identifier_columns (primary key). This is extremely fast and useful for well-documented schemas.

  2. Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find “inferred” joins where key names do not match. This is invaluable for exploring new or messy datasets.

Case 1: Standard Joins via Metadata

First, let’s define metadata for a standard retail scenario with customers, products, and transactions tables.

# Define metadata for each table
customers_meta <- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))

products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))

transactions_meta <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
    # This grouping variable will match the primary key of 'customers'
    list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"),
    # This one will match the primary key of 'products'
    list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id")
  ))
))

# Combine into a master metadata object
master_meta <- rbindlist(list(customers_meta, products_meta, transactions_meta))

Now, we can find the join paths using only this metadata.

# Find paths without looking at the data
metadata_paths <- map_join_paths(master_meta)
print(metadata_paths)
#>      table_from  table_to    key_from      key_to
#>          <char>    <char>      <list>      <list>
#> 1: transactions customers customer_id customer_id
#> 2: transactions  products  product_id  product_id

The function correctly identifies two METADATA paths: transactions can be joined to customers on customer_id, and to products on product_id.

Case 2: Multi-Variable Key Joins

The function also handles composite keys. Let’s imagine a table daily_promos whose primary key is the combination of product_id and region.

daily_promos_meta <- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))

# Add a grouping variable to transactions that matches this composite key
transactions_multi_meta <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
    list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region"))
  ))
))

multi_key_meta <- rbindlist(list(daily_promos_meta, transactions_multi_meta))
multi_key_paths <- map_join_paths(multi_key_meta)
print(multi_key_paths)
#>      table_from     table_to          key_from            key_to
#>          <char>       <char>            <list>            <list>
#> 1: transactions daily_promos product_id,region product_id,region

The function correctly identifies the single, multi-variable join path.

Case 3: Inferring Joins from Data

What if our data is messy and key names don’t align? Consider an inventory table where the product key is called sku, and an orders table that refers to it as product_code.

# Define the data
inventory_data <- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5))
orders_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2"))

data_list <- list(
  inventory = inventory_data,
  orders = orders_data
)

# Define the metadata. Note the mismatched names.
inventory_meta <- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock")))))
orders_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code")))))

inferred_meta <- rbindlist(list(inventory_meta, orders_meta))

Running map_join_paths with only metadata would fail to find a path here. But by providing the data_list, we enable the data-driven search.

inferred_paths <- map_join_paths(inferred_meta, data_list = data_list)
print(inferred_paths)
#>    table_from  table_to     key_from key_to
#>        <char>    <char>       <list> <list>
#> 1:     orders inventory product_code    sku

Success! The function scanned the values and found that orders$product_code can be joined to inventory$sku. It correctly marks this path as INFERRED.

Combining Both Methods

The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.