This vignette provides an introduction to the DBmaps
package and its approach to simplifying the analysis of data from
relational databases within R. A common and often repetitive challenge
in data analysis involves aggregating detailed “fact” tables (like
transactions or event logs) before they can be meaningfully joined with
descriptive “dimension” tables (like customers or products).
DBmaps
is designed to streamline and automate this entire
workflow.
The core principle of DBmaps
is a metadata-driven
approach. Instead of writing complex and manual data.table
join and aggregation code, the user first describes the analytical
potential of their tables in a structured format called metadata. The
package then uses this metadata to automatically discover join paths,
create executable plans, and produce the final merged dataset. This
makes the entire process more efficient, scalable, and less prone to
error.
This vignette walks through a complete, end-to-end example based on a typical e-commerce dataset. We will perform the following steps:
table_info()
.create_metadata_registry() & add_table()
.map_join_paths()
.create_join_plan()
.plot_join_plan()
.execute_join_plan()
.The example uses four CSV files that represent common tables in an e-commerce database. You can download them directly from the project’s GitHub repository:
First, let’s load these tables into our R session.
:= as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]
transactions[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]
views[, time
cat("--- All 4 Raw Data Tables Loaded Successfully ---\n")
#> --- All 4 Raw Data Tables Loaded Successfully ---
cat("---Customers Data---\n")
#> ---Customers Data---
print(head(customers, 5))
#> customer_id age gender income region
#> <char> <int> <char> <num> <char>
#> 1: ojMHmfN2Sm6q 65 F 64000 West
#> 2: J0gCRLHHBQHP 56 M 20000 South
#> 3: w3sSuV4CKchY 61 F 46000 South
#> 4: vtv2HAYCuytf 55 F 23000 Northeast
#> 5: ROZV6uiSAX06 28 M 26000 West
cat("---products Data---\n")
#> ---products Data---
print(head(products, 5))
#> product_id category original_price
#> <char> <char> <num>
#> 1: vfoWflekKTfI8RLd shoes 125.60
#> 2: fv9UcGlbaaC50Dod shoes 111.99
#> 3: 4Qm8RsYXaoob5ei5 shoes 152.99
#> 4: RBZTCdq3jmEYWdQG shoes 109.99
#> 5: 24vpcjMKjW5RjAIX shoes 114.44
cat("---Transactions Data---\n")
#> ---Transactions Data---
print(head(transactions, 5))
#> customer_id product_id time quantity price
#> <char> <char> <POSc> <int> <num>
#> 1: FmE9hPfaAFGa 00FhooNhhAGw7oSK 2020-01-31 15:45:31 1 303.99
#> 2: I3tDJ0lIGCf8 00FhooNhhAGw7oSK 2020-01-22 21:03:29 1 303.99
#> 3: KZ5KtjdTt5nx 00FhooNhhAGw7oSK 2020-01-17 10:44:28 1 303.99
#> 4: VmcGCmN9KewB 00FhooNhhAGw7oSK 2020-01-05 23:21:13 1 303.99
#> 5: ao6KutnPBoWn 00FhooNhhAGw7oSK 2020-01-11 15:16:31 3 303.99
cat("---Views Data---\n")
#> ---Views Data---
print(head(views, 5))
#> customer_id product_id time
#> <char> <char> <POSc>
#> 1: w9cgtNFwluJ2 r72EPsfvcX2kwv7L 2020-01-01 00:00:44
#> 2: 67VJSA0nhtCD VYJfCxlPfHLkWjR9 2020-01-01 00:00:48
#> 3: t3IiGzB9nTpJ MSQ6YkD8WOHQCPk0 2020-01-01 00:00:54
#> 4: sNYLWL9lDLrs Ttpe0cP0sBfMptR0 2020-01-01 00:01:02
#> 5: sNYLWL9lDLrs dSZEgMnGCvuX4y3h 2020-01-01 00:01:19
table_info()
The table_info()
function is the primary tool for
creating metadata. It captures descriptive information about a table and
its analytical potential. The key argument is
key_outcome_specs
, a list that defines how a table can be
aggregated. A crucial rule in DBmaps
is that every
aggregation method must include one or more
GroupingVariables
, as this is what makes the resulting
aggregated data mergeable.
<- table_info(
customers_info table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id",
key_outcome_specs = list(list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list(
list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region")
)))
)
<- table_info(
products_info table_name = "products", source_identifier = "products.csv", identifier_columns = "product_id",
key_outcome_specs = list(list(OutcomeName = "ProductCount", ValueExpression = 1, AggregationMethods = list(
list(AggregatedName = "ProductsPerCategory", AggregationFunction = "sum", GroupingVariables = "category")
)))
)
<- table_info(
transactions_info table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"),
key_outcome_specs = list(
list(OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list(
list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"),
list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id"),
list(AggregatedName = "DailyRevenueByCustomerProduct", AggregationFunction = "sum", GroupingVariables = c("customer_id", "product_id", "time"))
)),list(OutcomeName = "UnitsSold", ValueExpression = quote(quantity), AggregationMethods = list(
list(AggregatedName = "TotalUnitsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id")
))
)
)<- table_info(
views_info table_name = "views", source_identifier = "views.csv", identifier_columns = c("customer_id", "product_id", "time"),
key_outcome_specs = list(list(OutcomeName = "ViewCount", ValueExpression = 1, AggregationMethods = list(
list(AggregatedName = "ViewsByProduct", AggregationFunction = "count", GroupingVariables = "product_id"),
list(AggregatedName = "ViewsByCustomer", AggregationFunction = "count", GroupingVariables = "customer_id")
)))
)cat("---Metadata for transactions---\n")
#> ---Metadata for transactions---
print(transactions_info)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: transactions transactions.csv customer_id,product_id,time Revenue
#> 2: transactions transactions.csv customer_id,product_id,time Revenue
#> 3: transactions transactions.csv customer_id,product_id,time Revenue
#> 4: transactions transactions.csv customer_id,product_id,time UnitsSold
#> value_expression aggregated_name aggregation_function
#> <char> <char> <char>
#> 1: price * quantity RevenueByCustomer sum
#> 2: price * quantity RevenueByProduct sum
#> 3: price * quantity DailyRevenueByCustomerProduct sum
#> 4: quantity TotalUnitsByCustomer sum
#> grouping_variable
#> <list>
#> 1: customer_id
#> 2: product_id
#> 3: customer_id,product_id,time
#> 4: customer_id
The output of table_info() has several key columns that drive the entire DBmaps system. Here is a brief explanation of what each one does:
table_name: The conceptual name of the table (e.g., “transactions”).
identifier_columns: This list-column contains the name(s) of the column(s) that form the primary key of the raw table. This is what other tables can join TO.
outcome_name: A high-level description of the metric being calculated (e.g., “Revenue”).
value_expression: The R code (as a string) that defines how to calculate the raw value for the outcome from the source table’s columns (e.g., “price * quantity”).
aggregated_name: The final name for the new, aggregated column in the summarized table (e.g., “RevenueByCustomer”). This is the name you will use in your selections when creating a plan.
aggregation_function: The function to apply to the value_expression during grouping (e.g., “sum”).
grouping_variable: This is the most critical column for joins. It is a list-column containing the name(s) of the column(s) to group by. The values in this column define the “join key” that can be used to join this aggregated data FROM.
A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable with other tables.
To manage metadata for multiple tables, DBmaps provides a simple registry system. create_metadata_registry() initializes a registry object, and add_table() adds the metadata for each table to it. This avoids the need to manually combine data.table objects.
<- create_metadata_registry()
meta <- add_table(meta, customers_info)
meta #> Added metadata for table: customers
<- add_table(meta, products_info)
meta #> Added metadata for table: products
<- add_table(meta, views_info)
meta #> Added metadata for table: views
<- add_table(meta, transactions_info)
meta #> Added metadata for table: transactions
print(meta)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: customers customers.csv customer_id CustomerCount
#> 2: products products.csv product_id ProductCount
#> 3: views views.csv customer_id,product_id,time ViewCount
#> 4: views views.csv customer_id,product_id,time ViewCount
#> 5: transactions transactions.csv customer_id,product_id,time Revenue
#> 6: transactions transactions.csv customer_id,product_id,time Revenue
#> 7: transactions transactions.csv customer_id,product_id,time Revenue
#> 8: transactions transactions.csv customer_id,product_id,time UnitsSold
#> value_expression aggregated_name aggregation_function
#> <char> <char> <char>
#> 1: 1 CountByRegion sum
#> 2: 1 ProductsPerCategory sum
#> 3: 1 ViewsByProduct count
#> 4: 1 ViewsByCustomer count
#> 5: price * quantity RevenueByCustomer sum
#> 6: price * quantity RevenueByProduct sum
#> 7: price * quantity DailyRevenueByCustomerProduct sum
#> 8: quantity TotalUnitsByCustomer sum
#> grouping_variable
#> <list>
#> 1: region
#> 2: category
#> 3: product_id
#> 4: customer_id
#> 5: customer_id
#> 6: product_id
#> 7: customer_id,product_id,time
#> 8: customer_id
map_join_paths()
In any data analysis project involving multiple tables, a primary
task is to identify how these tables can be joined together. This
typically involves matching foreign keys in one table to primary keys in
another. The map_join_paths()
function automates this
discovery process, making it faster and less error-prone.
This function operates in two powerful modes:
Metadata-Driven Discovery: It can find join
paths based solely on a metadata definition, matching tables where a
defined grouping_variable
identically matches another
table’s identifier_columns
(primary key). This is extremely
fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find “inferred” joins where key names do not match. This is invaluable for exploring new or messy datasets.
The true power of the function is when it combines both methods. This provides a complete and reliable map of all possible connections in your data ecosystem.
# Create a named list of the actual data tables for the functions to use
<- list(
all_tables customers = customers,
products = products,
transactions = transactions,
views = views
)
# Generate the join map
<- map_join_paths(meta, all_tables)
paths print(paths)
#> table_from table_to key_from
#> <char> <char> <list>
#> 1: views products product_id
#> 2: views customers customer_id
#> 3: transactions customers customer_id
#> 4: transactions products product_id
#> 5: transactions views customer_id,product_id,time
#> key_to
#> <list>
#> 1: product_id
#> 2: customer_id
#> 3: customer_id
#> 4: product_id
#> 5: customer_id,product_id,time
The resulting “Join Map” shows every valid, directional, many-to-one join that can be performed after aggregation.
create_join_plan()
Now we define our analytical goal. Let’s say we want to create a product-level summary table that includes the product’s category, its total revenue, and the total number of times it was viewed.
The create_join_plan()
function translates this
high-level request into a concrete, step-by-step recipe of
data.table
code.
# Define our desired output
<- list(
selections products = c("product_id", "category"), # Base columns from the products table
transactions = "RevenueByProduct", # The aggregated revenue by product
views = "ViewsByProduct" # The aggregated view count by product
)
# Generate the plan
<- create_join_plan(
plan base_table = "products",
selections = selections,
metadata_dt = meta,
join_map = paths
)
print(plan)
#> step operation target details
#> <num> <char> <char> <char>
#> 1: 1 AGGREGATE agg_transactions Aggregate 'transactions'
#> 2: 2 AGGREGATE agg_views Aggregate 'views'
#> 3: 3 MERGE merged_step_3 Merge 'products' with 'agg_transactions'
#> 4: 4 MERGE merged_step_4 Merge 'merged_step_3' with 'agg_views'
#> 5: 5 SELECT final_data Select final columns
#> code
#> <char>
#> 1: agg_transactions <- transactions[, .(RevenueByProduct = sum(price * quantity)), by = .(product_id)]
#> 2: agg_views <- views[, .(ViewsByProduct = .N), by = .(product_id)]
#> 3: merged_step_3 <- merge(x = products, y = agg_transactions, by = c('product_id'), all.x = TRUE)
#> 4: merged_step_4 <- merge(x = merged_step_3, y = agg_views, by = c('product_id'), all.x = TRUE)
#> 5: final_data <- merged_step_4[, .SD, .SDcols = c('product_id','category','RevenueByProduct','ViewsByProduct')]
The output of the planner is a data.table that acts as an executable recipe. Each row represents a step, but the final column, code, is the most powerful feature. It contains the precise, executable R code for each step of the data manipulation process. Let’s break down how the plan achieves our goal:
1. Steps 1 & 2 (AGGREGATE): The planner analyzed our request for RevenueByProduct and ViewsByProduct. It correctly determined from the metadata that these require aggregating the transactions and views tables, respectively. It then generated the exact data.table code to create two new, summarized tables: agg_transactions and agg_views.
2. Steps 3 & 4 (MERGE): The planner then constructed a sequence of join operations. It first merges the aggregated transactions (agg_transactions) onto our products base table. Then, it merges the aggregated views (agg_views) onto the result of the first merge. This sequential chaining correctly combines all the necessary data.
3. Step 5 (SELECT): Finally, the planner generates the code to select only the columns we originally requested from the final, fully merged table.
This demonstrates the core value of the planner: it automates the tedious and error-prone task of determining the correct sequence of aggregations and joins, providing a concrete set of steps that will join the data in our preferred manner.
Handling Invalid Requests
A key feature of the planner is its ability to validate user requests. What happens if we ask for an aggregation that cannot logically be joined to our base table?
Let’s ask for RevenueByProduct
(grouped by
product_id
) to be joined to the customers
table (keyed by customer_id
). This is not a valid join.
# Add product metadata for this example
<- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))
products_meta <- table_info("transactions", "t.csv", "trans_id", list(
transactions_meta_v2 list(OutcomeName="Revenue", ValueExpression=quote(price*qty), AggregationMethods=list(
# This aggregation is by product_id, not customer_id
list(AggregatedName="RevenueByProduct", AggregationFunction="sum", GroupingVariables="product_id")
))
))<- create_metadata_registry()
invalid_metadata <- add_table(invalid_metadata, products_meta)
invalid_metadata #> Added metadata for table: products
<- add_table(invalid_metadata, transactions_meta_v2)
invalid_metadata #> Added metadata for table: transactions
# The invalid request
<- list(
invalid_selections customers = "customer_id",
transactions = "RevenueByProduct"
)
Instead of producing a faulty plan or a cryptic error,
create_join_plan
stops with a clear, informative
message.
create_join_plan(
base_table = "customers",
selections = invalid_selections,
metadata_dt = invalid_metadata
)#> Warning in create_join_plan(base_table = "customers", selections =
#> invalid_selections, : No direct path found from 'transactions' to 'customers'.
#> Skipping this table.'
#> step operation target details
#> <num> <char> <char> <char>
#> 1: 1 SELECT final_data Select final columns
#> code
#> <char>
#> 1: final_data <- customers[, .SD, .SDcols = c('customer_id')]
The reason this is invalid is that the join key of the selected aggregation does not match the join key of the base table.
customers
, whose
primary join key is customer_id
.transactions
table.product_id
.create_join_plan()
, correctly
sees that there is no direct path to join a table keyed by
product_id
to a table keyed on
customer_id
.This strict validation ensures that only logical and correct data manipulation plans are generated, preventing common data analysis errors.
To make the plan even clearer, we can visualize it as a flowchart
using plot_join_plan()
.
# This requires the DiagrammeR package
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
# Generate the plot object
<- plot_join_plan(plan)
visualize
visualize else {
} cat("Install the 'DiagrammeR' package to visualize the join plan.")
}
The graph shows the flow of data from the source tables (blue boxes) through intermediate aggregations (gray ellipses) to the final merged result and selection (yellow diamond).
The final step is to execute the plan using
execute_join_plan()
. This function takes the plan and the
list of source data tables and runs the generated code, returning the
final dataset.
# The executor runs the plan in a clean environment
<- execute_join_plan(plan, all_tables)
final_dt #> Executing step 1: agg_transactions <- transactions[, .(RevenueByProduct = sum(price * quantity)), by = .(product_id)]
#> Executing step 2: agg_views <- views[, .(ViewsByProduct = .N), by = .(product_id)]
#> Executing step 3: merged_step_3 <- merge(x = products, y = agg_transactions, by = c('product_id'), all.x = TRUE)
#> Executing step 4: merged_step_4 <- merge(x = merged_step_3, y = agg_views, by = c('product_id'), all.x = TRUE)
#> Executing step 5: final_data <- merged_step_4[, .SD, .SDcols = c('product_id','category','RevenueByProduct','ViewsByProduct')]
# Show the first few rows of the final, merged data.table
print(head(final_dt))
#> Key: <product_id>
#> product_id category RevenueByProduct ViewsByProduct
#> <char> <char> <num> <int>
#> 1: 00FhooNhhAGw7oSK coat 6079.80 6
#> 2: 00WqbkYf3W2wNq3e shoes 1134.07 4
#> 3: 01JF4pjW10biuBjY pants 1189.20 8
#> 4: 02L4UPiT4Dq1qrIg shirt 1209.78 11
#> 5: 03B7HZ3DtP6QKIxH shirt 1420.71 13
#> 6: 03QSka6hKhYLP7y5 shirt 740.00 4
The DBmaps
workflow successfully automated the entire
process. By investing a small amount of time to define metadata, we were
able to automatically discover relationships, generate a complex join
plan, visualize it, and execute it with just a few high-level function
calls. This demonstrates a powerful, scalable, and reproducible approach
to data preparation and analysis in R.