Introduction to DBmaps

Introduction

This vignette provides an introduction to the DBmaps package and its approach to simplifying the analysis of data from relational databases within R. A common and often repetitive challenge in data analysis involves aggregating detailed “fact” tables (like transactions or event logs) before they can be meaningfully joined with descriptive “dimension” tables (like customers or products). DBmaps is designed to streamline and automate this entire workflow.

The core principle of DBmaps is a metadata-driven approach. Instead of writing complex and manual data.table join and aggregation code, the user first describes the analytical potential of their tables in a structured format called metadata. The package then uses this metadata to automatically discover join paths, create executable plans, and produce the final merged dataset. This makes the entire process more efficient, scalable, and less prone to error.

A Complete Workflow Example

This vignette walks through a complete, end-to-end example based on a typical e-commerce dataset. We will perform the following steps:

  1. Load the four raw data tables.
  2. Define the metadata for each table using table_info().
  3. Functions to initialize and populate a registry that stores all your metadata definitions create_metadata_registry() & add_table().
  4. Automatically discover all possible join paths with map_join_paths().
  5. Define a desired final dataset and generate an execution plan with create_join_plan().
  6. Visualize the plan with plot_join_plan().
  7. Execute the plan to generate the final data.table with execute_join_plan().

1. The Example Data

The example uses four CSV files that represent common tables in an e-commerce database. You can download them directly from the project’s GitHub repository:

First, let’s load these tables into our R session.


transactions[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]
views[, time := as.POSIXct(time, format = "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")]

cat("--- All 4 Raw Data Tables Loaded Successfully ---\n")
#> --- All 4 Raw Data Tables Loaded Successfully ---
cat("---Customers Data---\n")
#> ---Customers Data---
print(head(customers, 5))
#>     customer_id   age gender income    region
#>          <char> <int> <char>  <num>    <char>
#> 1: ojMHmfN2Sm6q    65      F  64000      West
#> 2: J0gCRLHHBQHP    56      M  20000     South
#> 3: w3sSuV4CKchY    61      F  46000     South
#> 4: vtv2HAYCuytf    55      F  23000 Northeast
#> 5: ROZV6uiSAX06    28      M  26000      West
cat("---products Data---\n")
#> ---products Data---
print(head(products, 5))
#>          product_id category original_price
#>              <char>   <char>          <num>
#> 1: vfoWflekKTfI8RLd    shoes         125.60
#> 2: fv9UcGlbaaC50Dod    shoes         111.99
#> 3: 4Qm8RsYXaoob5ei5    shoes         152.99
#> 4: RBZTCdq3jmEYWdQG    shoes         109.99
#> 5: 24vpcjMKjW5RjAIX    shoes         114.44
cat("---Transactions Data---\n")
#> ---Transactions Data---
print(head(transactions, 5))
#>     customer_id       product_id                time quantity  price
#>          <char>           <char>              <POSc>    <int>  <num>
#> 1: FmE9hPfaAFGa 00FhooNhhAGw7oSK 2020-01-31 15:45:31        1 303.99
#> 2: I3tDJ0lIGCf8 00FhooNhhAGw7oSK 2020-01-22 21:03:29        1 303.99
#> 3: KZ5KtjdTt5nx 00FhooNhhAGw7oSK 2020-01-17 10:44:28        1 303.99
#> 4: VmcGCmN9KewB 00FhooNhhAGw7oSK 2020-01-05 23:21:13        1 303.99
#> 5: ao6KutnPBoWn 00FhooNhhAGw7oSK 2020-01-11 15:16:31        3 303.99
cat("---Views Data---\n")
#> ---Views Data---
print(head(views, 5))
#>     customer_id       product_id                time
#>          <char>           <char>              <POSc>
#> 1: w9cgtNFwluJ2 r72EPsfvcX2kwv7L 2020-01-01 00:00:44
#> 2: 67VJSA0nhtCD VYJfCxlPfHLkWjR9 2020-01-01 00:00:48
#> 3: t3IiGzB9nTpJ MSQ6YkD8WOHQCPk0 2020-01-01 00:00:54
#> 4: sNYLWL9lDLrs Ttpe0cP0sBfMptR0 2020-01-01 00:01:02
#> 5: sNYLWL9lDLrs dSZEgMnGCvuX4y3h 2020-01-01 00:01:19

2. Defining Metadata with table_info()

The table_info() function is the primary tool for creating metadata. It captures descriptive information about a table and its analytical potential. The key argument is key_outcome_specs, a list that defines how a table can be aggregated. A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable.

customers_info <- table_info(
  table_name = "customers", source_identifier = "customers.csv", identifier_columns = "customer_id",
  key_outcome_specs = list(list(OutcomeName = "CustomerCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "CountByRegion", AggregationFunction = "sum", GroupingVariables = "region")
  )))
)

products_info <- table_info(
  table_name = "products", source_identifier = "products.csv", identifier_columns = "product_id",
  key_outcome_specs = list(list(OutcomeName = "ProductCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "ProductsPerCategory", AggregationFunction = "sum", GroupingVariables = "category")
  )))
)

transactions_info <- table_info(
  table_name = "transactions", source_identifier = "transactions.csv", identifier_columns = c("customer_id", "product_id", "time"),
  key_outcome_specs = list(
    list(OutcomeName = "Revenue", ValueExpression = quote(price * quantity), AggregationMethods = list(
      list(AggregatedName = "RevenueByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id"),
      list(AggregatedName = "RevenueByProduct", AggregationFunction = "sum", GroupingVariables = "product_id"),
      list(AggregatedName = "DailyRevenueByCustomerProduct", AggregationFunction = "sum", GroupingVariables = c("customer_id", "product_id", "time"))
    )),
    list(OutcomeName = "UnitsSold", ValueExpression = quote(quantity), AggregationMethods = list(
      list(AggregatedName = "TotalUnitsByCustomer", AggregationFunction = "sum", GroupingVariables = "customer_id")
    ))
  )
)
views_info <- table_info(
  table_name = "views", source_identifier = "views.csv", identifier_columns = c("customer_id", "product_id", "time"),
  key_outcome_specs = list(list(OutcomeName = "ViewCount", ValueExpression = 1, AggregationMethods = list(
    list(AggregatedName = "ViewsByProduct", AggregationFunction = "count", GroupingVariables = "product_id"), 
    list(AggregatedName = "ViewsByCustomer", AggregationFunction = "count", GroupingVariables = "customer_id")
  )))
)
cat("---Metadata for transactions---\n")
#> ---Metadata for transactions---
print(transactions_info)
#>      table_name source_identifier          identifier_columns outcome_name
#>          <char>            <char>                      <list>       <char>
#> 1: transactions  transactions.csv customer_id,product_id,time      Revenue
#> 2: transactions  transactions.csv customer_id,product_id,time      Revenue
#> 3: transactions  transactions.csv customer_id,product_id,time      Revenue
#> 4: transactions  transactions.csv customer_id,product_id,time    UnitsSold
#>    value_expression               aggregated_name aggregation_function
#>              <char>                        <char>               <char>
#> 1: price * quantity             RevenueByCustomer                  sum
#> 2: price * quantity              RevenueByProduct                  sum
#> 3: price * quantity DailyRevenueByCustomerProduct                  sum
#> 4:         quantity          TotalUnitsByCustomer                  sum
#>              grouping_variable
#>                         <list>
#> 1:                 customer_id
#> 2:                  product_id
#> 3: customer_id,product_id,time
#> 4:                 customer_id

Understanding the Metadata Columns

The output of table_info() has several key columns that drive the entire DBmaps system. Here is a brief explanation of what each one does:

table_name: The conceptual name of the table (e.g., “transactions”).

identifier_columns: This list-column contains the name(s) of the column(s) that form the primary key of the raw table. This is what other tables can join TO.

outcome_name: A high-level description of the metric being calculated (e.g., “Revenue”).

value_expression: The R code (as a string) that defines how to calculate the raw value for the outcome from the source table’s columns (e.g., “price * quantity”).

aggregated_name: The final name for the new, aggregated column in the summarized table (e.g., “RevenueByCustomer”). This is the name you will use in your selections when creating a plan.

aggregation_function: The function to apply to the value_expression during grouping (e.g., “sum”).

grouping_variable: This is the most critical column for joins. It is a list-column containing the name(s) of the column(s) to group by. The values in this column define the “join key” that can be used to join this aggregated data FROM.

A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable with other tables.

3. create_metadata_registry() and add_table()

To manage metadata for multiple tables, DBmaps provides a simple registry system. create_metadata_registry() initializes a registry object, and add_table() adds the metadata for each table to it. This avoids the need to manually combine data.table objects.


meta <- create_metadata_registry()
meta <- add_table(meta, customers_info)
#> Added metadata for table: customers
meta <- add_table(meta, products_info)
#> Added metadata for table: products
meta <- add_table(meta, views_info)
#> Added metadata for table: views
meta <- add_table(meta, transactions_info)
#> Added metadata for table: transactions
print(meta)
#>      table_name source_identifier          identifier_columns  outcome_name
#>          <char>            <char>                      <list>        <char>
#> 1:    customers     customers.csv                 customer_id CustomerCount
#> 2:     products      products.csv                  product_id  ProductCount
#> 3:        views         views.csv customer_id,product_id,time     ViewCount
#> 4:        views         views.csv customer_id,product_id,time     ViewCount
#> 5: transactions  transactions.csv customer_id,product_id,time       Revenue
#> 6: transactions  transactions.csv customer_id,product_id,time       Revenue
#> 7: transactions  transactions.csv customer_id,product_id,time       Revenue
#> 8: transactions  transactions.csv customer_id,product_id,time     UnitsSold
#>    value_expression               aggregated_name aggregation_function
#>              <char>                        <char>               <char>
#> 1:                1                 CountByRegion                  sum
#> 2:                1           ProductsPerCategory                  sum
#> 3:                1                ViewsByProduct                count
#> 4:                1               ViewsByCustomer                count
#> 5: price * quantity             RevenueByCustomer                  sum
#> 6: price * quantity              RevenueByProduct                  sum
#> 7: price * quantity DailyRevenueByCustomerProduct                  sum
#> 8:         quantity          TotalUnitsByCustomer                  sum
#>              grouping_variable
#>                         <list>
#> 1:                      region
#> 2:                    category
#> 3:                  product_id
#> 4:                 customer_id
#> 5:                 customer_id
#> 6:                  product_id
#> 7: customer_id,product_id,time
#> 8:                 customer_id

4. Discovering Join Paths with map_join_paths()

In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The map_join_paths() function automates this discovery process, making it faster and less error-prone.

This function operates in two powerful modes:

  1. Metadata-Driven Discovery: It can find join paths based solely on a metadata definition, matching tables where a defined grouping_variable identically matches another table’s identifier_columns (primary key). This is extremely fast and useful for well-documented schemas.

  2. Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find “inferred” joins where key names do not match. This is invaluable for exploring new or messy datasets.

Combining Both Methods

The true power of the function is when it combines both methods. This provides a complete and reliable map of all possible connections in your data ecosystem.

# Create a named list of the actual data tables for the functions to use
all_tables <- list(
  customers = customers,
  products = products,
  transactions = transactions,
  views = views
)

# Generate the join map
paths <- map_join_paths(meta, all_tables)
print(paths)
#>      table_from  table_to                    key_from
#>          <char>    <char>                      <list>
#> 1:        views  products                  product_id
#> 2:        views customers                 customer_id
#> 3: transactions customers                 customer_id
#> 4: transactions  products                  product_id
#> 5: transactions     views customer_id,product_id,time
#>                         key_to
#>                         <list>
#> 1:                  product_id
#> 2:                 customer_id
#> 3:                 customer_id
#> 4:                  product_id
#> 5: customer_id,product_id,time

The resulting “Join Map” shows every valid, directional, many-to-one join that can be performed after aggregation.

5. Creating an Execution Plan with create_join_plan()

Now we define our analytical goal. Let’s say we want to create a product-level summary table that includes the product’s category, its total revenue, and the total number of times it was viewed.

The create_join_plan() function translates this high-level request into a concrete, step-by-step recipe of data.table code.

# Define our desired output
selections <- list(
  products = c("product_id", "category"),  # Base columns from the products table
  transactions = "RevenueByProduct",        # The aggregated revenue by product
  views = "ViewsByProduct"                  # The aggregated view count by product
)

# Generate the plan
plan <- create_join_plan(
  base_table = "products",
  selections = selections,
  metadata_dt = meta,
  join_map = paths
)

print(plan)
#>     step operation           target                                  details
#>    <num>    <char>           <char>                                   <char>
#> 1:     1 AGGREGATE agg_transactions                 Aggregate 'transactions'
#> 2:     2 AGGREGATE        agg_views                        Aggregate 'views'
#> 3:     3     MERGE    merged_step_3 Merge 'products' with 'agg_transactions'
#> 4:     4     MERGE    merged_step_4   Merge 'merged_step_3' with 'agg_views'
#> 5:     5    SELECT       final_data                     Select final columns
#>                                                                                                            code
#>                                                                                                          <char>
#> 1:          agg_transactions <- transactions[, .(RevenueByProduct = sum(price * quantity)), by = .(product_id)]
#> 2:                                             agg_views <- views[, .(ViewsByProduct = .N), by = .(product_id)]
#> 3:               merged_step_3 <- merge(x = products, y = agg_transactions, by = c('product_id'), all.x = TRUE)
#> 4:                 merged_step_4 <- merge(x = merged_step_3, y = agg_views, by = c('product_id'), all.x = TRUE)
#> 5: final_data <- merged_step_4[, .SD, .SDcols = c('product_id','category','RevenueByProduct','ViewsByProduct')]

The output of the planner is a data.table that acts as an executable recipe. Each row represents a step, but the final column, code, is the most powerful feature. It contains the precise, executable R code for each step of the data manipulation process. Let’s break down how the plan achieves our goal:

1. Steps 1 & 2 (AGGREGATE): The planner analyzed our request for RevenueByProduct and ViewsByProduct. It correctly determined from the metadata that these require aggregating the transactions and views tables, respectively. It then generated the exact data.table code to create two new, summarized tables: agg_transactions and agg_views.

2. Steps 3 & 4 (MERGE): The planner then constructed a sequence of join operations. It first merges the aggregated transactions (agg_transactions) onto our products base table. Then, it merges the aggregated views (agg_views) onto the result of the first merge. This sequential chaining correctly combines all the necessary data.

3. Step 5 (SELECT): Finally, the planner generates the code to select only the columns we originally requested from the final, fully merged table.

This demonstrates the core value of the planner: it automates the tedious and error-prone task of determining the correct sequence of aggregations and joins, providing a concrete set of steps that will join the data in our preferred manner.

Handling Invalid Requests

A key feature of the planner is its ability to validate user requests. What happens if we ask for an aggregation that cannot logically be joined to our base table?

Let’s ask for RevenueByProduct (grouped by product_id) to be joined to the customers table (keyed by customer_id). This is not a valid join.

# Add product metadata for this example
products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))
transactions_meta_v2 <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="Revenue", ValueExpression=quote(price*qty), AggregationMethods=list(
    # This aggregation is by product_id, not customer_id
    list(AggregatedName="RevenueByProduct", AggregationFunction="sum", GroupingVariables="product_id")
  ))
))
invalid_metadata <- create_metadata_registry()
invalid_metadata <- add_table(invalid_metadata, products_meta)
#> Added metadata for table: products
invalid_metadata <- add_table(invalid_metadata, transactions_meta_v2)
#> Added metadata for table: transactions

# The invalid request
invalid_selections <- list(
  customers = "customer_id",
  transactions = "RevenueByProduct"
)

Instead of producing a faulty plan or a cryptic error, create_join_plan stops with a clear, informative message.

create_join_plan(
  base_table = "customers",
  selections = invalid_selections,
  metadata_dt = invalid_metadata
)
#> Warning in create_join_plan(base_table = "customers", selections =
#> invalid_selections, : No direct path found from 'transactions' to 'customers'.
#> Skipping this table.'
#>     step operation     target              details
#>    <num>    <char>     <char>               <char>
#> 1:     1    SELECT final_data Select final columns
#>                                                          code
#>                                                        <char>
#> 1: final_data <- customers[, .SD, .SDcols = c('customer_id')]

The reason this is invalid is that the join key of the selected aggregation does not match the join key of the base table.

  1. The base_table is customers, whose primary join key is customer_id.
  2. The selection asks for the RevenueByProduct aggregation from the transactions table.
  3. According to our metadata, the RevenueByProduct aggregation is grouped by (and therefore keyed on) product_id.
  4. The planner function, create_join_plan(), correctly sees that there is no direct path to join a table keyed by product_id to a table keyed on customer_id.

This strict validation ensures that only logical and correct data manipulation plans are generated, preventing common data analysis errors.

6. Visualizing the Plan

To make the plan even clearer, we can visualize it as a flowchart using plot_join_plan().

# This requires the DiagrammeR package
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
  # Generate the plot object
  visualize <- plot_join_plan(plan)
  visualize  
} else {
  cat("Install the 'DiagrammeR' package to visualize the join plan.")
}

The graph shows the flow of data from the source tables (blue boxes) through intermediate aggregations (gray ellipses) to the final merged result and selection (yellow diamond).

7. Executing the Plan

The final step is to execute the plan using execute_join_plan(). This function takes the plan and the list of source data tables and runs the generated code, returning the final dataset.

# The executor runs the plan in a clean environment
final_dt <- execute_join_plan(plan, all_tables) 
#> Executing step 1: agg_transactions <- transactions[, .(RevenueByProduct = sum(price * quantity)), by = .(product_id)]
#> Executing step 2: agg_views <- views[, .(ViewsByProduct = .N), by = .(product_id)]
#> Executing step 3: merged_step_3 <- merge(x = products, y = agg_transactions, by = c('product_id'), all.x = TRUE)
#> Executing step 4: merged_step_4 <- merge(x = merged_step_3, y = agg_views, by = c('product_id'), all.x = TRUE)
#> Executing step 5: final_data <- merged_step_4[, .SD, .SDcols = c('product_id','category','RevenueByProduct','ViewsByProduct')]

# Show the first few rows of the final, merged data.table
print(head(final_dt))
#> Key: <product_id>
#>          product_id category RevenueByProduct ViewsByProduct
#>              <char>   <char>            <num>          <int>
#> 1: 00FhooNhhAGw7oSK     coat          6079.80              6
#> 2: 00WqbkYf3W2wNq3e    shoes          1134.07              4
#> 3: 01JF4pjW10biuBjY    pants          1189.20              8
#> 4: 02L4UPiT4Dq1qrIg    shirt          1209.78             11
#> 5: 03B7HZ3DtP6QKIxH    shirt          1420.71             13
#> 6: 03QSka6hKhYLP7y5    shirt           740.00              4

Conclusion

The DBmaps workflow successfully automated the entire process. By investing a small amount of time to define metadata, we were able to automatically discover relationships, generate a complex join plan, visualize it, and execute it with just a few high-level function calls. This demonstrates a powerful, scalable, and reproducible approach to data preparation and analysis in R.