In the DBmaps package, a foundational step is understanding the structure of individual tables. The primary goal of this is to define how tables can be transformed or summarized so they can be easily joined with other tables.
This vignette will walk you through using the table_info() function to capture this metadata. We will build a “Table of Tables” for a sample retail dataset where each defined aggregation prepares a table for a subsequent merge by grouping it on a potential join key.
table_info()
FunctionThe table_info() function is the primary tool for creating this metadata. It captures descriptive information about a table and its analytical potential, returning a tidy data.table.
The key argument is key_outcome_specs, a list that defines how a table can be aggregated. A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable.
For the customers table, we want to define how to count the number of customers by region. This prepares the data to be potentially joined with another table that also has a region column.
<- data.table(
customers customer_id = c("C001", "C002", "C003", "C004", "C005"),
region = c("Asia", "Europe", "Asia", "Americas", "Europe")
)
# Use table_info() to capture the metadata
<- table_info(
customers_info_dt table_name = "customers",
source_identifier = "customers.csv",
identifier_columns = "customer_id",
key_outcome_specs = list(
list(
OutcomeName = "CustomerCount",
ValueExpression = 1, # Each row = one customer
AggregationMethods = list(
# ** NOTE **: We only define aggregations that group by a variable.
# The table-wide "TotalCustomers" was removed as it doesn't prepare
# the table for a join on a key.
list(
AggregatedName = "CustomersByRegion",
AggregationFunction = "sum",
GroupingVariables = "region" # Grouped by region
)
)
)
)
)
print(customers_info_dt)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: customers customers.csv customer_id CustomerCount
#> value_expression aggregated_name aggregation_function grouping_variable
#> <char> <char> <char> <list>
#> 1: 1 CustomersByRegion sum region
Next, for the products table, our goal is to define how to count the number of products within each category. This would allow us to join this product count to another table containing category-level information.
<- data.table(
products product_id = c("P001", "P002", "P003", "P004", "P005", "P006"),
category = c("A", "B", "A", "C", "B", "C")
)
# Capture metadata via table_info():
<- table_info(
products_info_dt table_name = "products",
source_identifier = "products.csv",
identifier_columns = "product_id",
key_outcome_specs = list(
list(
OutcomeName = "ProductCount",
ValueExpression = 1, # Each row = one product
AggregationMethods = list(
list(
AggregatedName = "ProductsPerCategory",
AggregationFunction = "sum",
GroupingVariables = "category"
)
)
)
)
)
print(products_info_dt)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: products products.csv product_id ProductCount
#> value_expression aggregated_name aggregation_function grouping_variable
#> <char> <char> <char> <list>
#> 1: 1 ProductsPerCategory sum category
The transactions table is our central fact table. We will define a “Revenue” outcome that can be aggregated by customer_id or product_id. These aggregations are the most common use case, as they prepare the transactions table to be joined with customers or products.
# Tiny in-memory mimic of "transactions.csv":
<- data.table(
transactions transaction_id = c("T001", "T002", "T003", "T004", "T005"),
customer_id = c("C001", "C002", "C001", "C003", "C004"),
product_id = c("P001", "P002", "P001", "P003", "P002"),
price = c(10, 20, 22, 11, 21),
quantity = c(1, 2, 1, 3, 2)
)
<- table_info(
transactions_info_dt table_name = "transactions",
source_identifier = "transactions.csv",
identifier_columns = "transaction_id",
key_outcome_specs = list(
list(
OutcomeName = "Revenue",
ValueExpression = quote(price * quantity),
AggregationMethods = list(
list(
AggregatedName = "RevenueByCustomer",
AggregationFunction = "sum",
GroupingVariables = "customer_id"
),list(
AggregatedName = "RevenueByProduct",
AggregationFunction = "sum",
GroupingVariables = "product_id"
)
)
)
)
)
print(transactions_info_dt)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: transactions transactions.csv transaction_id Revenue
#> 2: transactions transactions.csv transaction_id Revenue
#> value_expression aggregated_name aggregation_function grouping_variable
#> <char> <char> <char> <list>
#> 1: price * quantity RevenueByCustomer sum customer_id
#> 2: price * quantity RevenueByProduct sum product_id
Finally, we can combine the metadata from each table into a single master data.table. This master table gives a complete, queryable overview of all the pre-defined, merge-ready aggregations in our database.
<- rbindlist(
master_metadata_dt list(customers_info_dt, products_info_dt, transactions_info_dt)
)
# Print the combined master metadata
print(master_metadata_dt)
#> table_name source_identifier identifier_columns outcome_name
#> <char> <char> <list> <char>
#> 1: customers customers.csv customer_id CustomerCount
#> 2: products products.csv product_id ProductCount
#> 3: transactions transactions.csv transaction_id Revenue
#> 4: transactions transactions.csv transaction_id Revenue
#> value_expression aggregated_name aggregation_function grouping_variable
#> <char> <char> <char> <list>
#> 1: 1 CustomersByRegion sum region
#> 2: 1 ProductsPerCategory sum category
#> 3: price * quantity RevenueByCustomer sum customer_id
#> 4: price * quantity RevenueByProduct sum product_id
# Show the structure of the combined data.table
cat("\nStructure of the master metadata data.table:\n")
#>
#> Structure of the master metadata data.table:
str(master_metadata_dt)
#> Classes 'data.table' and 'data.frame': 4 obs. of 8 variables:
#> $ table_name : chr "customers" "products" "transactions" "transactions"
#> $ source_identifier : chr "customers.csv" "products.csv" "transactions.csv" "transactions.csv"
#> $ identifier_columns :List of 4
#> ..$ : chr "customer_id"
#> ..$ : chr "product_id"
#> ..$ : chr "transaction_id"
#> ..$ : chr "transaction_id"
#> $ outcome_name : chr "CustomerCount" "ProductCount" "Revenue" "Revenue"
#> $ value_expression : chr "1" "1" "price * quantity" "price * quantity"
#> $ aggregated_name : chr "CustomersByRegion" "ProductsPerCategory" "RevenueByCustomer" "RevenueByProduct"
#> $ aggregation_function: chr "sum" "sum" "sum" "sum"
#> $ grouping_variable :List of 4
#> ..$ : chr "region"
#> ..$ : chr "category"
#> ..$ : chr "customer_id"
#> ..$ : chr "product_id"
#> - attr(*, ".internal.selfref")=<externalptr>