Defining Table Metadata with table

Introduction

In the DBmaps package, a foundational step is understanding the structure of individual tables. The primary goal of this is to define how tables can be transformed or summarized so they can be easily joined with other tables.

This vignette will walk you through using the table_info() function to capture this metadata. We will build a “Table of Tables” for a sample retail dataset where each defined aggregation prepares a table for a subsequent merge by grouping it on a potential join key.

The `table_info()` Function

The table_info() function is the primary tool for creating this metadata. It captures descriptive information about a table and its analytical potential, returning a tidy data.table.

The key argument is key_outcome_specs, a list that defines how a table can be aggregated. A crucial rule in DBmaps is that every aggregation method must include one or more GroupingVariables, as this is what makes the resulting aggregated data mergeable.

Example 1: The Customers Table

For the customers table, we want to define how to count the number of customers by region. This prepares the data to be potentially joined with another table that also has a region column.

customers <- data.table(
  customer_id = c("C001", "C002", "C003", "C004", "C005"),
  region = c("Asia", "Europe", "Asia", "Americas", "Europe")
)

# Use table_info() to capture the metadata
customers_info_dt <- table_info(
  table_name = "customers",
  source_identifier = "customers.csv",
  identifier_columns = "customer_id",
  key_outcome_specs = list(
    list(
      OutcomeName = "CustomerCount",
      ValueExpression = 1,  # Each row = one customer
      AggregationMethods = list(
        # ** NOTE **: We only define aggregations that group by a variable.
        # The table-wide "TotalCustomers" was removed as it doesn't prepare
        # the table for a join on a key.
        list(
          AggregatedName = "CustomersByRegion",
          AggregationFunction = "sum",
          GroupingVariables = "region"  # Grouped by region
        )
      )
    )
  )
)

print(customers_info_dt)
#>    table_name source_identifier identifier_columns  outcome_name
#>        <char>            <char>             <list>        <char>
#> 1:  customers     customers.csv        customer_id CustomerCount
#>    value_expression   aggregated_name aggregation_function grouping_variable
#>              <char>            <char>               <char>            <list>
#> 1:                1 CustomersByRegion                  sum            region

Example 2: The Products Table

Next, for the products table, our goal is to define how to count the number of products within each category. This would allow us to join this product count to another table containing category-level information.

products <- data.table(
  product_id = c("P001", "P002", "P003", "P004", "P005", "P006"),
  category   = c("A", "B", "A", "C", "B", "C")
)

# Capture metadata via table_info():
products_info_dt <- table_info(
  table_name = "products",
  source_identifier = "products.csv",
  identifier_columns = "product_id",
  key_outcome_specs = list(
    list(
      OutcomeName = "ProductCount",
      ValueExpression = 1,  # Each row = one product
      AggregationMethods = list(
        list(
          AggregatedName = "ProductsPerCategory",
          AggregationFunction = "sum",
          GroupingVariables = "category"
        )
      )
    )
  )
)

print(products_info_dt)
#>    table_name source_identifier identifier_columns outcome_name
#>        <char>            <char>             <list>       <char>
#> 1:   products      products.csv         product_id ProductCount
#>    value_expression     aggregated_name aggregation_function grouping_variable
#>              <char>              <char>               <char>            <list>
#> 1:                1 ProductsPerCategory                  sum          category

Example 3: The Transactions Table

The transactions table is our central fact table. We will define a “Revenue” outcome that can be aggregated by customer_id or product_id. These aggregations are the most common use case, as they prepare the transactions table to be joined with customers or products.

# Tiny in-memory mimic of "transactions.csv":
transactions <- data.table(
  transaction_id = c("T001", "T002", "T003", "T004", "T005"),
  customer_id = c("C001", "C002", "C001", "C003", "C004"),
  product_id = c("P001", "P002", "P001", "P003", "P002"),
  price = c(10, 20, 22, 11, 21),
  quantity = c(1, 2, 1, 3, 2)
)

transactions_info_dt <- table_info(
  table_name = "transactions",
  source_identifier = "transactions.csv",
  identifier_columns = "transaction_id",
  key_outcome_specs = list(
    list(
      OutcomeName = "Revenue",
      ValueExpression = quote(price * quantity),
      AggregationMethods = list(
        list(
          AggregatedName = "RevenueByCustomer",
          AggregationFunction = "sum",
          GroupingVariables = "customer_id"
        ),
        list(
          AggregatedName = "RevenueByProduct",
          AggregationFunction = "sum",
          GroupingVariables = "product_id"
        )
      )
    )
  )
)

print(transactions_info_dt)
#>      table_name source_identifier identifier_columns outcome_name
#>          <char>            <char>             <list>       <char>
#> 1: transactions  transactions.csv     transaction_id      Revenue
#> 2: transactions  transactions.csv     transaction_id      Revenue
#>    value_expression   aggregated_name aggregation_function grouping_variable
#>              <char>            <char>               <char>            <list>
#> 1: price * quantity RevenueByCustomer                  sum       customer_id
#> 2: price * quantity  RevenueByProduct                  sum        product_id

Assembling a Master Metadata Table

Finally, we can combine the metadata from each table into a single master data.table. This master table gives a complete, queryable overview of all the pre-defined, merge-ready aggregations in our database.

master_metadata_dt <- rbindlist(
  list(customers_info_dt, products_info_dt, transactions_info_dt)
)

# Print the combined master metadata
print(master_metadata_dt)
#>      table_name source_identifier identifier_columns  outcome_name
#>          <char>            <char>             <list>        <char>
#> 1:    customers     customers.csv        customer_id CustomerCount
#> 2:     products      products.csv         product_id  ProductCount
#> 3: transactions  transactions.csv     transaction_id       Revenue
#> 4: transactions  transactions.csv     transaction_id       Revenue
#>    value_expression     aggregated_name aggregation_function grouping_variable
#>              <char>              <char>               <char>            <list>
#> 1:                1   CustomersByRegion                  sum            region
#> 2:                1 ProductsPerCategory                  sum          category
#> 3: price * quantity   RevenueByCustomer                  sum       customer_id
#> 4: price * quantity    RevenueByProduct                  sum        product_id

# Show the structure of the combined data.table
cat("\nStructure of the master metadata data.table:\n")
#> 
#> Structure of the master metadata data.table:
str(master_metadata_dt)
#> Classes 'data.table' and 'data.frame':   4 obs. of  8 variables:
#>  $ table_name          : chr  "customers" "products" "transactions" "transactions"
#>  $ source_identifier   : chr  "customers.csv" "products.csv" "transactions.csv" "transactions.csv"
#>  $ identifier_columns  :List of 4
#>   ..$ : chr "customer_id"
#>   ..$ : chr "product_id"
#>   ..$ : chr "transaction_id"
#>   ..$ : chr "transaction_id"
#>  $ outcome_name        : chr  "CustomerCount" "ProductCount" "Revenue" "Revenue"
#>  $ value_expression    : chr  "1" "1" "price * quantity" "price * quantity"
#>  $ aggregated_name     : chr  "CustomersByRegion" "ProductsPerCategory" "RevenueByCustomer" "RevenueByProduct"
#>  $ aggregation_function: chr  "sum" "sum" "sum" "sum"
#>  $ grouping_variable   :List of 4
#>   ..$ : chr "region"
#>   ..$ : chr "category"
#>   ..$ : chr "customer_id"
#>   ..$ : chr "product_id"
#>  - attr(*, ".internal.selfref")=<externalptr>