When generating SQL for multiple ATLAS cohort definitions, the naive approach treats each cohort independently – parsing its JSON, expanding its concept sets against the vocabulary, scanning CDM domain tables, building qualified events, and writing results. This works, but it repeats expensive database operations (vocabulary joins, full-table scans, result-table I/O) once per cohort.
atlasCohortGenerator implements a DAG-based
batch optimizer that reorganizes a set of cohort SQL scripts
into a single execution plan with shared intermediate stages. The
optimizer produces semantically identical results to
the one-at-a-time approach while eliminating redundant work.
This vignette explains:
The traditional CIRCE approach generates a fully self-contained SQL script for each cohort. Every script independently performs five phases:
For a batch of N cohorts, the database executes N independent pipelines.
Each cohort creates its own local #Codesets temp table,
joins against CONCEPT_ANCESTOR and CONCEPT to
expand descendants, scans CDM domain tables
(e.g. DRUG_EXPOSURE, CONDITION_OCCURRENCE)
with joins to its local codesets, and writes directly to the results
tables. For N cohorts, the database performs:
The optimizer restructures these N independent pipelines into a single DAG with shared intermediate stages. The key insight is that vocabulary expansion and CDM table scanning are cohort-independent – we can do them once and share the results.
The batch script executes in a strict topological order with no back-edges or cross-cohort dependencies:
| Phase | What happens | Shared? |
|---|---|---|
| 1. Staging setup | Create #COHORT_STAGE,
#INCLUSION_EVENTS_STAGE,
#INCLUSION_STATS_STAGE |
Once |
| 2. Global codesets | Build #CODESETS_GLOBAL with all cohorts’ concept sets;
extract #ALL_CONCEPTS |
Once |
| 3. Domain filtering | Create ##DRUG_EXPOSURE_FILTERED,
##CONDITION_OCCURRENCE_FILTERED, etc. |
Once per domain |
| 4. Phase 1 | For each cohort: slice local #Codesets from global,
build primary events into #QUALIFIED_EVENTS_ALL |
Per-cohort, shared output |
| 5. Phase 2 | For each cohort: slice #qualified_events from
#QUALIFIED_EVENTS_ALL, run inclusion rules, end strategy,
write to staging |
Per-cohort |
| 6. Finalize | Single DELETE by batch IDs + INSERT from staging into result tables | Once |
Consider three cohorts that identify users of different drugs:
Each cohort operates in complete isolation:
Notice: CONCEPT_ANCESTOR is joined 3 times;
DRUG_EXPOSURE is scanned 3 times;
OBSERVATION_PERIOD is scanned 3 times; aspirin concepts are
expanded twice (Cohorts A and C); 3 separate DELETE+INSERT cycles hit
the results table.
The optimizer merges the shared work:
The savings are clear:
| Operation | One-at-a-Time | Optimized |
|---|---|---|
| CONCEPT_ANCESTOR joins | 3 | 1 |
| DRUG_EXPOSURE scans | 3 | 1 |
| OBSERVATION_PERIOD scans | 3 | 1 |
| CONDITION_OCCURRENCE scans | 1 | 1 |
| Result table writes | 3 DELETE + 3 INSERT | 1 DELETE + 1 INSERT |
The batch optimizer must produce exactly the same cohort membership as running each cohort independently. We guarantee this through four mechanisms:
Both paths use the same
build_cohort_query_internal() function to generate
per-cohort SQL. The batch path does not change the SQL logic; it only
changes how the pieces are assembled:
# Single-cohort path:
result <- build_cohort_query_internal(cohort, opts)
sql <- result$full_sql
# Batch path (same builder, structured output):
result <- build_cohort_query_internal(cohort, opts)
# Uses result$codeset_union_parts, result$primary_events_sql,
# result$tail_sql to assemble into batch scriptThe structured output from the builder provides clean separation
points (codeset_union_parts,
primary_events_sql, tail_sql) rather than
relying on fragile regex extraction from generated SQL.
Domain-filtered tables include both standard and source concept columns:
SELECT * INTO ##DRUG_EXPOSURE_FILTERED
FROM @cdm_database_schema.DRUG_EXPOSURE de
WHERE de.drug_concept_id IN (SELECT concept_id FROM #ALL_CONCEPTS)
OR de.drug_source_concept_id IN (SELECT concept_id FROM #ALL_CONCEPTS);This ensures rows that match only on
drug_source_concept_id (e.g., unmapped records with
drug_concept_id = 0) are not dropped in batch but retained,
matching the single-cohort behavior which queries the full CDM
table.
When a cohort specifies QualifiedLimit: "First", the
single-cohort SQL includes WHERE QE.ordinal = 1 to select
only the first qualifying event per person. The Phase 1 rewrite must
preserve this filter. The assembler explicitly checks:
# In assemble_batch_script():
if (isTRUE(structured$has_qualified_limit) &&
!grepl("ordinal\\s*=\\s*1", ins2, perl = TRUE)) {
next # Skip Phase 1 for this cohort; run full script instead
}If the rewritten INSERT does not contain the ordinal = 1
filter, the cohort is routed through the full script
path rather than the Phase 1 + tail path, preserving correct
semantics.
The package includes validate_batch_equivalence() which
runs both paths on the same CDM and compares results row-by-row:
# Run batch path
cdm <- generateCohortSet2(cdm, cohortSet, name = "batch_test")
# Run each cohort independently (optimize=FALSE)
for (each cohort) {
atlas_json_to_sql_batch(single_cohort, optimize = FALSE)
}
# Compare per cohort_definition_id:
# (subject_id, cohort_start_date, cohort_end_date) must match exactlyThe test suite (test-batch-equivalence.R) verifies:
isExcluded,
LEFT JOIN ... IS NULL)includeMapped,
concept_relationship join)Several additional safety checks prevent silent divergence:
can_vectorize tracking: A boolean per
cohort that must be TRUE for the Phase 1 + tail path to be used#strategy_ends
skip Phase 1 and run the full scriptapply_batch_rewrites() flags any @-parameters
that survive rewriting (except known-safe ones like
@vocabulary_database_schema)force_full_path option: Setting
options(atlasCohortGenerator.force_full_path = TRUE)
disables Phase 1 entirely and routes all cohorts through the full script
path, useful for debuggingThe speedup from batch optimization depends on the workload. The main factors are: how many cohorts share vocabulary/domains, the size of the CDM, and result-table I/O cost.
Cohorts that use entirely different CDM domains (e.g., one uses only
DRUG_EXPOSURE, another only MEASUREMENT) still
benefit from:
Typical speedup: 1.5–3x for 5–20 cohorts. The savings come primarily from vocabulary and I/O consolidation.
Cohorts that share CDM domains (e.g., multiple drug-exposure cohorts, or multiple condition-based cohorts) see the largest benefit:
DRUG_EXPOSURE table is filtered once into
##DRUG_EXPOSURE_FILTERED, then all cohorts read from the
smaller filtered tableTypical speedup: 3–10x for 10–50 cohorts sharing 2–3 domains. The savings grow roughly linearly with N because each additional cohort avoids a full CDM scan.
This is the best case. Subset cohorts (e.g., “all aspirin users” vs. “aspirin users with diabetes” vs. “aspirin users over age 65”) share:
##DRUG_EXPOSURE_FILTERED#QUALIFIED_EVENTS_ALL is fastTypical speedup: 5–15x for 5–20 subset cohorts. The primary-events build (the most expensive phase) operates on pre-filtered data, and Phase 2 per-cohort blocks do minimal additional work.
For large-scale studies generating 100+ cohorts:
Typical speedup: 5–20x depending on CDM size and cohort diversity.
| Workload | N | Shared Domains | Speedup Range |
|---|---|---|---|
| Independent, different domains | 5–20 | Low | 1.5–3x |
| Independent, overlapping domains | 10–50 | High | 3–10x |
| Subset cohorts | 5–20 | Very high | 5–15x |
| Large phenotype library | 100+ | Mixed | 5–20x |
The optimizer adds negligible overhead for small batches (< 5 cohorts) and never produces slower results than the one-at-a-time approach.
# Generate optimized batch SQL for multiple cohorts
batch_sql <- atlas_json_to_sql_batch(
json_inputs = list(
"path/to/aspirin_users.json",
"path/to/metformin_users.json",
"path/to/aspirin_diabetics.json"
),
cdm_schema = "@cdm_database_schema",
results_schema = "@results_schema",
target_dialect = "sql server",
optimize = TRUE # default
)
# Or with a cohort set data frame (e.g. from CDMConnector::readCohortSet)
cohort_set <- data.frame(
cohort_definition_id = c(1, 2, 3),
cohort = c("aspirin.json", "metformin.json", "aspirin_diabetics.json")
)
batch_sql <- atlas_json_to_sql_batch(
json_inputs = cohort_set,
optimize = TRUE
)# Dump the final batch SQL for inspection
options(atlasCohortGenerator.debug_sql_file = "batch_debug.sql")
cdm <- generateCohortSet2(cdm, cohort_set, name = "debug_run")
# Force all cohorts through the full (non-optimized) script path
options(atlasCohortGenerator.force_full_path = TRUE)
cdm <- generateCohortSet2(cdm, cohort_set, name = "full_path_run")The batch optimizer does not use an explicit graph data structure. Instead, the DAG is enforced by SQL execution order: the assembled script is a linear sequence of SQL statements where each phase only references tables created in earlier phases. There are no back-edges (later phases never write to tables read by earlier phases) and no cross-cohort dependencies (no cohort’s Phase 2 block reads from another cohort’s Phase 2 output).
This linear ordering is a valid topological sort of the implicit DAG:
#CODESETS_GLOBAL
|
v
#ALL_CONCEPTS
|
+----> ##DRUG_EXPOSURE_FILTERED
+----> ##CONDITION_OCCURRENCE_FILTERED
+----> ##PROCEDURE_OCCURRENCE_FILTERED
+----> ... (other domains)
+----> ##ATLAS_OBSERVATION_PERIOD
|
v
#QUALIFIED_EVENTS_ALL (Phase 1: per-cohort inserts, shared table)
|
v
Per-cohort Phase 2 blocks (independent of each other)
|
v
Staging tables (#COHORT_STAGE, ...)
|
v
Finalize (results tables)
Because Phase 2 blocks are independent, they could in principle be parallelized. The current implementation executes them sequentially within a single SQL batch, but the DAG structure means no ordering constraint exists between different cohorts’ Phase 2 blocks.