---
title: "Introduction to treeSS"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to treeSS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = FALSE
)
```

## Overview

The **treeSS** package implements the tree-spatial scan statistic of
Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters
that are simultaneously anomalous in geographic space and on a
hierarchical classification tree. It generalises Kulldorff's (1997)
circular spatial scan by searching jointly over circular spatial
zones $z$ and tree branches $g$, returning the (zone, branch) pair
that maximises the likelihood ratio.

The package targets epidemiological and surveillance applications
where the events to be clustered (deaths, hospitalisations, crimes,
collisions) carry both a *location* and a *categorical hierarchy*
(ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type,
etc.). Detecting clusters jointly in both dimensions can identify
specific cause-area combinations that pure spatial or pure tree scans
would miss.

The package provides:

| Function | Purpose |
|:---------|:--------|
| `treespatial_scan()` | Tree-spatial scan (main entry point) |
| `circular_scan()`    | Kulldorff's circular spatial scan (tree-free) |
| `tree_scan()`        | Tree-based scan (space-free) |
| `iterative_scan()`   | Sequential conditional scan with Holm–Bonferroni correction |
| `filter_clusters()`  | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 |
| `get_cluster_regions()` | Tidy region-by-cluster table for plotting |
| `aggregate_tree()`   | Roll counts up the tree |

All scans accept the Poisson and Binomial models of the original paper
and run the Monte Carlo step under OpenMP via the `n_cores` argument.

The package ships four example datasets:

| Dataset | Country | Domain | Regions | Tree |
|:--------|:--------|:-------|:--------|:-----|
| `rj_mortality` + `rj_tree` | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) |
| `chicago_crimes` + `chicago_tree` | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) |
| `fl_deaths` | USA | Public health | 65 counties | (built by user from raw data) |
| `london_collisions` + `london_tree` | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |

This vignette walks through `rj_mortality` (reproducing the published
analysis) and `chicago_crimes` (a different-domain illustration), then
shows how to use `iterative_scan()` for sequential cluster detection.
The Florida and London examples are shipped in `inst/examples/` and
discussed in the companion paper.

---

## Example 1 — Infant mortality in Rio de Janeiro

This reproduces Section 5.2 of Cançado et al. (2025): all live births
and infant deaths in the 89 municipalities of Rio de Janeiro state in
2016, classified by the ICD-10 hierarchy. The pre-built dataset
combines deaths, live births, ICD-10 leaf codes and centroid
coordinates in long format.

### Run the scan

```{r rj-data}
library(treeSS)
data(rj_mortality)
data(rj_tree)

str(rj_mortality, max = 1)
#> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code,
#>   name, live_births, x, y, node_id, cases, ...
```

```{r rj-scan}
result_rj <- treespatial_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_rj)
```

Running this produces the most likely cluster `P209` with 18
municipalities, log-likelihood ratio $LR \approx 38.83$, $p < 0.001$,
27 observed deaths against 3.34 expected — closely matching the
published Table 7 ($LR = 38.48$, 18 municipalities, 27 deaths,
3.37 expected). The minor LR difference comes from a SIM database
update between the paper's analysis and the dataset shipped here;
the conclusions are identical.

### Multiple distinct clusters (paper-faithful)

To extract more than one cluster from a single scan, the paper's
Section 5.1.1 specifies a filtering rule: a candidate (zone, branch)
pair is retained only if its branch is disjoint from every previously
retained branch (no ancestor/descendant relation), or its zone is
disjoint from every previously retained zone (no region in common).
This is implemented in `filter_clusters()`:

```{r rj-filter}
filter_clusters(result_rj)[1:5, c("node_id", "n_regions",
                                   "cases", "expected", "llr")]
```

For visualisation, `get_cluster_regions()` returns a tidy data.frame
ready to merge with a polygon layer:

```{r rj-getregions, eval = FALSE}
# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)

# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)
```

A complete plotting workflow with `geobr` and `ggplot2`:

```{r rj-plot}
library(ggplot2)
library(geobr)
library(sf)

mun <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))

cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE),
             unique(rj_mortality[, c("region_id", "ibge_code")]),
             by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")

ggplot(mun_facet) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")
```

A complete worked example with detailed annotation is shipped at
`system.file("examples", "example_brazil_rj.R", package = "treeSS")`.

---

## Example 2 — Crime in Chicago

The Chicago dataset illustrates the package on a non-medical domain
with a much larger tree (2841 nodes covering crime type, description
and location). It also demonstrates a recurring choice users face
in surveillance applications: **which population should be the
denominator?**

`chicago_crimes` ships with two denominator columns:

  * `population` — the total incidents in each community area. This
    is a *compositional* denominator: it expresses each cell as a
    share of the citywide incident count and answers the question
    *"which crime types over-occur in which areas, relative to the
    citywide profile?"*
  * `pop_residential` — the residential population of each community
    area (ACS 2020 5-year, aggregated to community areas by CMAP
    Community Data Snapshots). This is the appropriate denominator
    for an *incidence-rate* analysis (incidents per resident),
    analogous to deaths per live birth in Example 1.

The choice changes both the model and the interpretation. We use the
incidence-rate denominator below.

### Run the scan

```{r chi-scan}
data(chicago_crimes)
data(chicago_tree)

result_chi <- treespatial_scan(
  cases       = chicago_crimes$cases,
  population  = chicago_crimes$pop_residential,
  region_id   = chicago_crimes$region_id,
  x           = chicago_crimes$x,
  y           = chicago_crimes$y,
  node_id     = chicago_crimes$node_id,
  tree        = chicago_tree,
  max_pop_pct = 0.25,
  nsim        = 999, seed = 2023,
  n_cores     = 4L
)
print(result_chi)
```

The most likely cluster covers a contiguous group of community areas
on Chicago's South Side (Englewood, West Englewood, Auburn Gresham,
Roseland and neighbouring CCAs), driven by an over-occurrence of
violent and property crimes against residents. The relative risk for
this cluster is approximately 1.58 — a 58 % excess over the citywide
rate.

### Visualise the most likely cluster

`chicago_crimes` joins to the bundled `chicago_map` polygon layer via
the `area_number` column:

```{r chi-plot}
data(chicago_map)

cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE),
             unique(chicago_crimes[, c("region_id", "area_number", "name")]),
             by = "region_id")
chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number",
              all.x = TRUE)

ggplot(chi) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95",
                    name = "Cluster") +
  theme_minimal()
```

---

## Multi-cluster detection: `iterative_scan()`

`filter_clusters()` (used in Example 1) returns distinct clusters
extracted from a *single* run of the scan. An alternative — popular
in the spatial-epidemiology literature since Zhang, Assunção and
Kulldorff (2010) — is to run the scan, remove the cases assigned to
the detected cluster, and re-run on the residual data. Repeating this
yields a sequence of clusters whose subsequent components are
*conditionally most likely*, given that the prior clusters have been
explained away.

The function `iterative_scan()` implements this procedure. **Note that
it is not part of Cançado et al. (2025);** users seeking strict
fidelity to that paper should use `filter_clusters()`. The iterative
procedure is offered as a useful extension and is documented as such.

Because each iteration is a separate hypothesis test on data modified
by the previous iteration, the raw p-values overstate significance.
`iterative_scan()` collects the raw p-values from all iterations and
applies the Holm–Bonferroni correction at the end:

```{r rj-iter}
iter_rj <- iterative_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_iter    = 5, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.50, n_cores = 4L
)
print(iter_rj)
```

The returned `clusters` data.frame has one row per iteration with the
raw p-value, the Holm-adjusted p-value, and a `significant` flag:

```{r rj-iter-table}
iter_rj$clusters[, c("iteration", "node_id", "n_regions",
                      "llr", "pvalue", "pvalue_adjusted",
                      "significant")]
```

Mapping the iterative result uses the same S3 dispatch:

```{r rj-iter-map}
cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE),
               unique(rj_mortality[, c("region_id", "ibge_code")]),
               by = "region_id")
mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code",
                all.x = TRUE)

ggplot(mun_it) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0",
                                "3" = "#55A868", "4" = "#8172B2",
                                "5" = "#CCB974"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel) +
  theme_minimal() + theme(legend.position = "none")
```

When to use which:

  * Use `filter_clusters()` / `get_cluster_regions(n_clusters = N)`
    when you want to follow Cançado et al. (2025) exactly.
  * Use `iterative_scan()` when residual clustering is plausible and
    you want a sequence of conditionally distinct clusters with
    proper multiple-testing control.

---

## Computational notes

The Monte Carlo step (replicates of the null Poisson/Binomial counts
under proportional intensities) dominates the runtime. All scan
functions accept `n_cores`, which parallelises the Monte Carlo step
via OpenMP. On a 4-core laptop the Rio de Janeiro analysis runs in
under 20 seconds with `nsim = 999`; the Chicago analysis (much larger
tree) takes around two minutes.

For tasks where the case counts are concentrated near the leaves and
the tree is deep, `aggregate_tree()` can be used to compute the
internal-node sums once and reuse them across calls.

---

## Other examples

The `inst/examples/` directory of the installed package contains
fully annotated worked examples for all four datasets:

```{r examples-dir}
list.files(system.file("examples", package = "treeSS"))
#> [1] "example_brazil_rj.R"  "example_chicago.R"
#> [2] "example_florida.R"    "example_london.R"
```

The Florida script demonstrates the full workflow from raw long
data — building the ICD-10 tree from scratch, downloading county
polygons via `tigris`, assembling parallel input vectors and
plotting with `ggplot2`. The London script illustrates the use of
a non-medical tree (light × road × junction) on traffic collisions,
with an interactive `leaflet` map.

These two examples are also the basis of the detailed use cases in
the companion R Journal paper.

---

## References

Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H.
(2025). A tree-spatial scan statistic. *Environmental and Ecological
Statistics*, 32, 953–978.
[doi:10.1007/s10651-025-00670-w](https://doi.org/10.1007/s10651-025-00670-w)

Kulldorff, M. (1997). A spatial scan statistic. *Communications in
Statistics — Theory and Methods*, 26(6), 1481–1496.

Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan
statistic for database disease surveillance. *Biometrics*, 59(2),
323–331.

Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan
statistics adjusted for multiple clusters. *Journal of Probability
and Statistics*, 2010, 642379.

Holm, S. (1979). A simple sequentially rejective multiple test
procedure. *Scandinavian Journal of Statistics*, 6(2), 65–70.