bacontrees is an R package for modelling discrete
sequential data using Bayesian Context
Trees (BaCon Trees). Context trees, also known as
Variable Length Markov Chains (VLMCs), are parsimonious Markov models
where the order of dependence can vary with the observed past.
bacontrees provides:
ContextTree)
that exposes the full context tree for building custom algorithms.baConTree):
closed-form marginal likelihood, exact MAP tree, exact prior/posterior
probabilities for any tree, and exact sampling from the posterior — all
via an efficient recursive algorithm.fit_vlmc()
estimates the context tree via the context algorithm with
likelihood-ratio pruning.rvlmc()).You can install the development version of bacontrees from GitHub with:
# install.packages("pak")
pak::pak("Freguglia/bacontrees")Or with devtools:
# install.packages("devtools")
devtools::install_github("Freguglia/bacontrees")A Variable Length Markov Chain (VLMC) is a Markov model where the relevant past (the context) that determines the next symbol’s distribution has a variable length. Further pasts are only distinguished when the data supports it, making VLMCs much more parsimonious than fixed-order Markov chains.
The set of contexts forms a complete subtree of the full (maximal)
suffix tree over the alphabet. bacontrees represents this
subtree explicitly as a ContextTree object, whose
active leaf nodes are exactly the contexts of the model.
Under the Bayesian formulation, a Dirichlet prior is placed on the transition probabilities at each context and a prior distribution over tree topologies is defined through per-node weights. The resulting posterior over trees can be computed exactly: a single bottom-up recursion over the maximal tree yields the marginal likelihood, the posterior probability of every candidate tree, the MAP tree, and an exact sampler — all without MCMC.
The package ships with two example datasets: abc_vec (a
single sequence of length 1000) and abc_list (a list of
three sequences of length 1000 each), generated from a three-symbol VLMC
with alphabet {a, b, c}.
library(bacontrees)
data(abc_list)
head(abc_list[[1]], 20)
#> [1] "a" "c" "a" "c" "a" "c" "c" "b" "c" "b" "a" "c" "c" "b" "c" "b" "c" "b" "b"
#> [20] "c"Use rvlmc() to generate synthetic sequences given a
context list and the corresponding transition probability vectors.
alphabet <- c("a", "b", "c")
context_list <- c("*.a", "*.b", "*.c.a", "*.c.b", "*.c.c")
context_probs <- list(
c(0.10, 0.20, 0.70), # after 'a'
c(0.33, 0.33, 0.34), # after 'b'
c(0.20, 0.10, 0.70), # after 'ac'
c(0.01, 0.98, 0.01), # after 'bc'
c(0.40, 0.40, 0.20) # after 'cc'
)
set.seed(42)
seq1 <- rvlmc(1000, alphabet, context_list, context_probs)
head(seq1, 20)
#> [1] "a" "a" "a" "a" "c" "c" "b" "a" "b" "a" "a" "c" "c" "a" "c" "c" "b" "c" "b"
#> [20] "b"fit_vlmc()fit_vlmc() implements the classical context algorithm:
build a maximal tree, then prune nodes whose likelihood-ratio test
statistic falls below cutoff.
fit <- fit_vlmc(abc_list, cutoff = 20, max_length = 4)
fit$getActiveNodes()
#> [1] "*.a" "*.b" "*.c.a" "*.c.b" "*.c.c"ContextTree (and baConTree) objects have an
S3 plot() method that uses ggraph
internally and can be customised like any ggplot2 object.
plot(fit)
baConTreebaConTree extends ContextTree with Bayesian
machinery. On construction it runs a bottom-up recursion that
precomputes the normalising constants (\(\sigma\)-values) at every node of the
maximal tree. These values make all key Bayesian quantities available as
direct, closed-form computations.
bt <- baConTree$new(abc_list, maximalDepth = 3, alpha = 0.01,
priorWeights = function(node) exp(-node$getDepth() / 3))The exact log marginal likelihood of the data (integrating over all tree topologies and all transition probability matrices) is available immediately after construction:
bt$getMarginalLikelihood(log = TRUE)
#> [1] -2665.658The MAP tree is found exactly in a single top-down pass — no sampling required:
bt$activateMap()
bt$getActiveNodes()
#> [1] "*.a" "*.c.a" "*.b" "*.c.b" "*.c.c"The prior and posterior probability of any candidate tree
can be evaluated exactly by activating that tree and calling
activeTreeProbabilities():
bt$activateFromContexts(c("*.a", "*.b", "*.c.a", "*.c.b", "*.c.c"))
bt$activeTreeProbabilities()
#> $prior
#> [1] 0.04045974
#>
#> $posterior
#> [1] 0.9998251Draw independent samples directly from the prior or posterior without MCMC:
set.seed(1)
bt$sampleTree("posterior")
#> [1] "AAAAADNAAA"
bt$getActiveNodes()
#> [1] "*.a" "*.c.a" "*.b" "*.c.b" "*.c.c"metropolis_vlmc()For settings where exact enumeration is not the goal (e.g.,
diagnosing multimodality or studying the mixing behaviour of MCMC on
tree spaces), metropolis_vlmc() provides a
Metropolis-Hastings sampler over context trees.
library(progressr)
with_progress({
result <- metropolis_vlmc(
abc_list,
n_steps = 2000,
max_depth = 3,
alpha = 0.01,
context_weights = function(node) exp(-node$getDepth() / 3),
burnin = 200
)
})
print(result)
#> Tree with highest posterior probability ( 1 ):
#> { *.a, *.c.a, *.b, *.c.b, *.c.c }
#> # A tibble: 1 × 4
#> tree_contexts prob n tree_code
#> <chr> <dbl> <int> <chr>
#> 1 {*.a,*.c.a,*.b,*.c.b,*.c.c} 1 1800 AAAAADNAAAThe returned object contains:
| Field | Description |
|---|---|
$df |
Data frame of context sets ranked by posterior probability |
$codes |
Named list mapping tree codes to their context vectors |
$chain |
Full MCMC chain of tree codes (for convergence diagnostics) |
$baConTree |
The baConTree object used internally |
ContextTreeThe core data structure. Manages the maximal suffix tree, tracks which nodes are active (the current context set), and handles data attachment. Designed to be subclassed or used directly for building custom algorithms over context trees.
| Method | Description |
|---|---|
$new(dataset, maximalDepth, alphabet) |
Construct a context tree |
$root() |
Return the root node |
$validate() |
Check that the maximal tree structure is valid |
$getMaximalDepth() |
Return the maximal depth of the tree |
$getActiveNodes() |
Return paths of active (leaf) nodes |
$getLeaves() |
Return paths of maximal-tree leaves |
$getInnerNodes() |
Return inner nodes of the active tree |
$getGrowableNodes() |
Return active nodes that can be grown |
$getPrunableNodes() |
Return active nodes that can be pruned |
$nodeExists(path) |
Check if a node with the given path exists |
$getParentNode(path) |
Return the parent of a node |
$getChildrenNodes(path) |
Return the children of a node |
$getSiblingNodes(path) |
Return the siblings of a node |
$activateRoot() |
Set active tree to root-only |
$activateMaximal() |
Set active tree to all maximal leaves |
$activateByCode(code) |
Restore a previously stored tree code |
$activateFromContexts(contexts) |
Set active tree to match a context vector or string |
$growActive(path) |
Grow the active tree at a given node |
$pruneActive(path) |
Prune the active tree at a given node |
$setData(dataset) |
Attach sequence data and compute counts |
$getDataset() |
Retrieve attached Sequence object |
$getAlphabet() |
Retrieve the Alphabet object |
$activeTreeCode() |
Compact binary encoding of the active tree |
$igraph() |
Convert to an igraph object |
plot(ct) |
Plot with ggraph |
baConTreeExtends ContextTree with exact Bayesian inference. The
constructor runs a bottom-up recursion to precompute the normalising
constants for the prior and posterior, enabling all quantities below to
be evaluated without MCMC.
| Method | Description |
|---|---|
$new(data, maximalDepth, alpha, priorWeights, initialTree) |
Construct a Bayesian context tree and precompute recursive quantities |
$getMarginalLikelihood(log) |
Exact marginal likelihood of the data |
$activateMap() |
Set active tree to the exact MAP tree |
$activeTreeProbabilities(log) |
Exact prior and posterior probabilities of the active tree |
$sampleTree(type) |
Draw an exact independent sample from the prior or posterior |
$runMetropolisHastings(steps) |
Run a Metropolis-Hastings MCMC sampler (optional) |
$getChain() |
Retrieve the MCMC chain as a data frame |