multChernoff computes finite-sample tail bounds for the
likelihood ratio test (LRT) under multinomial sampling. These bounds can
be used to obtain conservative p-values and critical values when the
sample size is not large relative to the alphabet size, so standard
chi-squared asymptotics may be inaccurate.
The package exposes three main functions:
mgfBound() evaluates the upper bound on the moment
generating function.tailProbBound() returns a conservative upper bound on
P(LRT > x).criticalValue() returns a finite-sample critical value
for a target tail probability.The package is most useful when:
In this setting, tailProbBound() and
criticalValue() provide conservative finite-sample
alternatives to asymptotic likelihood-ratio approximations.
The most direct workflow is to evaluate the finite-sample tail bound for a fixed LRT value.
This value is a conservative upper bound on the tail probability. The
function also supports independent multinomial samples by passing
vectors of matching lengths for k and n, for
example
tailProbBound(x = 12, k = c(3, 4), n = c(12, 20)).
Naturalist Alexander Steven Corbet spent two years trapping butterflies in Peninsular Malaysia. Here is the data collected by Corbet.
| Frequency | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | 118 | 74 | 44 | 24 | 29 | 22 | 20 | 19 | 20 | 15 | 12 | 14 | 6 | 12 | 6 |
Here we ask the question: what percentage of butterflies in Malaya belonged to the species that Corbet had not seen? That is, we want to estimate the proportion of butterflies from all the unseen species. Clearly, the MLE is zero based on the sample. Instead, we ask for an upper bound with 95% confidence.
Let k = 435 + 1, where 435 is the number of species observed by Corbet. The corresponding observed distribution is the distribution from the table above with a zero entry at the end. In the following program, we maximize the last entry of the underlying distribution subject to a tail bound on the LRT, which equals the Kullback-Leibler divergence KL(observed distribution || true distribution) multiplied by twice the sample size.
See the paper at the bottom of this page for details.
library(plyr)
# Corbet butterfly data
# https://en.wikipedia.org/wiki/Unseen_species_problem
corbet.butterfly <- data.frame(j=1:15, n_j=c(118,74,44,24,29,22,20,19,20,15,12,14,6,12,6))
n.butterfly <- Reduce(c, alply(corbet.butterfly, 1, function(.df) rep(.df$j, .df$n_j)))
alpha <- 0.05
n.observed <- c(n.butterfly, 0) # the last one is the unseen
n <- sum(n.observed)
k <- length(n.observed)
p.observed <- n.observed / n
# critical value
t.alpha <- criticalValue(k, n, p=alpha, verbose = TRUE)We get printout critical value = 962.402970. This value
is then used in the following convex program.
# convex program
p <- Variable(k)
obj <- p[k]
constr <- list(p>=0,
sum(p) == 1,
2 * n * sum(p.observed * (log(p.observed) - log(p))) <= t.alpha)
prob <- Problem(Maximize(obj), constr)
result <- solve(prob, verbose=FALSE) # you should try other solvers
# result
print(result$status)
unseen <- result$value
cat(sprintf("unseen <= %f\n", unseen))With MOSEK solver, the percentage of unseen species is at most 21.1%.
F. Richard Guo and Thomas S. Richardson (2021). Chernoff-Type Concentration of Empirical Probabilities in Relative Entropy. IEEE Transactions on Information Theory, 67(1), 549-558. https://doi.org/10.1109/TIT.2020.3034539