multChernoff: Get Started

Overview

multChernoff computes finite-sample tail bounds for the likelihood ratio test (LRT) under multinomial sampling. These bounds can be used to obtain conservative p-values and critical values when the sample size is not large relative to the alphabet size, so standard chi-squared asymptotics may be inaccurate.

The package exposes three main functions:

mgfBound() evaluates the upper bound on the moment generating function.
tailProbBound() returns a conservative upper bound on P(LRT > x).
criticalValue() returns a finite-sample critical value for a target tail probability.

When to Use This Package

The package is most useful when:

the data are multinomial or a product of independent multinomial samples
the alphabet size is not small relative to the sample size
asymptotic chi-squared calibration may be too optimistic

In this setting, tailProbBound() and criticalValue() provide conservative finite-sample alternatives to asymptotic likelihood-ratio approximations.

Basic Usage

Tail Probabilities

The most direct workflow is to evaluate the finite-sample tail bound for a fixed LRT value.

tailProbBound(x = 20, k = 7, n = 50)
#> [1] 0.304216

This value is a conservative upper bound on the tail probability. The function also supports independent multinomial samples by passing vectors of matching lengths for k and n, for example tailProbBound(x = 12, k = c(3, 4), n = c(12, 20)).

Computing a Critical Value

To calibrate a test at level p, use criticalValue().

criticalValue(k = 10, n = 40, p = 0.05)
#> [1] 30.77278

The returned value can be used as a finite-sample alternative to the standard chi-squared cutoff.

Example: What percentage of butterflies were unseen by Corbet?

Naturalist Alexander Steven Corbet spent two years trapping butterflies in Peninsular Malaysia. Here is the data collected by Corbet.

Frequency	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Species	118	74	44	24	29	22	20	19	20	15	12	14	6	12	6

Here we ask the question: what percentage of butterflies in Malaya belonged to the species that Corbet had not seen? That is, we want to estimate the proportion of butterflies from all the unseen species. Clearly, the MLE is zero based on the sample. Instead, we ask for an upper bound with 95% confidence.

Let k = 435 + 1, where 435 is the number of species observed by Corbet. The corresponding observed distribution is the distribution from the table above with a zero entry at the end. In the following program, we maximize the last entry of the underlying distribution subject to a tail bound on the LRT, which equals the Kullback-Leibler divergence KL(observed distribution || true distribution) multiplied by twice the sample size.

See the paper at the bottom of this page for details.

library(plyr)

# Corbet butterfly data
# https://en.wikipedia.org/wiki/Unseen_species_problem
corbet.butterfly <- data.frame(j=1:15, n_j=c(118,74,44,24,29,22,20,19,20,15,12,14,6,12,6))
n.butterfly <- Reduce(c, alply(corbet.butterfly, 1, function(.df) rep(.df$j, .df$n_j)))

alpha <- 0.05
n.observed <- c(n.butterfly, 0)    # the last one is the unseen
n <- sum(n.observed)
k <- length(n.observed)
p.observed <- n.observed / n

# critical value
t.alpha <- criticalValue(k, n, p=alpha, verbose = TRUE)

cat(sprintf("critical value = %f\n", t.alpha))
#> critical value = 962.402970

We get printout critical value = 962.402970. This value is then used in the following convex program.

# convex program
p <- Variable(k)
obj <- p[k]
constr <- list(p>=0, 
               sum(p) == 1, 
               2 * n * sum(p.observed * (log(p.observed) - log(p))) <= t.alpha)
prob <- Problem(Maximize(obj), constr)
result <- solve(prob, verbose=FALSE) # you should try other solvers

# result
print(result$status)
unseen <- result$value

cat(sprintf("unseen <= %f\n", unseen))

With MOSEK solver, the percentage of unseen species is at most 21.1%.