Bayespmtools Package Tutorial

library(bayespmtools)

Introduction

Current sample size calculations for external validation of risk prediction models require researchers to specify fixed values of assumed model performance metrics alongside target precision levels. Using a Bayesian framework enables researchers to 1) include uncertainty in model performance in sample size calculations, 2) specify targets based on expected precision, assurance probabilities, and Value of Information (VoI).

The BayesPMTools R package incorporates the implementation of our proposed Bayesian approach towards Riley’s multi-criteria sample size calculations. Details of this approach are provided in Sadatsafavi et al. (2026): https://doi.org/10.1002/sim.70389.

The two core functions of this package are bpm_valsamp, and bpm_valprec. In general, bpm_valsamp is used for sample size calculations given a set of sample size rules, while bpm_valprec is used in fixed designs: quantifying the anticipated precision for a given set of sample sizes.

In this tutorial, we use the bpm_valsamp function to determine sample size for an exemplary scenario, while using bpm_valprec to 1) check the validity of the results and 2) perform VoI analysis to quantify the expected gain in net benefit across sample size components.

This tutorial assumes the BayesPMTools package is installed on your computer. Please refer to https://github.com/resplab/bayespmtools for more instructions on how to install and use this package.

# Setting the random number seed for reproducibility
set.seed(123)

The ISARIC Model

As a case study, we present sample size calculations for the ISARIC 4C model, a risk prediction model for predicting deterioration in patients hospitalized from COVID-19 Gupta et al. (2021). The investigators used electronic hospital records from all 9 health regions of the UK to develop and validate this model. London was left out for external validation while the other 8 regions were used in creating the model.

We assume we plan to validate this model in the London region.

Specifying Evidence (Uncertainty Quantification)

The investigators used leave-one-region out cross validation to estimate out-pf-sample performance metrics of the model. We use this analysis as the source of evidence on model performance for the London region. Based on the exchangeability assumption, the performance of the model in the London region will be a random draw from the predictive distribution of performance metrics observed across the other 8 regions of the UK.

The Bayesian sample size calculation approach requires specifying prior distributions on 1) outcome prevalence (prev), 2) c-statistic (cstat), 3) calibration slope (cal_slp), 4) one of calibration intercept (cal_int), observed-to-expected (O/E) ratio (cal_oe), or mean calibration (cal_mean, difference between average observed and predicted risks - which is the one we use for this example).

These distributions are derived from a meta-analysis of internal-external validation results from the original study. Details of this meta-analysis are provided elsewhere in Sadatsafavi et al. (2026). Here, we directly use the predictive distributions from this meta-analysis.

Evidence can be specified in different ways. Perhaps the most straightforward is a list of R formulas, as below:

evidence <- list(
    prev~beta(mean=0.428, sd=0.030), 
    cstat~beta(mean=0.761, cih=0.773), 
    cal_mean~norm(-0.009, 0.125),  #mean and SD
    cal_slp~norm(0.995, 0.024)     #mean and SD
  )

Note the flexibility in specifying the parameters for distributions. For prevalence, we are parameterizing the Beta distribution by its mean and SD; for c-statistic, we are using mean and the upper bound of 95% confidence intervals (i.e., 97.5th percentile). These values are internally mapped to the parameters of their respective distribution. If parameters are not named, they are taken as the natural parameters for the distribution as defined in R, as in the last two components.

Currently recognized distributions are norm (Normal), logitnorm (Logit-normal), probitnorm (Probit-normal), and beta (Beta). Note the flexibility in specifying the distributions.

Sample size targets and rules

Targets of sample size / precision calculations are generally divided into two groups: statistical metrics of model performance, and the net benefit (NB) as a measure of clinical utility.

For statistical metrics, the user can choose among c-statistic (cstat), mean calibration error (cal_mean), O/E ratio (cal_oe), calibration intercept (cal_int), and calibration slope (cal_slp). For our example, we use the same targets as chosen in Riley et al. (2021): The target 95% confidence interval widths are as follows: C-Statistic: 0.10, O/E ratio: 0.22, Calibration Slope: 0.3.

For these metrics, two types of rules are definable: those that target the expected CI width (ECIW), and those that target a given quantile of these metrics (QCIW). The former can be seen as the Bayesian equivalent of the frequentist sample size calculation in Riley et al. (2021). The latter is unique to our Bayesian framework and can be used to devise ‘assurance-type’ rules: determining the sample size that provides a given level of assurance that the CI width will be below a desired target.

For our case study, we calculate sample sizes corresponding to the expected value of these metrics, as well as sample sizes that will provide 90% assurance that the future CI width will be at most as wide as the specified targets above. The expected confidence intervals widths (eciw) only require a target confidence interval width. Meanwhile, the quantile confidence interval widths (qciw) require the target confidence interval width to be specified along with the desired assurance (quantile). In our example we use a quantile of 0.9 for 90% assurance.

For net benefit (NB) as a decision-theoretic metric, this framework suggests two sample size rules. The first is the Optimality Assurance. An Optimality Assurance of 90% indicates that we would like to be at least 90% confident that the strategy that we declare as the most optimal (highest NB) in the sample is truly the most optimal strategy. The assurance.nb parameter is therefore set at 0.9. The second sample size rule is the voi.nb, and is defined as the ratio of the Expected Value of Sample Information (EVSI) at a given sample size to Expected Value of Perfect Information (EVPI). A ratio of 0.8 means that the planned study is expected to reduce NB loss due to uncertainty by 80%. In our case study, we do not impose a VoI criteria. Instead, we will later calculate VoI metrics at sample size components for other targets. This will help us inspect the efficiency of sample sizes in terms of expected gain in NB.

targets_samp <- list(eciw.cstat=0.1,
                eciw.cal_oe=0.22,
                eciw.cal_slp=0.30,
                qciw.cstat=c(0.9, 0.1),
                qciw.cal_oe=c(0.9, 0.22),
                qciw.cal_slp=c(0.9, 0.30),
                oa.nb=0.90)

Sample size calculator (bpm_valsamp())

This is the main function call that computes the sample size requirements based on our evidence and targets. It returns the required sample size for external validation for each parameter requested in the targets.

samp <- bpm_valsamp(evidence=evidence, #Evidence as a list
                   targets=targets_samp, #Targets (as specified above)
                   n_sim=10000, #Number of Monte Carlo simulations
                   threshold=0.2) #Risk threshold for NB calculations

Let’s take a look at the sample size components.

#Print results
print(samp$results)
#>   eciw.cstat  eciw.cal_oe eciw.cal_slp   qciw.cstat  qciw.cal_oe qciw.cal_slp 
#>          347          440         1022          396          517         1172 
#>        oa.nb 
#>          306

As can be seen above, different components require different sample size estimates. The largest sample size is 1172, dictated by the assurance desired for the calibration slope criterion. The smallest component, at 306 pertains to Optimality Assurance.

Computing precision / VoI for a given sample size (bpm_valprec())

This function returns the parameter values based on the given sample sizes and targets. The utility of this function is in fixed designs where we are interested in evaluating the sufficiency of the available sample size for various targets.

For this example, we will be using the outputs from bpm_valsamp to verify that the 95% CI widths and VoI values for sample size components derived from bpm_valprec are indeed close to the desired ones. As such, we use the same evidence element defined earlier, and create a new target element target_prec, that contains the same parameters target_samp but defined as T for present eciw targets, or 0.9 for present qciw targets.

N_prec <- samp$results  #Sample size components from the main function call.

targets_prec = list(eciw.cstat = T, eciw.cal_oe = T, eciw.cal_slp = T, qciw.cstat = 0.9, qciw.cal_oe = 0.9, qciw.cal_slp = 0.9, oa.nb=T)

Note that for ECIW (as well as Optimality Assurance), we just express our desire to have them calculated. For QCIW values, the quantile value needs to be supplied.

prec <- bpm_valprec(
  N = N_prec, #Sample sizes
  evidence = evidence, #Evidence as a list
  dist_type="logitnorm", #Distribution type for calibrated risks
  method="sample", #Sample based or two-level ("2s") method
  targets = targets_prec, #Targets specified above
  n_sim = 10000, #Number of Monte Carlo simulations
  threshold=0.2) #Risk threshold for NB VoI calculations

The main output of bpm_valprec() is a matrix named ‘results’ that contains all requested precision / VoI metrics at all requested sample sizes. The rows of this matrix are named as the requested sample size (as character), while the columns are named after sample size rules. This allows us to extract the relevant cells from this matrix using the short code below:


res <- prec$results[cbind(samp$results, names(samp$results))]
names(res) <- names(samp$results)

print(res)
#>   eciw.cstat  eciw.cal_oe eciw.cal_slp   qciw.cstat  qciw.cal_oe qciw.cal_slp 
#>   0.10081432   0.21673020   0.30495915   0.09995627   0.21972258   0.30001731 
#>        oa.nb 
#>   0.88560000

These values should be close to the requested precision / assurance / VoI targets. If this is not the case, one should potentially run the analysis with larger values of n_sim, as well as checking if evidence is specified reasonably.

Drawing the EVSI curve

Finally, we use VoI analysis to determine the relative efficiency of sample size components in terms of their expected NB gain.

Here, we call the bpm_valprec() function with only one target (voi.nb=TRUE). Also, note that for ‘evidence’ input, we are passing the sample object from the main results. This bypasses the sample creation component, which is already done once. Because of this, some other inputs are also not needed anymore. For example, there is no need to specify n_sim, as this is already determined by the number of rows of the sample data frame.

voi <- bpm_valprec(
  N = N_prec, #Sample sizes at each calculations are to be conducted.
  evidence = samp$sample, #Evidence as a list
  targets = list(voi.nb=T), #Only requesting EVSI and EVPI
  threshold=0.2) #Risk threshold for NB VoI calculations

Below we use ggplot to draw the EVSI curve for each sample size component. Note that the main Y axis shows the raw EVSI(N) values, which asymptote to the EVPI value. The secondary (right) Y axis shows the EVSI(N)/EVPI ratio. This ratio asymptotes to 1 as the sample size increases.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.2.3

graph_shapes <- c("oa.nb"=18,
                  "eciw.cstat"=15,
                  "qciw.cstat"=15,
                  "eciw.cal_oe"=16,
                  "qciw.cal_oe"=16,
                  "eciw.cal_slp"=17,
                  "qciw.cal_slp"=17)

graph_names <- c("oa.nb"="Optimality assurance",
                  "eciw.cstat"="c-statistic (expected value)",
                  "qciw.cstat"="c-statistic (assurance)",
                  "eciw.cal_oe"="O/E ratio (expected value)",
                  "qciw.cal_oe"="O/E ratio (assurance)",
                  "eciw.cal_slp"="Calibration slope (expected value)",
                  "qciw.cal_slp"="Calibration slope (assurance)")

graph_colors <- c("oa.nb"="black",
                  "eciw.cstat"="#1f77b4", 
                  "qciw.cstat"="orange",
                  "eciw.cal_oe"="#1f77b4",
                  "qciw.cal_oe"="orange",
                  "eciw.cal_slp"="#1f77b4",
                  "qciw.cal_slp"="orange")

df <- data.frame(N=voi$N, EVPI=voi$results[,'voi.evpi'], EVSI=voi$results[,'voi.evsi'])
o <- order(df$N)
df <- df[o, ]

df$shapes <- graph_shapes[rownames(df)]
df$graph_names <- graph_names[rownames(df)]
df$colors <- graph_colors[rownames(df)]
df$names <- rownames(df)

df$names <- factor(df$names, levels = unique(df$names))


df$EVSI_EVPI_ratio <- df$EVSI / df$EVPI
scale_factor <- max(df$EVSI) / max(df$EVSI_EVPI_ratio)
df$EVSI_EVPI_ratio_scaled <- df$EVSI_EVPI_ratio * scale_factor


# EVSI plot
plt <- ggplot(df, aes(x=N)) +
  geom_line(aes(y=EVSI), color="black") +  # Add the EVSI/EVPI ratio line
  geom_point(aes(y=EVSI, shape=names, color=names), size=2) +  
  labs(x = "Sample size (N)", y = "Expected gain in NB (EVSI)", title = "") +
  scale_shape_manual(
    name = "Sample size targets and rules",  
    values = df$shapes,  # Unique shapes
    labels = df$graph_names  # Use levels of N for labels
  ) +
  scale_color_manual(
    name = "Sample size targets and rules",  
    values = df$colors,  # Unique shapes
    labels = df$graph_names  # Use levels of N for labels
  ) +
  # Primary Y-axis limits
  scale_y_continuous(
    limits = c(0, max(df$EVSI, df$EVPI)),
    sec.axis = sec_axis(~ . / scale_factor, name = "EVSI/EVPI Ratio")  # Secondary Y-axis
  ) +
  # Horizontal lines
  geom_hline(yintercept = df$EVPI, color = "gray") +
  theme_minimal()
print(plt)

The EVSI graph might be used to reason, for example, that whether one can relax sample size rules around calibration slope (which more than double the required sample size) without losing much clinical utility.

References

Gupta, Rishi K., Ewen M. Harrison, Antonia Ho, Annemarie B. Docherty, Stephen R. Knight, Maarten van Smeden, Ibrahim Abubakar, et al. 2021. “Development and Validation of the ISARIC 4C Deterioration Model for Adults Hospitalised with COVID-19: A Prospective Cohort Study.” The Lancet Respiratory Medicine 9 (4): 349–59.
Riley, Richard D., Thomas P. A. Debray, Gary S. Collins, Lucinda Archer, Joie Ensor, Maarten van Smeden, and Kym I. E. Snell. 2021. “Minimum Sample Size for External Validation of a Clinical Prediction Model with a Binary Outcome.” Statistics in Medicine 40 (19): 4230–51.
Sadatsafavi, Mohsen, Paul Gustafson, Solmaz Setayeshgar, Laure Wynants, and Richard D Riley. 2026. “Bayesian Sample Size Calculations for External Validation Studies of Risk Prediction Models.” Statistics in Medicine. https://doi.org/10.1002/sim.70389.