02. Model Performance & Benchmarking

Introduction

Once a SuperSurv ensemble is trained, we must rigorously prove that it outperforms the individual base learners. Because survival data involves right-censoring, we cannot use standard classification metrics like simple accuracy.

Instead, we evaluate the model across three critical dimensions: 1. Calibration: Does the predicted survival probability match the actual observed survival rate? 2. Discrimination (AUC & C-index): Can the model correctly rank which patient will survive longer? 3. Overall Accuracy (Brier Score): A combined measure of both calibration and discrimination.

This tutorial demonstrates how to extract these metrics and visualize the benchmark comparisons using SuperSurv’s built-in evaluation suite.

1. Data Preparation & The Extrapolation Rule

We begin by loading the metabric dataset and defining our evaluation time grid (new.times).

Crucial Methodological Note: Your new.times grid should never exceed the maximum observed follow-up time in your training cohort. For example, if your training data only spans 1 to 100 days, predicting survival at day 150 is extrapolation. Non-parametric models (like Survival Trees) mathematically cannot extrapolate, and parametric models (like Weibulls) will generate highly unreliable tail estimates. Always bind your evaluation grid within the limits of your observed data!

library(SuperSurv)
library(survival)

data("metabric", package = "SuperSurv")
set.seed(123)
train_idx <- sample(1:nrow(metabric), 0.7 * nrow(metabric))
train <- metabric[train_idx, ]
test  <- metabric[-train_idx, ]

X_tr <- train[, grep("^x", names(metabric))]
X_te <- test[, grep("^x", names(metabric))]

# Our max follow-up is well beyond 200, so this grid is safe.
new.times <- seq(50, 200, by = 25)

2. Train the Benchmark Ensemble

We will fit an ensemble consisting of a Cox model, a Weibull model, and a Survival Tree using the default Least Squares meta-learner.

my_library <- c("surv.coxph", "surv.weibull", "surv.rpart")

fit_supersurv <- SuperSurv(
  time = train$duration,
  event = train$event,
  X = X_tr,
  newdata = X_te,
  new.times = new.times,
  event.library = my_library,
  cens.library = my_library,
  control = list(saveFitLibrary = TRUE),
  verbose = FALSE,
  nFolds = 3
)

3. Extracting Integrated Metrics

The eval_summary() function automatically generates predictions on your test set and returns a clean, comparative table of the integrated metrics across your entire time grid.

# Evaluate performance directly using the fitted model and test data
performance_results <- eval_summary(
  object = fit_supersurv,
  newdata = X_te,
  time = test$duration,
  event = test$event,
  eval_times = new.times
)

Note: Look for the model with the lowest IBS (Integrated Brier Score) and the highest iAUC/Uno’s C-index.

4. Visualizing Longitudinal Benchmarks

A single integrated number rarely tells the whole story. Different models excel at different follow-up periods (e.g., a Cox model might dominate short-term survival, while a Random Forest dominates long-term).

The plot_benchmark() function generates a stacked dashboard to visualize this dynamic performance over time.

plot_benchmark(
  object = fit_supersurv,
  newdata = X_te,
  time = test$duration,
  event = test$event,
  eval_times = new.times
)

Interpreting the Curves:

IPCW Brier Score Plot: Look for the curve that stays the lowest. The SuperSurv ensemble should ideally hug the bottom edge.
Time-Dependent AUC Plot: Look for the curve that stays the highest (closest to 1.0).
Uno’s C-index Plot: Evaluates the global C-index using the specific predictions generated at each time point. Look for the curve that stays the highest.

5. Assessing Clinical Calibration

Before deploying a model to the clinic, doctors need to know if the probabilities are reliable. If the model predicts a patient has a 40% chance of surviving past Time = 100, do exactly 40% of similar patients actually survive?

We use plot_calibration() to evaluate this at a specific clinical milestone. The function groups patients into risk quantiles and plots their predicted probability against the actual observed Kaplan-Meier survival rate.

# Assess calibration specifically at Time = 150
plot_calibration(
  object = fit_supersurv,
  newdata = X_te,
  time = test$duration,
  event = test$event,
  eval_time = 150,
  bins = 5 # Group patients into 5 risk quintiles
)

Interpretation: A perfectly calibrated model will follow the 45-degree dashed black line. Points above the line indicate the model is under-predicting survival (being too pessimistic), while points below the line indicate it is over-predicting survival (being too optimistic).