Model and output customisation in missingHE

Each model fitting function in missingHE provides a series of customisation options to allow a flexible specification of the missingness models in terms of modelling assumptions and output manipulation. Examples include: choice of outcome and prior probability distributions, model structure, missing data assumptions, type of diagnostic and model assessment measures, choice of numeric and graphical posterior summaries. This high level of flexibility aims at facilitating the tasks of the user related to the implementation of the health economic analysis, interpretation of key parameters as well as the need to assess model performance and conduct sensitivity analyses.

This tutorial illustrates how to customise the specification and output of the different types of models that can be fitted in missingHE using the different arguments available inside the functions of the package. For illustration purposes, the built-in data sets called MenSS and PBS will be used in this tutorial as toy examples, which are directly available when installing and loading missingHE in your R workspace. See Introduction to missingHE and Longitudinal models in missingHE for a overview about the different modelling approaches available in missingHE, practical examples on how to implement them using default argument choices and a general presentation of the two data sets.

If you would like to have more information on the package, or would like to point out potential issues in the current version, feel free to contact the maintainer at a.gabrio@maastrichtuniversity.nl. Suggestions on how to improve the package are also very welcome. A development version of the package is also available from the author’s main GitHub repository at https://github.com/AnGabrio/missingHE.

1 Modelling assumptions

The majority of the customisation options about the assumptions and specification of the modelling approaches available in missingHE are common across the four main functions of the package, namely selection (selection models), pattern (pattern mixture models), hurdle (hurdle models) and lmdm (longitudinal missing data model). Among these, key options include:

Other options are instead specific to the type of modelling approach selected and are therefore not applicable to all functions. Among these, key examples include:

In the following, a general overview of the choices available in missingHE to specify these modelling assumptions using dedicated arguments from each main function of the package. An application of the methods to the two built-in data sets of the package will be used as a demonstrative example to illustrate the impact of these choices on the analysis results.

2 Outcome distribution

Choice of the distribution for the outcome variables is typically informed based on the characteristics of the observed data by checking whether possible inconsistencies between distributional assumptions and observed data occur. A common example is given by assuming both effectiveness and cost variables to be normally distributed, a typical assumption made in standard statistical analyses of continuous data. Indeed, provided a sufficient sample size is obtained from the study data, the CLT guarantees that the sampling distribution of the sample mean will be normal, regardless of the shape of the population distribution. In CEA, however, the context of the analysis poses some challenges which may violate this assumption: 1) sample sizes about health economic outcomes are often limited and further reduced by considerable proportions of missing values; 2) a combination of data features that violate normality are often present in the observed data (e.g. skewness); 3) interest is not in statistical inference but rather in uncertainty assessment and quantification. The occurrence of these elements lead analysts to consider alternative distributional forms for the modelling of cost-effectiveness outcomes, in an attempt to improve the fit and convey more robust results.

missingHE allows the user to choose among a set of pre-defined parametric forms for the distributions assigned to the effectiveness and cost outcomes, which are shared across all model fitting functions in the package. The choice of the available distributions for the effetiveness and cost outcomes was informed based on the current methodological literature and may be conveyed to the desired package function by passing a specific character name to the arguments dist_e and dist_c. In particular, the set of distributions available (with associated character names) is:

Note that, while costs are usually expressed in monetary terms and can be modelled using continuous distributions, the level of measurement of effectiveness variables may differ according to the specific outcome which should be used to inform the choice of the relevant modelling distribution.

To illustrate how alternative outcome distributions may be assumed to conduct the health economic assessment using missingHE, we use data from the built-in data set MenSS as a toy example. The data come from a pilot randomised trial on young males at risk of sexually transmitted infections (STIs) which, among its objectives, included the cost-effectiveness assessment of the MenSS’s Safer Sex (MenSS) intervention compared to the Standard of Care (SoC). Key variables in MenSS include: Total costs and QALYs computed over the trial period (c and e); number of sexual intercourses ("sex_inst") and whether an STI was diagnosed ("sti") at follow-up; baseline values for all effectiveness variables and demographics, treatment ("trt") and site ("site") indicator.

The following code is used to fit three selection models using the function selection to conduct the cost-effectiveness assessment under a Missing At Random (MAR) assumption. The models differ in terms of the distributional assumptions specified for the two health economic variables, namely QALYs (e) and Total costs (c):

Prior to fitting the model, the data in MenSS are pre-processed by: replacing the default levels for the factor variable trt (\(1\) and \(2\)) with the actual treatment names ("SoC" and "MenSS"); add/subtract to the cost/QALY data a small constant of \(0.01\) to allow estimation of Gamma and Beta distributions, whose support does not include \(0\) and \(1\), respectively.

> #rename trt levels
> MenSS$trt <- factor(MenSS$trt)
> levels(MenSS$trt) <- c("SoC", "MenSS")
> MenSS$e <- MenSS$e - 0.01 #ensure no ones QALYs occur
> MenSS$c <- MenSS$c + 0.01 #ensure no zero costs
> 
> #fit models with different distributions for outcomes
> #1=Normal-Normal
> sm1_nn <- selection(data = MenSS, dist_e = "norm", dist_c = "norm",
+                     model.eff = e ~ trt + u.0, model.cost = c ~ trt + e,
+                     model.me = me ~ 1, model.mc = mc ~ 1,
+                     type = "MAR", n.iter = 1000, ref = 2)
> #2=Normal-Gamma
> sm1_ng <- selection(data = MenSS, dist_e = "norm", dist_c = "gamma",
+                     model.eff = e ~ trt + u.0, model.cost = c ~ trt + e,
+                     model.me = me ~ 1, model.mc = mc ~ 1,
+                     type = "MAR", n.iter = 1000, ref = 2)
> #3=Beta-Gamma
> sm1_bg <- selection(data = MenSS, dist_e = "beta", dist_c = "gamma",
+                     model.eff = e ~ trt + u.0, model.cost = c ~ trt + e,
+                     model.me = me ~ 1, model.mc = mc ~ 1,
+                     type = "MAR", n.iter = 1000, ref = 2)

Before looking at the model results, it is generally a good idea to assess the fit of the model to the observed data to inform the decision of which model results should be trusted more. Within the Bayesian framework, alternative approaches to model selection based on performance measures are available. missingHE allows the user to compute three main types of predictive information criteria (PIC), taken from common relative measures of Bayesian model fit, using the pic function. The available criteria include: Deviance Information Criterion (DIC); Widely Applicable Information Criterion (WAIC); Leave-One-Out Information Criterion (LOOIC).

The following code shows how the results of any model fitted in missingHE may be passed to the argument x of pic to compute the desired criterion. In particular, we pass the three model result objects into pic as a list and select "looic" as value for criterion to indicate that we want the LOOIC criterion to be computed for all models, and we store the results into the new list object looic_m123. Next, we print out the computed criteria for each model by accessing the inner object looic_m123$pic.

> #estimate looic for all models based on complete cases
> looic_m123 <- pic(x = list(sm1_nn, sm1_ng, sm1_bg),
+                   criterion = "looic", cases = "cc")
> #print criteria for each model
> looic_m123$pic
#> LOOIC model 1 LOOIC model 2 LOOIC model 3 
#>      776.1071      626.7495      583.8404

The printed estimates for the criteria suggest a generally better fit for model 3 compared to model 2 and model 1, as indicated by lower LOOIC values. This in accordance with the expectation that distributions which allow to capture skewness, such as Gamma and Beta, tend to have a better fit to empirical data that show such features. Note that the criteria were generated by setting cases to "cc", that is their calculation was only based on the complete cases. Alternative cases choices are: available effectiveness cases ("ac_e"), available cost cases ("ac_c") and all cases ("all"). Although it is possible to compute the criteria based on imputed cases, it is generally recommended to assess the model fit only based on the observed cases (or after integrating out the missing values). Thus, care should be used when interpreting the PIC results directly computed based on some imputed data.

Depending on the type of criteria chosen, pic provides additional output related to the computation of the quantity. For example, the object looic_m123 is a list which stores several elements that were used in the computation of LOOIC (pic) for each model, such as estimates for the effective number of parameters (peff) as well as standard error estimates of pic_se and peff_se. These elements can be accessed and printed out in a similar way to what shown for the main LOOIC quantities. The following command can be used to access and print out peff estimates used in the calculation of LOOIC for each model.

> #print peff values
> looic_m123$peff

An alternative approach to assess the predictive ability of the model is to use the model parameters to generate replications of the data and compare them to the original data to assess the absolute predictive performance of a model with respect to the observed data. If the model is reasonable, the replications should look similar to the original data. To this purpose, missingHE allows the user to compute a series of graphical and/or numeric posterior predictive checks (PPC) using the function ppc.

As an example, the following commands are used to generated posterior density plots for a total of \(20\) replications of QALYs and/or Total costs under each model, which are compared to the empirical densities of the original outcome variables. The arguments inside ppc can be used to customise the resulting plot with respect to:

> #compare densities of observed vs replicated outcome data under each model
> ppc_dens_m1 <- ppc(x = sm1_nn, type = "dens_overlay", outcome = "both",
+                    ndisplay = 20, trt = "none")
> ppc_dens_m2 <- ppc(x = sm1_ng, type = "dens_overlay", outcome = "costs",
+                    ndisplay = 20, trt = "none")
> ppc_dens_m3 <- ppc(x = sm1_bg, type = "dens_overlay", outcome = "effects",
+                    ndisplay = 20, trt = "none")
Posterior densities of 20 replicated data under model 1 compared to the empirical density of the original data.

Figure 1: Posterior densities of 20 replicated data under model 1 compared to the empirical density of the original data.

Posterior densities of 20 replicated data under model 2 compared to the empirical density of the original data.

Figure 2: Posterior densities of 20 replicated data under model 2 compared to the empirical density of the original data.

Posterior densities of 20 replicated data under model 3 compared to the empirical density of the original data.

Figure 3: Posterior densities of 20 replicated data under model 3 compared to the empirical density of the original data.

Figure 1, Figure 2 and Figure 3 show how the predictive performance of the models changes considerably depending on the specific distributional assumptions made about the outcome variables. In particular, we see that under normality (model 1), replications for both effects and costs fail to capture the skewness observed in the empirical data distributions. Conversely, replications generated using Gamma (model 2) and Beta (model 3) distributions seem to better capture the features of the empirical cost and effectiveness data, respectively.

3 Missingness mechanism

Choice of the missingness mechanism for the outcomes is typically very difficult to justify unless strong justifications can be provided for the exclusion of alternative options. According to Rubin’s taxonomy, the types of mechanisms can be generally represented into two distinct groups: MAR and Missing Not At Random (MNAR), depending on whether the missingness probability is assumed to exclusively depend on observed quantities or not, respectively. Note that, in the literature, a third class called Missing Completely At Random (MCAR) is often identified which, however, can be also considered as a subclass of MAR where the missingness probability does not depend on any observed or unobserved variable (i.e. it is totally random). Depending on the modelling approach chosen, assumptions about the missingness mechanism can be encoded into the model in different ways. For example, selection models directly allow the specification of the missingness probability model, while pattern mixture models implicitly define the mechanism class based on the type of restrictions imposed to identify the outcome distributions in each pattern.

missingHE allows the specification of either a MAR or MNAR assumption about the effectiveness and cost missingness mechanisms in each approach through the argument type. All the three selection models fitted to the MenSS data set were estimated under a MAR assumption, where the missingness outcomes’ probabilities (model.me and model.mc) were assumed to not depend on any variable. Although this might be reasonable in some cases, it is generally more plausible that some observed variables provide some information that can help to improve estimation of the missingness probabilities.

The following code shows how estimation of selection models can be extended to include additional observed variables as predictors inside the missingness models to make the MAR assumptions more plausible. As an example, and for ease of presentation, we consider here only model 1 and include as possible predictors into model.me and model.mc the baseline utilities (u.0) and age (age) variables.

> #fit selection model 1 (Normal-Normal) under MAR conditional on u.0 and age
> sm1_nn_mar <- selection(data = MenSS, dist_e = "norm", dist_c = "norm",
+                     model.eff = e ~ trt + u.0, model.cost = c ~ trt + e,
+                     model.me = me ~ u.0 + age, model.mc = mc ~ u.0 + age,
+                     type = "MAR", n.iter = 1000, ref = 2)

Posterior summaries for all model parameters may be inspected using the print function. The argument display may also be used to select the group of parameters for which the output should be printed out, with choices being: fixed effects ("fixed"), random effects ("random"), individual-specific parameter estimates ("conditional"), point-wise log-likelihood values ("loglik"), imputed values ("mis"). For example, estimates for the “fixed effects” parameters (default) from sm1_nn_mar may be displayed by typing

> # print posterior summaries for fixed effects
> print(x = sm1_nn_mar, display = "fixed")
#>                   mean           sd          2.5%          50%       97.5% Rhat n.eff
#> alpha[1]      0.537543   0.07152566     0.4036043    0.5369433   0.6845948 1.00   750
#> alpha[2]      0.031314   0.02926121    -0.0272582    0.0310349   0.0888309 1.00  1000
#> alpha[3]      0.387055   0.07543976     0.2311733    0.3868189   0.5257348 1.00  1000
#> beta[1]     206.726886  40.44526701   124.6127786  206.1344268 283.9295557 1.00  1000
#> beta[2]     -18.708654  61.03221363  -133.5226363  -17.7765709  93.5840807 1.01   700
#> beta_f     -670.735959 295.39243714 -1234.4685963 -677.3889611 -78.9401711 1.01  1000
#> gamma_c[1]    1.509014   1.17361708    -0.6837026    1.5030422   3.8651579 1.00  1000
#> gamma_c[2]   -0.162815   1.08218964    -2.3058642   -0.1710584   1.8753562 1.01   550
#> gamma_c[3]   -0.015225   0.01981889    -0.0534190   -0.0149466   0.0236365 1.01   190
#> gamma_e[1]    1.579171   1.22403787    -0.6847831    1.5287345   4.1158728 1.00  1000
#> gamma_e[2]   -0.199901   1.12750306    -2.5655645   -0.1408840   1.9308528 1.00  1000
#> gamma_e[3]   -0.016423   0.02082310    -0.0562370   -0.0167658   0.0262968 1.00   370
#> p_c           0.713347   0.03529381     0.6450136    0.7132304   0.7785806 1.00  1000
#> p_e           0.713786   0.03572281     0.6404884    0.7151350   0.7831580 1.01   300
#> s_c         211.691979  21.91369782   173.6119028  209.8974606 258.9207116 1.02   150
#> s_e           0.092589   0.01001392     0.0753114    0.0917834   0.1146809 1.00   360
#> tau_c         0.000023   0.00000474     0.0000149    0.0000227   0.0000332 1.02   150
#> tau_e       120.662141  25.53671991    76.0358035  118.7056328 176.3105665 1.00   360
#> tmu_c       196.843069  33.15147952   130.4365508  196.5808806 265.2473601 1.01   290
#> tmu_e         0.895440   0.01396320     0.8683272    0.8959254   0.9228379 1.00   310

The output generated by print shows for each parameter a selection of standard posterior summary measures obtained directly from JAGS including: mean, standard deviation, median, \(2.5\%\) and \(97.5\%\) percentiles as well as two MCMC diagnostic measures, the potential scale reduction factor (Rhat) and the number of effective sample size (n.eff). The names and number of parameters displayed reflect the structure of the outcome (model.eff and model.cost) and missingness (model.me and model.mc) models as specified in sm1_nn_mar. In particular:

Posterior distributions for each of these quantities may be extracted from the list object sm1_nn_mar by accessing the inner JAGS output model. For example, if different types of summaries or a visual inspection of the full posterior sample of a specific parameter is desired, e.g. p_e, we may access its posterior samples by typing

> #store posterior samples of p_e and compute some custom summaries
> p_e <- sm1_nn_mar$model_output$model$BUGSoutput$sims.list$p_e
> summary(c(p_e))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.5647  0.6915  0.7151  0.7138  0.7369  0.8222

The function coef allows to compute and display key posterior summaries for only the regression coefficients in model.eff and model.cost, following standard convention for reporting coefficient estimates for outcome regression models. The arguments random and digits may be used to indicate whether coefficient estimates of “random effects” (if present) in the outcome models should be displayed instead, and the number of digits to be displayed for all estimates.

> #print coefficient estimates from model.eff and model.cost
> coef(sm1_nn_mar, random = FALSE, digits = 2)
#> $Effects
#>             Mean   SD    QL   QU
#> (Intercept) 0.54 0.07  0.40 0.68
#> trtMenSS    0.03 0.03 -0.03 0.09
#> u.0         0.39 0.08  0.23 0.53
#> 
#> $Costs
#>                Mean     SD       QL     QU
#> (Intercept)  206.73  40.45   124.61 283.93
#> trtMenSS     -18.71  61.03  -133.52  93.58
#> e           -670.74 295.39 -1234.47 -78.94

The output produced is split into two parts: the first related to the coefficient estimates for the effectiveness model ($effects), and the second to the coefficient estimates for the costs model ($costs). For each coefficient, the name of the associated variable alongside posterior mean, standard deviation, lower and upper bounds of a \((1-\alpha)\%\) credibility interval (CI) are displayed. By default, boundary estimates for a \(95\%\) interval are displayed, but the user can also specify custom low/high boundary estimates using the optional argument prob, i.e. prob = (0.05, 0.95) for a \(90\%\) CI.

4 Outcome model structure

Assumptions about the modelling structure of the effectiveness and cost outcome models can be specified across all model fitting functions through the arguments model.eff and model.cost, respectively. The arguments require a formula object which mimics standard linear regression specification e = trt + and c = trt +, where e and c and trt denote the outcome and treatment specific variable names accepted by missingHE which must always be included into the formulae. A generalised linear regression approach is used to specify the models at the location level of the outcomes, with default link functions automatically selected by missingHE according to the distributional assumptions specified in dist_e and dist_c, e.g. identity for Normal, logarithmic for Gamma and Weibull, and logit for Beta or Bernoulli. Any fully-observed variable in the data frame passed to data may be specified as additional covariates into the outcome models using a + sign. For example, the following code shows how to customise Model 1 (under MAR) to include different groups of covariates into the effectiveness and cost outcome and missingness models.

> #fit selection model 1 (Normal-Normal) with covariates into both outcome and missingness models
> sm1_nn_cov <- selection(data = MenSS, dist_e = "norm", dist_c = "norm",
+                     model.eff = e ~ trt + u.0 + age, 
+                     model.cost = c ~ trt + age + employment + e,
+                     model.me = me ~ u.0 + age, model.mc = mc ~ age + employment,
+                     type = "MAR", n.iter = 1000, ref = 2)

If appropriate, missingHE allows the user to further to customise each model structure to account for possible clustering (e.g. based on different centres) through a formula notation inspired to the one used by the lme4 package. More specifically, standard “random effects” parameters may be incorporated into each model by adding + (x | z) to the usual “fixed effects” component of the model formula, where x denotes a linear specification for the random effects variable names to be added to the model (only if x already included as fixed effects) and z denotes the clustering variable name. For example, the following code shows how random intercepts (1) based on site may be included into both outcome selection models fitted to MenSS.

> #fit selection model 1 (Normal-Normal) with covariates and random intercepts
> #into the model
> sm1_nn_cov_re <- selection(data = MenSS, dist_e = "norm", dist_c = "norm",
+                     model.eff = e ~ trt + u.0 + age + (1 | site), 
+                     model.cost = c ~ trt + age + employment + e + (1 | site),
+                     model.me = me ~ u.0 + age, model.mc = mc ~ age + employment,
+                     type = "MAR", n.iter = 1000, ref = 2)

As per usual, functions such as print and coef may be used to summarise posterior estimates for all relevant model parameters. When interest is in assessing MCMC convergence of a given model, the function diagnostic may be used to inspect graphical diagnostics for each family of model parameters. For example, the following code shows how autocorrelation plots for all regression coefficients in model.eff based on the formula stored in sm1_nn_cov.

Checking convergence using the diagnostic function for a family of model parameters estimated in missingHE, for example through inspection of the autocorrelation plots.

Figure 4: Checking convergence using the diagnostic function for a family of model parameters estimated in missingHE, for example through inspection of the autocorrelation plots.

Graphs are displayed for each MCMC chain (default\(=2\)) and model parameters belonging to the specified family (alpha), while the argument type is used to select the specific type of diagnostic plot to be compute based on a set of available names accepted by missingHE. Several types of graphs are available in missingHE, among which, some of the most popular plots and associated names include: trace ("traceplot"), density ("denplot"), histogram ("histogram), autocorrelation ("acf"), and running means ("running") plots.

The accepted names for the family of parameter, specified param, are related to the names assigned by missingHE to the model parameters in the JAGS text file, which can be displayed using the print function. Note that many parameter names are shared across all model fitting functions in missingHE. For example, the names "alpha" and "beta" are accepted by all functions and respectively denote all fixed effects regression coefficients in model.eff and model.cost, while "random.alpha" and "random.beta" denote the corresponding random effects parameters (if included). Similarly, "mu_e", "sd.e", "mu_c" and "sd.c" denote the effectiveness and cost outcome mean and standard deviation parameters. Some of family names are specific, or have a different interpretation, according to the modelling approach and function used. A quick summary of the most important family names unique to each approach in missingHE is the following:

5 Prior distributions

missingHE fits each modelling approach under weakly-informative prior distribution choices for all model parameters based on common practice and assuming a general lack of prior knowledge about the parameters. Examples include assuming Normal distributions centred at \(0\) with generally low precision values (i.e. \(1\times10^{6}\)) for all regression coefficients, and Half-Cauchy distributions with a scale value of \(2.5\). However, when desired and if prior information on some parameters is available, the user may overwrite missingHE defaul prior choices using the argument prior included in each modelling function. It is important that user-specific prior choices are passed to prior in a list form containing information about the desired parameters which need to be identified according to the type of model under consideration. For example, the following command may be used to update the priors about the effectiveness and cost regression coefficients and standard deviations when fitting a selection model using selection.

> #update priors on outcome regression coefficients and standard deviations
> myprior <- list("beta.prior" = c("norm", 0, 0.01), 
+                 "beta_f.prior" = c("norm", 0, 0.001),
+                 "alpha.prior" = c("norm", 0, 0.01),
+                 "sigma.prior.e" = c("unif", 0, 5),
+                 "sigma.prior.c" = c("unif", 0, 1000))
> #update model with new priors
> sm2_nn_cov <- selection(data = MenSS, dist_e = "norm", dist_c = "norm",
+                     model.eff = e ~ trt + u.0 + age, 
+                     model.cost = c ~ trt + age + employment + e,
+                     model.me = me ~ u.0 + age, model.mc = mc ~ age + employment,
+                     type = "MAR", n.iter = 1000, ref = 2,
+                     prior = myprior)

The object myprior must be a list whose elements should take names accepted by missingHE denoting the specific parameters whose priors need to be overwritten. For example the character names "beta.prior", "beta_f.prior and "alpha.prior" indicate that the default priors for all covariate coefficients in model.cost, the coefficient associated with e in model.cost and all covariate coefficients in model.eff need to be replaced. The content of these list elements are recognised by missingHE as providing the information necessary to write the updated priors for these parameters and must be provided in the form of vectors. The first element of each vector tells the name of the specific distribution assumed for the prior, while the successive elements inside each vector denote the canonical parameters indexing the chosen distribution. For example, for the first groups of parameters "beta", myprior instructs missingHE to replace their default priors with Normal distributions centred at \(0\) with a precision of \(0.01\). Similarly, myprior instructs missingHE to specify Uniform priors on the standard deviations with differing ranges for the effectiveness and cost outcomes, through the information contained in the "sigma.prior.e" and "sigma.prior.c" elements.

Note that specific character names for the custom prior list elements as well as specific names for the prior distributions and the total number and possible values of the parameters indexing the priors must be used. missingHE allows the user to specify custom priors for parameters that are either shared across all types of model fitting functions, such as "beta.prior" and "alpha.prior" for regression coefficients in model.eff and model.cost, or specific to given functions, such as "delta.prior.e" and "delta.prior.c" for MNAR parameters in model.me and model.mc. The available distribution forms, their character names and parameterisation used by missingHE match those available from the JAGS manual (version 4.3.0). With regard to the prior list elements, some of the most important parameter names that are specific to certain modelling approaches, with associated parameter interpretation, are:

6 Conclusions

Additional information about customisation options available for each functions in missingHE as well as summary information about each model specification, assumptions and parameter estimation may be accessed from the help file of each function by typing help(function).