RMeDPower2: helping design and analyses of data from repeated measures experiments

Min-Gyoung Shin

Gladstone Institutes
mingyoung.shin@gladstone.ucsf.edu

Reuben Thomas

Gladstone Institutes
reuben.thomas@gladstone.ucsf.edu

Abstract

Biomedical research very often involves data generated from repeated measures experiments. RMeDPower2 is an R package that provides complete functionality to analyse data coming from repeated measures experiments, i.e., where one has repeated measures from the same biological/independent units or samples. RMeDPower2 helps test the modeling assumptions one makes, identify outlier observations, outlier units at different levels of the design, estimates statistical power or perform sample size calculations, estimate parameters of interest and also to visualize the association being tested. The functionality is limited to testing associations of one predictor (continuous or categorical, e.g., disease status or brain pathology) along with one another covariate (e.g., gender status) in the context of hierarchical or crossed experimental designs. This vignette illustrates the use of this package in multiple contexts relevant to biomedical research.

RMeDPower2 defines the experimental design for the data, probability model of the data generating distribution and necessary parameters required for sample size calculation using convenient S4 class objects. It uses these objects in one framework that brings together the functionality implemented in multiple R packages - lme4 (implementation of linear mixed effects models), influence.ME (identification of outlier units), EnvStats (identification of outlier observations), DHARMa (testing of modeling assumptions for non-normal distributions), simr (sample-size calculations) and tidyverse (data manipulation and visualization).

1 Getting started

1.1 Installing `RMeDPower2`

RMeDPower2 can be installed using the following commands

install.packages("devtools")
library(devtools)
install_github('gladstone-institutes/RMeDPower2', build_vignettes=TRUE, force = TRUE)

1.2 Quick tutorial

The workflow can be accomplished in 5 main steps as described in the sections below (1.2.1-.2.5).

1.2.1 Load libraries and read input data

The first step is to load RMeDPower2 and read-in the input data as a data frame. This is accomplished by the following commands:

suppressPackageStartupMessages(library(RMeDPower2))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))

data <- plate_assay_pilot_data

Example data has experiment, line as experience columns, classification as the condition column, and cell size as the response column.

knitr::kable(data[1:5,] )

experiment	line	cell_size1	cell_size2	cell_size3	cell_size4	cell_size5	cell_size6
experiment1	cellline1	353.8401	353.8401	353.8401	353.8401	353.8401	353.8401
experiment1	cellline1	456.3522	456.3522	456.3522	456.3522	456.3522	456.3522
experiment1	cellline1	350.7909	350.7909	350.7909	350.7909	350.7909	350.7909
experiment1	cellline1	387.4861	387.4861	387.4861	387.4861	387.4861	387.4861
experiment1	cellline1	403.9912	403.9912	403.9912	403.9912	403.9912	403.9912

1.2.2 Specify design, model and power parameters

The following set of commands allows users to specify parameters for experimental design, assumed probability models and distributions, and power simulations.

Use a text editor to modify the template jsonfile capturing the design for the data and read in this file.

design <- readDesign(file.path(template_dir,"design_cell_assay.json"))

Use a text editor to modify the template jsonfile capturing the error probability model and read in this file.

model <- readProbabilityModel(file.path(template_dir,"prob_model.json"))

Use a text editor to modify the template jsonfile capturing the power parameters and read in this file.

power_param <- readPowerParams(file.path(template_dir,"power_param.json"))

1.2.3 Diagnose data and model

The next set of commands will allow the user to evaluate the datatype, examine if the model is appropriate and identity outliers.

To diagnose the data and model, test model assumptions, identify outlier observations, outlier biological/independent units.

diagnose_res <- diagnoseDataModel(data, design, model)
data_update = diagnose_res$Data_updated

Modify design objects (e.g., remove outliers), model objects (e.g., change probability distribution assumption) if needed based on output of step 1 above.

diagnose_res_update <- diagnoseDataModel(data_update, design_update, model_update)

Repeat steps 1 and 2 until the models assumptions are met. For example, users can choose to remove outlier-experiments so that the distribution is close to normal.

1.2.4 Statistical power estimation

This section of code will allow the user to run sample-size calculations from the pilot input data.

power_res <- calculatePower(data, design, model, power_param)

1.2.5 Estimate parameters of interest

Once we have run the models, the user can estimate and visualize parameters of interest in order to investigate the association between response variable and condition variables.

estimate_res <- getEstimatesOfInterest(data, design, model, print_plots = F)

1.3 Input data

Users will have to ensure that the input data is organized in a format that will allow RMedPower2 to read in the data. The data should be structured in a table with rows representing the individual observations or responses and columns representing not only the corresponding explanatory variables but also other variables that capture the hierarchical or crossed design of how the data was generated. It is important that the columns have names or headers - these column names will be used to define the relevant S4 class objects.

For example, the next below illustrates the input data from a cellular assay where each of the 2,588 observations represents a measurement of size (cell_size2) of a single cell, from a given cell line (line) that is either from a control or disease condition (classification) and is measured in a given experimental batch (experiment).

## 'data.frame':    2588 obs. of  4 variables:
##  $ experiment    : Factor w/ 3 levels "experiment1",..: 1 1 1 1 1 1 1 1 1 3 ...
##  $ line          : Factor w/ 10 levels "cellline1","cellline10",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ classification: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cell_size2    : num  354 456 351 387 404 ...

1.4 Three key classes of `RMeDPower2`

Here, users will learn how to set up parameters for experimental designs, probability and distributional assumptions, and power simulations.

1.4.1 `RMeDesign` S4 class

The underlying design for the data will be stored in an object (R data structure with variables) of this class. The slots of this class are:

response_column: character that is the column name in the input data table that captures the responses or individual observations. This has to be set by the user.
covariate: character that is the column name in the input data table that captures the covariate information. This will be used as a fixed effect in the mixed effects model of the data. This slot is set to NULL if there are no covariates.
condition_column: character that is the column name in the input data table that captures the predictor/independent variable information. This will be used as a fixed effect in the mixed effects model of the data.This has to be set by the user.
experimental_columns: character vector that represents the column names in the input data table that captures the experimental variables of the design. These will be used as random effects in the mixed effects model of the data. The order of the names in this character vector is important. The first element captures the top hierarchical level of the design, the second element the next level of the hierarchy and so on. The hierarchy is explicitly modeled in the specification of the mixed effects models. The hierarchy will not be modeled for variables specified in the crossed_columns slot (see below). This has to be set by the user.
condition_is_categorical: logical value equal to TRUE if the variable in the condition_column is to be considered as a categorical variable and is equal to FALSE otherwise. Defaults to TRUE.
crossed_columns: character vector that represents the column names in the input data table that are a subset of the values in experimental_columns representing crossed factors. Defaults to NULL.
total_column: character that is the column name in the input table that will be used to offset values in the mixed effects models. Useful for modeling count data. Defaults to NULL.
outlier_alpha: numeric value that is the p-value threshold used to identify outlier observations. Defaults to 0.05.
na_action: character equal to complete or unique. To be used in the context where the input data has multiple responses that will each be sequentially modeled. For example, from a single set of data, a user is gathering numerous observations. Here we refer to complete in the scenario where the observed outliers that identified for one response will also be identified as outliers for all the other responses while unique refers to the situation where the outlier observations identified for one response will only be used for the model of that particular response. Defaults to complete.
include_interaction: Whether to include condition * covariate interaction
random_slope_variable: Variable for random slopes (typically “condition_column”)
covariate_is_categorical: Specify whether the covariate variable is categorical. TRUE: Categorical, FALSE: Continuous.

To create an object of class RMeDesign, the following code is used:

design <- new("RMeDesign")
print(design)

## An object of class "RMeDesign"
## Slot "response_column":
## [1] "response_column"
## 
## Slot "covariate":
## NULL
## 
## Slot "condition_column":
## [1] "condition_column"
## 
## Slot "condition_is_categorical":
## [1] TRUE
## 
## Slot "experimental_columns":
## [1] "experimental_column"
## 
## Slot "crossed_columns":
## NULL
## 
## Slot "total_column":
## NULL
## 
## Slot "outlier_alpha":
## [1] 0.05
## 
## Slot "na_action":
## [1] "complete"
## 
## Slot "include_interaction":
## [1] NA
## 
## Slot "random_slope_variable":
## NULL
## 
## Slot "covariate_is_categorical":
## [1] NA

In practice, for a given data set the design can be input from a jsonfile. The user can use a text editor to modify the design_template.json file available with the package to input the relevant information based on the column names of the input data. For example, the design for the cell size data referred to above can be read using the function readDesign.

design <- readDesign(file.path(template_dir, "design_cell_assay.json"))
print(design)

## An object of class "RMeDesign"
## Slot "response_column":
## [1] "cell_size2"
## 
## Slot "covariate":
## NULL
## 
## Slot "condition_column":
## [1] "classification"
## 
## Slot "condition_is_categorical":
## [1] TRUE
## 
## Slot "experimental_columns":
## [1] "experiment" "line"      
## 
## Slot "crossed_columns":
## [1] "line"
## 
## Slot "total_column":
## NULL
## 
## Slot "outlier_alpha":
## [1] 0.05
## 
## Slot "na_action":
## [1] "complete"
## 
## Slot "include_interaction":
## [1] NA
## 
## Slot "random_slope_variable":
## NULL
## 
## Slot "covariate_is_categorical":
## [1] NA

1.4.2 `ProbabilityModel` S4 class

The error probability distribution is specified by two slots in objects of the ProbabilityModel class.

error_is_non_normal: Is the logical value that is TRUE if the underlying probability distribution is not assumed to be Normal and is FALSE otherwise.
family_p: This character value specifies the probability distribution. Accepted values are poisson(=poisson(link="log")), binomial(=binomial(link="logit")), bionomial_log(=binomial(link="log")), Gamma_log(=Gamma(link="log")), Gamma(=Gamma(link="inverse")), negative_binomial. Defaults to NULL.

model <- readProbabilityModel(file.path(template_dir, "prob_model.json"))
print(model)

## An object of class "ProbabilityModel"
## Slot "error_is_non_normal":
## [1] FALSE
## 
## Slot "family_p":
## NULL

1.4.3 `PowerParams` S4 class

The parameters described below are necessary for sample-size estimation and are stored in the following slots of the PowerParams class.

target_columns: This describes the character vector with column names of experimental variables for which the sample-size estimation is to be performed
power_curve: This is numeric 1 or 0. 1: Power simulation over a range of sample sizes or levels. 0: Power calculation over a single sample size or a level. Defaults to 1.
nsimn: The number of simulations to estimate power. Defaults to 10.
levels: numeric 1 or 0. 1: Amplify the number of corresponding target parameter. 0: Amplify the number of samples from the corresponding target parameter, ex) If target_columns = c(“experiment”,“cell_line”) and if you want to expand the number of experiment and sample more cells from each cell line, set levels = c(1,0).
max_size: Maximum levels or sample sizes to test. Default if set to NULL equals the current level or the current sample size x 5. ex) If max_levels = c(10,5), it will test upto 10 experiments and 5 cell lines.
alpha: Type I error for sample-size estimation. Defaults to 0.05.
breaks: Levels /sample sizes of the variable to be specified along the power curve. Default if set to NULL equals max (1, round (the number of current levels / 5)).
effect_size: A 3 dimensional numeric vector given the proposed effect sizes/parameter estimates for the condition_column, covariate and interaction term between these two variables, respectively in that order. Depending on the model not all elements of the effect_size would make sense. For example, in situations, where there is only the condition column, the second and third elements of effect_size should be equal to NA. If you know the effect size of your condition variable, the effect size can be provided as a parameter. Default set to NULL,i.e., if the effect size is not provided then it will be estimated from your data.
icc: Intra-class coefficients for each of the experimental variables. Used only in the situation when error_is_non_normal=FALSE and when the data does not allow variance component estimation.

power_param <- readPowerParams(file.path(template_dir, "power_param.json"))
print(power_param)

## An object of class "PowerParams"
## Slot "target_columns":
## [1] "experiment"
## 
## Slot "power_curve":
## [1] 1
## 
## Slot "nsimn":
## [1] 10
## 
## Slot "levels":
## [1] 1
## 
## Slot "max_size":
## [1] 1
## 
## Slot "alpha":
## [1] 0.05
## 
## Slot "breaks":
## NULL
## 
## Slot "effect_size":
## NULL
## 
## Slot "icc":
## NULL

1.5 Three key functions of `RMeDPower2`

The following are functions for users to manipulate data, estimate power, and perform association analysis.

1.5.1 `diagnoseDataModel(data, design, model)` function

This set of code, tests the model assumptions by using quantile-quantile, residuals vs fitted and residuals versus predicted plots. If needed, the code also transforms the responses if feasible and identifies outlier observations and also outlier experimental units.

1.5.2 `calculatePower(data, design, model, power_param)` function

This set of code performs sample-size calculations for given experimental design/target variable.

1.5.3 `getEstimatesOfInterest(data, design, model)` function

This string estimates parameter of interest and also plots the resulting association with predictor of interest using residuals from the model that removes the effects of the modeled experimental variables.

An overview of the anticipated usage of the functionality of this package is shown in the figure below.

Flowchart illustrating the functionality of the package

2 Application examples

Here we illustrate the application of RMeDPower2 using data derived from in vitro assays. We further illustrate the RMeDPower2 across the other domains in biomedical research on our website: https://gladstone-institutes.github.io/RMeDPower2/index.html .

Flowchart illustrating the functionality of the package

2.1 Plate assays

iPSC lines from control and patient derived iPSCs from sALS patients were obtained through the Answer ALS (AALS) consortium36 . AALS is the largest repository of ALS patient samples, and contains publicly available patient-specific iPSCs and multi-OMIC data from motor neurons derived from those iPSCs—including RNA Seq, proteomics and whole genome sequence data— all of which is matched with longitudinal clinical data (https://dataportal.answerals.org/home)36.

We used 6 control lines and 4 sALS lines and differentiated each line into motor neurons (iMNs). iMNs were transduced with a morphology marker and imaged on day ~25 using RM37-45 as previously described36. Images were processed and the soma size of neurons were captured using contour ellipse46 function adapted for use in python (https://www.python.org/).

The next step is to read the data and perform data distribution diagnostics, association analysis, and power simulation.

First, we will read the data from the package and json files with specified parameters:

template_dir <- system.file("input_templates/cell_assay_data", package = "RMeDPower2")
data <- plate_assay_pilot_data
design <- readDesign(file.path(template_dir,"design_cell_assay.json"))
model <- readProbabilityModel(file.path(template_dir,"prob_model.json"))
power_param <- readPowerParams(file.path(template_dir,"power_param.json"))

A visualization of the distribution of cell_size2 across experiments and cell-lines as box-plots and also in terms of empirical cumulative distribution plots suggests the observations have an extremely long tail. We will therefore filter out the top 5 percent of the observations to ensure the results are not skewed by these extreme observations. We also see that the observations in experiment 3 are in general lower than those in the other two experiments.

data %<>% filter(cell_size2 < quantile(cell_size2, 0.95))

We would like to estimate the difference in the mean cell-sizes across all the observations derived from all experimental batches between the ALS disease lines and the control cell lines (modeled by the classification variable). The first step is run diagnosis on the data and model using the ‘diagnoseDataModel’ function. The Q-Q plot of the residuals of the LMM model fit to the log-transformed cell-sizes suggests a better fit than the one fit to the cell sizes in the natural scale , assuming that the data generating distribution is based on the Normal distribution. Therefore, we will hereafter model the log-transformed cell sizes. The reasonably horizontal best fit lines for the residuals from this versus the fitted values, and the square-root of the absolute values of the residuals versus the fitted values, Figure 7e suggest both the LMM model assumptions of linearity and homoscedasticity are reasonable. There were no identified individual outlier observations based on the application of the Rosner’s test to the distribution of the residuals in.

diagnose_res <- diagnoseDataModel(data, design, model)

The diagnoseDataModel function fits at least one model representing one in the natural scale of the response of interest. Additionally, if individual observations are inferred as outliers using the rosner’s test applied to the residuals of the fitted model then another model is fit without these outlier observations. In addition, if the assumed probability distribution of the residuals of the model is normal then a further additional model are fit using logarithm (log) transformed responses and another one if there are outliers in the log-transformed model fit.

To identify all the models fit to the data

print(diagnose_res$models)

## [1] "Natural scale" "Log scale"

We see that two models were fit to this data - one in the natural scale of cell size and the other in the logarithmic scale. No individual observations were identified as outliers in each of these models.

Let us look the basic QC plots for the models fit in the natural scale of the responses.

plot_no <- 1
plots <- diagnose_res$diagnostic_plots$natural_scale$plots
captions <- diagnose_res$diagnostic_plots$natural_scale$captions
for (i in seq_along(plots)) {
  print(plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:** %s. %s\n\n", plot_no,names(plots)[i], captions[i]))
  plot_no <- plot_no + 1
}

Figure 1: residuals_QQ. Check normality of residuals (natural scale). Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 2: residuals_vs_fitted. Check for linearity (natural scale). Expectation is that the best fit solid red line is horizontal or close to being horizontal. Check the Residuals vs Fitted plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 3: residuals_homoscedasticity. Check for homoscedasticity (natural scale). Expectation is that the best fit solid red line is horizontal or close to being so. Check the Scale-Location plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 4: random_effects_QQ_1. Check normality of random effects for experimental_column2_(Intercept) (natural scale): line. Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 5: random_effects_QQ_2. Check normality of random effects for experimental_column1_(Intercept) (natural scale): experiment. Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

cat("\n\n\n")

cooks_plots <- diagnose_res$cooks_plots$natural_scale
for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

Figure 6:

Figure 7:

print("Inferred outlier groups are")

[1] “Inferred outlier groups are”

print(diagnose_res$inferred_outlier_groups$natural_scale)

$experiment [1] “experiment3”

$line [1] “cellline4” “cellline9”

Now, let us look at the same set of QC plots for models on the log-transformed responses.

plots <- diagnose_res$diagnostic_plots$log_scale$plots
captions <- diagnose_res$diagnostic_plots$log_scale$captions
for (i in seq_along(plots)) {
  print(plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:** %s. %s\n\n", plot_no,names(plots)[i], captions[i]))
  plot_no <- plot_no + 1
}

Figure 8: residuals_QQ. Check normality of residuals (log scale). Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 9: residuals_vs_fitted. Check for linearity (log scale). Expectation is that the best fit solid red line is horizontal or close to being horizontal. Check the Residuals vs Fitted plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 10: residuals_homoscedasticity. Check for homoscedasticity (log scale). Expectation is that the best fit solid red line is horizontal or close to being so. Check the Scale-Location plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 11: random_effects_QQ_1. Check normality of random effects for experimental_column2_(Intercept) (log scale): line. Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

Figure 12: random_effects_QQ_2. Check normality of random effects for experimental_column1_(Intercept) (log scale): experiment. Expectation is that the black points lie on or close to the solid red diagonal line. Check the Normal Q-Q plots for good and bad examples here: https://library.virginia.edu/data/articles/diagnostic-plots

cooks_plots <- diagnose_res$cooks_plots$log_scale

cat("\n\n\n")

for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

Figure 13:

Figure 14:

print("Inferred outlier groups are")

[1] “Inferred outlier groups are”

print(diagnose_res$inferred_outlier_groups$natural_scale)

$experiment [1] “experiment3”

$line [1] “cellline4” “cellline9”

We see normality assumption is better met with the log-transformed model (Figure 6) compared to the model in the natural scale (Figure 1).

The code also outputs Cook’s distance measures for each experiment and cell-line that is intended to capture how influential each of these were in determining the final parameter estimates in the LMM model.

The observations in experimental batch 3 and those for cell lines 4 and 9 were identified as outliers using a 4/n (where n is the number of replicates, n=3 for experimental batches and n=10 for cell-lines) threshold applied to the estimates of Cook’s distance for the estimated model parameters. This warranted further investigation on whether or not there was something unusual with the observations made in experimental batch 3 (example: the images from the microscope displayed low signal to noise and may have skewed the measurements) and also for cell line 4# and 9# (example: differentiation runs, genetic background was different from the other lines) that would suggest observations linked to this experiment and cell lines may need to be removed from the analysis. Our investigation did not indicate anything untoward and therefore we ran a sensitivity analysis where we dropped all observations linked to either experiment 3, cell line 4# or cell line 9#, refit the model and visualized the resulting change in ALS status associated mean soma perimeter value. There were only relatively slight changes in the estimated changes in the mean, so that the conclusions we would draw of an increase in the soma cell perimeter in ALS lines remains irrespective of whether or not we drop observations from the experimental batch or the two cell lines.

We will use the log transformed cell_size2 data for parameter estimation using the following commands.

design_update <- design
data_update <- diagnose_res$Data_updated
print(colnames(data_update))

##  [1] "cell_size2_logTransformed" "experiment"               
##  [3] "line"                      "classification"           
##  [5] "cell_size1"                "cell_size2"               
##  [7] "cell_size3"                "cell_size4"               
##  [9] "cell_size5"                "cell_size6"

design_update@response_column <-  "cell_size2_logTransformed"
estimate_res <- getEstimatesOfInterest(data_update, design_update, model,print_plots = FALSE)

Figure 15:

The median residual boxplot shows the differences in the distribution of the response variable for the two conditions.

##print model summary output
print(estimate_res[[1]])

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: response_column ~ condition_column + (1 | experimental_column1) +  
##     (1 | experimental_column2)
##    Data: data.update
## 
## REML criterion at convergence: -3678.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8747 -0.6665 -0.0869  0.6377  2.9014 
## 
## Random effects:
##  Groups               Name        Variance  Std.Dev.
##  experimental_column2 (Intercept) 0.0002438 0.01561 
##  experimental_column1 (Intercept) 0.0021256 0.04610 
##  Residual                         0.0129115 0.11363 
## Number of obs: 2458, groups:  experimental_column2, 10; experimental_column1, 3
## 
## Fixed effects:
##                   Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)        5.77509    0.02764 2.19350 208.924  9.3e-06 ***
## condition_column1  0.03072    0.01182 8.47900   2.599   0.0302 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## cndtn_clmn1 -0.171

As a result of linear mixed model-based association analysis, the response variable was found to be significantly associated with the condition variable with an estimate of 0.03.

We can also estimate the statistical needed to estimate the observed differences in mean perimeter across 15 experimental batches

power_param@max_size <- 15
power_res <- calculatePower(data_update, design_update, model, power_param)

[[1]]

Figure 16:

The power curve shows that at least eight or more experimental batches are required to achieve a power of 80% or greater. The points in the plot represent the mean estimate of power over the simulations while the error bars represent the 95% confidence intervals.

Accordingly, we will now load data from 8 experimental batches and estimate our parameter of interest.

full_data <- plate_assay_full_data

##filter the top 5% of the observations
full_data %<>% filter(perim_2th_effect < quantile(perim_2th_effect, 0.95))

design@response_column <- "perim_2th_effect"
design@condition_column <- "classif"
full_diag_res <- diagnoseDataModel(full_data, design, model)
full_data_update <- full_diag_res$Data_updated
full_design <- design
full_design@response_column <- "perim_2th_effect_logTransformed"

Note now, none of the experimental batches or the 13 cell-lines are considered outliers or unduly influencing the final parameter estimates.

cooks_plots <- full_diag_res$cooks_plots$log_scale
cat("\n\n\n")

for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

Figure 17:

Figure 18:

print("Inferred outlier groups are")

[1] “Inferred outlier groups are”

print(diagnose_res$inferred_outlier_groups$natural_scale)

$experiment [1] “experiment3”

$line [1] “cellline4” “cellline9”

We derive an estimate of 0.027 as the mean difference in perimeter with an associated significance of 0.006 using the data from 8 experimental batches.

full_res <- getEstimatesOfInterest(full_data_update, full_design, model, print_plots = FALSE)
print(full_res[[1]])

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: response_column ~ condition_column + (1 | experimental_column1) +  
##     (1 | experimental_column2)
##    Data: data.update
## 
## REML criterion at convergence: -10215.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1653 -0.6837 -0.0505  0.6413  3.0906 
## 
## Random effects:
##  Groups               Name        Variance Std.Dev.
##  experimental_column2 (Intercept) 0.000168 0.01296 
##  experimental_column1 (Intercept) 0.000883 0.02971 
##  Residual                         0.010516 0.10255 
## Number of obs: 5989, groups:  experimental_column2, 13; experimental_column1, 8
## 
## Fixed effects:
##                    Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)        5.753966   0.011811 10.213598 487.158   <2e-16 ***
## condition_column1  0.027159   0.007924 10.925415   3.427   0.0057 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## cndtn_clmn1 -0.309

Figure 19:

3 Website

Please check https://gladstone-institutes.github.io/RMeDPower2/index.html for more detailed tutorials in other applications involving single cell RNA, mouse behavior and electrophysiology experiments. These will include illustration of the application of RMeDPower2 to non-normal count data and an explanation of the simulation setup for sample size calculations. There is also an guide to defining and explaining the required classes by the package. The guide include an Rshiny app to help an user appropriately define the classes appropriate for their experimental design and question of interest.

4 Session info

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_2.0.4   dplyr_1.1.4      ggplot2_4.0.0    RMeDPower2_1.0.2
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        influence.ME_0.9-10 simr_1.0.9         
##  [4] xfun_0.53           bslib_0.9.0         lattice_0.22-7     
##  [7] numDeriv_2016.8-1.1 vctrs_0.6.5         tools_4.5.1        
## [10] Rdpack_2.6.4        generics_0.1.4      parallel_4.5.1     
## [13] pbkrtest_0.5.5      tibble_3.3.0        pkgconfig_2.0.3    
## [16] Matrix_1.7-3        DHARMa_0.5.0        RColorBrewer_1.1-3 
## [19] S7_0.2.0            lifecycle_1.0.4     binom_1.1-1.1      
## [22] stringr_1.6.0       compiler_4.5.1      farver_2.1.2       
## [25] lmerTest_3.2-1      carData_3.0-5       litedown_0.7       
## [28] htmltools_0.5.8.1   sass_0.4.10         yaml_2.3.10        
## [31] Formula_1.2-5       pillar_1.11.1       car_3.1-3          
## [34] nloptr_2.2.1        jquerylib_0.1.4     tidyr_1.3.1        
## [37] MASS_7.3-65         cachem_1.1.0        reformulas_0.4.4   
## [40] iterators_1.0.14    boot_1.3-31         abind_1.4-8        
## [43] nlme_3.1-168        commonmark_2.0.0    tidyselect_1.2.1   
## [46] digest_0.6.37       stringi_1.8.7       purrr_1.2.0        
## [49] labeling_0.4.3      splines_4.5.1       fastmap_1.2.0      
## [52] grid_4.5.1          RLRsim_3.1-8        cli_3.6.5          
## [55] broom_1.0.10        withr_3.0.2         backports_1.5.0    
## [58] scales_1.4.0        plotrix_3.8-4       rmarkdown_2.29     
## [61] ggtext_0.1.2        lme4_2.0-1          png_0.1-8          
## [64] evaluate_1.0.5      knitr_1.50          rbibutils_2.3      
## [67] EnvStats_3.1.0      markdown_2.0        mgcv_1.9-3         
## [70] rlang_1.1.6         gridtext_0.1.5      Rcpp_1.1.0         
## [73] glue_1.8.0          xml2_1.4.0          rstudioapi_0.17.1  
## [76] minqa_1.2.8         jsonlite_2.0.0      plyr_1.8.9         
## [79] R6_2.6.1

RMeDPower2: helping design and analyses of data from repeated measures experiments

Min-Gyoung Shin

Reuben Thomas

1 Getting started

1.1 Installing RMeDPower2

1.2 Quick tutorial

1.2.1 Load libraries and read input data

1.2.2 Specify design, model and power parameters

1.2.3 Diagnose data and model

1.2.4 Statistical power estimation

1.2.5 Estimate parameters of interest

1.3 Input data

1.4 Three key classes of RMeDPower2

1.4.1 RMeDesign S4 class

1.4.2 ProbabilityModel S4 class

1.4.3 PowerParams S4 class

1.5 Three key functions of RMeDPower2

1.5.1 diagnoseDataModel(data, design, model) function

1.5.2 calculatePower(data, design, model, power_param) function

1.5.3 getEstimatesOfInterest(data, design, model) function

2 Application examples

2.1 Plate assays

3 Website

4 Session info

1.1 Installing `RMeDPower2`

1.4 Three key classes of `RMeDPower2`

1.4.1 `RMeDesign` S4 class

1.4.2 `ProbabilityModel` S4 class

1.4.3 `PowerParams` S4 class

1.5 Three key functions of `RMeDPower2`

1.5.1 `diagnoseDataModel(data, design, model)` function

1.5.2 `calculatePower(data, design, model, power_param)` function

1.5.3 `getEstimatesOfInterest(data, design, model)` function