| Title: | Estimating Propensity Scores (PS), PS-Based Weights, and Effects |
| Version: | 0.1.2 |
| Description: | Toolbox that provides a streamlined, end-to-end workflow for propensity score analysis in generating real-world evidence from real-world data. The package covers the full analytic pipeline - from estimating propensity scores via logistic regression, to calculating weights or creating a matched cohort, to generating publication-ready Table 1s with standardized mean differences and weighted balance diagnostics. It also estimates incidence rates, hazard ratios, risk ratios, and risk differences with support for stratified and direct-standardized analyses. All core functions produce formatted 'Excel' reports with embedded 'README' documentation, making results immediately shareable with collaborators and stakeholders. Methods are based on Rosenbaum and Rubin (1983) <doi:10.1093/biomet/70.1.41>, Austin (2011) <doi:10.1080/00273171.2011.568786>, and Desai et al. (2017) <doi:10.1097/EDE.0000000000000595>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Language: | en-US |
| RoxygenNote: | 7.3.3 |
| Imports: | stats, utils, parallel, survival, survey |
| Suggests: | openxlsx, ggplot2, MatchIt, testthat (≥ 3.0.0) |
| URL: | https://github.com/hanseul0618/rwetools |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-05-22 02:23:19 UTC; hanse |
| Author: | Hanseul Cho [aut, cre], Georg Hahn [aut], Janinne Ortega-Montiel [aut], Julie Paik [aut], Elisabetta Patorno [aut] |
| Maintainer: | Hanseul Cho <hanseul0618@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-28 13:30:02 UTC |
Add a README sheet to an openxlsx workbook
Description
Add a README sheet to an openxlsx workbook
Usage
add_readme_sheet(wb, readme_text, verbose = TRUE)
Arguments
wb |
An openxlsx workbook object |
readme_text |
Character string (may contain newlines) |
verbose |
Logical. Print progress messages (default TRUE). |
Value
Called for its side effect (adds a sheet to wb). Returns
NULL invisibly.
Build a Table 1 comparing baseline characteristics between groups
Description
Generates a descriptive Table 1 with counts, percentages, means, SDs, crude differences, standardized mean differences (SMD), and missingness. Supports inverse probability weighting via a user-supplied weight column and optionally saves the output to an Excel file.
Usage
build_table1(
in_df,
out_xlsxpath = NULL,
exposure_var,
exp_value = 1,
ref_value = 0,
use_weights = FALSE,
weight_var = "psweight",
cont_vars = NULL,
cat_vars = NULL,
binary_vars = NULL,
drop_vars = NULL,
drop_varpattern = NULL,
Var_colname = "Variable",
Vartype_colname = "Type",
Total_colname = "Total",
Exp_colname = "Exp",
Ref_colname = "Ref",
CrudeDiff_colname = "Crude_diff",
StdDiff_colname = "Std_diff",
MissingTotal_colname = "Missing_Total_N_Pct",
MissingExp_colname = "Missing_Exp_N_Pct",
MissingRef_colname = "Missing_Ref_N_Pct",
ExistingTotal_colname = "Existing_Total_N_denom",
ExistingExp_colname = NULL,
ExistingRef_colname = NULL,
mean_decimal = 2,
sd_decimal = 2,
n_decimal = 0,
pct_decimal = 1,
use_absolute_values_for_diff = FALSE,
add_n_of_patients_row = TRUE,
verbose = TRUE
)
Arguments
in_df |
Data frame containing the analytic cohort. |
out_xlsxpath |
Character string. File path for Excel output, or
|
exposure_var |
Character string. Name of the binary exposure column. |
exp_value |
Value in |
ref_value |
Value in |
use_weights |
Logical. Apply weights? (default FALSE). |
weight_var |
Character string. Name of the weight column (default
|
cont_vars |
Character vector of continuous variable names. |
cat_vars |
Character vector or named list of categorical variable names. If a named list, each element should be a character vector of factor levels. |
binary_vars |
Character vector of binary variable names. |
drop_vars |
Character vector of variable names to exclude. |
drop_varpattern |
Character vector of regex patterns; matching variables are excluded. |
Var_colname, Vartype_colname, Total_colname, Exp_colname, Ref_colname |
Column-name overrides for the output table. |
CrudeDiff_colname, StdDiff_colname |
Column-name overrides for difference columns. |
MissingTotal_colname, MissingExp_colname, MissingRef_colname |
Column-name overrides for missingness columns. |
ExistingTotal_colname, ExistingExp_colname, ExistingRef_colname |
Column-name overrides for non-missing-count columns. Set to |
mean_decimal, sd_decimal |
Integer. Decimal places for mean and SD. |
n_decimal |
Integer. Decimal places for counts (default 0). |
pct_decimal |
Integer. Decimal places for percentages (default 1). |
use_absolute_values_for_diff |
Logical. Show absolute differences? (default FALSE). |
add_n_of_patients_row |
Logical. Prepend an |
verbose |
Logical. Print progress messages (default TRUE). |
Value
Invisibly returns the Table 1 data frame.
Side Effects
When out_xlsxpath is not NULL, creates the output directory
(if needed) and writes an Excel workbook.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df <- read.csv(csv_path)
# Unweighted Table 1
tbl <- build_table1(
in_df = df,
exposure_var = "exposure",
cont_vars = c("cont1", "cont2", "cont3"),
binary_vars = c("binary1", "binary2"),
cat_vars = c("cat1", "cat2"),
verbose = FALSE
)
head(tbl)
# Weighted Table 1 with Excel output (requires openxlsx)
if (requireNamespace("openxlsx", quietly = TRUE)) {
df_ps <- estimate_ps(
in_df = df,
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
df_wt <- create_ps_weights(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
weight_method = "mw",
weight_var = "mw_wt",
verbose = FALSE
)
out_xlsx <- tempfile(fileext = ".xlsx")
tbl_wt <- build_table1(
in_df = df_wt,
out_xlsxpath = out_xlsx,
exposure_var = "exposure",
use_weights = TRUE,
weight_var = "mw_wt",
cont_vars = c("cont1", "cont2", "cont3"),
binary_vars = c("binary1", "binary2"),
cat_vars = c("cat1", "cat2"),
verbose = FALSE
)
}
Calculate C-statistic (concordance) for PS model discrimination
Description
Uses survival::concordance which is O(n log n) and supports sample
weights natively.
Usage
calc_c_statistic(
exposure_vec,
ps_vec,
weights_vec = NULL,
label = "PS Model",
verbose = TRUE
)
Arguments
exposure_vec |
Binary exposure vector (0/1) |
ps_vec |
Propensity score vector |
weights_vec |
Optional numeric weights vector (for post-weighting / post-matching c-stat) |
label |
Character label printed in console (default "PS Model") |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A list with elements c_stat (numeric) and se (numeric),
or NULL if survival is not available.
Canonize levels for consistent factor conversion
Description
Converts a vector to a factor with canonized levels. Logical vectors
become factor("0", "1"), numeric vectors are coerced via
as.character, and character vectors are trimmed.
Usage
canonize_levels(x)
Arguments
x |
A vector (logical, numeric, integer, character, or factor). |
Value
A factor with canonical levels.
Check propensity score assumptions
Description
Internal function used by create_ps_weights and other PS
functions to run diagnostic checks on propensity scores, including perfect
separation, positivity, overlap, and covariate balance.
Usage
check_ps_assumptions_internal(data, ps_var, exposure, verbose = TRUE)
Arguments
data |
Data frame containing the propensity score and exposure columns. |
ps_var |
Character string. Name of the propensity score column. |
exposure |
Character string. Name of the binary exposure column. |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A list of assumption-check results, each element containing a
detected flag and relevant summary statistics.
Combine named lists, keeping unique values
Description
Merges multiple named lists by name, concatenating and de-duplicating the
values for each shared key. Useful for combining cat_vars_w_levels
outputs from multiple calls to get_var_types.
Usage
combine_named_list(...)
Arguments
... |
Named lists to combine. |
Value
A single named list with unique, combined values per key.
Create a balance (Love) plot comparing standardized differences
Description
Create a balance (Love) plot comparing standardized differences
Usage
create_love_plot(
variable_names,
crude_std_diff,
adjusted_std_diff,
crude_label,
adjusted_label,
title,
output_path,
colors = NULL,
shapes = NULL,
use_absolute = FALSE
)
Arguments
variable_names |
Character vector of variable names |
crude_std_diff |
Numeric vector of crude/pre standardized differences |
adjusted_std_diff |
Numeric vector of adjusted/post standardized differences |
crude_label |
Label for crude/pre group (e.g. "Unweighted", "Pre-matching", "Crude") |
adjusted_label |
Label for adjusted/post group (e.g. "Weighted", "Post-matching") |
title |
Plot title |
output_path |
Full file path for saved plot |
colors |
Named character vector of length 2 with crude_label and adjusted_label as names |
shapes |
Optional named integer vector of length 2 with crude_label and adjusted_label as names giving ggplot2 shape codes (e.g., 16 = filled circle, 17 = filled triangle). If NULL, ggplot2 default shapes are used. |
use_absolute |
Logical. Plot absolute values of std diff? (default FALSE) |
Value
Invisibly returns a ggplot object, or invisible(NULL)
if ggplot2 is unavailable or no data remain after removing NAs.
Side Effects
Saves the plot to output_path via ggplot2::ggsave.
Calculate PS Fine Stratification Weights, Trim Non-overlapping Regions, and Generate Diagnostics
Description
Combined function that (1) creates PS fine strata, (2) calculates stratification weights, (3) trims non-overlapping PS regions, and optionally (4) builds a crude vs weighted balance table (Table 1) and (5) generates diagnostic plots.
Usage
create_ps_fs_weights(
in_df,
out_csvpath = NULL,
out_xlsxpath_report = NULL,
out_dir_plots = NULL,
trim_nonoverlap_region = TRUE,
exposure_var = "exp",
exp_value = 1,
ref_value = 0,
ps_var = NULL,
weight_var = "ps_fs_wt",
number_of_strata = 50,
stratification_method = c("exposure", "cohort"),
estimand = c("ATT", "ATE"),
make_unwt_wt_table1 = FALSE,
table1_cont_vars = NULL,
table1_binary_vars = NULL,
table1_cat_vars = NULL,
std_diff_threshold = 0.1,
readme_text = NULL,
verbose = TRUE
)
Arguments
in_df |
Data frame with PS already calculated (the "crude" / untrimmed data) |
out_csvpath |
Character. Path for output CSV of trimmed+weighted data (optional) |
out_xlsxpath_report |
Character. Path for Excel diagnostic report (optional) |
out_dir_plots |
Character. Directory to save diagnostic plots (NULL = skip plots) |
trim_nonoverlap_region |
Logical. Trim non-overlapping PS regions (default TRUE) |
exposure_var |
Character. Name of binary exposure variable (default "exp") |
exp_value |
Value representing exposed group (default 1) |
ref_value |
Value representing reference group (default 0) |
ps_var |
Character. Name of PS variable in the data (required) |
weight_var |
Character. Name for the stratification weight variable (default "ps_fs_wt") |
number_of_strata |
Integer. Number of strata to create (default 50) |
stratification_method |
Character. "exposure" or "cohort" (default "exposure") |
estimand |
Character. "ATT" or "ATE" (default "ATT") |
make_unwt_wt_table1 |
Logical. Build crude vs weighted balance table (default FALSE) |
table1_cont_vars |
Character vector. Continuous vars for Table 1 (auto-detected if NULL) |
table1_binary_vars |
Character vector. Binary vars for Table 1 (auto-detected if NULL) |
table1_cat_vars |
Character vector. Categorical vars for Table 1 (auto-detected if NULL) |
std_diff_threshold |
Numeric. Threshold for acceptable standardised difference (default 0.1) |
readme_text |
Character. Optional message for README sheet in Excel report |
verbose |
Logical. Print progress messages (default TRUE). |
Value
Invisibly returns the trimmed+weighted data frame.
Side Effects
Writes a CSV file when
out_csvpathis provided.Creates directories, writes an Excel diagnostic report, and saves PNG plot files when the corresponding path arguments are supplied.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df_ps <- estimate_ps(
in_df = read.csv(csv_path),
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
result <- create_ps_fs_weights(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
weight_var = "fs_wt",
number_of_strata = 10,
stratification_method = "exposure",
estimand = "ATT",
verbose = FALSE
)
summary(result$fs_wt)
Perform propensity score matching and optionally generate diagnostic reports.
Description
This function performs 1:k nearest neighbor PS matching using MatchIt and generates comprehensive diagnostic reports including assumption checks, balance tables, and plots.
Usage
create_ps_matched_cohort(
in_df = NULL,
in_csvpath = NULL,
out_csvpath_matcheddata = NULL,
out_csvpath_crudedata_w_matchindicator = NULL,
out_xlsxpath_report = NULL,
out_dir_plots = NULL,
exposure_var = "exp",
exp_value = 1,
ref_value = 0,
ps_var = "ps",
ratio = 1,
caliper = 0.2,
replace = FALSE,
make_crude_matched_table1 = FALSE,
table1_cont_vars = NULL,
table1_binary_vars = NULL,
table1_cat_vars = NULL,
std_diff_threshold = 0.1,
readme_text = NULL,
verbose = TRUE
)
Arguments
in_df |
Data frame containing the input data with PS already calculated (optional if in_csvpath provided) |
in_csvpath |
Character string. Path to input CSV file with PS already calculated (optional if in_df provided) |
out_csvpath_matcheddata |
Character string. Path for matched cohort CSV file (optional) |
out_csvpath_crudedata_w_matchindicator |
Character string. Path for crude data with match indicator CSV file (optional) |
out_xlsxpath_report |
Character string. Path for output Excel diagnostic report (optional). |
out_dir_plots |
Character string. Directory to save plot files (optional). |
exposure_var |
Character string. Name of the binary exposure/treatment variable column (default: "exp") |
exp_value |
Value representing the exposed/treated group (default: 1) |
ref_value |
Value representing the reference/control group (default: 0) |
ps_var |
Character string. Name of the PS variable column in the data (default: "ps") |
ratio |
Integer. Matching ratio (1:k). Default is 1 for 1:1 matching. |
caliper |
Numeric. Caliper width in standard deviations of logit(PS). Default is 0.2. |
replace |
Logical. Whether to match with replacement. Default is FALSE. |
make_crude_matched_table1 |
Logical. Whether to generate crude and matched Table 1 (default: FALSE). |
table1_cont_vars |
Character vector. Names of continuous variables for Table 1 (optional, auto-detected if NULL) |
table1_binary_vars |
Character vector. Names of binary variables for Table 1 (optional, auto-detected if NULL) |
table1_cat_vars |
Character vector. Names of categorical variables for Table 1 (optional, auto-detected if NULL) |
std_diff_threshold |
Numeric. Threshold for acceptable standardized difference (default: 0.1) |
readme_text |
Character string. Optional message to include in README sheet of Excel report |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A data frame containing the matched cohort only (crude data with match indicator can be saved to CSV via out_csvpath_crudedata_w_matchindicator).
Side Effects
Writes matched-cohort and/or crude-data CSV files when the corresponding path arguments are provided.
Creates directories, writes an Excel diagnostic report, and saves PNG plot files when the corresponding path arguments are supplied.
Examples
# Requires MatchIt
if (requireNamespace("MatchIt", quietly = TRUE)) {
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df_ps <- estimate_ps(
in_df = read.csv(csv_path),
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
matched <- create_ps_matched_cohort(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
ratio = 1,
caliper = 0.2,
verbose = FALSE
)
nrow(matched)
}
Calculate propensity score weights and optionally generate diagnostic reports.
Description
This function calculates various types of propensity score weights (matching weights, overlap weights, IPTW, stabilized IPTW) and adds them to a dataset that already contains propensity scores. Optionally generates comprehensive diagnostic reports including assumption checks, balance tables, and plots.
Usage
create_ps_weights(
in_df = NULL,
in_csvpath = NULL,
out_csvpath = NULL,
out_xlsxpath_report = NULL,
out_dir_plots = NULL,
exposure_var = "exp",
exp_value = 1,
ref_value = 0,
ps_var = "ps",
weight_method = c("mw", "ow", "iptw", "stabiptw"),
weight_var = "psweight",
estimand = c("ATT", "ATE", "ATO"),
make_unwt_wt_table1 = FALSE,
table1_cont_vars = NULL,
table1_binary_vars = NULL,
table1_cat_vars = NULL,
std_diff_threshold = 0.1,
readme_text = NULL,
verbose = TRUE
)
Arguments
in_df |
Data frame containing the input data with PS already calculated (optional if in_csvpath provided) |
in_csvpath |
Character string. Path to input CSV file with PS already calculated (optional if in_df provided) |
out_csvpath |
Character string. Path for output CSV file (optional) |
out_xlsxpath_report |
Character string. Path for output Excel diagnostic report (optional). If NULL, no diagnostic report is generated. |
out_dir_plots |
Character string. Directory to save plot files (optional). If NULL, no plots are saved. |
exposure_var |
Character string. Name of the binary exposure/treatment variable column (default: "exp") |
exp_value |
Value representing the exposed/treated group (default: 1) |
ref_value |
Value representing the reference/control group (default: 0) |
ps_var |
Character string. Name of the PS variable column in the data (default: "ps") |
weight_method |
Character string. Type of weight to calculate: "mw" (matching weights), "ow" (overlap weights), "iptw" (inverse probability weights), "stabiptw" (stabilized IPTW). Only ONE method can be specified. |
weight_var |
Character string. Name for the weight variable column (default: "psweight") |
estimand |
Character string. Target estimand: "ATT" (average treatment effect on treated), "ATE" (average treatment effect), or "ATO" (average treatment effect in overlap population). Only used for IPTW methods. For MW/OW, estimand is determined by the method itself. |
make_unwt_wt_table1 |
Logical. Whether to generate unweighted and weighted Table 1 (default: FALSE). Only used if out_xlsxpath_report is provided. |
table1_cont_vars |
Character vector. Names of continuous variables for Table 1 (optional, auto-detected if NULL) |
table1_binary_vars |
Character vector. Names of binary variables for Table 1 (optional, auto-detected if NULL) |
table1_cat_vars |
Character vector. Names of categorical variables for Table 1 (optional, auto-detected if NULL) |
std_diff_threshold |
Numeric. Threshold for acceptable standardized difference (default: 0.1) |
readme_text |
Character string. Optional message to include in README sheet of Excel report |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A data frame with the weight column added. Also saves to CSV if path specified.
Side Effects
Writes a CSV file when
out_csvpathis provided.Creates directories, writes an Excel diagnostic report, and saves PNG plot files when the corresponding path arguments are supplied.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df_ps <- estimate_ps(
in_df = read.csv(csv_path),
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
# Add matching weights
df_mw <- create_ps_weights(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
weight_method = "mw",
weight_var = "mw_wt",
verbose = FALSE
)
summary(df_mw$mw_wt)
# Add IPTW weights with Excel diagnostic report (requires openxlsx, ggplot2)
if (requireNamespace("openxlsx", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE)) {
out_xlsx <- tempfile(fileext = ".xlsx")
out_dir <- tempdir()
df_iptw <- create_ps_weights(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
weight_method = "iptw",
weight_var = "iptw_wt",
estimand = "ATE",
out_xlsxpath_report = out_xlsx,
make_unwt_wt_table1 = TRUE,
out_dir_plots = out_dir,
verbose = FALSE
)
}
Estimate Incidence Rates and Hazard Ratios
Description
This function estimates incidence rates and hazard ratios for time-to-event outcomes. It can perform unweighted analysis, weighted (propensity score adjusted) analysis, or both, using potentially different datasets for each analysis type. Incidence rate CIs are estimated using Poisson distribution.
Usage
estimate_hr_ir(
in_df_unwt = NULL,
in_df_wted = NULL,
out_xlsxpath = NULL,
exposure_var = "exp",
exp_value = 1,
ref_value = 0,
outcome_var = NULL,
weight_var = NULL,
survival_time = NULL,
time_unit = c("days", "years"),
ir_per_pyears = 1000,
confidence_level = 0.95,
readme_text = NULL,
stratify_by = NULL,
verbose = TRUE
)
Arguments
in_df_unwt |
Data frame for unweighted analysis. Set to NULL to skip unweighted analysis. This could be the full cohort, matched data, or any specific population |
in_df_wted |
Data frame for weighted analysis. Set to NULL to skip weighted analysis. This could be the same as unweighted data or a different population (e.g., trimmed) |
out_xlsxpath |
Character string. Path for output Excel file with all results |
exposure_var |
Character string. Name of binary exposure/treatment variable (default: "exp") |
exp_value |
Value representing the exposed/treated group (default: 1) |
ref_value |
Value representing the reference/control group (default: 0) |
outcome_var |
Character string. Name of the event indicator variable (0=censored, 1=event) |
weight_var |
Character string. Name of weight variable in the weighted dataset. Required when in_df_wted is not NULL. |
survival_time |
Character string. Name of survival/follow-up time variable (required) |
time_unit |
Character string. Unit of time for survival analysis: "days" or "years". Default: "days" |
ir_per_pyears |
Numeric. Multiplier for incidence rate per person-years. Allowed values: 1, 100, 1000, 10000, 100000. Default: 1000 (i.e., IR per 1,000 person-years). Column name is fixed as IR_per_Npy_Unwt / IR_per_Npy_Wted. The actual multiplier value is documented in the Analysis Summary sheet. |
confidence_level |
Numeric. Confidence level for confidence intervals (default: 0.95 for 95% CI) |
readme_text |
Character string. Optional message to include in README sheet at beginning of Excel file. Use this to document the analysis parameters, data sources, or any notes |
stratify_by |
Character string. Name of the stratification variable. When specified: - Cox model uses strata() in the formula (separate baseline hazards per stratum) - IR/IRD are computed via direct standardization using total cohort person-time distribution as weights Default: NULL (no stratification) |
verbose |
Logical. Print progress messages (default TRUE). |
Details
The function allows flexible analysis configurations:
Unweighted only: Set in_df_unwt = data, in_df_wted = NULL
Weighted only: Set in_df_unwt = NULL, in_df_wted = data
Both: Provide both datasets (can be the same or different populations)
For unweighted analysis:
Uses standard Cox regression without weights
Can be applied to matched data, trimmed data, or full cohort
For weighted analysis:
Uses weighted Cox regression with specified weights
Typically applied to PS-weighted data (IPTW, MW, OW, FS weights)
Stratification (stratify_by parameter): When stratify_by is specified:
Cox model: Fits a stratified Cox proportional hazards model using strata() in the model formula. This allows separate baseline hazard functions for each stratum while estimating a common treatment effect (HR) across strata.
IR/IRD: Uses direct standardization with total cohort person-time distribution as standard weights.
IR_std = sum(Rate_k * W_k) where W_k = PT_total_k / PT_total
Variance via delta method: Var = sum(W_k^2 * d_k / PT_k^2)
IRD_std = IR_std_exposed - IR_std_unexposed
CI for standardized IR uses log-transformation to ensure positive bounds
ir_per_pyears parameter: Controls the person-year multiplier for incidence rates and incidence rate differences. For example:
ir_per_pyears = 1000: rates expressed per 1,000 person-years (default)
ir_per_pyears = 100000: rates expressed per 100,000 person-years Output column names are fixed (e.g., IR_per_Npy_Unwt, IR_per_Npy_Wted). The multiplier value is recorded in the Analysis Summary sheet.
Value
Invisibly returns a list containing:
incidence_rates: Data frame with event counts, person-years, and incidence rates
hazard_ratios: Data frame with unweighted and/or weighted HR estimates
unweighted_model: The unweighted Cox model object (if unweighted analysis performed)
weighted_model: The weighted Cox model object (if weighted analysis performed)
stratum_details_unwt: Data frame with stratum-level details (if stratified, unweighted)
stratum_details_wted: Data frame with stratum-level details (if stratified, weighted) Also saves results to Excel file when out_xlsxpath is provided
Side Effects
Creates the output directory and writes an Excel workbook when
out_xlsxpath is provided.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df <- read.csv(csv_path)
# Unweighted analysis only
result <- estimate_hr_ir(
in_df_unwt = df,
in_df_wted = NULL,
exposure_var = "exposure",
outcome_var = "outcome",
survival_time = "follow_up_days",
time_unit = "days",
ir_per_pyears = 1000,
verbose = FALSE
)
result$incidence_rates
# Weighted analysis with Excel output (requires openxlsx)
if (requireNamespace("openxlsx", quietly = TRUE)) {
df_ps <- estimate_ps(
in_df = df,
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
df_wt <- create_ps_weights(
in_df = df_ps,
exposure_var = "exposure",
ps_var = "ps",
weight_method = "mw",
weight_var = "mw_wt",
verbose = FALSE
)
out_xlsx <- tempfile(fileext = ".xlsx")
result2 <- estimate_hr_ir(
in_df_unwt = df,
in_df_wted = df_wt,
out_xlsxpath = out_xlsx,
exposure_var = "exposure",
outcome_var = "outcome",
survival_time = "follow_up_days",
weight_var = "mw_wt",
ir_per_pyears = 1000,
verbose = FALSE
)
}
Calculate propensity scores and add them to the dataset
Description
This function calculates propensity scores using logistic regression and adds them as a new column to the dataset. It can output both an R data frame and/or CSV file. Additionally, it can save odds ratio table from the PS model to Excel or data frame.
Usage
estimate_ps(
in_df = NULL,
in_csvpath = NULL,
out_csvpath = NULL,
out_xlsxpath_odds_ratio = NULL,
exposure_var = "exposure",
exp_value = 1,
ref_value = 0,
class_vars = NULL,
cont_vars = NULL,
ps_var = "ps",
interactions = NULL,
exclude_vars_w_extreme_distribution = FALSE,
verbose = TRUE
)
Arguments
in_df |
Data frame containing the input data (optional if in_csvpath provided) |
in_csvpath |
Character string. Path to input CSV file (optional if in_df provided) |
out_csvpath |
Character string. Path for output CSV file (optional, if NULL no CSV is saved) |
out_xlsxpath_odds_ratio |
Character string. Path for output Excel file with OR table (optional) |
exposure_var |
Character string. Name of the binary exposure/treatment variable column (default: "exposure") |
exp_value |
Value representing the exposed/treated group (default: 1) |
ref_value |
Value representing the reference/control group (default: 0) |
class_vars |
Character vector. Names of categorical/factor variables to include in PS model |
cont_vars |
Character vector. Names of continuous variables to include in PS model |
ps_var |
Character string. Name for the calculated PS variable column (default: "ps") |
interactions |
Character string. Interaction terms to include (e.g., "var1:var2 + var1:var3") |
exclude_vars_w_extreme_distribution |
Logical. If TRUE, automatically excludes variables with extreme distribution (categorical: only 1 unique level in either group, or any level with count < 5 in either group; continuous: constant/single unique value (SD < 1e-6) in either group, or SMD > 1.5 between groups) instead of throwing error (default: FALSE). The categorical screen unions levels across the two exposure groups, so a level present in one group but absent in the other is counted as n = 0 and flagged by the cnt < 5 rule. |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A data frame with the PS column added. Also saves to CSV if path specified.
Side Effects
Writes a CSV file when
out_csvpathis provided.Writes an Excel file with odds-ratio table when
out_xlsxpath_odds_ratiois provided.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df <- read.csv(csv_path)
# Basic usage: estimate PS and add it as a new column
result <- estimate_ps(
in_df = df,
exposure_var = "exposure",
class_vars = c("cat1", "cat2", "cat3", "cat4"),
cont_vars = c("cont1", "cont2", "cont3"),
verbose = FALSE
)
head(result$ps)
# With CSV output and Excel OR table (requires openxlsx)
if (requireNamespace("openxlsx", quietly = TRUE)) {
out_csv <- tempfile(fileext = ".csv")
out_xlsx <- tempfile(fileext = ".xlsx")
result2 <- estimate_ps(
in_df = df,
out_csvpath = out_csv,
out_xlsxpath_odds_ratio = out_xlsx,
exposure_var = "exposure",
class_vars = c("cat1", "cat2"),
cont_vars = c("cont1", "cont2"),
exclude_vars_w_extreme_distribution = TRUE,
verbose = FALSE
)
}
Estimate Risk Ratios and Risk Differences using Cumulative Incidence
Description
This function estimates risk ratios (RR) and risk differences (RD) for time-to-event outcomes using cumulative incidence at a specified timepoint. Two estimators are supported:
- Kaplan-Meier (KM)
Standard 1 - S(t) cumulative incidence. Appropriate when there are no competing risks or competing events are treated as censored.
- Aalen-Johansen (AJ)
Cumulative incidence function accounting for competing risks. Produces unbiased risk estimates in the presence of competing events by treating them as a separate absorbing state rather than censoring. Requires a competing event indicator via
competing_event_var. Uses multi-statesurvfitinternally (survival>= 3.1).
It can perform unweighted analysis, weighted (propensity score adjusted) analysis, or both, using potentially different datasets for each analysis type. Confidence intervals can be estimated using bootstrapping, analytical (delta method / Greenwood), or both.
Usage
estimate_rr_rd(
in_df_unwt = NULL,
in_df_wted = NULL,
out_xlsxpath = NULL,
exposure_var = "exp",
exp_value = 1,
ref_value = 0,
outcome_var = NULL,
weight_var = NULL,
survival_time = NULL,
time_unit = c("days", "months", "years"),
rr_rd_at_timepoint = 365,
rr_rd_per_individuals = 1000,
confidence_level = 0.95,
conf_int_method = c("bootstrap", "analytical", "both"),
risk_estimator = c("KM", "AJ"),
competing_event_var = NULL,
bootstrap_count = 500,
stratify_by = NULL,
n_cores = NULL,
seed = NULL,
readme_text = NULL,
verbose = TRUE
)
Arguments
in_df_unwt |
Data frame for unweighted analysis. Set to NULL to skip unweighted analysis. This could be the full cohort, matched data, or any specific population. |
in_df_wted |
Data frame for weighted analysis. Set to NULL to skip weighted analysis. This could be the same as unweighted data or a different population (e.g., trimmed). |
out_xlsxpath |
Character string. Path for output Excel file with all results. |
exposure_var |
Character string. Name of binary exposure/treatment variable (default: "exp"). |
exp_value |
Value representing the exposed/treated group (default: 1). |
ref_value |
Value representing the reference/control group (default: 0). |
outcome_var |
Character string. Name of the event indicator variable (0=censored, 1=event). |
weight_var |
Character string. Name of weight variable in the weighted dataset.
Required when |
survival_time |
Character string. Name of survival/follow-up time variable (required). |
time_unit |
Character string. Unit of time for survival analysis: "days", "months", or "years". Default: "days". |
rr_rd_at_timepoint |
Numeric. Timepoint at which to calculate cumulative incidence.
Units depend on |
rr_rd_per_individuals |
Numeric. Denominator for expressing risk and risk difference. Default: 1000. |
confidence_level |
Numeric. Confidence level for confidence intervals (default: 0.95). |
conf_int_method |
Character string. Method for confidence interval estimation:
Default: "bootstrap". |
risk_estimator |
Character string. Cumulative incidence estimator:
Default: "KM". |
competing_event_var |
Character string. Name of the competing event indicator variable
(1 = competing event occurred, e.g., death; 0 = no competing event).
Required when |
bootstrap_count |
Integer. Number of bootstrap iterations for CI estimation (default: 500).
Ignored when |
stratify_by |
Character string. Name of stratification variable for direct standardization (default: NULL). When specified, the function: (1) calculates stratum weights (w_k = N_k / N_total) from the total population, (2) computes risk within each stratum using the selected estimator, (3) standardizes: Risk_std = Sum(Risk_k * w_k), (4) derives RR and RD from standardized risks. Bootstrap CIs use stratified resampling (within each stratum). Analytical SEs use Greenwood formula: Var(Risk_std) = Sum(w_k^2 * Var(R_k)). |
n_cores |
Integer. Number of CPU cores for parallel bootstrap. Default uses all
available cores minus 1. Set to 1 to disable parallelization.
Ignored when |
seed |
Integer or NULL. Optional seed for the bootstrap random number
generator. When NULL (default), the caller's current RNG state
is used and not modified. When an integer is supplied, it is
passed to |
readme_text |
Character string. Optional message to include in README sheet of Excel output. |
verbose |
Logical. Print progress messages (default TRUE). |
Details
When stratify_by is specified, the function performs direct standardization:
stratum-specific risks are combined using the total population distribution as the standard.
Bootstrap resampling is stratified to preserve stratum structure.
The function proceeds as follows:
Fits cumulative incidence curves by exposure group using the selected estimator
Extracts cumulative incidence (risk) at the specified timepoint
Computes Risk Ratio (RR) = Risk_exposed / Risk_reference
Computes Risk Difference (RD) = Risk_exposed - Risk_reference
Estimates confidence intervals using the selected method
Kaplan-Meier (KM) estimator: Risk = 1 - S(t), where S(t) is the KM survival estimate. Competing events, if present, are treated as censored, which can overestimate cumulative incidence.
Aalen-Johansen (AJ) estimator:
Cumulative incidence function from a multi-state model via
survfit with Surv(time, factor(status)).
Competing events are modelled as a separate absorbing state, producing
unbiased risk estimates. Variance is based on the counting-process
variance estimator (Aalen, 1978).
Analytical CI method:
RR: delta method on log scale. Var(log(RR)) = Var(R_exp)/R_exp^2 + Var(R_ref)/R_ref^2. CI = exp(log(RR) +/- z * SE(log(RR))).
RD: Var(RD) = (Var(R_exp) + Var(R_ref)) * per_n^2. CI = RD +/- z * SE(RD).
Variance from
survfit(Greenwood for KM, counting-process for AJ).
Bootstrap CI method:
Percentile bootstrap with parallel computation via
parLapply using L'Ecuyer-CMRG random number streams.
Pass an integer to seed for reproducible bootstrap CIs; by default
(seed = NULL) the caller's current RNG state is used and not modified.
When stratify_by is specified (Direct Standardization):
Stratum weights w_k = N_k / N_total from the total population
Within each stratum k, risk R_{ik} is computed for each exposure group i
Standardized risk: Risk_std_i = Sum_k(R_{ik} * w_k)
RR = Risk_std_1 / Risk_std_0; RD = Risk_std_1 - Risk_std_0
Analytical SE: Var(Risk_std_i) = Sum_k(w_k^2 * Var(R_{ik}))
Bootstrap CIs use stratified resampling with fixed original population weights
Value
Invisibly returns a list containing:
- estimates
Data frame with RR, RD, confidence intervals, and
Risk_Estimator("KM"or"AJ").- cumulative_incidence
Data frame with cumulative incidence by group. Columns:
Risk(probability),Risk_LCI/Risk_UCI(log-transformed CI),Risk_SE(SE on 0-1 scale),Risk_Var(variance),RiskperN(scaled risk),RiskperN_LCI/RiskperN_UCI(scaled CI), andRisk_Estimator.- stratum_details
(if
stratify_byspecified) Data frame with stratum-level results andRisk_Estimator.- unweighted_bootstrap
(if applicable) Bootstrap results matrix.
- weighted_bootstrap
(if applicable) Bootstrap results matrix.
Also saves results to Excel file when out_xlsxpath is provided.
Side Effects
Creates the output directory and writes an Excel workbook when
out_xlsxpath is provided.
Examples
csv_path <- system.file("extdata", "sample_data.csv", package = "rwetools")
df <- read.csv(csv_path)
# KM with analytical CIs (fast, no bootstrap)
result <- estimate_rr_rd(
in_df_unwt = df,
in_df_wted = NULL,
exposure_var = "exposure",
outcome_var = "outcome",
survival_time = "follow_up_days",
time_unit = "days",
rr_rd_at_timepoint = 365,
rr_rd_per_individuals = 1000,
conf_int_method = "analytical",
verbose = FALSE
)
result$estimates
# KM with bootstrap CIs (slow) and competing risk (AJ estimator)
result2 <- estimate_rr_rd(
in_df_unwt = df,
in_df_wted = NULL,
exposure_var = "exposure",
outcome_var = "outcome",
survival_time = "follow_up_days",
time_unit = "days",
rr_rd_at_timepoint = 365,
rr_rd_per_individuals = 1000,
risk_estimator = "AJ",
competing_event_var = "competing_event",
conf_int_method = "bootstrap",
bootstrap_count = 100,
n_cores = 1,
verbose = FALSE
)
Format mean and standard deviation as "mean (SD)"
Description
Format mean and standard deviation as "mean (SD)"
Usage
fmt_mean_sd(m, s, mean_digits = 2, sd_digits = 2, use_abs = FALSE)
Arguments
m |
Numeric scalar. The mean value. |
s |
Numeric scalar. The standard deviation. |
mean_digits |
Integer. Decimal places for the mean (default 2). |
sd_digits |
Integer. Decimal places for the SD (default 2). |
use_abs |
Logical. If TRUE, use absolute values (default FALSE). |
Value
Character string in the form "mean (sd)", or "NA" when m is NA.
Format count and percentage as "n (pct%)"
Description
Format count and percentage as "n (pct%)"
Usage
fmt_n_pct(n, d, n_digits = 0, pct_digits = 1, use_abs = FALSE)
Arguments
n |
Numeric scalar. The count (numerator). |
d |
Numeric scalar. The denominator. |
n_digits |
Integer. Decimal places for the count (default 0). |
pct_digits |
Integer. Decimal places for the percentage (default 1). |
use_abs |
Logical. If TRUE, use absolute values (default FALSE). |
Value
Character string in the form "n (pct%)", or "NA" when inputs are
missing or d == 0.
Format a numeric value as a fixed-point string
Description
Format a numeric value as a fixed-point string
Usage
fmt_num(x, digits = 3, use_abs = FALSE)
Arguments
x |
Numeric scalar to format. |
digits |
Integer. Number of decimal places (default 3). |
use_abs |
Logical. If TRUE, format the absolute value (default FALSE). |
Value
Character string, or "NA" when x is NA.
Format a proportion as a percentage string
Description
Format a proportion as a percentage string
Usage
fmt_pct(p, digits = 1, use_abs = FALSE)
Arguments
p |
Numeric scalar. A proportion (0 to 1 scale). |
digits |
Integer. Decimal places for the percentage (default 1). |
use_abs |
Logical. If TRUE, use absolute value (default FALSE). |
Value
Character string in the form "xx.x%", or "–" when p
is NA.
Classify variables as continuous, binary, or categorical
Description
Inspects one or more data frames to determine the type of each variable using a combination of pattern-based rules and data-driven heuristics.
Usage
get_var_types(df_list, max_cat_levels = 12)
Arguments
df_list |
A data frame or list of data frames to inspect. |
max_cat_levels |
Integer. Numeric variables with more than this many unique values are classified as continuous (default 12). |
Value
A named list with elements:
- cat_vars
Sorted character vector of categorical variable names.
- cat_vars_w_levels
Named list mapping each categorical variable to its sorted unique levels.
- binary_vars
Sorted character vector of binary (0/1) variable names.
- cont_vars
Sorted character vector of continuous variable names.
Compute weighted proportion for a single factor level
Description
Compute weighted proportion for a single factor level
Usage
get_weighted_prop(design, var, level)
Arguments
design |
A |
var |
Character string. Name of the categorical variable. |
level |
Character string. The factor level whose proportion is needed. |
Value
Numeric scalar: the weighted proportion, 0 if the level is absent,
or NA_real_ on error.
Compute weighted mean, SD, and total weight from a survey design
Description
Compute weighted mean, SD, and total weight from a survey design
Usage
get_weighted_stats(des, var_name)
Arguments
des |
A |
var_name |
Character string. Name of the variable in |
Value
A named numeric vector with elements mean, sd, and
sum_w. Returns c(mean = NA, sd = NA, sum_w = 0) when all
values are NA.
Get non-baseline factor levels
Description
Returns all factor levels except those commonly used as a baseline (e.g., "0", "No", "None", "FALSE").
Usage
levels_excl_zero(f)
Arguments
f |
A factor. |
Value
Character vector of non-baseline levels.
Process a single categorical or binary variable for Table 1
Description
Computes weighted counts, percentages, crude differences, and standardized
mean differences for each level of a categorical or binary variable.
Uses stats::xtabs to compute the full weighted cross-tabulation
once per variable, then extracts per-level results efficiently.
Usage
process_catbin_direct(
v,
dat,
levels_to_show = c("all", "non_zero"),
predefined_levels = NULL,
N_all,
N_ref,
N_tx,
idx_ref,
idx_tx,
w_vec,
grp_vec,
using_w,
n_decimal = 0,
pct_decimal = 1,
use_abs = FALSE,
calculate_existing = FALSE,
verbose = TRUE
)
Arguments
v |
Character string. Variable name. |
dat |
Data frame containing |
levels_to_show |
Character. Either |
predefined_levels |
Character vector of levels to display, or |
N_all |
Integer. Total number of patients. |
N_ref |
Integer. Number of reference-group patients. |
N_tx |
Integer. Number of exposed-group patients. |
idx_ref |
Integer vector. Row indices for the reference group. |
idx_tx |
Integer vector. Row indices for the exposed group. |
w_vec |
Numeric vector. Pre-computed weight vector for all rows. |
grp_vec |
Integer vector. Pre-computed group vector (0/1) for all rows. |
using_w |
Logical. Whether weights are active (any weight > 1). |
n_decimal |
Integer. Decimal places for counts (default 0). |
pct_decimal |
Integer. Decimal places for percentages (default 1). |
use_abs |
Logical. Use absolute values for differences (default FALSE). |
calculate_existing |
Logical. If TRUE, compute non-missing counts (default FALSE). |
verbose |
Logical. Print progress messages (default TRUE). |
Value
A matrix of list rows (one per level), or NULL if the
variable is not in dat or has no displayable levels.
Summarize propensity-score distribution by exposure group
Description
Summarize propensity-score distribution by exposure group
Usage
summarize_ps_by_group(ps_vals, group_vals, group_labels = NULL)
Arguments
ps_vals |
Numeric vector of propensity scores |
group_vals |
Vector of group indicators (0/1) |
group_labels |
Named character vector, e.g. c("0"="Control","1"="Treated"). If NULL, uses numeric group values. |
Value
data.frame with one row per group containing n, mean, SD, min, quartiles, and max of the propensity score.