---
title: "FrailtyCompRisk"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{FrailtyCompRisk}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(FrailtyCompRisk)
```

# Package FrailtyCompRisk

## Overview

The package **FrailtyCompRisk** is originally designed for statisticians
and epidemiologists in order to analyze time-to-event data where
individuals are nested within centers (e.g., hospitals or clinics), and
where multiple causes of failure may occur.

The package is build upon the random-effects model for the
subdistribution hazard presented in "Analysing multicentre competing
risks data with a mixed proportional hazards model for the
subdistribution" [Katsahian S, Resche-Rigon M, Chevret S, Porcher R.].

## The Model:

Consider $N$ observations of a time-to-event variable $T$, a cause of
event $\epsilon$, and a covariate $Z$:
$(T_i, \epsilon_i, Z_i)_{i=1,\dots,N}$. Let $\epsilon = 1$ denote the
cause of failure of interest.

The cumulative incidence function (CIF) for cause 1 is defined as: $$
F_1(t)=P(T \le t,\epsilon = 1)
$$ Fine and Gray proposed a proportional hazards model for the
subdistribution hazard associated with $F_1(t)$, defined as: $$
\alpha_1(t) = \frac{\frac{dF_1(t)}{dt}}{1 - F_1(t)}
$$ The subdistribution hazard for individual $i$ is modeled as:

$$
\alpha_{1}(t | Z_i, u) = \alpha_{1,0}(t) e^{(Z_i^\top \beta)}
$$ where:

-   $\alpha_{1,0}(t)$ is an unspecified baseline subdistribution hazard,

-   $\beta$ is the vector of covariate effects.

This model assumes that all individuals are independent. To account for
clustering (e.g., patients within centers), a random effect is added:

Let $k(i) \in {1, \dots, K}$ be the cluster (center) to which subject
$i$ belongs. Then the subdistribution hazard becomes:

$$
\alpha_{1,k}(t | Z_i, u) = \alpha_{1,0}(t) e^{(Z_i^\top \beta \ + \ u_{k(i)})}
$$ where $u_k \sim \mathcal{N}(0, \theta)$ are independent Gaussian
random effects for each cluster $k$, with variance $\theta$.

## Method

The package uses a restricted maximum likelihood (REML) or maximum
likelihood (ML) approach to estimate the parameters $\beta$, $\theta$,
and $u_k$. The estimation relies on a Newton-Raphson optimization
routine applied to the (restricted) log-likelihood.

## Features

The package provides several main functions, $\\$
`Reml_CompRisk_frailty()` Estimates $\beta$, $\theta$, and $u_k$ under
the subdistribution hazard model with random effects. Also returns the
$p$-value for testing $H_0: \theta = 0$.

`Reml_CompRisk_frailty()` can take 5 arguments:

1.  `data` (mandatory), a data frame with the following columns:

    -   `times` : Observed times to event or censoring.

    -   `status` : Event type: 0 = censored, 1 = cause of interest, \>1
        = other competing risks.

    -   `clusters` : Cluster membership (e.g., centers or subjects).

    -   `covariates` : (optional) Columns from the 4th onward are used
        as covariates.

2.  `cluster_censoring` (optional): Logical. If TRUE, accounts for
    center-specific censoring using a frailty model on censoring times.
    Default is FALSE.

3.  `max_iter` (optional): Maximum number of iterations for the
    Newton-Raphson algorithm (default = 300).

4.  `tol` (optional): Convergence tolerance for the likelihood and
    frailty variance (default = 1e-6).

5.  `threshold` (optional): Lower bound for the frailty variance
    parameter $\theta$. If the estimated value falls below this
    threshold, frailty is considered negligible (default = 1e-5).

$\\$

`Ml_CompRisk()` Estimates $\beta$ for the subdistribution hazard model
without random effects.

`Ml_CompRisk()` can take 3 arguments:

1.  `data` (mandatory), a data frame with the following columns:

    -   `times` : Observed times to event or censoring.

    -   `status` : Event type: 0 = censored, 1 = cause of interest, \>1
        = other competing risks.

    -   `covariates` : (optional) Columns from the 3th onward are used
        as covariates.

2.  `max_iter` (optional): Maximum number of iterations for the
    Newton-Raphson algorithm (default = 100).

3.  `tol` (optional): Convergence tolerance for the likelihood and
    frailty variance (default = 1e-6).

$\\$

`Reml_Cox_frailty()` Estimates $\beta$, $\theta$, and $u_k$ under a Cox
model with random effects, and provides a $p_{value}$ for testing
$\theta = 0$.

`Reml_Cox_frailty()` can take 4 arguments:

1.  `data` (mandatory), a data frame with the following columns:

    -   `times` : Observed times to event or censoring.

    -   `status` : Event type: 0 = censored, 1 = cause of interest

    -   `clusters` : Cluster membership (e.g., centers or subjects).

    -   `covariates` : (optional) Columns from the 4th onward are used
        as covariates.

2.  `max_iter` (optional): Maximum number of iterations for the
    Newton-Raphson algorithm (default = 300).

3.  `tol` (optional): Convergence tolerance for the likelihood and
    frailty variance (default = 1e-6).

4.  `threshold` (optional): Lower bound for the frailty variance
    parameter $\theta$. If the estimated value falls below this
    threshold, frailty is considered negligible (default = 1e-5).

$\\$

`Ml_Cox()` Estimates $\beta$ under a standard Cox proportional hazards
model.

`Ml_Cox()` can take 3 arguments:

1.  `data` (mandatory), a data frame with the following columns:

    -   `times` : Observed times to event or censoring.

    -   `status` : Event type: 0 = censored, 1 = cause of interest

    -   `covariates` : (optional) Columns from the 3th onward are used
        as covariates.

2.  `max_iter` (optional): Maximum number of iterations for the
    Newton-Raphson algorithm (default = 100).

3.  `tol` (optional): Convergence tolerance for the likelihood and
    frailty variance (default = 1e-6).

`Parameters_estimation()` Encapsulates the four previous methods into
one unified function.

`Parameters_estimation()` can take 6 arguments:

1.  `data` (mandatory), a data frame with the following columns:

    -   `times` : Observed times to event or censoring.

    -   `status` : Event type: 0 = censored, 1 = cause of interest, \>1
        = other competing risks.

    -   `clusters` : Cluster membership (e.g., centers or subjects).

    -   `covariates` : (optional) Columns from the 4th onward are used
        as covariates.

2.  `method` (optional):

    -   "CompRisk_frailty" (default): Competing risks model for cause 1
        with shared frailty.
    -   "CompRisk": Competing risks model for cause 1 without frailty.
    -   "Cox_frailty": Cox proportional hazards model with shared
        frailty.
    -   "Cox": Standard Cox proportional hazards model.

3.  `cluster_censoring` (optional): Logical. If TRUE, accounts for
    center-specific censoring using a frailty model on censoring times.
    Default is FALSE.

4.  `max_iter` (optional): Maximum number of iterations for the
    Newton-Raphson algorithm (default = 100 or 300, depends on the
    method chosen).

5.  `tol` (optional): Convergence tolerance for the likelihood and
    frailty variance (default = 1e-6).

6.  `threshold` (optional): Lower bound for the frailty variance
    parameter $\theta$. If the estimated value falls below this
    threshold, frailty is considered negligible (default = 1e-5).

$\\$

`simulate_data()` Simulates clustered time-to-event data with competing
risks.

`simulate_data()` can take 8 arguments:

1.  `G` : a vector of group or cluster identifiers (length N, where N is
    the size sample). Each value indicates which cluster the individual
    belongs to.

2.  `Z` : a matrix of covariates (dimensions N x p), where p is the size
    of the vector describing an individual). Can be set to Matrix(0,0,N)
    if no covariates are used.

3.  `prop` : proportion of individuals susceptible to cause 1 (default:
    0.6).

4.  `beta` : a numeric vector of regression coefficients, one per
    covariate (p).

5.  `theta` : variance of the shared frailty term for event times (cause
    1).

6.  `cens` : logical, indicating whether censoring should be simulated
    (default: FALSE).

7.  `pcens` : target censoring proportion (default: 0.25).

8.  `tau` : variance of the shared frailty term for censoring times
    (default: 0). $\\$

`check_data_format()` Verifies input data structure and formatting.

`check_data_format()` can only take 1 argument, a data frame. $\\$

## Examples

```{r}
set.seed(123)

n_cov = 2
n_per_cluster = 15
n_cluster = 20
n = n_cluster * n_per_cluster

G = rep(1:n_cluster, each = n_per_cluster)
Z = matrix(rnorm(n*n_cov,0,1),ncol = n_cov)

df = simulate_data(G,Z,prop = 0.6,beta = c(1,1.2),theta = 0.6,cens = TRUE)

 #Estimate using REML with frailty

res <- Reml_CompRisk_frailty(df)

 #Print estimated coefficients

res$beta
res$theta
```

## Computational Complexity

The algorithm involves computing weighted matrices and solving penalized
systems in each iteration. For $N$ individuals, $K$ clusters and $p$
covariables describing each individual:

-   Sparse matrix multiplication and integration steps: \~
    $\mathcal{O}(N^2)$

-   Newton-Raphson updates (inversion of Fisher information matrix): \~
    $\mathcal{O}((p + K)^3)$ in the worst case

In total, the complexity of the algorithm is
\~$\mathcal{O}(N^2 + Np + NK + (p + K)^3)$ in the worst case.

However, the implementation uses sparse matrices (`Matrix` package) to
significantly reduce practical computational cost.

## References

The `Matrix` package

Katsahian S, Resche-Rigon M, Chevret S, Porcher R. (2006). *Analysing
multicentre competing risks data with a mixed proportional hazards model
for the subdistribution.* *Statistics in Medicine*, 25(26), 4267–4278.