MSCA

Background

This library implements basic tools to conduct unsupervised learning or clustering of instances - such as patients - that are described by multiple censored time-to-event endpoints. It has been developed to be adapted to situations where events are not associated with a radical change in state like in the field of the social sciences where events such as or mark a change in an individual status from one state to another, but have an additive impact, such as multiple long-term conditions on patients. This vignette (on progress) will shrtly describe how to conduct a clustring analysis using a toy dataset.

Unsupervised analyses workflow are conducted through the following steps when based on distances / dissimilarity between analysed instances:

compute some distances between instances
use a procedure to construct a hierarchy
use any criteria to decide the number of clusters obtain from the hierarchy
use appropriate statistics and graphs to describe the defined typology

The main purpose of the proposed tools is to be able to compute the Jaccard distance between patients on multiple censored time-to-event indicators. As a results patients having similar trajectories are expected to get clustered together, whereas patients with divergent health trajectories are likely to be assigned to different clusters.

In the fist section we will show how to construct censored state matrices from time stamped records (electonic health records) using simulated electronic health records. In section 2, we will show how to compute patients dissimilarity and derive a simple typology. In section 3 will will illustrate the use of the CLARA procedure in this setting when having to analyse larger set of patient (> 15000).

From electronic health records to state matrices

Load data and compute individual patient state matrices

library(MSCA)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

data(EHR)
head(EHR)
#> # A tibble: 6 × 3
#>   link_id  reg      aos
#>   <chr>    <chr>  <dbl>
#> 1 K_610836 ltc_30  83.0
#> 2 K_739086 ltc_22  73.0
#> 3 K_661210 ltc_15  79.9
#> 4 K_866970 ltc_31  30.6
#> 5 K_270151 ltc_31  39.6
#> 6 K_243582 ltc_16  63.5
EHR %>%
  nrow()
#> [1] 4856

Our toy dataset is composed of 4856 records 35 long term conditions and two absorbing states (death of censoring).

EHR %>%
  group_by( reg ) %>%
  tally
#> # A tibble: 37 × 2
#>    reg        n
#>    <chr>  <int>
#>  1 cens    1225
#>  2 death    927
#>  3 ltc_1     88
#>  4 ltc_10   171
#>  5 ltc_11    21
#>  6 ltc_12    83
#>  7 ltc_13   122
#>  8 ltc_14    11
#>  9 ltc_15    91
#> 10 ltc_16   259
#> # ℹ 27 more rows

The function ( make_state_matrix ) is needed to obtain the individual patients state matrices:

s_mat <- make_state_matrices(
  data = EHR,
  id = "link_id",
  ltc = "reg",
  aos = "aos",
  l = 111,
  fail_code = "death",
  cens_code = "cens"
)
dim( s_mat )
#> [1] 4144 2152

Compute the Jaccard distance between patients

The use of allow to speed the computation of Jaccard distance between patients.

library( cluster )
library( fastcluster )
#> 
#> Attaching package: 'fastcluster'
#> The following object is masked from 'package:stats':
#> 
#>     hclust
# Compute the jaccard distance
d_mat <- fast_jaccard_dist( s_mat , as.dist = TRUE )

# Get a hierachical clustering using the built in hclust function
h_mat <- hclust(d = d_mat , method = 'ward.D2' )
h_mat
#> 
#> Call:
#> hclust(d = d_mat, method = "ward.D2")
#> 
#> Cluster method   : ward.D2 
#> Number of objects: 2152

# Get a typology

ct_mat_8 <- cutree( h_mat , k = 8 )
table( ct_mat_8 )
#> ct_mat_8
#>    1    2    3    4    5    6    7    8 
#> 1172  116  242  129  157   91  136  109

Analyse clusters and get sequences statistics

Once a typology has been defined it become interesting to obtain basic sequence statistics by clusters. To do so few data manipulation is needed:

# Get a data frame with patient id and cluster assignation 
df1 <- data.frame( link_id = names(ct_mat_8) , cl = paste0('cl_',ct_mat_8)) 
head(df1)  
#>    link_id   cl
#> 1  K_10030 cl_1
#> 2 K_101275 cl_1
#> 3  K_10227 cl_1
#> 4 K_102385 cl_1
#> 5 K_102612 cl_1
#> 6 K_103518 cl_1

# Merge with primary data
EHR_cl <- EHR %>%
  left_join( df1 )
#> Joining with `by = join_by(link_id)`

# Get cluster sequences by cluster
dt_seq <- get_cluster_sequences(
  dt =  EHR_cl ,
  cl_col = "cl",
  id_col = "link_id",
  event_col = "reg",
  k = 2
)

# Get basic stats by cluster
sequence_stats(
  seq_data = dt_seq ,
  min_seq_freq = 0.03,
  min_conditional_prob = 0,
  min_relative_risk = 0
)
#> [[1]]
#> # A tibble: 1 × 8
#>   from  to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr> <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_6 death        54    0.397            0.761          1.19     59.9   80.2
#> 
#> [[2]]
#> # A tibble: 1 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_22 death       109    0.450            0.813          1.16     61.9   84.0
#> 
#> [[3]]
#> # A tibble: 3 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_31 death        39   0.302            0.609           1.29     45.6   72.7
#> 2 ltc_7  ltc_…         5   0.0388           0.833           3.45     45.6   45.6
#> 3 ltc_31 ltc_…         4   0.0310           0.0625          1.45     45.6   56.9
#> 
#> [[4]]
#> # A tibble: 1 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_16 death        64    0.552            0.853          1.19     73.6   86.3
#> 
#> [[5]]
#> # A tibble: 7 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_10 death        76   0.0648            0.731          1.07     67.3   82.0
#> 2 ltc_31 death        48   0.0410            0.762          1.12     71.0   82.0
#> 3 ltc_13 death        41   0.0350            0.683          1.00     61.6   82.0
#> 4 ltc_18 death        41   0.0350            0.804          1.18     71.8   82.0
#> 5 ltc_15 death        40   0.0341            0.784          1.15     73.0   82.0
#> 6 ltc_23 death        40   0.0341            0.741          1.09     61.6   82.0
#> 7 ltc_1  death        37   0.0316            0.74           1.08     57.1   82.0
#> 
#> [[6]]
#> # A tibble: 2 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_16 death        26   0.239            0.591           1.27     51.0   70.0
#> 2 ltc_16 ltc_…         4   0.0367           0.0909          1.61     51.0   65.4
#> 
#> [[7]]
#> # A tibble: 1 × 8
#>   from  to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr> <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_7 death        64    0.408            0.703          1.21     64.1   81.5
#> 
#> [[8]]
#> # A tibble: 4 × 8
#>   from   to    seq_count seq_freq conditional_prob relative_risk med.from med.to
#>   <chr>  <chr>     <int>    <dbl>            <dbl>         <dbl>    <dbl>  <dbl>
#> 1 ltc_18 death        28   0.308            0.596          1.29      53.1   69.4
#> 2 ltc_18 ltc_…         5   0.0549           0.106          1.05      53.1   58.6
#> 3 ltc_18 ltc_7         4   0.0440           0.0851         0.757     53.1   55.9
#> 4 ltc_18 ltc_6         3   0.0330           0.0638         1.42      53.1   56