Assessing Effects of Timing Errors on Sequence Analaysis Outcomes with MCseqReplic

A Quick Tour

For this quick tour, we consider the following toy dataset of six sequences of length 4, of which four are unique sequences.

exdata <- read.table(text="
                a a b b
                a a b b
                b b a a
                a c c b
                b b a c
                b b a c
                ")
weights=rep(1, nrow(exdata))
s.exdata <- TraMineR::seqdef(exdata, weights = weights, id=paste("id",1:nrow(exdata), sep=""))
#>  [>] 3 distinct states appear in the data:
#>      1 = a
#>      2 = b
#>      3 = c
#>  [>] state coding:
#>        [alphabet]  [label]  [long label]
#>      1  a           a        a
#>      2  b           b        b
#>      3  c           c        c
#>  [>] sum of weights: 6 - min/max: 1/1
#>  [>] 6 sequences in the data set
#>  [>] min/max sequence length: 4/4

Function MCseqReplicate proposes a choice of three basic models to generate the random time changes (see details in Ritschard and Liao 2026):

"indep" transition times are changed independently of each others
"keep.dss" time changes are constraint to respect the DSS, i.e. the sequence of successive states.
"relative" after each time change, the time to the next transition remains unchanged

We illustrate by generating only three replications, for which we can print all replicated MC-sets. In practice, as shown in Ritschard and Liao (2026), we would need at least 100 MC-sets to stabilize sensitivity results and it would not make sense to print so many MC-sets. We select the default model "keep.dss" and, by setting J\(=1\), a uniform time change distribution among the three values \(\{-1,0,1\}\).

## 3 altered sequence datasets
set.seed(3)
(altseq.list <- MCseqReplicate(s.exdata, J=1, R=3, include.obs=TRUE))
#> [[1]]
#>        Sequence
#> R1-id1 a-b-b-b 
#> R1-id2 a-a-a-b 
#> R1-id3 b-b-a-a 
#> R1-id4 a-c-c-b 
#> R1-id5 b-b-a-c 
#> R1-id6 b-a-a-c 
#> 
#> [[2]]
#>        Sequence
#> R2-id1 a-a-b-b 
#> R2-id2 a-a-b-b 
#> R2-id3 b-b-a-a 
#> R2-id4 a-c-c-b 
#> R2-id5 b-b-a-c 
#> R2-id6 b-a-a-c 
#> 
#> [[3]]
#>        Sequence
#> R3-id1 a-a-a-b 
#> R3-id2 a-b-b-b 
#> R3-id3 b-a-a-a 
#> R3-id4 a-c-b-b 
#> R3-id5 b-a-c-c 
#> R3-id6 b-b-a-c 
#> 
#> [[4]]
#>     Sequence
#> id1 a-a-b-b 
#> id2 a-a-b-b 
#> id3 b-b-a-a 
#> id4 a-c-c-b 
#> id5 b-b-a-c 
#> id6 b-b-a-c 
#> 
#> attr(,"unique")
#> [1] FALSE
#> attr(,"obs")
#> [1] TRUE

MCseqReplicate returns a list of MC-sets, the replicated datasets, together here with the observed dataset as last element because we have set include.obs=TRUE. Inputting this list to MCdisslist together with instructions about which dissimilarity measure to use, we get the list of dissimilarity matrices.

(dist.list <- MCdisslist(altseq.list, method="LCS"))
#> [[1]]
#>        R1-id1 R1-id2 R1-id3 R1-id4 R1-id5
#> R1-id2      4                            
#> R1-id3      4      4                     
#> R1-id4      4      4      6              
#> R1-id5      4      6      2      4       
#> R1-id6      6      4      2      4      2
#> 
#> [[2]]
#>        R2-id1 R2-id2 R2-id3 R2-id4 R2-id5
#> R2-id2      0                            
#> R2-id3      4      4                     
#> R2-id4      4      4      6              
#> R2-id5      4      4      2      4       
#> R2-id6      4      4      2      4      2
#> 
#> [[3]]
#>        R3-id1 R3-id2 R3-id3 R3-id4 R3-id5
#> R3-id2      4                            
#> R3-id3      2      6                     
#> R3-id4      4      2      6              
#> R3-id5      6      6      4      4       
#> R3-id6      6      4      4      4      2
#> 
#> [[4]]
#>     id1 id2 id3 id4 id5
#> id2   0                
#> id3   4   4            
#> id4   4   4   6        
#> id5   4   4   2   4    
#> id6   4   4   2   4   0
#> 
#> attr(,"toref")
#> [1] FALSE
#> attr(,"obs")
#> [1] TRUE

Instead of the uniform distribution of timing error at each transition that we specified with J=1, we could specify any distribution such as J=c(1,2,3), which gives higher chances to delay a transition than advancing it. Using MCpj, we can also specify a distribution complying with a priori information about the probability of no error and the expected non-zero timing error (see the Appendix provided as supplementary material of Ritschard and Liao (2026)). For example:

MCpj(Emean=1.05, pzero=.5)
#> [1] 0.25 0.50 0.25
#> attr(,"lambda")
#> [1] 0.09838649

The attribute `"lambda" is the parameter value of the underlying Poisson distribution.

Once we have the list of dissimilarity matrices, we can replicate multiple times any dissimilarity-based analysis. But let us first explore how the dissimilarities vary across MC-sets and their correlations with observed dissimilarities.

Exploring the MC-simulated dissimilraties

We can compute statistical summaries of the distribution of each dissimilarity across the MC-sets.


(MCdistSE <- MCseqdistSE(dist.list))
#> 6  sequences, 3  dissimilarity MC-replications
#> Dissimilarity between MC-simulated sequences , 3  dissimilarity replications
#>  diss.o: Observed dissimilarities
#>     id1 id2 id3 id4 id5
#> id2   0                
#> id3   4   4            
#> id4   4   4   6        
#> id5   4   4   2   4    
#> id6   4   4   2   4   0
#> 
#>  MC.mean: Mean of dissimilarities 
#>          id1      id2      id3      id4      id5
#> id2 2.666667                                    
#> id3 3.333333 4.666667                           
#> id4 4.000000 3.333333 6.000000                  
#> id5 4.666667 5.333333 2.666667 4.000000         
#> id6 5.333333 4.000000 2.666667 4.000000 2.000000
#> 
#>  MC.sd: Standard deviation of dissimilarities
#>          id1      id2      id3      id4      id5
#> id2 2.309401                                    
#> id3 1.154701 1.154701                           
#> id4 0.000000 1.154701 0.000000                  
#> id5 1.154701 1.154701 1.154701 0.000000         
#> id6 1.154701 0.000000 1.154701 0.000000 0.000000
#> 
#>  MC.se: Standard eror of dissimilarities
#>          id1      id2      id3      id4      id5
#> id2 3.527668                                    
#> id3 1.333333 1.333333                           
#> id4 0.000000 1.333333 0.000000                  
#> id5 1.333333 1.763834 1.333333 0.000000         
#> id6 1.763834 0.000000 1.333333 0.000000 2.000000
#> 
#>  MC.bias: Bias (observed minus MC.mean) 
#>            id1        id2        id3        id4        id5
#> id2 -2.6666667                                            
#> id3  0.6666667 -0.6666667                                 
#> id4  0.0000000  0.6666667  0.0000000                      
#> id5 -0.6666667 -1.3333333 -0.6666667  0.0000000           
#> id6 -1.3333333  0.0000000 -0.6666667  0.0000000 -2.0000000

MC.sd measures departure from MC.mean, and the standard error MC.se departure from the corresponding observed dissimilarities diss.o. Function MCratios returns additional summaries relative to the ratios of the observed distances over their standard errors

MCratios(MCdistSE)
#> 6  sequences
#> Dissimilarity between MC-simulated sequences
#>  diss.z: Ratios diss/MC.se 
#>           id1       id2       id3       id4       id5
#> id2  0.000000                                        
#> id3  3.000000  3.000000                              
#> id4 99.000000  3.000000 99.000000                    
#> id5  3.000000  2.267787  1.500000 99.000000          
#> id6  2.267787 99.000000  1.500000 99.000000  0.000000
#> 
#>  MC.mean.z: Ratios MC.mean/mean.se 
#>     id1 id2 id3 id4 id5
#> id2   2                
#> id3   5   7            
#> id4 Inf   5 Inf        
#> id5   7   8   4 Inf    
#> id6   8 Inf   4 Inf Inf
#> 
#>  mean.se: Standard error of mean simulated dissimilarities 
#>           id1       id2       id3       id4       id5
#> id2 1.3333333                                        
#> id3 0.6666667 0.6666667                              
#> id4 0.0000000 0.6666667 0.0000000                    
#> id5 0.6666667 0.6666667 0.6666667 0.0000000          
#> id6 0.6666667 0.0000000 0.6666667 0.0000000 0.0000000

With function MCdisscor we get the correlations (Spearman by default) between distances in each MC-set and the observed distances.

MCdisscorr(dist.list)
#> Extracting diss.o from disslist
#> [1] 0.7426484 0.9923502 0.4397608

Here, the distances computed on the second MC-set are very strongly correlated with the distances between the observed sequences.

Now that we have gained some information about how distances can vary across MC-sets, let us look at how SA outcomes can be affected by timing errors. We successively consider group comparison, clustering, and MDS scores but we could as well analyze effects on other outcomes such as representative sequences (Gabadinho and Ritschard 2013) and the distribution of complexity indicators (Ritschard 2023), the latter being of course independent of the distances between sequences.

Group comparison

For example, assuming the first 3 sequences are female sequences and the other male sequences, we replicate the sex-group comparison within each MC-set and the observed sequences using TraMineR::dissassoc, which returns, among others, pseudo \(F\) and \(R^2\) values with their p-values (Studer et al. 2011). To save place, we print the outcome for the first MC-set only.

sex <- c("f","f","f","m","m","m")
assoc.list <- lapply(dist.list, 
                     TraMineR::dissassoc, 
                     group=sex)
assoc.list[[1]]
#> Pseudo ANOVA table:
#>              SS df      MSE
#> Exp    2.666667  1 2.666667
#> Res    7.333333  4 1.833333
#> Total 10.000000  5 2.000000
#> 
#> Test values  (p-values based on 1000 permutation):
#>                    t0 p.value
#> Pseudo F   1.45454545   0.391
#> Pseudo Fbf 2.18181818   0.391
#> Pseudo R2  0.26666667   0.391
#> Bartlett   0.01327808   0.391
#> Levene     1.00000000   0.393
#> 
#> Inconclusive intervals: 
#> 0.00383  <  0.01  <  0.0162
#> 0.03649  <  0.05  <  0.0635 
#> 
#> Discrepancy per level:
#>       n discrepancy
#> f     3    1.333333
#> m     3    1.111111
#> Total 6    1.666667

Since we have only two groups (f and m), we can also use function TraMineRextras::dissCompare to compute LRT and \(\Delta\)BIC statistics (Liao and Fasang 2020):

library(TraMineRextras) ## for function dissCompare
comp.list <- suppressMessages(lapply(dist.list, 
                                     TraMineRextras::dissCompare, 
                                     group=sex, 
                                     squared=FALSE, 
                                     s=0))
comp.list[[1]]
#>       LRT   p-value Delta BIC Bayes Factor
#> 1 1.86093 0.1725176 0.0691701      1.03519

Alternatively, we can collect statistics (R2, LRT, BIC) and p-values with function MCcompgrp:

comptab <- MCcompgrp(dist.list, group=sex, 
                     dissassoc.args=list(R=1000),
                     dissCompare.args=list(squared=FALSE))
round(comptab,3)
#>         R2 p(R2)   LRT p(LRT) Delta BIC Bayes Factor
#> [1,] 0.267 0.389 1.861  0.173     0.069        1.035
#> [2,] 0.308 0.196 2.206  0.137     0.415        1.230
#> [3,] 0.312 0.291 2.248  0.134     0.456        1.256
#> [4,] 0.360 0.198 2.678  0.102     0.886        1.557

Summarizing group comparison results of the three MC-sets

summary(comptab[-nrow(comptab),])
#>        R2             p(R2)             LRT            p(LRT)      
#>  Min.   :0.2667   Min.   :0.1960   Min.   :1.861   Min.   :0.1338  
#>  1st Qu.:0.2872   1st Qu.:0.2435   1st Qu.:2.034   1st Qu.:0.1356  
#>  Median :0.3077   Median :0.2910   Median :2.206   Median :0.1374  
#>  Mean   :0.2956   Mean   :0.2920   Mean   :2.105   Mean   :0.1479  
#>  3rd Qu.:0.3101   3rd Qu.:0.3400   3rd Qu.:2.227   3rd Qu.:0.1550  
#>  Max.   :0.3125   Max.   :0.3890   Max.   :2.248   Max.   :0.1725  
#>    Delta BIC        Bayes Factor  
#>  Min.   :0.06917   Min.   :1.035  
#>  1st Qu.:0.24188   1st Qu.:1.133  
#>  Median :0.41459   Median :1.230  
#>  Mean   :0.31339   Mean   :1.174  
#>  3rd Qu.:0.43550   3rd Qu.:1.243  
#>  Max.   :0.45640   Max.   :1.256

We do not comment the inferential results, which don’t make much sense for this illustrative example with two groups of size 3.

Clustering

Regarding clustering, a first aspect of interest is the number of clusters. We investigate how, for Ward hierarchical clustering, this number can vary with timing errors by means of function MCclustqual. We use ward.D, i.e. we don’t square the dissimilarities because LCS is not an Euclidean distance. Here, with 4 distinct sequences, the number of clusters is at most 4. The CQIs, computed by means of WeightedCluster (Studer 2013), of the MC-sets are stored in the qual.tab attribute. Below, we display the second element only of the list:

clqual <- MCclustqual(dist.list, clustmeth="ward.D", ncluster=4, verbose=FALSE)
round(clqual$qual.tab[[2]],3)
#>            PBC HG  HGSD   ASW  ASWw    CH    R2   CHsq  R2sq HC
#> cluster2 0.744  1 0.890 0.452 0.635 3.429 0.462  5.455 0.577  0
#> cluster3 0.905  1 1.000 0.750 0.833 5.000 0.769 11.500 0.885  0
#> cluster4 0.862  1 0.984 0.667 0.833 5.111 0.885 10.889 0.942  0

The print method for the outcome of MCclustqual prints, by default, the table of cluster number \(k\) for which the statistics reach their maximum (minimum for HC) by MC-sets and the frequency of these numbers \(k\) across the MC-sets.

clqual
#> $qual.max
#>     PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> MC1   4  2    4   4    4  2  4    4    4  2
#> MC2   3  2    3   3    3  4  4    3    4  2
#> MC3   3  3    3   4    4  3  4    3    4  3
#> Obs   3  2    3   4    4  4  4    4    4  2
#>  
#> $max.freq
#>   PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> 2   0  2    0   0    0  1  0    0    0  2
#> 3   2  1    2   1    1  1  0    2    0  1
#> 4   1  0    1   2    2  1  3    1    3  0

Distribution of optimal number \(k\) over the three MC-sets:

clqual$max.freq
#>   PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> 2   0  2    0   0    0  1  0    0    0  2
#> 3   2  1    2   1    1  1  0    2    0  1
#> 4   1  0    1   2    2  1  3    1    3  0

Plotting the PBC by number \(k\) of clusters

ggplotMCcqi(clqual, cqi="PBC")

From this plot, three clusters seems a good solution. So, let us now partition in three groups with PAM and retrieve the vector of cluster membership (labelled with index of cluster medoids) by MC-sets:

clust.list <- lapply(dist.list, WeightedCluster::wcKMedoids, k=3, cluster.only=TRUE)
clust.list
#> [[1]]
#> [1] 4 2 6 4 6 6
#> 
#> [[2]]
#> [1] 2 2 6 4 6 6
#> 
#> [[3]]
#> [1] 3 4 3 4 6 6
#> 
#> [[4]]
#> [1] 2 2 6 4 6 6

The clustering of the second MC-set gives the same results as of the observed sequences. Using MCclustcomp, which uses package aricode (Chiquet et al. 2023; Sundqvist et al. 2022), we can compute cluster dissimilarity indices (CDI) for comparing the solutions of the MC-sets with the solution obtained for the observed sequences:

(res <- MCclustcomp(clust.list, AMI=TRUE))
#>                [,1]     [,2]       [,3]
#> RI        0.8666667 1.000000 0.66666667
#> ARI       0.6590909 1.000000 0.07407407
#> MI        0.7803552 1.011404 0.54930614
#> NMI       0.7715562 1.000000 0.50000000
#> VI        0.4620981 0.000000 1.01140426
#> NVI       0.3719239 0.000000 0.64804096
#> ID        0.2310491 0.000000 0.54930614
#> NID       0.2284438 0.000000 0.50000000
#> V         0.7905694 1.000000 0.64549722
#> MARI      0.6666667 1.000000 0.06250000
#> MARIraw   0.1333333 0.200000 0.01111111
#> Frobenius 1.5000000 0.000000 2.33333333
#> AMI       0.5729972 1.000000 0.07759626

Assessing Effects of Timing Errors on Sequence Analaysis Outcomes with MCseqReplic

Gilbert Ritschard

July 1, 2026

Introduction

A Quick Tour

Exploring the MC-simulated dissimilraties

Group comparison

Clustering

MDS scores

Closing remarks

References