The common practice in sequence analysis (SA) in social research is to consider sequence data as exact, i.e., free of measurement errors, which is a strong assumption in a domain where data are most often collected through surveys. In particular, timing errors typically result from seam effects in prospective longitudinal surveys (Engstrom and Sinibaldi 2024; Jäckle and Eckman 2020) and because of telescoping effects in retrospective surveys (Abate et al. 2022; Celhay et al. 2024).
The MCseqReplic package provides different tools, based on
Monte-Carlo (MC) simulations, for exploring how SA outcomes could vary
or resist in presence of timing errors. The MC simulation procedure is
detailed in Ritschard and Liao (2026).
Starting with a sequence object created with TraMineR (Gabadinho et al. 2011), the main function,
MCseqReplicate, replicates multiple times the sequence
dataset with randomly modified transition times. Using this series of
modified datasets—called MC-sets—additional functions can generate the
associated series of MC-dissimilarity matrices, estimate standard errors
of the observed dissimilarities, compute cluster quality indices (CQI)
for a range of partition sizes by MC-set, compute cluster comparison
indices (CCI) comparing the solution resulting from each MC-set with the
partitioning of the observed sequences, and compute correlations of MDS
scores of MC-sets with MDS scores of observed dissimilarities.
For this quick tour, we consider the following toy dataset of six sequences of length 4, of which four are unique sequences.
exdata <- read.table(text="
a a b b
a a b b
b b a a
a c c b
b b a c
b b a c
")
weights=rep(1, nrow(exdata))
s.exdata <- TraMineR::seqdef(exdata, weights = weights, id=paste("id",1:nrow(exdata), sep=""))
#> [>] 3 distinct states appear in the data:
#> 1 = a
#> 2 = b
#> 3 = c
#> [>] state coding:
#> [alphabet] [label] [long label]
#> 1 a a a
#> 2 b b b
#> 3 c c c
#> [>] sum of weights: 6 - min/max: 1/1
#> [>] 6 sequences in the data set
#> [>] min/max sequence length: 4/4Function MCseqReplicate proposes a choice of three basic
models to generate the random time changes (see
details in Ritschard and Liao 2026):
"indep" transition times are changed independently of
each others"keep.dss" time changes are constraint to respect the
DSS, i.e. the sequence of successive states."relative" after each time change, the time to the next
transition remains unchangedWe illustrate by generating only three replications, for which we can
print all replicated MC-sets. In practice, as shown in Ritschard and Liao (2026), we would need at
least 100 MC-sets to stabilize sensitivity results and it would not make
sense to print so many MC-sets. We select the default model
"keep.dss" and, by setting J\(=1\), a uniform time change distribution
among the three values \(\{-1,0,1\}\).
## 3 altered sequence datasets
set.seed(3)
(altseq.list <- MCseqReplicate(s.exdata, J=1, R=3, include.obs=TRUE))
#> [[1]]
#> Sequence
#> R1-id1 a-b-b-b
#> R1-id2 a-a-a-b
#> R1-id3 b-b-a-a
#> R1-id4 a-c-c-b
#> R1-id5 b-b-a-c
#> R1-id6 b-a-a-c
#>
#> [[2]]
#> Sequence
#> R2-id1 a-a-b-b
#> R2-id2 a-a-b-b
#> R2-id3 b-b-a-a
#> R2-id4 a-c-c-b
#> R2-id5 b-b-a-c
#> R2-id6 b-a-a-c
#>
#> [[3]]
#> Sequence
#> R3-id1 a-a-a-b
#> R3-id2 a-b-b-b
#> R3-id3 b-a-a-a
#> R3-id4 a-c-b-b
#> R3-id5 b-a-c-c
#> R3-id6 b-b-a-c
#>
#> [[4]]
#> Sequence
#> id1 a-a-b-b
#> id2 a-a-b-b
#> id3 b-b-a-a
#> id4 a-c-c-b
#> id5 b-b-a-c
#> id6 b-b-a-c
#>
#> attr(,"unique")
#> [1] FALSE
#> attr(,"obs")
#> [1] TRUEMCseqReplicate returns a list of MC-sets, the replicated
datasets, together here with the observed dataset as last element
because we have set include.obs=TRUE. Inputting this list
to MCdisslist together with instructions about which
dissimilarity measure to use, we get the list of dissimilarity
matrices.
(dist.list <- MCdisslist(altseq.list, method="LCS"))
#> [[1]]
#> R1-id1 R1-id2 R1-id3 R1-id4 R1-id5
#> R1-id2 4
#> R1-id3 4 4
#> R1-id4 4 4 6
#> R1-id5 4 6 2 4
#> R1-id6 6 4 2 4 2
#>
#> [[2]]
#> R2-id1 R2-id2 R2-id3 R2-id4 R2-id5
#> R2-id2 0
#> R2-id3 4 4
#> R2-id4 4 4 6
#> R2-id5 4 4 2 4
#> R2-id6 4 4 2 4 2
#>
#> [[3]]
#> R3-id1 R3-id2 R3-id3 R3-id4 R3-id5
#> R3-id2 4
#> R3-id3 2 6
#> R3-id4 4 2 6
#> R3-id5 6 6 4 4
#> R3-id6 6 4 4 4 2
#>
#> [[4]]
#> id1 id2 id3 id4 id5
#> id2 0
#> id3 4 4
#> id4 4 4 6
#> id5 4 4 2 4
#> id6 4 4 2 4 0
#>
#> attr(,"toref")
#> [1] FALSE
#> attr(,"obs")
#> [1] TRUEInstead of the uniform distribution of timing error at each
transition that we specified with J=1, we could specify any
distribution such as J=c(1,2,3), which gives higher chances
to delay a transition than advancing it. Using MCpj, we can
also specify a distribution complying with a priori information about
the probability of no error and the expected non-zero timing error (see
the Appendix provided as supplementary material of Ritschard and Liao (2026)). For example:
The attribute `"lambda" is the parameter value of the
underlying Poisson distribution.
Once we have the list of dissimilarity matrices, we can replicate multiple times any dissimilarity-based analysis. But let us first explore how the dissimilarities vary across MC-sets and their correlations with observed dissimilarities.
We can compute statistical summaries of the distribution of each dissimilarity across the MC-sets.
(MCdistSE <- MCseqdistSE(dist.list))
#> 6 sequences, 3 dissimilarity MC-replications
#> Dissimilarity between MC-simulated sequences , 3 dissimilarity replications
#> diss.o: Observed dissimilarities
#> id1 id2 id3 id4 id5
#> id2 0
#> id3 4 4
#> id4 4 4 6
#> id5 4 4 2 4
#> id6 4 4 2 4 0
#>
#> MC.mean: Mean of dissimilarities
#> id1 id2 id3 id4 id5
#> id2 2.666667
#> id3 3.333333 4.666667
#> id4 4.000000 3.333333 6.000000
#> id5 4.666667 5.333333 2.666667 4.000000
#> id6 5.333333 4.000000 2.666667 4.000000 2.000000
#>
#> MC.sd: Standard deviation of dissimilarities
#> id1 id2 id3 id4 id5
#> id2 2.309401
#> id3 1.154701 1.154701
#> id4 0.000000 1.154701 0.000000
#> id5 1.154701 1.154701 1.154701 0.000000
#> id6 1.154701 0.000000 1.154701 0.000000 0.000000
#>
#> MC.se: Standard eror of dissimilarities
#> id1 id2 id3 id4 id5
#> id2 3.527668
#> id3 1.333333 1.333333
#> id4 0.000000 1.333333 0.000000
#> id5 1.333333 1.763834 1.333333 0.000000
#> id6 1.763834 0.000000 1.333333 0.000000 2.000000
#>
#> MC.bias: Bias (observed minus MC.mean)
#> id1 id2 id3 id4 id5
#> id2 -2.6666667
#> id3 0.6666667 -0.6666667
#> id4 0.0000000 0.6666667 0.0000000
#> id5 -0.6666667 -1.3333333 -0.6666667 0.0000000
#> id6 -1.3333333 0.0000000 -0.6666667 0.0000000 -2.0000000MC.sd measures departure from MC.mean, and
the standard error MC.se departure from the corresponding
observed dissimilarities diss.o. Function
MCratios returns additional summaries relative to the
ratios of the observed distances over their standard errors
MCratios(MCdistSE)
#> 6 sequences
#> Dissimilarity between MC-simulated sequences
#> diss.z: Ratios diss/MC.se
#> id1 id2 id3 id4 id5
#> id2 0.000000
#> id3 3.000000 3.000000
#> id4 99.000000 3.000000 99.000000
#> id5 3.000000 2.267787 1.500000 99.000000
#> id6 2.267787 99.000000 1.500000 99.000000 0.000000
#>
#> MC.mean.z: Ratios MC.mean/mean.se
#> id1 id2 id3 id4 id5
#> id2 2
#> id3 5 7
#> id4 Inf 5 Inf
#> id5 7 8 4 Inf
#> id6 8 Inf 4 Inf Inf
#>
#> mean.se: Standard error of mean simulated dissimilarities
#> id1 id2 id3 id4 id5
#> id2 1.3333333
#> id3 0.6666667 0.6666667
#> id4 0.0000000 0.6666667 0.0000000
#> id5 0.6666667 0.6666667 0.6666667 0.0000000
#> id6 0.6666667 0.0000000 0.6666667 0.0000000 0.0000000With function MCdisscor we get the correlations
(Spearman by default) between distances in each MC-set and the observed
distances.
Here, the distances computed on the second MC-set are very strongly correlated with the distances between the observed sequences.
Now that we have gained some information about how distances can vary across MC-sets, let us look at how SA outcomes can be affected by timing errors. We successively consider group comparison, clustering, and MDS scores but we could as well analyze effects on other outcomes such as representative sequences (Gabadinho and Ritschard 2013) and the distribution of complexity indicators (Ritschard 2023), the latter being of course independent of the distances between sequences.
For example, assuming the first 3 sequences are female sequences and
the other male sequences, we replicate the sex-group comparison within
each MC-set and the observed sequences using
TraMineR::dissassoc, which returns, among others, pseudo
\(F\) and \(R^2\) values with their p-values (Studer et al. 2011). To save place, we print
the outcome for the first MC-set only.
sex <- c("f","f","f","m","m","m")
assoc.list <- lapply(dist.list,
TraMineR::dissassoc,
group=sex)
assoc.list[[1]]
#> Pseudo ANOVA table:
#> SS df MSE
#> Exp 2.666667 1 2.666667
#> Res 7.333333 4 1.833333
#> Total 10.000000 5 2.000000
#>
#> Test values (p-values based on 1000 permutation):
#> t0 p.value
#> Pseudo F 1.45454545 0.391
#> Pseudo Fbf 2.18181818 0.391
#> Pseudo R2 0.26666667 0.391
#> Bartlett 0.01327808 0.391
#> Levene 1.00000000 0.393
#>
#> Inconclusive intervals:
#> 0.00383 < 0.01 < 0.0162
#> 0.03649 < 0.05 < 0.0635
#>
#> Discrepancy per level:
#> n discrepancy
#> f 3 1.333333
#> m 3 1.111111
#> Total 6 1.666667Since we have only two groups (f and m), we
can also use function TraMineRextras::dissCompare to
compute LRT and \(\Delta\)BIC
statistics (Liao and Fasang 2020):
library(TraMineRextras) ## for function dissCompare
comp.list <- suppressMessages(lapply(dist.list,
TraMineRextras::dissCompare,
group=sex,
squared=FALSE,
s=0))
comp.list[[1]]
#> LRT p-value Delta BIC Bayes Factor
#> 1 1.86093 0.1725176 0.0691701 1.03519Alternatively, we can collect statistics (R2, LRT, BIC) and p-values
with function MCcompgrp:
comptab <- MCcompgrp(dist.list, group=sex,
dissassoc.args=list(R=1000),
dissCompare.args=list(squared=FALSE))
round(comptab,3)
#> R2 p(R2) LRT p(LRT) Delta BIC Bayes Factor
#> [1,] 0.267 0.389 1.861 0.173 0.069 1.035
#> [2,] 0.308 0.196 2.206 0.137 0.415 1.230
#> [3,] 0.312 0.291 2.248 0.134 0.456 1.256
#> [4,] 0.360 0.198 2.678 0.102 0.886 1.557Summarizing group comparison results of the three MC-sets
summary(comptab[-nrow(comptab),])
#> R2 p(R2) LRT p(LRT)
#> Min. :0.2667 Min. :0.1960 Min. :1.861 Min. :0.1338
#> 1st Qu.:0.2872 1st Qu.:0.2435 1st Qu.:2.034 1st Qu.:0.1356
#> Median :0.3077 Median :0.2910 Median :2.206 Median :0.1374
#> Mean :0.2956 Mean :0.2920 Mean :2.105 Mean :0.1479
#> 3rd Qu.:0.3101 3rd Qu.:0.3400 3rd Qu.:2.227 3rd Qu.:0.1550
#> Max. :0.3125 Max. :0.3890 Max. :2.248 Max. :0.1725
#> Delta BIC Bayes Factor
#> Min. :0.06917 Min. :1.035
#> 1st Qu.:0.24188 1st Qu.:1.133
#> Median :0.41459 Median :1.230
#> Mean :0.31339 Mean :1.174
#> 3rd Qu.:0.43550 3rd Qu.:1.243
#> Max. :0.45640 Max. :1.256We do not comment the inferential results, which don’t make much sense for this illustrative example with two groups of size 3.
Regarding clustering, a first aspect of interest is the number of
clusters. We investigate how, for Ward hierarchical clustering, this
number can vary with timing errors by means of function
MCclustqual. We use ward.D, i.e. we don’t
square the dissimilarities because LCS is not an Euclidean distance.
Here, with 4 distinct sequences, the number of clusters is at most 4.
The CQIs, computed by means of WeightedCluster (Studer 2013), of the MC-sets are stored in the
qual.tab attribute. Below, we display the second element
only of the list:
clqual <- MCclustqual(dist.list, clustmeth="ward.D", ncluster=4, verbose=FALSE)
round(clqual$qual.tab[[2]],3)
#> PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> cluster2 0.744 1 0.890 0.452 0.635 3.429 0.462 5.455 0.577 0
#> cluster3 0.905 1 1.000 0.750 0.833 5.000 0.769 11.500 0.885 0
#> cluster4 0.862 1 0.984 0.667 0.833 5.111 0.885 10.889 0.942 0The print method for the outcome of
MCclustqual prints, by default, the table of cluster number
\(k\) for which the statistics reach
their maximum (minimum for HC) by MC-sets and the frequency of these
numbers \(k\) across the MC-sets.
clqual
#> $qual.max
#> PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> MC1 4 2 4 4 4 2 4 4 4 2
#> MC2 3 2 3 3 3 4 4 3 4 2
#> MC3 3 3 3 4 4 3 4 3 4 3
#> Obs 3 2 3 4 4 4 4 4 4 2
#>
#> $max.freq
#> PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> 2 0 2 0 0 0 1 0 0 0 2
#> 3 2 1 2 1 1 1 0 2 0 1
#> 4 1 0 1 2 2 1 3 1 3 0Distribution of optimal number \(k\) over the three MC-sets:
clqual$max.freq
#> PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
#> 2 0 2 0 0 0 1 0 0 0 2
#> 3 2 1 2 1 1 1 0 2 0 1
#> 4 1 0 1 2 2 1 3 1 3 0Plotting the PBC by number \(k\) of clusters
From this plot, three clusters seems a good solution. So, let us now partition in three groups with PAM and retrieve the vector of cluster membership (labelled with index of cluster medoids) by MC-sets:
clust.list <- lapply(dist.list, WeightedCluster::wcKMedoids, k=3, cluster.only=TRUE)
clust.list
#> [[1]]
#> [1] 4 2 6 4 6 6
#>
#> [[2]]
#> [1] 2 2 6 4 6 6
#>
#> [[3]]
#> [1] 3 4 3 4 6 6
#>
#> [[4]]
#> [1] 2 2 6 4 6 6The clustering of the second MC-set gives the same results as of the
observed sequences. Using MCclustcomp, which uses package
aricode (Chiquet et al. 2023;
Sundqvist et al. 2022), we can compute cluster dissimilarity
indices (CDI) for comparing the solutions of the MC-sets with the
solution obtained for the observed sequences:
(res <- MCclustcomp(clust.list, AMI=TRUE))
#> [,1] [,2] [,3]
#> RI 0.8666667 1.000000 0.66666667
#> ARI 0.6590909 1.000000 0.07407407
#> MI 0.7803552 1.011404 0.54930614
#> NMI 0.7715562 1.000000 0.50000000
#> VI 0.4620981 0.000000 1.01140426
#> NVI 0.3719239 0.000000 0.64804096
#> ID 0.2310491 0.000000 0.54930614
#> NID 0.2284438 0.000000 0.50000000
#> V 0.7905694 1.000000 0.64549722
#> MARI 0.6666667 1.000000 0.06250000
#> MARIraw 0.1333333 0.200000 0.01111111
#> Frobenius 1.5000000 0.000000 2.33333333
#> AMI 0.5729972 1.000000 0.07759626The numerical representation of sequences by multidimensional
scalling (MDS) scores is often used to order sequences in index plots
and to visualize the topological organization of clusters.(Piccarreta and Lior 2010) Function
MCmdscorr computes the first MDS factor of each MC-set and,
when observed dissimilarities are provided, the correlation between the
scores of each MC-set with the scores for the observed dissimilarities.
By default the function returns the Spearman correlations:
We get the first MDS scores either by setting what="mds"
or the second element of the list returned when
what="both". We illustrate by using the scores as sorting
variable in index plots.
MCmdsboth <- MCmdscorr(dist.list, what="both")
#> Spearman correlations between 1st MDS factor of observed and replicated distances
#> Extracting diss.o from disslist
#> Computing first MDS scores of observed dissimilarities
#> Computing first MDS scores of replicated dissimilarities
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> Computing correlations
MCmds <- MCmdsboth[[2]]
nset <- length(MCmds)
title <- paste0("MC",1:nset)
if (attr(dist.list,"obs")) title[nset] <- "Obs"
layout(matrix(c(1:4,rep(5,4)),nrow=2,byrow=TRUE), heights = c(.75,.25))
for (i in 1:length(altseq.list)) {
seqIplot(altseq.list[[i]],sortv=MCmds[[i]], with.legend=FALSE,
ylab=NA, yaxis=FALSE, main=title[i])
}
seqlegend(altseq.list[[1]],ncol=3, cex=1.5 )There is no specific function for more than one MDS factor by MCset.
However, they can easily be obtained with lapply.
mds2.list <- lapply(dist.list, cmdscale, k=2)
mds2.list[[1]]
#> [,1] [,2]
#> R1-id1 2 0.000000e+00
#> R1-id2 2 2.229184e+00
#> R1-id3 -2 1.658581e+00
#> R1-id4 2 -2.229184e+00
#> R1-id5 -2 -1.658581e+00
#> R1-id6 -2 6.543791e-16Computing MDS scores is time consuming. Therefore, the previous computation can be very long when the number of sequences and the number of MC-sets is large. We strongly suggest using parallel computation in that case.
This vignette described the basic usage of the functions provided by
MCseqReplic. Please refer to help pages for a full list of
the options of each function. For example, MCmdscorr and
MCclustqual propose options of parallel computing to
replicate the time consuming computation of MDS scores and range of CQIs
values.
For illustration, the vignette used a small toy example dataset,
which allowed to print most of the results returned by the functions.
Real datasets are much larger and, as already mentioned, they should be
replicated at least 100 times to sufficiently stabilize sensitivity
results. As a consequence, it would not make sense to print the full
outcomes of each function. Outcomes such as the list of MC-sets and the
list of dissimilarity matrices are supposed to feed further analyses and
plots. There are print and summary methods for
the outcome of some functions (MCseqdistSE,
MCratios, MCclustqual) that print partial and
summary results.