Package {statMatchLCM}


Type: Package
Title: Statistical Matching Using Latent Class Models
Version: 1.2
Description: Tools for statistical matching based on latent class models. The package implements statistical matching procedures based on latent class models. It allows researchers to perform data integration when no unique identifiers are available by modeling the joint distribution of variables through latent categorical structures. The package supports estimation of latent class models, probabilistic matching between donor and recipient data sets, and generation of synthetic linked data under uncertainty. It is particularly useful in survey research and data fusion applications where combining information from multiple sources is required while preserving statistical properties and accounting for measurement error and missing data mechanisms.
License: GPL-3
Encoding: UTF-8
RoxygenNote: 7.3.3
Imports: nnet, StatMatch
Suggests: NPBayesImputeCat
Depends: R (≥ 4.1.0)
LazyData: true
NeedsCompilation: no
Packaged: 2026-05-11 17:11:33 UTC; woali
Author: Alicja Wolny-Dominiak [aut, cre], Israa Lewaaelhamd [aut], Mohammed Ali Ismail [aut]
Maintainer: Alicja Wolny-Dominiak <woali@ue.katowice.pl>
Repository: CRAN
Date/Publication: 2026-05-15 20:20:08 UTC

Dataset datA

Description

A simple dataset with categorical variables.

Usage

datA

Format

A data frame with 20 observations and 2 variables:

X

Category (e.g., "F", "M")

Y1

Color category (e.g., "blue")

Source

Simulated data


Create stacked A/B dataset for MDC

Description

Creates a joint data set with missing Y1/Z1, required for mass data combination.

Usage

datAB_to_SM(datA, datB)

Arguments

datA

data.frame A

datB

data.frame B

Value

data.frame with harmonized structure

Examples

data(datA)
data(datB)
datAB_to_SM(datA, datB)

Dataset datB

Description

A simple dataset with categorical variables.

Usage

datB

Format

A data frame with 15 observations and 2 variables:

X

Category (e.g., "F", "M")

Z1

Color category (e.g., "red", "green", "blue")

Source

Simulated data


Convert factor/character variables to numeric with mapping

Description

Converts all factor or character columns in a data.frame to numeric codes and stores the mapping tables.

Usage

fact_to_num(df)

Arguments

df

data.frame with factor or character columns

Value

A list with:

data

data.frame with numeric-coded variables

levels

list of factor levels

tables

list of mapping tables (factor → numeric)

Examples

data(datA)
fact_to_num(datA)

Convert numeric codes back to factor

Description

Restores a factor variable from numeric codes using a mapping table created by fact_to_num().

Usage

num_to_fact(x, table)

Arguments

x

numeric vector

table

data.frame with columns level_fact and level_num

Value

A factor vector:

levels

Defined by table$level_num.

labels

Defined by table$level_fact.


Quality assessment of synthetic Z1

Description

Evaluates the quality of the synthetic target variable by computing the Hellinger distance between the reference and synthetic distributions. This measure quantifies the degree of similarity between the two distributions, providing an assessment of the accuracy and coherence of the data fusion process

Usage

sm_quality(step1, step2)

Arguments

step1

output from smc_step1()

step2

output from step 2 method

Value

A list with:

heli_latent

A numeric value representing the Hellinger distance between the reference and synthetic distributions.

ref_distr

A numeric vector or table representing the reference (original) distribution.

synt_distr

A numeric vector or table representing the synthetic (generated) distribution.


SMA1 – Nearest-neighbour hot deck on shared variables

Description

Second step of the SMA procedure using nearest-neighbour hot deck matching on shared variables (Y1, Z1) to fuse datasets while preserving their joint distribution.

Usage

sma1_step2(step1)

Arguments

step1

list returned by the SMA step 1 procedure

Value

A list with:

datA_fused_2

The final fused dataset A.

out_nndd

NNDD matching results.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)
  datC <- data.frame(
    X = datB$X[1:4],
    Y1 = datA$Y1[1:4],
    Z1 = datB$Z1[1:4]
  )

  # adding auxiliary information
  datABC <- rbind(datAB, datC)

  # call DPMPM (reduced settings for speed)
  output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datABC,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first_aii <- sma_step1(datA, datB, datC, output_list_AII)
  step_second_aii1 <- sma1_step2(step_first_aii)

  result_aii1 <- step_second_aii1$datA_fused_2
  head(result_aii1)
}

SMA2 – Hot deck on fitted multinomial probabilities

Description

Second step of SMA using multinomial models and nearest-neighbour hot deck on fitted probabilities to improve data fusion accuracy.

Usage

sma2_step2(step1)

Arguments

step1

list returned by the SMA step 1 procedure

Value

A list with:

datA_fused_2

A data frame containing the final fused (imputed) version of dataset A.

out_nnd

A list with results from the nearest-neighbour distance matching procedure.

model_form

A formula object specifying the model used.

FitteddatA_imp1

A data frame of fitted values obtained for dataset A.

FitteddatC_imp1

A data frame of fitted values obtained for dataset C.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)
  datC <- data.frame(
    X = datB$X[1:4],
    Y1 = datA$Y1[1:4],
    Z1 = datB$Z1[1:4]
  )

  # adding auxiliary information
  datABC <- rbind(datAB, datC)

  output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datABC,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first_aii <- sma_step1(datA, datB, datC, output_list_AII)
  step_second_aii2 <- sma2_step2(step_first_aii)

  result_aii2 <- step_second_aii2$datA_fused_2
  head(result_aii2)
}

SMA3 – Multinomial simulation approach

Description

Second step of SMA is to generate the target variable using multinomial probability distributions derived from a donor-based model to achieve statistically coherent data fusion

Usage

sma3_step2(step1)

Arguments

step1

list returned by the SMA step 1 procedure

Value

A list with:

datA_fused_2

A data frame containing the final fused version of dataset A.

beta_BZ

A numeric vector of estimated model coefficients.

AA_dummy

A matrix or data frame of dummy variables constructed for dataset A.

BB_dummy

A matrix or data frame of dummy variables constructed for dataset B.

prob_Z_all

A numeric matrix of predicted probabilities for each category of variable Z (rows correspond to observations).

zero_one

A binary matrix (one-hot encoded) sampled from prob_Z_all, where each row contains a single 1 indicating the selected category of Z.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)
  datC <- data.frame(
    X = datB$X[1:4],
    Y1 = datA$Y1[1:4],
    Z1 = datB$Z1[1:4]
  )

  # adding auxiliary information
  datABC <- rbind(datAB, datC)

  # call DPMPM (reduced settings for speed)
  output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datABC,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first_aii <- sma_step1(datA, datB, datC, output_list_AII)
  step_second_aii3 <- sma1_step2(step_first_aii)

  result_aii3 <- step_second_aii3$datA_fused_2
  head(result_aii3)
}

SMA step 1: selection of best imputed dataset

Description

Selects the imputed dataset minimizing the Hellinger distance between the reference distribution from dataset A and the synthetic distribution from dataset B, in a three-sample statistical matching framework (A, B, C).

Usage

sma_step1(datA, datB, datC, output_ll)

Arguments

datA

data.frame A

datB

data.frame B

datC

data.frame C

output_ll

list with imputed datasets (impdata)

Value

A list with imputed datasets:

datABC_imp1

The full imputed dataset combining A, B, and C.

datA_imp1

Subset of the imputed data corresponding to dataset A.

datB_imp1

Subset corresponding to dataset B.

datC_imp1

Subset corresponding to dataset C.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)
  datC <- data.frame(
    X = datB$X[1:4],
    Y1 = datA$Y1[1:4],
    Z1 = datB$Z1[1:4]
  )

  # adding auxiliary information
  datABC <- rbind(datAB, datC)

  # call DPMPM
  output_list_AII <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datABC, nrun = 500, burn = 50, thin = 50,
    K = 80, aalpha = 0.25, balpha = 0.25,
    m = 2, seed = 1234, silent = FALSE
  )

  step_first_aii <- sma_step1(datA, datB, datC, output_list_AII)
  names(step_first_aii)
}

SMC1 - Nearest-neighbour hot deck on original variables

Description

Performs nearest-neighbour hot deck imputation on the selected imputed datasets by matching recipient units in dataset A with donor units in dataset B based on common variables. The procedure transfers the target variable from the nearest donor to construct a statistically coherent fused dataset.

Usage

smc1_step2(step1)

Arguments

step1

output from smc_step1()

Value

A list with:

datA_fused_2

The final fused dataset A after step 2 of the procedure.

out_nndd

A list containing nearest-neighbour distance matching results.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)

  # call DPMPM (reduced settings for speed)
  output_list <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datAB,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first <- smc_step1(datA, datB, output_list)
  step_second <- smc1_step2(step_first)

  result <- step_second$datA_fused_2
  result
}

SMC2 – Hot deck on fitted multinomial probabilities

Description

Implements a model-based hot deck data fusion approach by estimating multinomial models on both datasets and matching observations based on their fitted probability distributions

Usage

smc2_step2(step1)

Arguments

step1

output from smc_step1()

Value

A list with:

datA_fused_2

A data frame containing the final fused (imputed) version of dataset A.

out_nndd

A list with results from the nearest-neighbour distance matching procedure.

model_form

A formula object used to fit the models.

FitteddatA_imp1

A data frame of fitted values obtained from the model for dataset A.

FitteddatB_imp1

A data frame of fitted values obtained from the model for dataset B.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)

  # call DPMPM (reduced settings for speed)
  output_list <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datAB,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first <- smc_step1(datA, datB, output_list)
  step_second <- smc2_step2(step_first)

  result <- step_second$datA_fused_2
  result
}

SMC3 – Multinomial simulation approach

Description

Generates the target variable using multinomial simulation based on fitted probability distributions to preserve uncertainty and variability

Usage

smc3_step2(step1)

Arguments

step1

output from smc_step1()

Value

A list with:

datA_fused_2

A data frame containing the final fused (imputed) version of dataset A.

FitteddatA_imp1

A data frame of fitted values obtained from the model for dataset A.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)

  # call DPMPM (reduced settings for speed)
  output_list <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datAB,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first <- smc_step1(datA, datB, output_list)
  step_second <- smc3_step2(step_first)

  result <- step_second$datA_fused_2
  result
}

SMC step 1: selection of best imputed dataset

Description

Identifies and selects the imputed dataset that minimizes the Hellinger distance between the reference and synthetic distributions, thereby achieving the highest level of distributional similarity and improving the statistical consistency of the imputation.

Usage

smc_step1(datA, datB, output_ll)

Arguments

datA

data.frame A

datB

data.frame B

output_ll

list with imputed datasets (impdata)

Value

A list with imputed datasets:

datAB_imp1

The full imputed dataset obtained after combining A and B.

datA_imp1

Subset of datAB_imp1 corresponding to dataset A.

datB_imp1

Subset of datAB_imp1 corresponding to dataset B.

References

Examples

if (requireNamespace("NPBayesImputeCat", quietly = TRUE)) {
  data(datA)
  data(datB)

  datAB <- datAB_to_SM(datA, datB)

  # call DPMPM (reduced settings for speed)
  output_list <- NPBayesImputeCat::DPMPM_nozeros_imp(
    X = datAB,
    nrun = 50,
    burn = 10,
    thin = 10,
    K = 20,
    aalpha = 0.25,
    balpha = 0.25,
    m = 2,
    seed = 1234,
    silent = TRUE
  )

  step_first <- smc_step1(datA, datB, output_list)
  str(step_first$datA_imp1)
  str(step_first$datB_imp1)
}