Let \(x_i = \left(x_i^1,\cdots,x_i^p\right)^T\) and \(x_j = \left(x_j^1,\cdots,x_j^p\right)^T\) be the feature vectors of two distinct patients \(i\) and \(j\). A first rough idea may be to calculate the \(L_1-\)norm of this two feature vectors: \[ \|x_i - x_j\|_{L_1}=\frac{1}{p}\sum\limits_{k=1}^p|x_i^k - x_j^k| \]
However, this naive approach arises from some problems:
This measure is well-defined for numerical features but not for categorical features, such as gender (male/female). We need a method to define the distance for categorical variables.
\(|x_i^k - x_j^k|\) is not scale-invariant, meaning that if one changes the unit of measurement (e.g., from meters to centimeters), the contribution of this feature would increase by a factor of 100. Features with large values will dominate the distance.
All features are treated equally, which may not reflect their actual importance. For example, in the case of lung cancer, the patient’s smoking status (no/yes) seems more relevant than the city they live in.
This package addresses these issues by deriving a weighted distance from regression model coefficients, which naturally handles mixed feature types and reflects each feature’s importance.
To address the challenges discussed earlier, we need to perform two steps:
Generalize the L1-distance to handle different variable types, including numerical and categorical features.
Assign appropriate weights to each feature to:
a. eliminate the dependency on the scale of the variables,
By incorporating these modifications, we arrive at the following weighted distance measure:
\[ d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)| \]
The weighted distance measure now includes weights \(\alpha(x_i^k, x_j^k)\), which are determined based on the training data. In the next section, we will discuss how to obtain these weights and effectively compute the weighted distance measure for mixed data types and varying feature importance.
Let \(\alpha(x_i^k, x_j^k)\) the weights and \(d(x_i^k, x_j^k)\) the distance for feature \(k\) and observation \(i\) and \(j\). We will define them as:
If feature \(k\) is numerical, then
\[d(x_i^k, x_j^k) = x_i^k - x_j^k\] and
\[\alpha(x_i^k, x_j^k) = \hat{\beta_k}\]
If feature \(k\) is categorical, then
\[d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0\] and \[\alpha(x_i^k, x_j^k) = \hat{\beta_k}^i - \hat{\beta_k}^j,\]
where \(\hat{\beta_k}^k\) are the coefficients of a regression model (linear, logistic, or CPH model). The regression coefficients thus serve as weights that reflect each feature’s importance and scale.
Two observations \(i\) and \(j\) are considered more similar when the fraction of trees in which patient \(i\) and \(j\) share the same terminal node is close to one (Breiman, 2002).
\[d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]},\] where \(M\) is the number of trees that contain both observations and \(T\) is the total number of trees.
A drawback of this measure is that the decision is binary, meaning that potentially similar observations might be counted as dissimilar. For example, suppose a final cut-off is consistently made around age 58, and observation 1 has an age of 56 while observation 2 has an age of 60. In this case, the distance between observation 1 and observation 2 would be the same as the distance between observation 1 and an observation with an age of 80. This limitation makes the proximity measure less sensitive to small differences between observations, potentially affecting the overall analysis of similarity.
In contrast to the proximity measure, the depth measure takes into account the number of edges between two observations instead of their final nodes in each tree. This distance measure is averaged over all trees and is defined as:
\[ d(x_i, x_j) = \frac{1}{M}\sum\limits_{t=1}^T g_{ij}, \]
where \(M\) is the number of trees containing both observations, and \(g_{ij}\) is the number of edges between the end nodes of observation \(i\) and \(j\) in tree \(t\). This measure considers the structure of the trees and provides a more nuanced understanding of the similarity between observations.
For more details and a thorough explanation of the depth measure, refer to the publication by Englund and Verikas: “A novel approach to estimate proximity in a random forest: An exploratory study.”
By accounting for tree structure, the depth measure provides a more nuanced similarity assessment than the binary proximity measure.