Examples

CosinusDistance

The cosine distance is a distance-measure based on the cosine similarity. Let \(A\) be the data matrix and \(A_i\), \(A_j\) some row vectors of \(A\). The cosine similarity is then defined as \(\begin{equation} \text{s(i,j)} = \cos(\theta) = \frac{\mathbf{A_i} \cdot \mathbf{A_j}}{|\mathbf{A_i}| |\mathbf{A_j}|} \end{equation}\), and the cosine distance as \(d(i,j)=\max{s}-s(i,j)\).

data(Hepta) 
distMatrix = CosinusDistance(Hepta$Data)

Dist2All

The Dist2All function calculates the distances of a given point \(x\), to all other points (rows) of a given data matrix \(A\). For the calculation of the distances, various distance-measures can be chosen, for e.g. Euclidean, Manhattan (City Block), Mahalanobis, Bhjattacharyya, for a complete list see parallelDist. The distance-measure can be specified with the method argument. The function returns an ordered vector of the distances from point \(x\) to all points in \(A\) in ascending order, as well as the indices of k-nearest-neighbors for the chosen distance measure.

data(Hepta)
V = Dist2All(Hepta$Data[1,],Hepta$Data, method = "euclidean", knn=3)
# Vector of distances from Hepta$Data[1,] to all other rows in Hepta$Data
print(V$distToAll)
#>   [1] 0.00000000 0.08058895 0.04781043 0.09214454 0.03886835 0.09699465
#>   [7] 0.09783282 0.06061589 0.10267484 0.08356635 0.12395455 0.13170909
#>  [13] 0.09367107 0.07311107 0.10489254 0.07201004 0.16944962 0.08932404
#>  [19] 0.13545775 0.03665085 0.12492320 0.13485587 0.05292353 0.03097211
#>  [25] 0.14614814 0.08677934 0.02394002 0.11479889 0.03230518 0.14903284
#>  [31] 0.13893677 0.14948497 2.52446917 3.01913398 3.21579110 3.17836461
#>  [37] 3.15769630 2.57796199 3.31600855 3.42089925 3.83891632 3.58352663
#>  [43] 3.82807942 3.74241501 3.66695458 2.56495911 3.48798635 2.51574735
#>  [49] 2.47210300 3.08088830 3.29044933 2.51797162 2.77927187 3.11074797
#>  [55] 3.09722767 3.55496535 3.33477595 3.09117752 3.14815058 2.98838821
#>  [61] 2.90743022 2.93734164 2.66675926 2.64259662 3.36701734 3.90809765
#>  [67] 3.30677234 2.66204951 2.20011186 3.05133415 3.51307757 2.65792595
#>  [73] 3.32292220 2.87368651 2.72029774 2.61406504 2.92254677 2.83595565
#>  [79] 3.55531532 2.53112234 2.76796248 2.90106297 3.11617310 3.60091602
#>  [85] 3.55476025 2.31093773 2.62329978 3.22935873 2.56992529 3.40588864
#>  [91] 3.90022679 2.81386462 3.02034146 3.05283900 2.13176098 3.07322426
#>  [97] 3.49643390 3.33818899 3.28321841 2.64631876 3.34413644 3.69818991
#> [103] 2.86346004 3.65485742 3.78012860 3.58415974 2.65018304 3.56255550
#> [109] 3.65163561 2.91422029 3.07258132 2.45181926 2.29991462 3.20917355
#> [115] 3.70924494 2.59280107 2.97424022 2.83887470 3.53219603 2.70771842
#> [121] 3.03205030 3.31160172 2.47996181 3.05245948 3.12721819 3.63906971
#> [127] 3.07121966 2.40720597 2.77981952 3.75378880 3.93878434 2.63787864
#> [133] 3.57013739 3.00944011 3.00081140 3.10025752 2.44570366 3.09900684
#> [139] 2.94566780 3.22610410 3.77257806 2.95948219 3.04835200 3.29707317
#> [145] 2.38829944 3.36077136 3.68833648 2.18316289 2.99890839 2.81540383
#> [151] 2.42404613 3.81733227 2.92926568 3.45549966 3.21561093 3.37903200
#> [157] 2.41146632 3.09742210 2.93177839 3.02379783 3.01282943 2.31164299
#> [163] 2.92613725 3.30081802 2.89988712 2.83634572 2.95293088 3.38450777
#> [169] 2.22953148 2.83342086 3.52553473 2.32071642 2.65455358 2.52694921
#> [175] 2.78506782 3.55896170 2.21862698 3.10491516 2.20840668 2.95602706
#> [181] 3.02296244 3.87704358 3.02381731 3.93379495 3.70924221 3.03949680
#> [187] 3.13826953 3.00121181 2.97098494 2.90194795 3.67516270 3.42685363
#> [193] 3.65196565 2.40343230 3.17742347 2.80353846 3.04065098 2.98600351
#> [199] 3.22565744 3.16701313 2.52899115 3.72693787 2.57746647 3.77579621
#> [205] 2.94798545 3.06495823 2.52541787 2.76796966 3.30391078 2.95077124
#> [211] 3.67311616 2.69897901
# Vector of the indices of the k-nearest-neighbors, according to the euclidean distance
print(V$KNN)
#> [1]  1 27 24

DistanceMatrix

For a given \([1:n, 1:d]\) data matrix \(A\), with \(n\) cases and \(d\) variables, the function calculates the symmetric \([1:n, 1:n]\) distance matrix, given a chosen distance-measure. The method argument specifies the distance-measure (euclidean by default).

data(Hepta)
Dmatrix = DistanceMatrix(Hepta$Data, method='euclidean')

Options for method include :

‘euclidean’, ‘sqEuclidean’, ‘binary’, ‘cityblock’, ‘maximum’, ‘canberra’, ‘cosine’, ‘chebychev’, ‘jaccard’, ‘mahalanobis’, ‘minkowski’ ,‘manhattan’ , ‘braycur’ ,‘cosine’.

For the method ‘minkowski’, the parameter dim, can be used to specify the value of p in \(\left( \sum_{i=1}^{n} |A_{j i} - A_{l i}|^p \right)^{1/p}\)

Dmatrix = DistanceMatrix(Hepta$Data, method='minkowski', dim=3)

Fractional Distances

The fractional distance function uses the formula of the Minkowski-metric to calculate the distances and allows the usage of fractional values \(p \in [0,1]\), which can be useful for high-dimensional data [Aggrawal et al., 2001].

data(Hepta)
distMatrix = FractionalDistance(Hepta$Data, p = 1/2)

Tfidf-distance

The term frequency-inverse document frequency (Tf-idf) is a statistical measure of relevance of a term \(t\) to a document \(d\) in a collection of documents \(D\). The Tfidf-distance for two documents \(d_i\), \(d_j \in D\) is then the absolute difference between the Tfidf-values.

An exemplary usage for bioinformatic data is the calculation of distances between genes using the Tfidf-distance, based on GO-Terms (Gene-Ontology-terms). For this a matrix \(A\) of \(n\) genes as rows and \(m\) GO-Terms as columns is used, where genes can be interpreted as documents and GO-terms as terms [Thrun, 2022].

data(Hearingloss_N109)
V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term, tf_fun = mean)
# Get distances
dist = V$dist
# Get weights
TfidfWeights = V$TfidfWeights

For the calculation of the (augmented) term-frequency, per default the mean of the non-zero entries is used, but can be specified with the argument tf_fun.

BIDistances

Introduction to Bioinformatic Distances