library(BIDistances)
#> Warning: vorhergehender Import 'Rcpp::LdFlags' durch 'RcppParallel::LdFlags'
#> während des Ladens von 'DataVisualizations' ersetzt
This packages contains various functions for distances-measures useful for bioinformatic data.
Installation using GitHub
#{r} #library(remotes) #install_github("Mthrun/BIDistances") #
The cosine distance is a distance-measure based on the cosine similarity. Let A be the data matrix and Ai, Aj some row vectors of A. The cosine similarity is then defined as s(i,j)=cos(θ)=Ai⋅Aj|Ai||Aj|, and the cosine distance as d(i,j)=max.
The Dist2All function calculates the distances of a given point x, to all other points (rows) of a given data matrix A. For the calculation of the distances, various distance-measures can be chosen, for e.g. Euclidean, Manhattan (City Block), Mahalanobis, Bhjattacharyya, for a complete list see parallelDist. The distance-measure can be specified with the method argument. The function returns an ordered vector of the distances from point x to all points in A in ascending order, as well as the indices of k-nearest-neighbors for the chosen distance measure.
data(Hepta)
V = Dist2All(Hepta$Data[1,],Hepta$Data, method = "euclidean", knn=3)
# Vector of distances from Hepta$Data[1,] to all other rows in Hepta$Data
print(V$distToAll)
#> [1] 0.00000000 0.08058895 0.04781043 0.09214454 0.03886835 0.09699465
#> [7] 0.09783282 0.06061589 0.10267484 0.08356635 0.12395455 0.13170909
#> [13] 0.09367107 0.07311107 0.10489254 0.07201004 0.16944962 0.08932404
#> [19] 0.13545775 0.03665085 0.12492320 0.13485587 0.05292353 0.03097211
#> [25] 0.14614814 0.08677934 0.02394002 0.11479889 0.03230518 0.14903284
#> [31] 0.13893677 0.14948497 2.52446917 3.01913398 3.21579110 3.17836461
#> [37] 3.15769630 2.57796199 3.31600855 3.42089925 3.83891632 3.58352663
#> [43] 3.82807942 3.74241501 3.66695458 2.56495911 3.48798635 2.51574735
#> [49] 2.47210300 3.08088830 3.29044933 2.51797162 2.77927187 3.11074797
#> [55] 3.09722767 3.55496535 3.33477595 3.09117752 3.14815058 2.98838821
#> [61] 2.90743022 2.93734164 2.66675926 2.64259662 3.36701734 3.90809765
#> [67] 3.30677234 2.66204951 2.20011186 3.05133415 3.51307757 2.65792595
#> [73] 3.32292220 2.87368651 2.72029774 2.61406504 2.92254677 2.83595565
#> [79] 3.55531532 2.53112234 2.76796248 2.90106297 3.11617310 3.60091602
#> [85] 3.55476025 2.31093773 2.62329978 3.22935873 2.56992529 3.40588864
#> [91] 3.90022679 2.81386462 3.02034146 3.05283900 2.13176098 3.07322426
#> [97] 3.49643390 3.33818899 3.28321841 2.64631876 3.34413644 3.69818991
#> [103] 2.86346004 3.65485742 3.78012860 3.58415974 2.65018304 3.56255550
#> [109] 3.65163561 2.91422029 3.07258132 2.45181926 2.29991462 3.20917355
#> [115] 3.70924494 2.59280107 2.97424022 2.83887470 3.53219603 2.70771842
#> [121] 3.03205030 3.31160172 2.47996181 3.05245948 3.12721819 3.63906971
#> [127] 3.07121966 2.40720597 2.77981952 3.75378880 3.93878434 2.63787864
#> [133] 3.57013739 3.00944011 3.00081140 3.10025752 2.44570366 3.09900684
#> [139] 2.94566780 3.22610410 3.77257806 2.95948219 3.04835200 3.29707317
#> [145] 2.38829944 3.36077136 3.68833648 2.18316289 2.99890839 2.81540383
#> [151] 2.42404613 3.81733227 2.92926568 3.45549966 3.21561093 3.37903200
#> [157] 2.41146632 3.09742210 2.93177839 3.02379783 3.01282943 2.31164299
#> [163] 2.92613725 3.30081802 2.89988712 2.83634572 2.95293088 3.38450777
#> [169] 2.22953148 2.83342086 3.52553473 2.32071642 2.65455358 2.52694921
#> [175] 2.78506782 3.55896170 2.21862698 3.10491516 2.20840668 2.95602706
#> [181] 3.02296244 3.87704358 3.02381731 3.93379495 3.70924221 3.03949680
#> [187] 3.13826953 3.00121181 2.97098494 2.90194795 3.67516270 3.42685363
#> [193] 3.65196565 2.40343230 3.17742347 2.80353846 3.04065098 2.98600351
#> [199] 3.22565744 3.16701313 2.52899115 3.72693787 2.57746647 3.77579621
#> [205] 2.94798545 3.06495823 2.52541787 2.76796966 3.30391078 2.95077124
#> [211] 3.67311616 2.69897901
# Vector of the indices of the k-nearest-neighbors, according to the euclidean distance
print(V$KNN)
#> [1] 1 27 24
For a given [1:n, 1:d] data matrix A, with n cases and d variables, the function calculates the symmetric [1:n, 1:n] distance matrix, given a chosen distance-measure. The method argument specifies the distance-measure (euclidean by default).
Options for method include :
‘euclidean’, ‘sqEuclidean’, ‘binary’, ‘cityblock’, ‘maximum’, ‘canberra’, ‘cosine’, ‘chebychev’, ‘jaccard’, ‘mahalanobis’, ‘minkowski’ ,‘manhattan’ , ‘braycur’ ,‘cosine’.
For the method ‘minkowski’, the parameter dim, can be used to specify the value of p in \left( \sum_{i=1}^{n} |A_{j i} - A_{l i}|^p \right)^{1/p}
The fractional distance function uses the formula of the Minkowski-metric to calculate the distances and allows the usage of fractional values p \in [0,1], which can be useful for high-dimensional data [Aggrawal et al., 2001].
The term frequency-inverse document frequency (Tf-idf) is a statistical measure of relevance of a term t to a document d in a collection of documents D. The Tfidf-distance for two documents d_i, d_j \in D is then the absolute difference between the Tfidf-values.
An exemplary usage for bioinformatic data is the calculation of distances between genes using the Tfidf-distance, based on GO-Terms (Gene-Ontology-terms). For this a matrix A of n genes as rows and m GO-Terms as columns is used, where genes can be interpreted as documents and GO-terms as terms [Thrun, 2022].
data(Hearingloss_N109)
V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term, tf_fun = mean)
# Get distances
dist = V$dist
# Get weights
TfidfWeights = V$TfidfWeights
For the calculation of the (augmented) term-frequency, per default the mean of the non-zero entries is used, but can be specified with the argument tf_fun.
[Thrun, 2021] Thrun, M. C.: The Exploitation of Distance Distributions for Clustering, International Journal of Computational Intelligence and Applications, Vol. 20(3), pp. 2150016, DOI: 10.1142/S1469026821500164, 2021.
[Thrun, 2022] Thrun, M. C.: Knowledge-based Indentification of Homogenous Structures in Genes, 10th World Conference on Information Systems and Technologies (WorldCist’22), in: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F. (eds) Information Systems and Technologies, Lecture Notes in Networks and Systems, Vol 468.,pp. 81-90, DOI: 10.1007/978-3-031-04826-5_9, Budva, Montenegro, 12-14 April, 2022.
[Aggrawal et al., 2001] Aggrawal, C. C., Hinneburg, A., Keim, D. (2001), On the Suprising Behavior of Distance Metrics in High Dimensional Space.