ConfidenceEllipse

R-CMD-check Project Status: Active – The project has reached a stable, usable state and is being actively developed. Lifecycle: stable License: MIT codecov

The ConfidenceEllipse package computes the coordinate points of confidence ellipses and ellipsoids for a given bivariate and trivariate dataset. The size of the ellipse and ellipsoid is determined by the confidence level, and the shape is determined by the covariance matrix. The confidence level is usually chosen to be 95% or 99%, and the resulting confidence region contains the points that are expected to lie within the multivariate distribution.

Installation

You can install the development version of ConfidenceEllipse like so:

# install.packages("remotes")
# remotes::install_github("ChristianGoueguel/ConfidenceEllipse")

Example

library(magrittr)
library(tidyselect)
library(patchwork)
library(dplyr)
library(ggplot2)
library(purrr)
library(tidyr)
library(ConfidenceEllipse)

Dataset

The dataset is comprised of 13 different measurements for 180 archaeological glass vessels from different groups (Janssen, K.H.A., De Raedt, I., Schalm, O., Veeckman, J.: Microchim. Acta 15 (suppl.) (1998) 253-267. Compositions of 15th - 17th century archaeological glass vessels excavated in Antwerp.)

data("glass", package = "ConfidenceEllipse")
glass %>% glimpse()
#> Rows: 180
#> Columns: 14
#> $ glassType <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, …
#> $ Na2O      <dbl> 13.904, 14.194, 14.668, 14.800, 14.078, 13.600, 12.942, 15.6…
#> $ MgO       <dbl> 2.244, 2.184, 3.034, 2.455, 2.480, 1.648, 2.690, 2.028, 2.25…
#> $ Al2O3     <dbl> 1.312, 1.310, 1.362, 1.385, 1.072, 2.012, 1.420, 1.242, 1.07…
#> $ SiO2      <dbl> 67.752, 67.076, 63.254, 63.790, 68.768, 69.628, 64.012, 70.6…
#> $ P2O5      <dbl> 0.884, 0.938, 0.988, 1.200, 0.682, 0.698, 0.966, 0.210, 0.75…
#> $ SO3       <dbl> 0.052, 0.024, 0.064, 0.115, 0.070, 0.038, 0.046, 0.310, 0.03…
#> $ Cl2O      <dbl> 0.936, 0.966, 0.886, 0.988, 0.966, 0.908, 0.896, 0.676, 0.93…
#> $ K2O       <dbl> 3.044, 3.396, 2.828, 2.878, 2.402, 3.196, 2.526, 2.326, 2.32…
#> $ CaO       <dbl> 8.784, 8.636, 11.088, 10.833, 8.808, 6.160, 12.982, 6.324, 9…
#> $ MnO       <dbl> 0.674, 0.698, 1.240, 0.978, 0.310, 1.170, 0.874, 0.214, 0.60…
#> $ Fe2O3     <dbl> 0.364, 0.336, 0.400, 0.433, 0.242, 0.650, 0.516, 0.278, 0.25…
#> $ BaO       <dbl> 0.040, 0.040, 0.046, 0.025, 0.022, 0.156, 0.014, 0.032, 0.02…
#> $ PbO       <dbl> 0.004, 0.198, 0.134, 0.120, 0.102, 0.136, 0.120, 0.062, 0.02…

Confidence Region

Classical and robust confidence ellipse

First, the confidence_ellipse function is used to compute coordinate points of the confidence ellipse and then the ellipse is plotted on a two-dimensional plot x and y of the data. Points that lie outside the ellipse are considered to be outliers, while points that lie within the ellipse are considered to be part of the underlying distribution with the specified confidence level conf_level.

ellipse_95 <- confidence_ellipse(glass, x = SiO2, y = Na2O, conf_level = 0.95)
rob_ellipse_95 <- confidence_ellipse(glass, x = SiO2, y = Na2O, conf_level = 0.95, robust = TRUE)
ellipse_95 %>% glimpse()
#> Rows: 361
#> Columns: 2
#> $ x <dbl> 74.45896, 74.35724, 74.25314, 74.14669, 74.03792, 73.92686, 73.81356…
#> $ y <dbl> 21.99964, 22.08244, 22.16235, 22.23932, 22.31335, 22.38440, 22.45245…
cutoff <- qchisq(0.95, df = 2)
 MDsquared <- glass %>%
  select(SiO2, Na2O) %>%
  as.matrix() %>%
  mahalanobis(colMeans(.), cov(.), inverted = FALSE)
plot1 <- 
  ggplot() +
  geom_path(data = ellipse_95, aes(x = x, y = y), color = "blue", linewidth = 1L) +
  geom_point(
    data = glass %>% 
      mutate(md = MDsquared) %>%
      filter(md <= cutoff),
    aes(x = SiO2, y = Na2O),
    shape = 21L, color = "black", fill = "lightblue", size = 3L) +
  geom_point(
    data = glass %>% 
      mutate(md = MDsquared) %>%
      filter(md > cutoff),
    aes(x = SiO2, y = Na2O),
    shape = 21L, color = "black", fill = "gold", size = 3L) +
  labs(
    x = "SiO2 (wt.%)", 
    y = "Na2O (wt.%)",
    title = "Classical confidence ellipse\nat 95% confidence level") +
  theme_bw() +
  theme(legend.position = "none")
x_mcd <- glass %>%
  select(SiO2, Na2O) %>%
  as.matrix() %>%
  robustbase::covMcd()

rob_MDsquared <- glass %>%
  select(SiO2, Na2O) %>%
  as.matrix() %>%
  mahalanobis(x_mcd$center, x_mcd$cov)
plot2 <- 
  ggplot() +
  geom_path(data = rob_ellipse_95, aes(x = x, y = y), color = "blue", linewidth = 1L) +
  geom_point(
    data = glass %>% 
      mutate(md = rob_MDsquared) %>%
      filter(md <= cutoff),
    aes(x = SiO2, y = Na2O),
    shape = 21L, color = "black", fill = "lightblue", size = 3L) +
  geom_point(
    data = glass %>% 
      mutate(md = rob_MDsquared) %>%
      filter(md > cutoff),
    aes(x = SiO2, y = Na2O),
    shape = 21L, color = "black", fill = "gold", size = 3L) +
  labs(
    x = "SiO2 (wt.%)", 
    y = "Na2O (wt.%)",
    title = "Robust confidence ellipse\nat 95% confidence level") +
  theme_bw() +
  theme(legend.position = "none")
plot1 | plot2

Grouping

For grouping bivariate data, the .group_by argument can be used if the data contains an unique grouping variable (.group_by = NULL by default). When a grouping variable is provided, the function will compute the ellipses separately for each level of the factor, allowing you to explore potential differences or patterns within subgroups of the data.

It’s important to note that the grouping variable should be appropriately coded as a factor before passing it to the .group_by argument. If the variable is currently stored as a character or numeric type, you may need to convert it to a factor using functions like as.factor() or forcats::as_factor().

rpca_scores <- glass %>%
  select(where(is.numeric) )%>% 
  pcaPP::PCAproj(method = "qn") %>%
  pluck("scores") %>%
  as_tibble() %>%
  mutate(glassType = glass %>% pull(glassType)) %>%
  rename(PC1 = Comp.1, PC2 = Comp.2) 
ellipse_pca <- rpca_scores %>% confidence_ellipse(x = PC1, y = PC2, .group_by = glassType)
ggplot() +
  geom_point(data = rpca_scores, aes(x = PC1, y = PC2, color = glassType, shape = glassType), size = 3L) +
  geom_path(data = ellipse_pca, aes(x = x, y = y, color = glassType), linewidth = 1L) +
  scale_color_brewer(palette = "Set1", direction = 1) +
  labs(x = "PC1", y = "PC2", title = "Principal component analysis") +
  theme_bw() +
  theme(legend.position = "none")

Ellipsoid

The confidence_ellipsoid function accepts an additional variable z and computes the ellipsoid for trivariate data.

ellipsoid_grp <- glass %>% confidence_ellipsoid(SiO2, Na2O, Fe2O3, glassType)
ellipsoid_grp %>% glimpse()
#> Rows: 10,000
#> Columns: 4
#> $ x         <dbl> 67.32486, 67.32486, 67.32486, 67.32486, 67.32486, 67.32486, …
#> $ y         <dbl> 14.51964, 14.51964, 14.51964, 14.51964, 14.51964, 14.51964, …
#> $ z         <dbl> 0.5971494, 0.5971494, 0.5971494, 0.5971494, 0.5971494, 0.597…
#> $ glassType <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
rgl::setupKnitr(autoprint = TRUE)
rgl::plot3d(
  x = ellipsoid_grp$x, 
  y = ellipsoid_grp$y, 
  z = ellipsoid_grp$z,
  xlab = "SiO2 (wt.%)", 
  ylab = "Na2O (wt.%)", 
  zlab = "Fe2O3 (wt.%)",
  type = "s", 
  radius = 0.03,
  col = as.numeric(ellipsoid_grp$glassType)
  )
rgl::points3d(
  x = glass$SiO2, 
  y = glass$Na2O, 
  z = glass$Fe2O3, 
  col = as.numeric(glass$glassType),
  size = 5
  )
rgl::view3d(theta = 260, phi = 30, fov = 60, zoom = .85)