Introduction

Here my goal is to begin exploring some CGM (continuous glucose monitoring) data to get a better understanding for how to work with these types of data and what their potential are. This was inspired by Irina Gaynanova’s website (https://irinagain.github.io/CGM/) where her lab group worked on compiling CGM datasets and calculating various statistics from these data. In fact, they also created an R package and associated shiny app for exploring CGM data, which I may use in this exploration here.

The Data

The data come from this repository: (https://github.com/irinagain/Awesome-CGM) where Itina Gaynanova and her colleagues compiled free and available CGM datasets.

The specific datasets I will use below includes Allepo et al. (2017) (https://diabetesjournals.org/care/article/40/4/538/3687/REPLACE-BG-A-Randomized-Trial-Comparing-Continuous) and Hall et al. (2018) (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005143#pbio.2005143.s010)

Required disclaimer: The source of the data is from Allepo et al. (2017) and Hall et al. (2018), but the analyses, content and conclusions presented herein are solely the responsibility of the authors and have not been reviewed or approved by Allepo et al. (2017) or Hall et al. (2018).

Ideas and preliminary notes

Here are just some ideas of ways in which I could approach these data:

  • Basic visualizations of CGM readings by subject over time

  • Daily summaries of average fluctuations including variation and/or confidence ribbons

  • The R Package iglu (stands for ierpreting glucose?) can allow the calculation of numerous metrics for blood glucose profiles which may be more or less useful for helping us analyze and quantify these profiles in various contexts.

  • For example, maybe these metrics can be used as features in some type of predictive model for diabetes.

  • Those data might also be useful for predicting future glucose levels when implementing automatic insulin supply (e.g. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253125)

  • To not reinvent the wheel, here is a good reference from the study above about the models they used for predicting glucose levels into the near future (15 and 60 minute mark) (https://doi.org/10.1371/journal.pone.0253125.s015). These included ARIMA, Support Vector Regression, Gradient-boosting trees, Feed-forward neural networks, and recurrent neural networks.

  • There is also this thing called a Surveillence Error Grid which assigns different levels of risk to predictions of blood glucose levels. For example, predicting a glucose level of 120 but the actual value being 500 is very risky compared to predicting 160 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4764212/)

Packages and Functions

Load the necessary packages and functions here:

library(tidyverse) # for magic
library(RSQLite) # for loading SQLite data
library(iglu) # for CGM metrics
library(factoextra) # clustering algorithms & visualization
library(ggforce) # add ellipses to pca plots
library(concaveman) # for adding hulls to pca plots
library(vegan) # for NMDS analysis
library(caret) # for cross-validation
library(ropls) # for PCA and PLS regression (to install: https://rdrr.io/bioc/ropls/)
library(chemhelper) # for use with ropls (to install: https://rdrr.io/github/Aariq/chemhelper/f/README.md)
library(ggrepel) # add labels to a plot that don't overlap
library(glue) # for formatting strings of text in figures
library(cowplot) # for plotting multiple plots together

Upload data

Hall 2018 data

# Read the raw data in
raw_hall_data = read_tsv("raw_data/hall-data/hall-data-main.txt")
## Warning: One or more parsing issues, see `problems()` for details
# I get a warning because "low" was used for a few rows of readings,
# maybe because they were too low to for the meter.

# what could these 'low' values actually be?
sort(raw_hall_data$GlucoseValue)[1:20]
##  [1] 40 40 40 41 41 41 41 41 42 42 42 42 42 42 43 43 43 43 43 43
hist(raw_hall_data$GlucoseValue, breaks=100)

# Ok, so for now will fill in the "low" values in glucose with 39. I'm guessing values in the histogram didn't go much further below that.
clean_hall_data <- select(raw_hall_data, id = subjectId, time = DisplayTime, gl = GlucoseValue) %>% 
  mutate(gl = ifelse(is.na(gl), 39, gl))

Other data that also need to be uploaded:

# meal data:
raw_meal_metadata = read_tsv("raw_data/hall-data/hall-meal-data.tsv")
raw_meal_metadata
## # A tibble: 486 × 6
##    userID   meal  glucotype nb_meals_glucotype mealType nb_meals_tot
##    <chr>    <chr> <chr>                  <dbl> <chr>           <dbl>
##  1 2133-001 Bar 1 moderate                   1 Bar                 6
##  2 2133-001 Bar 1 low                        0 Bar                 6
##  3 2133-001 Bar 1 severe                     0 Bar                 6
##  4 2133-001 Bar 2 moderate                   1 Bar                 6
##  5 2133-001 Bar 2 low                        0 Bar                 6
##  6 2133-001 Bar 2 severe                     0 Bar                 6
##  7 2133-001 CF 1  moderate                   1 CF                  6
##  8 2133-001 CF 1  low                        0 CF                  6
##  9 2133-001 CF 1  severe                     0 CF                  6
## 10 2133-001 CF 2  moderate                   0 CF                  6
## # … with 476 more rows
# Need to use SQLite for loading the subject data
filename <- "raw_data/hall-data/hall-data-subjects.db"
sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver,
                dbname = filename)
dbListTables(db)
## [1] "clinical"
hall_subject_data <- as_tibble(dbReadTable(db,"clinical"))

# names(hall_subject_data)
# hall_subject_data$diagnosis
# Get just age, BMI, ID, Height, Weight, and the study's predicted diagnosis:

hall_subject_data_clean <- select(hall_subject_data, id = userID, 
                                  age = Age, BMI, height = Height,
                                  weight = Weight, diagnosis)

Analysis

Calculating Gaynanova’s metrics for this dataset:

# This calculates the percent above 140, 180, and 250:
above_percent(clean_hall_data)
## # A tibble: 57 × 4
##    id          above_140 above_180 above_250
##    <chr>           <dbl>     <dbl>     <dbl>
##  1 1636-69-001    11.5       2.55          0
##  2 1636-69-026    13.3       0.278         0
##  3 1636-69-028    16.4       1.54          0
##  4 1636-69-032     2.92      0.168         0
##  5 1636-69-035    17.6       1.97          0
##  6 1636-69-048     0.393     0             0
##  7 1636-69-053     4.61      0.107         0
##  8 1636-69-060    13.9       0.275         0
##  9 1636-69-064    14.4       1.20          0
## 10 1636-69-069     9.96      0.369         0
## # … with 47 more rows

Ooooh this is cool! Let’s generate some features based on these types of percentages:

above <- above_percent(clean_hall_data, targets_above = c(140, 180, 250))
below <- below_percent(clean_hall_data, targets_below = c(54, 70))
median_gl <- median_glu(clean_hall_data)

person_metrics <- left_join(above, below, by="id") %>% 
  left_join(median_gl, by="id")

Now, I’m curious to see how well these very basic data about percentages above and below certain glucose thresholds (and the median) can be used to predict the predicted diabetes diagnosis from Hall et al. 2018.

Simple K-Means

First let’s do a simple k-means cluster analysis and color the points based on diagnosis to see if any pattern emerges. But first have to remove (or impute the missing values in our dataset):

complete.cases(person_metrics) # all there!
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Then standardize all variables for PCA and also create a distance matrix for later use:

person_metrics_std <- mutate(person_metrics, across(where(is.double), scale))
person_metrics_dist <- dist(person_metrics_std[,-1]) #default is bray
set.seed(123)
cluster1 <- kmeans(person_metrics_std[,-1], 3)
cluster1$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 3 1 1 1 3 1 1 2 3 1 3 1 1 1 1 3 1 1 2
## [39] 1 3 2 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 1
# Now create a simple PCA ordination to visualize
pca <- princomp(person_metrics_std[,-1])
pca_coords <- pca$scores[,1:2]
# hall_subject_data_clean$cluster <- cluster1$cluster
pca_data_fin <- bind_cols(hall_subject_data_clean, 
                          as_tibble(pca_coords), 
                          as_tibble(cluster1$cluster)) %>% 
  rename(cluster1 = value) %>% 
  # also make the clusters categoric:
  mutate(cluster1 = as.factor(cluster1))

Now to visualize these results:

ggplot(data=pca_data_fin) + 
  geom_mark_hull(aes(x=Comp.1, y=Comp.2, color=cluster1, 
                     fill=after_scale(alpha(color, 0.2))), 
                 concavity=0.2, size=0.2, show.legend=FALSE) +
  geom_point(aes(x=Comp.1, y=Comp.2, color=diagnosis),
             size=4) +
  ggsci::scale_colour_npg() +
  coord_equal() +
  theme_minimal() +
  theme(panel.border = element_rect(fill= "transparent"))

Ok, but the problem is that the data look like the are not normally distributed (one of the assumptions of PCA), so why not try the non-parametric NMDS ordination:

# use the previously calculated distance matrix for the nmds
set.seed(111)
nmds1 = metaMDS(person_metrics_dist, k=2) # K = number of reduced dimensions
## Run 0 stress 0.03971878 
## Run 1 stress 0.03963769 
## ... New best solution
## ... Procrustes: rmse 0.008219732  max resid 0.04475595 
## Run 2 stress 0.03963927 
## ... Procrustes: rmse 0.002671526  max resid 0.01452217 
## Run 3 stress 0.06180934 
## Run 4 stress 0.04204511 
## Run 5 stress 0.03973114 
## ... Procrustes: rmse 0.008713421  max resid 0.04764022 
## Run 6 stress 0.03974215 
## ... Procrustes: rmse 0.007865419  max resid 0.04281825 
## Run 7 stress 0.06653507 
## Run 8 stress 0.05070582 
## Run 9 stress 0.04264866 
## Run 10 stress 0.0400417 
## ... Procrustes: rmse 0.01627036  max resid 0.0896567 
## Run 11 stress 0.04011182 
## ... Procrustes: rmse 0.01709139  max resid 0.09425783 
## Run 12 stress 0.0502237 
## Run 13 stress 0.03964217 
## ... Procrustes: rmse 0.0007827899  max resid 0.004243243 
## ... Similar to previous best
## Run 14 stress 0.03990895 
## ... Procrustes: rmse 0.01500035  max resid 0.081659 
## Run 15 stress 0.04979458 
## Run 16 stress 0.04260557 
## Run 17 stress 0.06321982 
## Run 18 stress 0.06180829 
## Run 19 stress 0.04276806 
## Run 20 stress 0.04207285 
## *** Solution reached
nmds_coords <- nmds1$points
nmds_data_fin <- bind_cols(hall_subject_data_clean, 
                           as_tibble(nmds_coords), 
                           as_tibble(cluster1$cluster)) %>% 
  rename(cluster1 = value) %>% 
  # also make the clusters categoric:
  mutate(cluster1 = as.factor(cluster1))
ggplot(data=nmds_data_fin) + 
  geom_mark_hull(aes(x=MDS1, y=MDS2, color=cluster1, 
                     fill=after_scale(alpha(color, 0.2))), 
                 concavity=0.2, size=0.2, show.legend=FALSE) +
  geom_point(aes(x=MDS1, y=MDS2, color=diagnosis),
             size=4) +
  ggsci::scale_colour_npg() +
  #coord_fixed(ratio=1) +
  theme_minimal() +
  theme(panel.border = element_rect(fill= "transparent"))

I guess not much difference from the PCA… Oh well! But what’s cool to see is that even with some REALLY basic metrics we can start to see a bit of association between the clusters and Hall et al. (2018) predictions. What happens when we add some other basic features such as BMI and age?

person_metrics_updated <- left_join(person_metrics, hall_subject_data_clean, by="id")

# Now need to fill in NAs (can just use the mean of each column):
person_metrics_updated[!complete.cases(person_metrics_updated),]
## # A tibble: 10 × 12
##    id         above_140 above_180 above_250 below_54 below_70 median   age   BMI
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>  <dbl> <dbl> <dbl>
##  1 1636-69-0…     16.4      1.54          0   0        0.384     110    50  27.3
##  2 1636-69-1…      7.54     0.474         0   0        0.0527    109    53  38  
##  3 2133-002        1.27     0             0   0        3.82       92    29  21.4
##  4 2133-003        5.15     0.443         0   0        0.886      97    36  24.6
##  5 2133-004       24.6      5.01          0   0        0.732     125    54  28.1
##  6 2133-008        0        0             0   0.222    4.43       89    31  19  
##  7 2133-019        8.83     0.111         0   0.0555   1.44      105    64  29.9
##  8 2133-033        1.90     0             0   0.211    7.72       92    42  19.6
##  9 2133-035        4.43     0.273         0   0.0546   0.546      98    61  30.1
## 10 2133-041        5.49     0.215         0   0.215    3.98      111    51  27.3
## # … with 3 more variables: height <dbl>, weight <dbl>, diagnosis <chr>
# looks like it's just height and weight values
person_metrics_updated_all <- mutate(person_metrics_updated, 
                                     height = ifelse(is.na(height), 
                                                     mean(height, na.rm=T),
                                                     height),
                                     weight = ifelse(is.na(weight), 
                                                     mean(weight, na.rm=T),
                                                     weight))
# Then standardize:
person_metrics_updated_std <- mutate(person_metrics_updated_all, 
                                     across(where(is.double), scale))
# Remove the diagnosis and id columns:
person_metrics_upd <- select(person_metrics_updated_std, -diagnosis, -id)

# Then finally, calculate the clusters and PCA again:
set.seed(123)
cluster2 <- kmeans(person_metrics_std[,-1], 3)
cluster2$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 3 1 1 1 3 1 1 2 3 1 3 1 1 1 1 3 1 1 2
## [39] 1 3 2 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 1
# Now create a simple PCA ordination to visualize
pca <- princomp(person_metrics_std[,-1])
pca_coords <- pca$scores[,1:2]
# hall_subject_data_clean$cluster <- cluster1$cluster
pca_data_fin <- bind_cols(hall_subject_data_clean, 
                          as_tibble(pca_coords), 
                          as_tibble(cluster2$cluster)) %>% 
  rename(cluster2 = value) %>% 
  # also make the clusters categoric:
  mutate(cluster2 = as.factor(cluster2))

ggplot(data=pca_data_fin) + 
  geom_mark_hull(aes(x=Comp.1, y=Comp.2, color=cluster2, 
                     fill=after_scale(alpha(color, 0.2))), 
                 concavity=0.2, size=0.2, show.legend=FALSE) +
  geom_point(aes(x=Comp.1, y=Comp.2, color=diagnosis),
             size=4) +
  ggsci::scale_colour_npg() +
  coord_equal() +
  theme_minimal() +
  theme(panel.border = element_rect(fill= "transparent"))

Not that much more different. Overall, it looks like cluster 1 is pretty much non-diabetic, cluster 2 is pre-diabetic or diabetic, and cluster 3 is a mix. I think next it is time to bring out the big guns and add in many more of Gaynanova’s metrics and use those to actually train a model for predicting Hall’s predicted diagnoses using KNN, Decision Trees, etc.

Predicting Hall diagnosis with supervised ML

Let’s try logitic regression, k-NN, a Decision Tree, Partial Least Squares-Discriminant Analysis, and Random Forest together with more of Gaynanova’s metrics and compare results. But first, since there are a lot of metrics, we should do a bit of feature selection and remove as much multicollinearity/redunancy as possible (though not always necessary to do this).

First calculate all of Gaynanaova’s metrics:

# # note that this also computes interpolation of missing time points in the data for calculating all the time-dependent metrics.
# hall_full <- all_metrics(clean_hall_data)
# # add the diagnoses too:
# hall_full_diag <- left_join(hall_full, select(hall_subject_data_clean, id, diagnosis))
# # export this so that I don't have to recalculate the metrics each time since it takes a little while:
# write.csv(hall_full_diag, "clean_data/hall_full.csv", row.names=FALSE)
hall_full <- read_csv("clean_data/hall_full.csv")
head(hall_full)
## # A tibble: 6 × 50
##   id       ADRR  COGI    CV  eA1C   GMI GRADE GRADE_eugly GRADE_hyper GRADE_hypo
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>       <dbl>      <dbl>
## 1 1636-6… 16.0  0.956  25.2  5.40  5.90 1.96         47.3       51.9       4.06 
## 2 1636-6… 10.4  0.990  17.5  5.64  6.06 2.24         57.7       42.0       0.458
## 3 1636-6…  9.08 0.969  22.2  5.65  6.07 2.50         45.4       53.5       1.52 
## 4 1636-6…  7.39 0.998  14.1  5.40  5.90 1.47         86.2       13.6       1.34 
## 5 1636-6… 10.7  0.971  22.5  5.69  6.10 2.62         44.6       55.2       1.32 
## 6 1636-6…  8.13 0.981  13.5  5.01  5.63 0.696        94.0        3.07     12.2  
## # … with 40 more variables: HBGI <dbl>, LBGI <dbl>, hyper_index <dbl>,
## #   hypo_index <dbl>, IGC <dbl>, IQR <dbl>, J_index <dbl>, M_value <dbl>,
## #   MAD <dbl>, MAGE <dbl>, above_140 <dbl>, above_180 <dbl>, above_250 <dbl>,
## #   below_54 <dbl>, below_70 <dbl>, in_range_63_140 <dbl>,
## #   in_range_70_180 <dbl>, range <dbl>, SD <dbl>, Min. <dbl>, 1st Qu. <dbl>,
## #   Median <dbl>, Mean <dbl>, 3rd Qu. <dbl>, Max. <dbl>, Conga <dbl>,
## #   GVP <dbl>, MODD <dbl>, SD.Roc <dbl>, CV_Measures_Mean <dbl>, …
# Also normalize/scale the features:
# Need to add "as.numeric" because scale for some reason is not returning a pure numeric
hall_full_std <- mutate(hall_full, across(where(is.double), ~ as.numeric(scale(.x))))

Preliminary PCA

all_metrics_pca <- opls(dplyr::select(hall_full_std, -id, -diagnosis), predI = 10, fig.pdfC="none") 
## PCA
## 57 samples x 48 variables
## standard scaling of predictors
##       R2X(cum) pre ort
## Total    0.969  10   0
pca_data <- as_tibble(all_metrics_pca@suppLs$xModelMN)
pca_scores <- get_scores(all_metrics_pca)

pca_cor_dat <-
  cor(pca_scores[,2:3], pca_data) %>%
    t() %>%
    as_tibble(rownames = "variable")

# pca.cor.dat
pca_loading_plot <-
  ggplot(pca_cor_dat) +
  geom_segment(aes(x = 0, y = 0, xend = p1, yend = p2),
               arrow = arrow(length = unit(0.15, "cm"))) +
  # gghighlight(p.adj < 0.05, use_direct_label = FALSE) +
  geom_label_repel(aes(x = p1, y = p2, label = variable),
                   segment.alpha = 0.6, direction = "y", size = 3, point.padding = 0.2,
                   min.segment.length = 0, force = 5, max.overlaps=25) +
  theme_bw() +
  labs(x = "Correlation to PC1",
       y = "Correlation to PC2",
       title = "PCA correlation plot")

mycolors <- c("diabetic" = "#455d95", "non-diabetic" = "#eea587", "pre-diabetic" = "#8f9bbe")
pca_score_dat <- get_plotdata(all_metrics_pca)
pca_score_plot <-
  ggplot(pca_score_dat$scores, aes(x = p1, y = p2, color = hall_full_std$diagnosis)) +
  geom_point(size = 3) +
  scale_color_manual("Hall Prediction:", values = mycolors) +
  labs(x = glue("PC1 ({all_metrics_pca@modelDF$R2X[1]*100}%)"),
       y = glue("PC2 ({all_metrics_pca@modelDF$R2X[2]*100}%)"),
       title = "PCA score plot") +
  theme_bw() +
  theme(legend.position = "right")

plot_grid(pca_loading_plot,
          pca_score_plot + 
            theme(legend.position = "bottom"),
          ncol = 2, nrow = 1)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
PCA score and loading plot.

PCA score and loading plot.

You can see that there is a lot of potential correlation between the metrics (i.e., a lot of redundancy), but more than 75% of the variation is captured by just two axes. There’s a bit of a distinction in the three diagnosis predictions, but they still aren’t too strong. Next step is to do some actual supervised modeling.

PLS-DA

# Try PLSDA with different numbers of predictive axes to see what 
# is optimal and reduce overfitting
plsda_results <- tibble(NULL)
for(i in 1:5){
  all_metrics_plsda <-
  opls(
    x = dplyr::select(hall_full_std, -id, -diagnosis), #X data
    y = hall_full_std$diagnosis, #Y data
    fig.pdfC = "none", #suppresses default plotting
    predI = i, # set number of predictive axes
    permI = 200) #use 200 permutations to generate a p-value
  
  plsda_results <- bind_rows(plsda_results, 
                             as_tibble(all_metrics_plsda@summaryDF))
}
plsda_results
## # A tibble: 5 × 8
##   `R2X(cum)` `R2Y(cum)` `Q2(cum)` RMSEE   pre   ort  pR2Y   pQ2
##        <dbl>      <dbl>     <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1      0.554      0.166    0.102  0.374     1     0 0.005 0.005
## 2      0.749      0.21     0.069  0.368     2     0 0.015 0.005
## 3      0.812      0.258    0.0421 0.364     3     0 0.015 0.01 
## 4      0.838      0.328   -0.0484 0.346     4     0 0.005 0.045
## 5      0.876      0.356   -0.126  0.341     5     0 0.02  0.03

R2X(cum) is the proportion of variation in the data explained by the predictive axes (the CGM metrics) (R^2). R2Y(cum) is the proportion of variation in the diagnoses explained by the model. With one predictive component (first row), PLS-DA explains 55.4% of total variation, but only explains 16.6% of the difference between diagnoses. Q2(cum) is the predictive power of the model calculated through cross-validation, but it can never be greater than R2Y(cum). In some ways this metric is the most important because we want to understand the predictive power of the model, which in this case looks to be around 10% (if I understand ‘Q2(cum)’ correctly). Finally, the columns pR2Y and pQ2 are permutation generated p-values based on R^2 or Q^2 values.

Thus, we can say that one predictive component yields the greatest predictive power (though only around 10%), but that this predictive power is statistically significant (p < 0.05). Yay! Ok, but 10% predictive power is really not great.

Decision Tree

Load some necessary packages:

library(rpart) # decision trees
library(rpart.plot) # decision tree visualizations

Take a look at our categories (something I should have done way sooner!):

table(hall_full$diagnosis)
## 
##     diabetic non-diabetic pre-diabetic 
##            5           38           14

Actually, since there are so few diabetic classifications (only 5!), maybe it’s way better to lump diabetic and pre-diabetic into one “potential-diabetic” grouping:

hall_full_mod <- mutate(hall_full, diagnosis = case_when(diagnosis == "diabetic" ~ "potential-diabetic",
                                                         diagnosis == "pre-diabetic" ~ "potential-diabetic",
                                                         TRUE ~ "non-diabetic"))
table(hall_full_mod$diagnosis)
## 
##       non-diabetic potential-diabetic 
##                 38                 19

That’s better!

Next, split the data into training and test:

set.seed(123)
hall_train <- sample_frac(hall_full_mod, 0.7)
hall_test  <- anti_join(hall_full_mod, hall_train, by = 'id')

#Make sure we have all classes present in each:
table(hall_test$diagnosis)
## 
##       non-diabetic potential-diabetic 
##                  9                  8
table(hall_train$diagnosis)
## 
##       non-diabetic potential-diabetic 
##                 29                 11

Now setup the decision tree model and test it with just a few variables:

#names(hall_train)
simp_tree <- rpart(diagnosis ~ Median + above_250 + above_180 + above_140 + below_54 + below_70, data = hall_train)
rpart.plot(simp_tree)

Oh wow, it chose to use just median! I wonder how accurate that is though?

printcp(simp_tree)
## 
## Classification tree:
## rpart(formula = diagnosis ~ Median + above_250 + above_180 + 
##     above_140 + below_54 + below_70, data = hall_train)
## 
## Variables actually used in tree construction:
## [1] Median
## 
## Root node error: 11/40 = 0.275
## 
## n= 40 
## 
##        CP nsplit rel error  xerror    xstd
## 1 0.27273      0   1.00000 1.00000 0.25673
## 2 0.01000      1   0.72727 0.81818 0.24009
p <- predict(simp_tree, hall_test, type = 'class')
confusionMatrix(p, as.factor(hall_test$diagnosis), positive='potential-diabetic')
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           non-diabetic potential-diabetic
##   non-diabetic                  8                  3
##   potential-diabetic            1                  5
##                                             
##                Accuracy : 0.7647            
##                  95% CI : (0.501, 0.9319)   
##     No Information Rate : 0.5294            
##     P-Value [Acc > NIR] : 0.04207           
##                                             
##                   Kappa : 0.5211            
##                                             
##  Mcnemar's Test P-Value : 0.61708           
##                                             
##             Sensitivity : 0.6250            
##             Specificity : 0.8889            
##          Pos Pred Value : 0.8333            
##          Neg Pred Value : 0.7273            
##              Prevalence : 0.4706            
##          Detection Rate : 0.2941            
##    Detection Prevalence : 0.3529            
##       Balanced Accuracy : 0.7569            
##                                             
##        'Positive' Class : potential-diabetic
## 

The p-value here indicates that there is a “significant” predictive ability beyond the percent of imbalance in the data (for example, there are 9 non-diabetics in theis test set vs. 8 potential-diabetics, so by chance alone you are more likely to predict non-diabetics because there are more than 50% of those, the “no information rate”). The accuracy is 0.7647, with a 95% confidence interval between 0.501 and 0.9319. OK, so it’s not GREAT, but still, for ONE feature?? Then there’s the kappa value of 0.5211

Kappa value interpretation Landis & Koch (1977): <0 No agreement 0 — .20 Slight .21 — .40 Fair .41 — .60 Moderate .61 — .80 Substantial .81–1.0 Perfect

But see this… https://towardsdatascience.com/interpretation-of-kappa-values-2acd1ca7b18f

Ok, but it also might be overfitting a really tiny dataset that we have. What happens if we use a k-fold cross-validation technique on this instead? Let’s try that:

# this creates random stratified folds to ensure a balance between the classes in each set (the diagnoses)
set.seed(123)
folds <- createFolds(hall_full_mod$diagnosis, k = 6)
str(folds)
## List of 6
##  $ Fold1: int [1:9] 6 10 12 21 22 25 39 56 57
##  $ Fold2: int [1:9] 2 5 7 17 20 26 44 48 54
##  $ Fold3: int [1:10] 3 8 11 18 19 23 33 45 49 52
##  $ Fold4: int [1:9] 13 15 24 28 40 41 42 47 55
##  $ Fold5: int [1:10] 1 9 27 30 31 35 36 38 46 51
##  $ Fold6: int [1:10] 4 14 16 29 32 34 37 43 50 53
cv_results <- lapply(folds, function(x) {
  hall_train <- hall_full_mod[-x,]
  hall_test <- hall_full_mod[x,]
  simp_tree <- rpart(diagnosis ~ Median + above_250 + above_180 + above_140 + below_54 + below_70, data = hall_train)
  diag_pred <- predict(simp_tree, hall_test, type = 'class')
  diag_actual <- as.factor(hall_test$diagnosis)
  result <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
  return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})
results_clean <- unlist(cv_results)

accuracy <- results_clean[grepl("Accuracy", names(results_clean))]
kappa <- results_clean[grepl("Kappa", names(results_clean))]

mean(accuracy)
## [1] 0.7185185
mean(kappa)
## [1] 0.3289562

Ok, so this doesn’t look quite as encouraging anymore. And this is probably from the fact that we have so little data and used very few features. What about using Random Forest so that we can add more features?

Random Forest

library(randomForest) # for random forests!
set.seed(123)
# Need to fix the data so that there are no spaces in the feature names for `randomForest()` to work:
names(hall_full_mod) <- str_replace(names(hall_full_mod), " ", "_")
hall_full_rf <- hall_full_mod
names(hall_full_rf) <- sub("^","i",names(hall_full_mod)) # and add a i to the start (for now)
hall_full_rf <- select(hall_full_rf, -iid) # remove ID

partition <- createDataPartition(hall_full_rf$idiagnosis, p = .7, list=F)
hall_train <- hall_full_rf[partition,-1]
hall_test <- hall_full_rf[-partition,-1]
set.seed(123)
hall_rf <- randomForest(as.factor(idiagnosis) ~ ., # try all features
                        mtry = 10, # select 10 random features each time
                        ntree = 500, # number of trees
                        data = hall_train)
p <- predict(hall_rf, hall_test, type = 'class')
confusionMatrix(p, as.factor(hall_test$idiagnosis), positive='potential-diabetic')
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           non-diabetic potential-diabetic
##   non-diabetic                  9                  2
##   potential-diabetic            2                  3
##                                             
##                Accuracy : 0.75              
##                  95% CI : (0.4762, 0.9273)  
##     No Information Rate : 0.6875            
##     P-Value [Acc > NIR] : 0.4069            
##                                             
##                   Kappa : 0.4182            
##                                             
##  Mcnemar's Test P-Value : 1.0000            
##                                             
##             Sensitivity : 0.6000            
##             Specificity : 0.8182            
##          Pos Pred Value : 0.6000            
##          Neg Pred Value : 0.8182            
##              Prevalence : 0.3125            
##          Detection Rate : 0.1875            
##    Detection Prevalence : 0.3125            
##       Balanced Accuracy : 0.7091            
##                                             
##        'Positive' Class : potential-diabetic
## 

Accuracy of 75% and Kappa of 0.41. Can we tune this forest?

# set training options (6-fold cross val + 10 repeats)
ctrl <- trainControl(method = "repeatedcv", number = 6, repeats = 10)

# create a grid of how many features it tries at any one time:
grid_rf <- expand.grid(.mtry = c(2, 5, 10, 20, 35))

set.seed(123)
m_rf <- train(as.factor(idiagnosis) ~ ., data = hall_full_rf, method = "rf",
              metric = "Kappa", trControl = ctrl,
              tuneGrid = grid_rf)
m_rf
## Random Forest 
## 
## 57 samples
## 48 predictors
##  2 classes: 'non-diabetic', 'potential-diabetic' 
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 10 times) 
## Summary of sample sizes: 48, 47, 47, 48, 48, 47, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6636869  0.2265621
##    5    0.6761279  0.2515812
##   10    0.6629461  0.2330395
##   20    0.6670539  0.2482402
##   35    0.6513973  0.2145016
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

Ok, so it looks like the best we can get is around an accuracy of 64% with a kappa of .17

Though the model isn’t perfect, it does shed some light on which features might be more important:

varImpPlot(m_rf$finalModel)

But I’m still not pleased with this. There is a lot of collinearity in the data, so what if we used the top 6(?) PCA components as the features instead? And boost the tree?

PCA Decision Tree (boosted?)

all_metrics_pca <- opls(dplyr::select(hall_full_std, -id, -diagnosis), predI = 6, fig.pdfC="none") 
## PCA
## 57 samples x 48 variables
## standard scaling of predictors
##       R2X(cum) pre ort
## Total    0.918   6   0
pca_scores <- get_scores(all_metrics_pca)
pca_scores$sample <- hall_full_mod$diagnosis
hall_pca_metrics <- rename(pca_scores, diagnosis = sample)
hall_pca_metrics
## # A tibble: 57 × 7
##    diagnosis              p1      p2     p3     p4     p5      p6
##    <chr>               <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 potential-diabetic -6.40   1.45   -1.13  -0.297  2.12  -0.0328
##  2 potential-diabetic -3.74  -2.37   -0.553  1.16  -1.50  -0.306 
##  3 non-diabetic       -4.98  -1.37    1.52   3.31   2.56  -2.03  
##  4 potential-diabetic  1.12  -4.48    0.163  0.219 -0.560 -0.715 
##  5 non-diabetic       -6.70  -0.784   1.65   0.819  0.882  0.832 
##  6 non-diabetic        4.35  -3.88   -0.354 -0.963 -0.775  0.438 
##  7 non-diabetic        0.884 -1.49   -0.858 -1.37   0.320 -0.314 
##  8 non-diabetic       -2.59  -5.85    3.31   2.38  -2.31  -0.356 
##  9 non-diabetic       -6.21  -0.0706 -2.94   1.03  -2.14  -1.29  
## 10 non-diabetic       -2.25  -2.14    0.566  0.402  1.15   0.751 
## # … with 47 more rows

So here are our new data! Now uses a 6-fold cross-validation as before:

set.seed(123)
folds <- createFolds(hall_pca_metrics$diagnosis, k = 6)

cv_PCA_results <- lapply(folds, function(x) {
  hall_train <- hall_pca_metrics[-x,]
  hall_test <- hall_pca_metrics[x,]
  simp_tree <- rpart(diagnosis ~ ., data = hall_pca_metrics)
  diag_pred <- predict(simp_tree, hall_test, type = 'class')
  diag_actual <- as.factor(hall_test$diagnosis)
  result <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
  return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})
results_clean <- unlist(cv_PCA_results)

accuracy <- results_clean[grepl("Accuracy", names(results_clean))]
kappa <- results_clean[grepl("Kappa", names(results_clean))]

mean(accuracy)
## [1] 0.8055556
mean(kappa)
## [1] 0.5239975

Wow! That’s pretty good for a simple decision tree! What about boosting it?

# Same as above but boosted and using C5.0 method:
cv_PCA_results <- lapply(folds, function(x) {
  hall_train <- hall_pca_metrics[-x,]
  hall_test <- hall_pca_metrics[x,]
  simp_tree <- C5.0(as.factor(diagnosis) ~ ., data = hall_pca_metrics, trials = 10)
  diag_pred <- predict(simp_tree, hall_test, type = 'class')
  diag_actual <- as.factor(hall_test$diagnosis)
  result <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
  return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})
results_clean <- unlist(cv_PCA_results)

accuracy <- results_clean[grepl("Accuracy", names(results_clean))]
kappa <- results_clean[grepl("Kappa", names(results_clean))]

mean(accuracy)
## [1] 0.8703704
mean(kappa)
## [1] 0.680303

AMAZING. That’s cool, we’ve gotten accuracy up to 87% and the kappa up to 0.68.

# Don't know why this gives me an error when I knit:
#C5imp(simp_tree)

Interestingly, when looking at the variable importance of each PCA component, it seemed like the components were not totally in order of their importance of explaining overall variation in the data.

Conclusion

This is only a temporary conclusion, but for now, the interesting finding is that Gaynanova’s metrics can provide a simple way of characterizing Hall (2018)’s glycemic signatures (glucotypes) with a fairly good accuracy of 87% and kappa of 0.68. The key was using a boosted decision tree after reducing dimensionality with a basic PCA ordination. The boosted tree that used the first 6 components of the PCA was even more superior than using a tuned randomForest that could use any features it wanted.

There’s much more to be done, but I’ll end it here for now.

References

Broll S, Buchanan D, Chun E, Muschelli J, Fernandes N, Seo J, Shih J, Urbanek J, Schwenck J, Gaynanova I (2021). iglu: Interpreting Glucose Data from Continuous Glucose Monitors. R package version 3.0.0. R package version 3.1.0

Mary Martin, Elizabeth Chun, David Buchanan, Eric Wang, Sangaman Senthil & Irina Gaynanova. (2020, June 15). irinagain/Awesome-CGM: List of public CGM datasets (Version v1.0.0). Zenodo.