Here my goal is to begin exploring some CGM (continuous glucose monitoring) data to get a better understanding for how to work with these types of data and what their potential are. This was inspired by Irina Gaynanova’s website (https://irinagain.github.io/CGM/) where her lab group worked on compiling CGM datasets and calculating various statistics from these data. In fact, they also created an R package and associated shiny app for exploring CGM data, which I may use in this exploration here.
The data come from this repository: (https://github.com/irinagain/Awesome-CGM) where Itina Gaynanova and her colleagues compiled free and available CGM datasets.
The specific datasets I will use below includes Allepo et al. (2017) (https://diabetesjournals.org/care/article/40/4/538/3687/REPLACE-BG-A-Randomized-Trial-Comparing-Continuous) and Hall et al. (2018) (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005143#pbio.2005143.s010)
Required disclaimer: The source of the data is from Allepo et al. (2017) and Hall et al. (2018), but the analyses, content and conclusions presented herein are solely the responsibility of the authors and have not been reviewed or approved by Allepo et al. (2017) or Hall et al. (2018).
Here are just some ideas of ways in which I could approach these data:
Basic visualizations of CGM readings by subject over time
Daily summaries of average fluctuations including variation and/or confidence ribbons
The R Package iglu
(stands for ierpreting glucose?) can allow the calculation of numerous metrics for blood glucose profiles which may be more or less useful for helping us analyze and quantify these profiles in various contexts.
For example, maybe these metrics can be used as features in some type of predictive model for diabetes.
Those data might also be useful for predicting future glucose levels when implementing automatic insulin supply (e.g. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253125)
To not reinvent the wheel, here is a good reference from the study above about the models they used for predicting glucose levels into the near future (15 and 60 minute mark) (https://doi.org/10.1371/journal.pone.0253125.s015). These included ARIMA, Support Vector Regression, Gradient-boosting trees, Feed-forward neural networks, and recurrent neural networks.
There is also this thing called a Surveillence Error Grid which assigns different levels of risk to predictions of blood glucose levels. For example, predicting a glucose level of 120 but the actual value being 500 is very risky compared to predicting 160 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4764212/)
Load the necessary packages and functions here:
library(tidyverse) # for magic
library(RSQLite) # for loading SQLite data
library(iglu) # for CGM metrics
library(factoextra) # clustering algorithms & visualization
library(ggforce) # add ellipses to pca plots
library(concaveman) # for adding hulls to pca plots
library(vegan) # for NMDS analysis
library(caret) # for cross-validation
library(ropls) # for PCA and PLS regression (to install: https://rdrr.io/bioc/ropls/)
library(chemhelper) # for use with ropls (to install: https://rdrr.io/github/Aariq/chemhelper/f/README.md)
library(ggrepel) # add labels to a plot that don't overlap
library(glue) # for formatting strings of text in figures
library(cowplot) # for plotting multiple plots together
# Read the raw data in
= read_tsv("raw_data/hall-data/hall-data-main.txt") raw_hall_data
## Warning: One or more parsing issues, see `problems()` for details
# I get a warning because "low" was used for a few rows of readings,
# maybe because they were too low to for the meter.
# what could these 'low' values actually be?
sort(raw_hall_data$GlucoseValue)[1:20]
## [1] 40 40 40 41 41 41 41 41 42 42 42 42 42 42 43 43 43 43 43 43
hist(raw_hall_data$GlucoseValue, breaks=100)
# Ok, so for now will fill in the "low" values in glucose with 39. I'm guessing values in the histogram didn't go much further below that.
<- select(raw_hall_data, id = subjectId, time = DisplayTime, gl = GlucoseValue) %>%
clean_hall_data mutate(gl = ifelse(is.na(gl), 39, gl))
Other data that also need to be uploaded:
# meal data:
= read_tsv("raw_data/hall-data/hall-meal-data.tsv")
raw_meal_metadata raw_meal_metadata
## # A tibble: 486 × 6
## userID meal glucotype nb_meals_glucotype mealType nb_meals_tot
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 2133-001 Bar 1 moderate 1 Bar 6
## 2 2133-001 Bar 1 low 0 Bar 6
## 3 2133-001 Bar 1 severe 0 Bar 6
## 4 2133-001 Bar 2 moderate 1 Bar 6
## 5 2133-001 Bar 2 low 0 Bar 6
## 6 2133-001 Bar 2 severe 0 Bar 6
## 7 2133-001 CF 1 moderate 1 CF 6
## 8 2133-001 CF 1 low 0 CF 6
## 9 2133-001 CF 1 severe 0 CF 6
## 10 2133-001 CF 2 moderate 0 CF 6
## # … with 476 more rows
# Need to use SQLite for loading the subject data
<- "raw_data/hall-data/hall-data-subjects.db"
filename <- dbDriver("SQLite")
sqlite.driver <- dbConnect(sqlite.driver,
db dbname = filename)
dbListTables(db)
## [1] "clinical"
<- as_tibble(dbReadTable(db,"clinical"))
hall_subject_data
# names(hall_subject_data)
# hall_subject_data$diagnosis
# Get just age, BMI, ID, Height, Weight, and the study's predicted diagnosis:
<- select(hall_subject_data, id = userID,
hall_subject_data_clean age = Age, BMI, height = Height,
weight = Weight, diagnosis)
# This calculates the percent above 140, 180, and 250:
above_percent(clean_hall_data)
## # A tibble: 57 × 4
## id above_140 above_180 above_250
## <chr> <dbl> <dbl> <dbl>
## 1 1636-69-001 11.5 2.55 0
## 2 1636-69-026 13.3 0.278 0
## 3 1636-69-028 16.4 1.54 0
## 4 1636-69-032 2.92 0.168 0
## 5 1636-69-035 17.6 1.97 0
## 6 1636-69-048 0.393 0 0
## 7 1636-69-053 4.61 0.107 0
## 8 1636-69-060 13.9 0.275 0
## 9 1636-69-064 14.4 1.20 0
## 10 1636-69-069 9.96 0.369 0
## # … with 47 more rows
Ooooh this is cool! Let’s generate some features based on these types of percentages:
<- above_percent(clean_hall_data, targets_above = c(140, 180, 250))
above <- below_percent(clean_hall_data, targets_below = c(54, 70))
below <- median_glu(clean_hall_data)
median_gl
<- left_join(above, below, by="id") %>%
person_metrics left_join(median_gl, by="id")
Now, I’m curious to see how well these very basic data about percentages above and below certain glucose thresholds (and the median) can be used to predict the predicted diabetes diagnosis from Hall et al. 2018.
First let’s do a simple k-means cluster analysis and color the points based on diagnosis to see if any pattern emerges. But first have to remove (or impute the missing values in our dataset):
complete.cases(person_metrics) # all there!
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Then standardize all variables for PCA and also create a distance matrix for later use:
<- mutate(person_metrics, across(where(is.double), scale))
person_metrics_std <- dist(person_metrics_std[,-1]) #default is bray person_metrics_dist
set.seed(123)
<- kmeans(person_metrics_std[,-1], 3)
cluster1 $cluster cluster1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 3 1 1 1 3 1 1 2 3 1 3 1 1 1 1 3 1 1 2
## [39] 1 3 2 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 1
# Now create a simple PCA ordination to visualize
<- princomp(person_metrics_std[,-1])
pca <- pca$scores[,1:2]
pca_coords # hall_subject_data_clean$cluster <- cluster1$cluster
<- bind_cols(hall_subject_data_clean,
pca_data_fin as_tibble(pca_coords),
as_tibble(cluster1$cluster)) %>%
rename(cluster1 = value) %>%
# also make the clusters categoric:
mutate(cluster1 = as.factor(cluster1))
Now to visualize these results:
ggplot(data=pca_data_fin) +
geom_mark_hull(aes(x=Comp.1, y=Comp.2, color=cluster1,
fill=after_scale(alpha(color, 0.2))),
concavity=0.2, size=0.2, show.legend=FALSE) +
geom_point(aes(x=Comp.1, y=Comp.2, color=diagnosis),
size=4) +
::scale_colour_npg() +
ggscicoord_equal() +
theme_minimal() +
theme(panel.border = element_rect(fill= "transparent"))
Ok, but the problem is that the data look like the are not normally distributed (one of the assumptions of PCA), so why not try the non-parametric NMDS ordination:
# use the previously calculated distance matrix for the nmds
set.seed(111)
= metaMDS(person_metrics_dist, k=2) # K = number of reduced dimensions nmds1
## Run 0 stress 0.03971878
## Run 1 stress 0.03963769
## ... New best solution
## ... Procrustes: rmse 0.008219732 max resid 0.04475595
## Run 2 stress 0.03963927
## ... Procrustes: rmse 0.002671526 max resid 0.01452217
## Run 3 stress 0.06180934
## Run 4 stress 0.04204511
## Run 5 stress 0.03973114
## ... Procrustes: rmse 0.008713421 max resid 0.04764022
## Run 6 stress 0.03974215
## ... Procrustes: rmse 0.007865419 max resid 0.04281825
## Run 7 stress 0.06653507
## Run 8 stress 0.05070582
## Run 9 stress 0.04264866
## Run 10 stress 0.0400417
## ... Procrustes: rmse 0.01627036 max resid 0.0896567
## Run 11 stress 0.04011182
## ... Procrustes: rmse 0.01709139 max resid 0.09425783
## Run 12 stress 0.0502237
## Run 13 stress 0.03964217
## ... Procrustes: rmse 0.0007827899 max resid 0.004243243
## ... Similar to previous best
## Run 14 stress 0.03990895
## ... Procrustes: rmse 0.01500035 max resid 0.081659
## Run 15 stress 0.04979458
## Run 16 stress 0.04260557
## Run 17 stress 0.06321982
## Run 18 stress 0.06180829
## Run 19 stress 0.04276806
## Run 20 stress 0.04207285
## *** Solution reached
<- nmds1$points
nmds_coords <- bind_cols(hall_subject_data_clean,
nmds_data_fin as_tibble(nmds_coords),
as_tibble(cluster1$cluster)) %>%
rename(cluster1 = value) %>%
# also make the clusters categoric:
mutate(cluster1 = as.factor(cluster1))
ggplot(data=nmds_data_fin) +
geom_mark_hull(aes(x=MDS1, y=MDS2, color=cluster1,
fill=after_scale(alpha(color, 0.2))),
concavity=0.2, size=0.2, show.legend=FALSE) +
geom_point(aes(x=MDS1, y=MDS2, color=diagnosis),
size=4) +
::scale_colour_npg() +
ggsci#coord_fixed(ratio=1) +
theme_minimal() +
theme(panel.border = element_rect(fill= "transparent"))
I guess not much difference from the PCA… Oh well! But what’s cool to see is that even with some REALLY basic metrics we can start to see a bit of association between the clusters and Hall et al. (2018) predictions. What happens when we add some other basic features such as BMI and age?
<- left_join(person_metrics, hall_subject_data_clean, by="id")
person_metrics_updated
# Now need to fill in NAs (can just use the mean of each column):
!complete.cases(person_metrics_updated),] person_metrics_updated[
## # A tibble: 10 × 12
## id above_140 above_180 above_250 below_54 below_70 median age BMI
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1636-69-0… 16.4 1.54 0 0 0.384 110 50 27.3
## 2 1636-69-1… 7.54 0.474 0 0 0.0527 109 53 38
## 3 2133-002 1.27 0 0 0 3.82 92 29 21.4
## 4 2133-003 5.15 0.443 0 0 0.886 97 36 24.6
## 5 2133-004 24.6 5.01 0 0 0.732 125 54 28.1
## 6 2133-008 0 0 0 0.222 4.43 89 31 19
## 7 2133-019 8.83 0.111 0 0.0555 1.44 105 64 29.9
## 8 2133-033 1.90 0 0 0.211 7.72 92 42 19.6
## 9 2133-035 4.43 0.273 0 0.0546 0.546 98 61 30.1
## 10 2133-041 5.49 0.215 0 0.215 3.98 111 51 27.3
## # … with 3 more variables: height <dbl>, weight <dbl>, diagnosis <chr>
# looks like it's just height and weight values
<- mutate(person_metrics_updated,
person_metrics_updated_all height = ifelse(is.na(height),
mean(height, na.rm=T),
height),weight = ifelse(is.na(weight),
mean(weight, na.rm=T),
weight))# Then standardize:
<- mutate(person_metrics_updated_all,
person_metrics_updated_std across(where(is.double), scale))
# Remove the diagnosis and id columns:
<- select(person_metrics_updated_std, -diagnosis, -id)
person_metrics_upd
# Then finally, calculate the clusters and PCA again:
set.seed(123)
<- kmeans(person_metrics_std[,-1], 3)
cluster2 $cluster cluster2
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 1 3 1 1 1 3 1 1 2 3 1 3 1 1 1 1 3 1 1 2
## [39] 1 3 2 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 1
# Now create a simple PCA ordination to visualize
<- princomp(person_metrics_std[,-1])
pca <- pca$scores[,1:2]
pca_coords # hall_subject_data_clean$cluster <- cluster1$cluster
<- bind_cols(hall_subject_data_clean,
pca_data_fin as_tibble(pca_coords),
as_tibble(cluster2$cluster)) %>%
rename(cluster2 = value) %>%
# also make the clusters categoric:
mutate(cluster2 = as.factor(cluster2))
ggplot(data=pca_data_fin) +
geom_mark_hull(aes(x=Comp.1, y=Comp.2, color=cluster2,
fill=after_scale(alpha(color, 0.2))),
concavity=0.2, size=0.2, show.legend=FALSE) +
geom_point(aes(x=Comp.1, y=Comp.2, color=diagnosis),
size=4) +
::scale_colour_npg() +
ggscicoord_equal() +
theme_minimal() +
theme(panel.border = element_rect(fill= "transparent"))
Not that much more different. Overall, it looks like cluster 1 is pretty much non-diabetic, cluster 2 is pre-diabetic or diabetic, and cluster 3 is a mix. I think next it is time to bring out the big guns and add in many more of Gaynanova’s metrics and use those to actually train a model for predicting Hall’s predicted diagnoses using KNN, Decision Trees, etc.
Let’s try logitic regression, k-NN, a Decision Tree, Partial Least Squares-Discriminant Analysis, and Random Forest together with more of Gaynanova’s metrics and compare results. But first, since there are a lot of metrics, we should do a bit of feature selection and remove as much multicollinearity/redunancy as possible (though not always necessary to do this).
First calculate all of Gaynanaova’s metrics:
# # note that this also computes interpolation of missing time points in the data for calculating all the time-dependent metrics.
# hall_full <- all_metrics(clean_hall_data)
# # add the diagnoses too:
# hall_full_diag <- left_join(hall_full, select(hall_subject_data_clean, id, diagnosis))
# # export this so that I don't have to recalculate the metrics each time since it takes a little while:
# write.csv(hall_full_diag, "clean_data/hall_full.csv", row.names=FALSE)
<- read_csv("clean_data/hall_full.csv")
hall_full head(hall_full)
## # A tibble: 6 × 50
## id ADRR COGI CV eA1C GMI GRADE GRADE_eugly GRADE_hyper GRADE_hypo
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1636-6… 16.0 0.956 25.2 5.40 5.90 1.96 47.3 51.9 4.06
## 2 1636-6… 10.4 0.990 17.5 5.64 6.06 2.24 57.7 42.0 0.458
## 3 1636-6… 9.08 0.969 22.2 5.65 6.07 2.50 45.4 53.5 1.52
## 4 1636-6… 7.39 0.998 14.1 5.40 5.90 1.47 86.2 13.6 1.34
## 5 1636-6… 10.7 0.971 22.5 5.69 6.10 2.62 44.6 55.2 1.32
## 6 1636-6… 8.13 0.981 13.5 5.01 5.63 0.696 94.0 3.07 12.2
## # … with 40 more variables: HBGI <dbl>, LBGI <dbl>, hyper_index <dbl>,
## # hypo_index <dbl>, IGC <dbl>, IQR <dbl>, J_index <dbl>, M_value <dbl>,
## # MAD <dbl>, MAGE <dbl>, above_140 <dbl>, above_180 <dbl>, above_250 <dbl>,
## # below_54 <dbl>, below_70 <dbl>, in_range_63_140 <dbl>,
## # in_range_70_180 <dbl>, range <dbl>, SD <dbl>, Min. <dbl>, 1st Qu. <dbl>,
## # Median <dbl>, Mean <dbl>, 3rd Qu. <dbl>, Max. <dbl>, Conga <dbl>,
## # GVP <dbl>, MODD <dbl>, SD.Roc <dbl>, CV_Measures_Mean <dbl>, …
# Also normalize/scale the features:
# Need to add "as.numeric" because scale for some reason is not returning a pure numeric
<- mutate(hall_full, across(where(is.double), ~ as.numeric(scale(.x)))) hall_full_std
<- opls(dplyr::select(hall_full_std, -id, -diagnosis), predI = 10, fig.pdfC="none") all_metrics_pca
## PCA
## 57 samples x 48 variables
## standard scaling of predictors
## R2X(cum) pre ort
## Total 0.969 10 0
<- as_tibble(all_metrics_pca@suppLs$xModelMN)
pca_data <- get_scores(all_metrics_pca)
pca_scores
<-
pca_cor_dat cor(pca_scores[,2:3], pca_data) %>%
t() %>%
as_tibble(rownames = "variable")
# pca.cor.dat
<-
pca_loading_plot ggplot(pca_cor_dat) +
geom_segment(aes(x = 0, y = 0, xend = p1, yend = p2),
arrow = arrow(length = unit(0.15, "cm"))) +
# gghighlight(p.adj < 0.05, use_direct_label = FALSE) +
geom_label_repel(aes(x = p1, y = p2, label = variable),
segment.alpha = 0.6, direction = "y", size = 3, point.padding = 0.2,
min.segment.length = 0, force = 5, max.overlaps=25) +
theme_bw() +
labs(x = "Correlation to PC1",
y = "Correlation to PC2",
title = "PCA correlation plot")
<- c("diabetic" = "#455d95", "non-diabetic" = "#eea587", "pre-diabetic" = "#8f9bbe")
mycolors <- get_plotdata(all_metrics_pca)
pca_score_dat <-
pca_score_plot ggplot(pca_score_dat$scores, aes(x = p1, y = p2, color = hall_full_std$diagnosis)) +
geom_point(size = 3) +
scale_color_manual("Hall Prediction:", values = mycolors) +
labs(x = glue("PC1 ({all_metrics_pca@modelDF$R2X[1]*100}%)"),
y = glue("PC2 ({all_metrics_pca@modelDF$R2X[2]*100}%)"),
title = "PCA score plot") +
theme_bw() +
theme(legend.position = "right")
plot_grid(pca_loading_plot,
+
pca_score_plot theme(legend.position = "bottom"),
ncol = 2, nrow = 1)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
You can see that there is a lot of potential correlation between the metrics (i.e., a lot of redundancy), but more than 75% of the variation is captured by just two axes. There’s a bit of a distinction in the three diagnosis predictions, but they still aren’t too strong. Next step is to do some actual supervised modeling.
# Try PLSDA with different numbers of predictive axes to see what
# is optimal and reduce overfitting
<- tibble(NULL)
plsda_results for(i in 1:5){
<-
all_metrics_plsda opls(
x = dplyr::select(hall_full_std, -id, -diagnosis), #X data
y = hall_full_std$diagnosis, #Y data
fig.pdfC = "none", #suppresses default plotting
predI = i, # set number of predictive axes
permI = 200) #use 200 permutations to generate a p-value
<- bind_rows(plsda_results,
plsda_results as_tibble(all_metrics_plsda@summaryDF))
}
plsda_results
## # A tibble: 5 × 8
## `R2X(cum)` `R2Y(cum)` `Q2(cum)` RMSEE pre ort pR2Y pQ2
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.554 0.166 0.102 0.374 1 0 0.005 0.005
## 2 0.749 0.21 0.069 0.368 2 0 0.015 0.005
## 3 0.812 0.258 0.0421 0.364 3 0 0.015 0.01
## 4 0.838 0.328 -0.0484 0.346 4 0 0.005 0.045
## 5 0.876 0.356 -0.126 0.341 5 0 0.02 0.03
R2X(cum) is the proportion of variation in the data explained by the predictive axes (the CGM metrics) (R^2). R2Y(cum) is the proportion of variation in the diagnoses explained by the model. With one predictive component (first row), PLS-DA explains 55.4% of total variation, but only explains 16.6% of the difference between diagnoses. Q2(cum) is the predictive power of the model calculated through cross-validation, but it can never be greater than R2Y(cum). In some ways this metric is the most important because we want to understand the predictive power of the model, which in this case looks to be around 10% (if I understand ‘Q2(cum)’ correctly). Finally, the columns pR2Y and pQ2 are permutation generated p-values based on R^2 or Q^2 values.
Thus, we can say that one predictive component yields the greatest predictive power (though only around 10%), but that this predictive power is statistically significant (p < 0.05). Yay! Ok, but 10% predictive power is really not great.
Load some necessary packages:
library(rpart) # decision trees
library(rpart.plot) # decision tree visualizations
Take a look at our categories (something I should have done way sooner!):
table(hall_full$diagnosis)
##
## diabetic non-diabetic pre-diabetic
## 5 38 14
Actually, since there are so few diabetic classifications (only 5!), maybe it’s way better to lump diabetic and pre-diabetic into one “potential-diabetic” grouping:
<- mutate(hall_full, diagnosis = case_when(diagnosis == "diabetic" ~ "potential-diabetic",
hall_full_mod == "pre-diabetic" ~ "potential-diabetic",
diagnosis TRUE ~ "non-diabetic"))
table(hall_full_mod$diagnosis)
##
## non-diabetic potential-diabetic
## 38 19
That’s better!
Next, split the data into training and test:
set.seed(123)
<- sample_frac(hall_full_mod, 0.7)
hall_train <- anti_join(hall_full_mod, hall_train, by = 'id')
hall_test
#Make sure we have all classes present in each:
table(hall_test$diagnosis)
##
## non-diabetic potential-diabetic
## 9 8
table(hall_train$diagnosis)
##
## non-diabetic potential-diabetic
## 29 11
Now setup the decision tree model and test it with just a few variables:
#names(hall_train)
<- rpart(diagnosis ~ Median + above_250 + above_180 + above_140 + below_54 + below_70, data = hall_train)
simp_tree rpart.plot(simp_tree)
Oh wow, it chose to use just median! I wonder how accurate that is though?
printcp(simp_tree)
##
## Classification tree:
## rpart(formula = diagnosis ~ Median + above_250 + above_180 +
## above_140 + below_54 + below_70, data = hall_train)
##
## Variables actually used in tree construction:
## [1] Median
##
## Root node error: 11/40 = 0.275
##
## n= 40
##
## CP nsplit rel error xerror xstd
## 1 0.27273 0 1.00000 1.00000 0.25673
## 2 0.01000 1 0.72727 0.81818 0.24009
<- predict(simp_tree, hall_test, type = 'class')
p confusionMatrix(p, as.factor(hall_test$diagnosis), positive='potential-diabetic')
## Confusion Matrix and Statistics
##
## Reference
## Prediction non-diabetic potential-diabetic
## non-diabetic 8 3
## potential-diabetic 1 5
##
## Accuracy : 0.7647
## 95% CI : (0.501, 0.9319)
## No Information Rate : 0.5294
## P-Value [Acc > NIR] : 0.04207
##
## Kappa : 0.5211
##
## Mcnemar's Test P-Value : 0.61708
##
## Sensitivity : 0.6250
## Specificity : 0.8889
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.7273
## Prevalence : 0.4706
## Detection Rate : 0.2941
## Detection Prevalence : 0.3529
## Balanced Accuracy : 0.7569
##
## 'Positive' Class : potential-diabetic
##
The p-value here indicates that there is a “significant” predictive ability beyond the percent of imbalance in the data (for example, there are 9 non-diabetics in theis test set vs. 8 potential-diabetics, so by chance alone you are more likely to predict non-diabetics because there are more than 50% of those, the “no information rate”). The accuracy is 0.7647, with a 95% confidence interval between 0.501 and 0.9319. OK, so it’s not GREAT, but still, for ONE feature?? Then there’s the kappa value of 0.5211
Kappa value interpretation Landis & Koch (1977): <0 No agreement 0 — .20 Slight .21 — .40 Fair .41 — .60 Moderate .61 — .80 Substantial .81–1.0 Perfect
But see this… https://towardsdatascience.com/interpretation-of-kappa-values-2acd1ca7b18f
Ok, but it also might be overfitting a really tiny dataset that we have. What happens if we use a k-fold cross-validation technique on this instead? Let’s try that:
# this creates random stratified folds to ensure a balance between the classes in each set (the diagnoses)
set.seed(123)
<- createFolds(hall_full_mod$diagnosis, k = 6)
folds str(folds)
## List of 6
## $ Fold1: int [1:9] 6 10 12 21 22 25 39 56 57
## $ Fold2: int [1:9] 2 5 7 17 20 26 44 48 54
## $ Fold3: int [1:10] 3 8 11 18 19 23 33 45 49 52
## $ Fold4: int [1:9] 13 15 24 28 40 41 42 47 55
## $ Fold5: int [1:10] 1 9 27 30 31 35 36 38 46 51
## $ Fold6: int [1:10] 4 14 16 29 32 34 37 43 50 53
<- lapply(folds, function(x) {
cv_results <- hall_full_mod[-x,]
hall_train <- hall_full_mod[x,]
hall_test <- rpart(diagnosis ~ Median + above_250 + above_180 + above_140 + below_54 + below_70, data = hall_train)
simp_tree <- predict(simp_tree, hall_test, type = 'class')
diag_pred <- as.factor(hall_test$diagnosis)
diag_actual <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
result return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})<- unlist(cv_results)
results_clean
<- results_clean[grepl("Accuracy", names(results_clean))]
accuracy <- results_clean[grepl("Kappa", names(results_clean))]
kappa
mean(accuracy)
## [1] 0.7185185
mean(kappa)
## [1] 0.3289562
Ok, so this doesn’t look quite as encouraging anymore. And this is probably from the fact that we have so little data and used very few features. What about using Random Forest so that we can add more features?
library(randomForest) # for random forests!
set.seed(123)
# Need to fix the data so that there are no spaces in the feature names for `randomForest()` to work:
names(hall_full_mod) <- str_replace(names(hall_full_mod), " ", "_")
<- hall_full_mod
hall_full_rf names(hall_full_rf) <- sub("^","i",names(hall_full_mod)) # and add a i to the start (for now)
<- select(hall_full_rf, -iid) # remove ID
hall_full_rf
<- createDataPartition(hall_full_rf$idiagnosis, p = .7, list=F)
partition <- hall_full_rf[partition,-1]
hall_train <- hall_full_rf[-partition,-1] hall_test
set.seed(123)
<- randomForest(as.factor(idiagnosis) ~ ., # try all features
hall_rf mtry = 10, # select 10 random features each time
ntree = 500, # number of trees
data = hall_train)
<- predict(hall_rf, hall_test, type = 'class')
p confusionMatrix(p, as.factor(hall_test$idiagnosis), positive='potential-diabetic')
## Confusion Matrix and Statistics
##
## Reference
## Prediction non-diabetic potential-diabetic
## non-diabetic 9 2
## potential-diabetic 2 3
##
## Accuracy : 0.75
## 95% CI : (0.4762, 0.9273)
## No Information Rate : 0.6875
## P-Value [Acc > NIR] : 0.4069
##
## Kappa : 0.4182
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6000
## Specificity : 0.8182
## Pos Pred Value : 0.6000
## Neg Pred Value : 0.8182
## Prevalence : 0.3125
## Detection Rate : 0.1875
## Detection Prevalence : 0.3125
## Balanced Accuracy : 0.7091
##
## 'Positive' Class : potential-diabetic
##
Accuracy of 75% and Kappa of 0.41. Can we tune this forest?
# set training options (6-fold cross val + 10 repeats)
<- trainControl(method = "repeatedcv", number = 6, repeats = 10)
ctrl
# create a grid of how many features it tries at any one time:
<- expand.grid(.mtry = c(2, 5, 10, 20, 35))
grid_rf
set.seed(123)
<- train(as.factor(idiagnosis) ~ ., data = hall_full_rf, method = "rf",
m_rf metric = "Kappa", trControl = ctrl,
tuneGrid = grid_rf)
m_rf
## Random Forest
##
## 57 samples
## 48 predictors
## 2 classes: 'non-diabetic', 'potential-diabetic'
##
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 10 times)
## Summary of sample sizes: 48, 47, 47, 48, 48, 47, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6636869 0.2265621
## 5 0.6761279 0.2515812
## 10 0.6629461 0.2330395
## 20 0.6670539 0.2482402
## 35 0.6513973 0.2145016
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
Ok, so it looks like the best we can get is around an accuracy of 64% with a kappa of .17
Though the model isn’t perfect, it does shed some light on which features might be more important:
varImpPlot(m_rf$finalModel)
But I’m still not pleased with this. There is a lot of collinearity in the data, so what if we used the top 6(?) PCA components as the features instead? And boost the tree?
<- opls(dplyr::select(hall_full_std, -id, -diagnosis), predI = 6, fig.pdfC="none") all_metrics_pca
## PCA
## 57 samples x 48 variables
## standard scaling of predictors
## R2X(cum) pre ort
## Total 0.918 6 0
<- get_scores(all_metrics_pca)
pca_scores $sample <- hall_full_mod$diagnosis
pca_scores<- rename(pca_scores, diagnosis = sample)
hall_pca_metrics hall_pca_metrics
## # A tibble: 57 × 7
## diagnosis p1 p2 p3 p4 p5 p6
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 potential-diabetic -6.40 1.45 -1.13 -0.297 2.12 -0.0328
## 2 potential-diabetic -3.74 -2.37 -0.553 1.16 -1.50 -0.306
## 3 non-diabetic -4.98 -1.37 1.52 3.31 2.56 -2.03
## 4 potential-diabetic 1.12 -4.48 0.163 0.219 -0.560 -0.715
## 5 non-diabetic -6.70 -0.784 1.65 0.819 0.882 0.832
## 6 non-diabetic 4.35 -3.88 -0.354 -0.963 -0.775 0.438
## 7 non-diabetic 0.884 -1.49 -0.858 -1.37 0.320 -0.314
## 8 non-diabetic -2.59 -5.85 3.31 2.38 -2.31 -0.356
## 9 non-diabetic -6.21 -0.0706 -2.94 1.03 -2.14 -1.29
## 10 non-diabetic -2.25 -2.14 0.566 0.402 1.15 0.751
## # … with 47 more rows
So here are our new data! Now uses a 6-fold cross-validation as before:
set.seed(123)
<- createFolds(hall_pca_metrics$diagnosis, k = 6)
folds
<- lapply(folds, function(x) {
cv_PCA_results <- hall_pca_metrics[-x,]
hall_train <- hall_pca_metrics[x,]
hall_test <- rpart(diagnosis ~ ., data = hall_pca_metrics)
simp_tree <- predict(simp_tree, hall_test, type = 'class')
diag_pred <- as.factor(hall_test$diagnosis)
diag_actual <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
result return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})<- unlist(cv_PCA_results)
results_clean
<- results_clean[grepl("Accuracy", names(results_clean))]
accuracy <- results_clean[grepl("Kappa", names(results_clean))]
kappa
mean(accuracy)
## [1] 0.8055556
mean(kappa)
## [1] 0.5239975
Wow! That’s pretty good for a simple decision tree! What about boosting it?
# Same as above but boosted and using C5.0 method:
<- lapply(folds, function(x) {
cv_PCA_results <- hall_pca_metrics[-x,]
hall_train <- hall_pca_metrics[x,]
hall_test <- C5.0(as.factor(diagnosis) ~ ., data = hall_pca_metrics, trials = 10)
simp_tree <- predict(simp_tree, hall_test, type = 'class')
diag_pred <- as.factor(hall_test$diagnosis)
diag_actual <- confusionMatrix(diag_pred, diag_actual, positive='potential-diabetic')
result return(list(result$overall["Kappa"], result$overall["Accuracy"]))
})<- unlist(cv_PCA_results)
results_clean
<- results_clean[grepl("Accuracy", names(results_clean))]
accuracy <- results_clean[grepl("Kappa", names(results_clean))]
kappa
mean(accuracy)
## [1] 0.8703704
mean(kappa)
## [1] 0.680303
AMAZING. That’s cool, we’ve gotten accuracy up to 87% and the kappa up to 0.68.
# Don't know why this gives me an error when I knit:
#C5imp(simp_tree)
Interestingly, when looking at the variable importance of each PCA component, it seemed like the components were not totally in order of their importance of explaining overall variation in the data.
This is only a temporary conclusion, but for now, the interesting finding is that Gaynanova’s metrics can provide a simple way of characterizing Hall (2018)’s glycemic signatures (glucotypes) with a fairly good accuracy of 87% and kappa of 0.68. The key was using a boosted decision tree after reducing dimensionality with a basic PCA ordination. The boosted tree that used the first 6 components of the PCA was even more superior than using a tuned randomForest that could use any features it wanted.
There’s much more to be done, but I’ll end it here for now.
Broll S, Buchanan D, Chun E, Muschelli J, Fernandes N, Seo J, Shih J, Urbanek J, Schwenck J, Gaynanova I (2021). iglu: Interpreting Glucose Data from Continuous Glucose Monitors. R package version 3.0.0. R package version 3.1.0
Mary Martin, Elizabeth Chun, David Buchanan, Eric Wang, Sangaman Senthil & Irina Gaynanova. (2020, June 15). irinagain/Awesome-CGM: List of public CGM datasets (Version v1.0.0). Zenodo.