Here my goal is to begin exploring some CGM (continuous glucose monitoring) data to get a better understanding for how to work with these types of data and what their potential are. This was inspired by Irina Gaynanova’s website ( where her lab group worked on compiling CGM datasets and calculating various statistics from these data. In fact, they also created an R package and associated shiny app for exploring CGM data, which I may use in this exploration here.

The Data

The data come from this repository: ( where Itina Gaynanova and her colleagues compiled free and available CGM datasets.

The specific datasets I will use below includes Allepo et al. (2017) ( and Hall et al. (2018) (

Required disclaimer: The source of the data is from Allepo et al. (2017) and Hall et al. (2018), but the analyses, content and conclusions presented herein are solely the responsibility of the authors and have not been reviewed or approved by Allepo et al. (2017) or Hall et al. (2018).

Ideas and preliminary notes

Here are just some ideas of ways in which I could approach these data:

  • Basic visualizations of CGM readings by subject over time

  • Daily summaries of average fluctuations including variation and/or confidence ribbons

  • The R Package iglu (stands for ierpreting glucose?) can allow the calculation of numerous metrics for blood glucose profiles which may be more or less useful for helping us analyze and quantify these profiles in various contexts.

  • For example, maybe these metrics can be used as features in some type of predictive model for diabetes.

  • Those data might also be useful for predicting future glucose levels when implementing automatic insulin supply (e.g.

  • To not reinvent the wheel, here is a good reference from the study above about the models they used for predicting glucose levels into the near future (15 and 60 minute mark) ( These included ARIMA, Support Vector Regression, Gradient-boosting trees, Feed-forward neural networks, and recurrent neural networks.

  • There is also this thing called a Surveillence Error Grid which assigns different levels of risk to predictions of blood glucose levels. For example, predicting a glucose level of 120 but the actual value being 500 is very risky compared to predicting 160 (

Packages and Functions

Load the necessary packages and functions here:

library(tidyverse) # for magic
library(RSQLite) # for loading SQLite data
library(iglu) # for CGM metrics
library(factoextra) # clustering algorithms & visualization
library(ggforce) # add ellipses to pca plots
library(concaveman) # for adding hulls to pca plots
library(vegan) # for NMDS analysis
library(caret) # for cross-validation
library(ropls) # for PCA and PLS regression (to install:
library(chemhelper) # for use with ropls (to install:
library(ggrepel) # add labels to a plot that don't overlap
library(glue) # for formatting strings of text in figures
library(cowplot) # for plotting multiple plots together

Upload data

Hall 2018 data

# Read the raw data in
raw_hall_data = read_tsv("raw_data/hall-data/hall-data-main.txt")
## Warning: One or more parsing issues, see `problems()` for details
# I get a warning because "low" was used for a few rows of readings,
# maybe because they were too low to for the meter.

# what could these 'low' values actually be?
##  [1] 40 40 40 41 41 41 41 41 42 42 42 42 42 42 43 43 43 43 43 43
hist(raw_hall_data$GlucoseValue, breaks=100)