About this analysis

This analysis is one of a series of posts that showcases my approach to exploring data and drawing insights from it. Here I explore the relationship season or month and the availability of Airbnb listings. Enjoy!


What is the relationship between season and availability?


First, load required packages:

library(tidyverse)
library(lubridate) # for working with dates

Upload the airbnb calendar data:

calendar <- read_csv("data/calendar.csv")
calendar
## # A tibble: 1,393,570 × 4
##    listing_id date       available price 
##         <dbl> <date>     <lgl>     <chr> 
##  1     241032 2016-01-04 TRUE      $85.00
##  2     241032 2016-01-05 TRUE      $85.00
##  3     241032 2016-01-06 FALSE     <NA>  
##  4     241032 2016-01-07 FALSE     <NA>  
##  5     241032 2016-01-08 FALSE     <NA>  
##  6     241032 2016-01-09 FALSE     <NA>  
##  7     241032 2016-01-10 FALSE     <NA>  
##  8     241032 2016-01-11 FALSE     <NA>  
##  9     241032 2016-01-12 FALSE     <NA>  
## 10     241032 2016-01-13 TRUE      $85.00
## # ℹ 1,393,560 more rows

For this visualization, my initial thought was to make a time-series graphic with a confidence ribbon that shows how seasonality is related to availability. In the process I realized the data needed a lot of detrending to be able to extract at least a reasonable estimate of seasonal availability, so I generalized the figure to availability per month.

First, I want to summarize the calendar data by aggregating the proportion of availability by calendar date and seeing a quick visualization of that: how many unique dates?

length(unique(calendar$date))
## [1] 365

Exactly 365, nice!

Now, group by date and summarize availability as the average of Trues and Falses. In other words, the proportion of listings that are available on any given date:

avail_summ <- group_by(calendar, date) %>%
  summarize(prop_avail = mean(available)) %>%
  ungroup()

avail_summ
## # A tibble: 365 × 2
##    date       prop_avail
##    <date>          <dbl>
##  1 2016-01-04      0.454
##  2 2016-01-05      0.489
##  3 2016-01-06      0.478
##  4 2016-01-07      0.465
##  5 2016-01-08      0.467
##  6 2016-01-09      0.486
##  7 2016-01-10      0.526
##  8 2016-01-11      0.545
##  9 2016-01-12      0.554
## 10 2016-01-13      0.549
## # ℹ 355 more rows

Nice! Now plot a simple line plot of prop_avail ~ date:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail)) +
  theme_grey()

Gosh! That looks horrible. But I think there are a few things going on here. First, the general trend of going from low availability to high availability is probably just the fact that fewer rentals are booked far in advance, so by chance alone there will be higher availability the further away from January you get (since these data were all scraped at the beginning of January 2016). The other thing that’s going on is that there are three sudden drops, one at 90 days, one at 180 days, and one at 365 days. These drops might have to do with the window that hosts open up their listings for. And I think the defaults are 90 days, 180 days, and 365 days. So many listings are just automatically blocked off after 90, 180, etc, which might cause those sudden drops, but I’m not totally sure. Either way, we have to get rid of them to get at the seasonality signal.

First, remove those steps. Add an indicator of those steps by completely detrending the data using diff():

avail_summ <- mutate(avail_summ, prop_avail_detrend = c(NA, diff(prop_avail)))

Looking at those data we can see those three big spikes going down:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail_detrend)) +
  theme_grey()
## Warning: Removed 1 row containing missing values (`geom_line()`).

The algorithm for removing those spikes from the data is as follows: Wherever a value goes below a certain threshold in the completely detrended data—destep_indicator (i.e., those huge down spikes), subtract that value from every following value in the dataset:

# Extract the spike indicator:
destep_indicator <- avail_summ$prop_avail_detrend

# Get the spots where those drops happen (90, 180, and 365 days)
# I just arbitrarily set the threshold to - 0.025 to capture those three
destep_spots <- which(destep_indicator < -0.025)
# Extract how far each step is:
destep_size <- destep_indicator[destep_indicator < -0.025][-1]
# Then create a blank vector of zeros,
destep_vec <- rep(0, 365)
# and then loop through the three spikes, adding the value to all the points
# that follow:
for(i in seq_along(destep_spots)){
  destep_vec[destep_spots[i]:365] <- destep_vec[destep_spots[i]:365] + destep_size[i]
}

# Finally, subtract that vector from the original data to remove the steps:
avail_summ <- mutate(avail_summ, prop_avail_destep = prop_avail - destep_vec)

Cool! So now let’s plot the ‘de-stepped’ availability data:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail_destep)) +
  theme_grey()

That’s looking much better!

Next step is to remove the trend that people trend that people tend to book fewer listings far in advance and more listings close to present (hence the generally increasing availability over time).

Ideally, we would have data on how far in advance each booking was made so that we could build a model that precisely fit those data. With that we would be able to remove exactly that trend from the availability data and be left with only the seasonality trends (plus other random error). But I don’t have those data, so the next best thing is to estimate that curve myself. It’s tricky because over-fitting would lose the seasonality curve, but under-fitting would not detrend the data as well. I chose to use a General Additive Model for fitting the curve since I have no mechanistic reason to choose any specific type of model, and GAMs do well at fitting unknown relationships.

I ended up settling on a GAM model with a k of 6 and smoothing factor of 0.8. This is a key assumption to my final figure, but not having a more objective way to pick the model, I’ll go for this one:

library(mgcv) # load package for using GAMs
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
mod_1 <- gam(prop_avail_destep ~  s(seq_along(date), k=6), data=avail_summ, sp=0.8)

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = predict(mod_1)), color=2, size=1) +
  geom_line(aes(y = prop_avail_destep), size=1, color="grey20") +
  theme_grey()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The red line shows the GAM model prediction line which seems to fit fairly well without dipping into the potentially seasonal variation.

Now, I can use the residuals from the model to finally see if there are any seasonality trends but also summarize by month since day or week might be too precise for the broad assumptions I’ve been making about how to detrend the dataset:

avail_summ <- mutate(avail_summ, prop_avail_detrended = mod_1$residuals,
                     day = yday(date),
                     month = month(date,label = T))

ggplot(avail_summ, aes(x = month)) +
  geom_point(aes(y = prop_avail_detrended), size=1, color="grey20") +
  theme_grey()

Seems like some seasonal trends are showing up! January is still a bit off probably because of how the original model fit, so perhaps best to remove those values and interpolate between February and December.

First extract the current values for December, January (for the length), and February, and then apply a custom interpolation function:

month1 <- avail_summ$prop_avail_detrended[avail_summ$month == "Dec"]
month2_int <- avail_summ$prop_avail_detrended[avail_summ$month == "Jan"]
month3 <- avail_summ$prop_avail_detrended[avail_summ$month == "Feb"]

# Custom function to interpolate between two months:
month_interp <- function(month1, month2_int, month3){
  m1_mean <- mean(month1)
  m3_mean <- mean(month3)
  m1_sd <- sd(month1)
  m3_sd <- sd(month3)
  len <- length(month2_int)
  set.seed(123)
  return(rnorm(n=len, mean=mean(c(m1_mean, m3_mean)), sd=mean(c(m1_sd, m3_sd))))
}

# Then apply the interpolation:
avail_summ_detrended <- select(avail_summ, date, prop_avail_detrended, day, month) %>%
  mutate(prop_avail_detrended = ifelse(month=="Jan", 
                                       month_interp(month1, month2_int, month3),
                                       prop_avail_detrended))

Now see how it looks:

ggplot(avail_summ_detrended, aes(x = month)) +
  geom_point(aes(y = prop_avail_detrended), size=1, color="grey20") +
  theme_grey()

Not bad! Now for making the actual visualization:

# Calculate summary stats for plotting the relative availability
# for each month:

# First a function to bootstrap confidence intervals:
conf_int <- function(x = avail_summ_detrended$prop_avail_detrended){
  samp_vec <- NA
  for(i in 1:2000){
    samp_vec[i] = median(sample(x, replace=T))
  }
  return(quantile(samp_vec, c(0.025,0.975)))
}

# Calculate the confidence intervals and median values:
avail_summ_summ <- group_by(avail_summ_detrended, month) %>%
  summarize(med_avail = median(prop_avail_detrended),
            lower = conf_int(prop_avail_detrended)[1],
            upper = conf_int(prop_avail_detrended)[2]) %>%
  ungroup() %>%
  # Make the bar color blue or red depending on whether
  # the median is below or above average availability:
  mutate(dir_col = ifelse(med_avail <=0, "#e34d3f", "00a398"))

ggplot(avail_summ_detrended, aes(x = month)) +
  geom_hline(aes(yintercept = 0)) +
  geom_col(data = avail_summ_summ, aes(y = med_avail, fill=dir_col)) +
  geom_errorbar(data = avail_summ_summ, aes(ymax = upper, ymin = lower),
                width = 0.3, color = "grey20") +
  # Calling it "relative availability index" because I don't trust the 
  # absolute value to be a real indicator of availability since I 
  # had to do a lot of transforming to get to here and these
  # are not real availabilities, they are only based on what 
  # was booked at the time of data collection.
  labs(x = "Month", y = "Relative Availability\nIndex (± 95% CI)",
       title="AirBnB Seasonal Availability (2016)*",
       caption="* These data are based on detrended availability of Seattle, WA listings from the start of 2016.
Values for January were interpolated from February and December.") +
  theme_gray() +
  theme(plot.margin = unit(c(.7,.7,.7,.7), "cm"),
        panel.grid = element_blank(),
        # panel.grid.major.x = element_line(color="grey90"),
        panel.grid.major.y = element_line(color="grey90"),
        panel.grid.minor.y = element_line(color="grey90"),
        panel.background = element_rect(fill=NA),
        legend.position = "none",
        axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        axis.ticks.x = element_blank(),
        axis.title.x = element_text(size = 13, margin=ggplot2::margin(t=0.5, unit="cm")),
        axis.title.y = element_text(size = 13, margin=ggplot2::margin(r=0.5, unit="cm")),
        plot.title = element_text(size = 16),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0, size=10,  margin=margin(t=10), color = "grey40", face="italic"))

Conclusion

The big assumption behind this visualization was how I removed the trend based around the fact that people are probably less likely to book far in advance than sooner. And because I didn’t have historic availability data, I had to infer future availability. The result is a relative availability index.

Future work should look at using historic data on how long in advance bookings are made, and with that I could build a better model with which to remove the trend in the data and present a more accurate visualization of seasonality trends.