About this analysis

This analysis is one of a series of posts that showcases my approach to exploring data and drawing insights from it. Here I explore the relationship season or month and the availability of Airbnb listings. Enjoy!


What is the relationship between season and availability?


First, load required packages:

library(tidyverse)
library(lubridate) # for working with dates

Upload the airbnb calendar data:

calendar <- read_csv("data/calendar.csv")
calendar
## # A tibble: 1,393,570 × 4
##    listing_id date       available price 
##         <dbl> <date>     <lgl>     <chr> 
##  1     241032 2016-01-04 TRUE      $85.00
##  2     241032 2016-01-05 TRUE      $85.00
##  3     241032 2016-01-06 FALSE     <NA>  
##  4     241032 2016-01-07 FALSE     <NA>  
##  5     241032 2016-01-08 FALSE     <NA>  
##  6     241032 2016-01-09 FALSE     <NA>  
##  7     241032 2016-01-10 FALSE     <NA>  
##  8     241032 2016-01-11 FALSE     <NA>  
##  9     241032 2016-01-12 FALSE     <NA>  
## 10     241032 2016-01-13 TRUE      $85.00
## # ℹ 1,393,560 more rows

For this visualization, my initial thought was to make a time-series graphic with a confidence ribbon that shows how seasonality is related to availability. In the process I realized the data needed a lot of detrending to be able to extract at least a reasonable estimate of seasonal availability, so I generalized the figure to availability per month.

First, I want to summarize the calendar data by aggregating the proportion of availability by calendar date and seeing a quick visualization of that: how many unique dates?

length(unique(calendar$date))
## [1] 365

Exactly 365, nice!

Now, group by date and summarize availability as the average of Trues and Falses. In other words, the proportion of listings that are available on any given date:

avail_summ <- group_by(calendar, date) %>%
  summarize(prop_avail = mean(available)) %>%
  ungroup()

avail_summ
## # A tibble: 365 × 2
##    date       prop_avail
##    <date>          <dbl>
##  1 2016-01-04      0.454
##  2 2016-01-05      0.489
##  3 2016-01-06      0.478
##  4 2016-01-07      0.465
##  5 2016-01-08      0.467
##  6 2016-01-09      0.486
##  7 2016-01-10      0.526
##  8 2016-01-11      0.545
##  9 2016-01-12      0.554
## 10 2016-01-13      0.549
## # ℹ 355 more rows

Nice! Now plot a simple line plot of prop_avail ~ date:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail)) +
  theme_grey()

Gosh! That looks horrible. But I think there are a few things going on here. First, the general trend of going from low availability to high availability is probably just the fact that fewer rentals are booked far in advance, so by chance alone there will be higher availability the further away from January you get (since these data were all scraped at the beginning of January 2016). The other thing that’s going on is that there are three sudden drops, one at 90 days, one at 180 days, and one at 365 days. These drops might have to do with the window that hosts open up their listings for. And I think the defaults are 90 days, 180 days, and 365 days. So many listings are just automatically blocked off after 90, 180, etc, which might cause those sudden drops, but I’m not totally sure. Either way, we have to get rid of them to get at the seasonality signal.

First, remove those steps. Add an indicator of those steps by completely detrending the data using diff():

avail_summ <- mutate(avail_summ, prop_avail_detrend = c(NA, diff(prop_avail)))

Looking at those data we can see those three big spikes going down:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail_detrend)) +
  theme_grey()
## Warning: Removed 1 row containing missing values (`geom_line()`).

The algorithm for removing those spikes from the data is as follows: Wherever a value goes below a certain threshold in the completely detrended data—destep_indicator (i.e., those huge down spikes), subtract that value from every following value in the dataset:

# Extract the spike indicator:
destep_indicator <- avail_summ$prop_avail_detrend

# Get the spots where those drops happen (90, 180, and 365 days)
# I just arbitrarily set the threshold to - 0.025 to capture those three
destep_spots <- which(destep_indicator < -0.025)
# Extract how far each step is:
destep_size <- destep_indicator[destep_indicator < -0.025][-1]
# Then create a blank vector of zeros,
destep_vec <- rep(0, 365)
# and then loop through the three spikes, adding the value to all the points
# that follow:
for(i in seq_along(destep_spots)){
  destep_vec[destep_spots[i]:365] <- destep_vec[destep_spots[i]:365] + destep_size[i]
}

# Finally, subtract that vector from the original data to remove the steps:
avail_summ <- mutate(avail_summ, prop_avail_destep = prop_avail - destep_vec)

Cool! So now let’s plot the ‘de-stepped’ availability data:

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = prop_avail_destep)) +
  theme_grey()

That’s looking much better!

Next step is to remove the trend that people trend that people tend to book fewer listings far in advance and more listings close to present (hence the generally increasing availability over time).

Ideally, we would have data on how far in advance each booking was made so that we could build a model that precisely fit those data. With that we would be able to remove exactly that trend from the availability data and be left with only the seasonality trends (plus other random error). But I don’t have those data, so the next best thing is to estimate that curve myself. It’s tricky because over-fitting would lose the seasonality curve, but under-fitting would not detrend the data as well. I chose to use a General Additive Model for fitting the curve since I have no mechanistic reason to choose any specific type of model, and GAMs do well at fitting unknown relationships.

I ended up settling on a GAM model with a k of 6 and smoothing factor of 0.8. This is a key assumption to my final figure, but not having a more objective way to pick the model, I’ll go for this one:

library(mgcv) # load package for using GAMs
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
mod_1 <- gam(prop_avail_destep ~  s(seq_along(date), k=6), data=avail_summ, sp=0.8)

ggplot(avail_summ, aes(x = date)) +
  geom_line(aes(y = predict(mod_1)), color=2, size=1) +
  geom_line(aes(y = prop_avail_destep), size=1, color="grey20") +
  theme_grey()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.