About this analysis

This analysis is one of a series of posts that showcases my approach to exploring data and drawing insights from it. Here I explore the relationship between availability and price in an Airbnb listings dataset. Enjoy!


How many days in advance should you be booking your Airbnb?


First, load required packages:

library(tidyverse)
library(ggtext) # for adding rich text to plots

Upload the airbnb listings data:

listings <- read_csv("data/listings.csv")

There’s a warning message due to a typo in one of the zipcode values, but that won’t affect this analysis.

Now explore the “availability” and price data from listings:

# Extract all columns that contain "avail" in their name:
avail_data <- select(listings, id, contains("avail"), price) %>%
  # and parse price to numeric
  mutate(price = parse_number(price))

Context

Before looking at the data, I started thinking about some of the potential trends between price and percent availability. For example, maybe more expensive rentals are also more desirable so they have a lower availability across longer time spans than cheaper rentals since guests might book them with more time in advance.

The thing is, there are a ton more questions and ways to look at the data, as well as extracting availability directly from the “calendar” dataframe in more detail. It would help to have more context since there are so many different ways to look at these data and many different interpretations that can be derived. After exploring a bit, I took the creative license to focus on visualizing the relationship between availability and price with the goal of helping Airbnb customers better understand how far in advance they should be booking their stays based on their budget.

Data exploration

The “availability_XX” columns indicate the number of days available XX days into the future. But I think the proportion of days (rather than the actual value) is more useful since each period of time (30, 60, 90, and 365) has more potential days that can be available.

table(avail_data$has_availability) # all are true
## 
## TRUE 
## 3818
par(mfrow=c(2,3)) # set plotting parameters to a 2x3 grid
hist(avail_data$availability_30/30, 20, main="")
hist(avail_data$availability_60/60, 20, main="")
hist(avail_data$availability_90/90, 20, main="")
hist(avail_data$availability_365/365, 20, main="")
hist(avail_data$price, 20, main="")
hist(log(avail_data$price), 20, main="")

par(mfrow=c(1,1)) # reset plotting parameters to default

Ok, so there are a lot of 1s and 0s (totally available and no vacancies). The number of listings with no vacancies (0s) also goes down over greater time spans into the future as you’d expect. And as with the other assessment questions, price is skewed but can be transformed to ~normality by logging.

Now for some quick preliminary visualizations to see what we have to work with in terms of availability ~ price:

# customize my plotting theme:
theme_mygrey <- theme_gray() +
  theme(plot.margin = unit(c(.7,.7,.7,.7), "cm"),
        panel.grid.minor = element_blank(),
        axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        axis.title.x = element_text(size = 15, margin=ggplot2::margin(t=0.5, unit="cm")),
        axis.title.y = element_text(size = 15, margin=ggplot2::margin(r=0.5, unit="cm")),
        plot.title = element_text(size = 16))

# Here is a direct scatterplot of availability ~ price:
ggplot(avail_data, aes(x = price, y = availability_30/30)) +
  geom_point(alpha = 0.5, stroke = 0, size = 3) +
  labs(x = "Price (log-scaled)", y = "0-30 day Availability") +
  scale_x_continuous(trans = "log10", labels=scales::dollar_format()) +
  theme_mygrey