About this analysis

This analysis is one of a series of posts that showcases my approach to exploring data and drawing insights from it. Here I explore the relationship between availability and price in an Airbnb listings dataset. Enjoy!


How many days in advance should you be booking your Airbnb?


First, load required packages:

library(tidyverse)
library(ggtext) # for adding rich text to plots

Upload the airbnb listings data:

listings <- read_csv("data/listings.csv")

There’s a warning message due to a typo in one of the zipcode values, but that won’t affect this analysis.

Now explore the “availability” and price data from listings:

# Extract all columns that contain "avail" in their name:
avail_data <- select(listings, id, contains("avail"), price) %>%
  # and parse price to numeric
  mutate(price = parse_number(price))

Context

Before looking at the data, I started thinking about some of the potential trends between price and percent availability. For example, maybe more expensive rentals are also more desirable so they have a lower availability across longer time spans than cheaper rentals since guests might book them with more time in advance.

The thing is, there are a ton more questions and ways to look at the data, as well as extracting availability directly from the “calendar” dataframe in more detail. It would help to have more context since there are so many different ways to look at these data and many different interpretations that can be derived. After exploring a bit, I took the creative license to focus on visualizing the relationship between availability and price with the goal of helping Airbnb customers better understand how far in advance they should be booking their stays based on their budget.

Data exploration

The “availability_XX” columns indicate the number of days available XX days into the future. But I think the proportion of days (rather than the actual value) is more useful since each period of time (30, 60, 90, and 365) has more potential days that can be available.

table(avail_data$has_availability) # all are true
## 
## TRUE 
## 3818
par(mfrow=c(2,3)) # set plotting parameters to a 2x3 grid
hist(avail_data$availability_30/30, 20, main="")
hist(avail_data$availability_60/60, 20, main="")
hist(avail_data$availability_90/90, 20, main="")
hist(avail_data$availability_365/365, 20, main="")
hist(avail_data$price, 20, main="")
hist(log(avail_data$price), 20, main="")

par(mfrow=c(1,1)) # reset plotting parameters to default

Ok, so there are a lot of 1s and 0s (totally available and no vacancies). The number of listings with no vacancies (0s) also goes down over greater time spans into the future as you’d expect. And as with the other assessment questions, price is skewed but can be transformed to ~normality by logging.

Now for some quick preliminary visualizations to see what we have to work with in terms of availability ~ price:

# customize my plotting theme:
theme_mygrey <- theme_gray() +
  theme(plot.margin = unit(c(.7,.7,.7,.7), "cm"),
        panel.grid.minor = element_blank(),
        axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        axis.title.x = element_text(size = 15, margin=ggplot2::margin(t=0.5, unit="cm")),
        axis.title.y = element_text(size = 15, margin=ggplot2::margin(r=0.5, unit="cm")),
        plot.title = element_text(size = 16))

# Here is a direct scatterplot of availability ~ price:
ggplot(avail_data, aes(x = price, y = availability_30/30)) +
  geom_point(alpha = 0.5, stroke = 0, size = 3) +
  labs(x = "Price (log-scaled)", y = "0-30 day Availability") +
  scale_x_continuous(trans = "log10", labels=scales::dollar_format()) +
  theme_mygrey

Cool! The problem is that there is so little to meaningfully extract from this visualization. I looked at all the different time periods as well, but they all look more or less the same. Conclusion is that I need to do some feature engineering to extract some more meaningful aspects of these data.

Data cleaning and feature engineering

Here I do all the cleaning, engineering, and converting from wide to long format. For example, instead of the total availability 0 to 30 days, 0 to 60 days, and 0 to 90 days, etc., it would be more informative to subtract the availability from the previous chunk of time so that we have the availability of a booking per different windows of time (0 to 30 days, 30 to 60 days, 60 to 90 days, etc.).

# Remove the "has_availability" column (not useful, all TRUEs)
avail_data_long <- select(avail_data, -has_availability) %>%
  # Calculate the available days within each time period (rather than the cumulative)
  mutate(available_0to30 = availability_30,
         available_30to60 = availability_60 - availability_30,
         available_60to90 = availability_90 - availability_60,
         available_90to365 = availability_365 - availability_90) %>%
  # Remove the previous availability_XX columns
  select(-contains("availability")) %>%
  # Convert to long format
  pivot_longer(cols = contains("_"), 
               names_to = "availability_window", 
               values_to = "available_days") %>%
  # extract and parse the number of days in each availability window:
  mutate(window = case_when(grepl("0to30", availability_window) ~ 30,
                            grepl("30to60", availability_window) ~ 30,
                            grepl("60to90", availability_window) ~ 30,
                            grepl("90to365", availability_window) ~ 275)) %>%
  # Calculate percent availability by dividing available_days by the window size:
  mutate(availability_perc = round((available_days / window * 100), 2)) %>%
  # Create a label for the windows by converting "30to60" to "30 to 60":
  mutate(window_lab = str_split_fixed(availability_window, "_", n = 2)[,2],
         window_lab = paste(str_split_fixed(window_lab, "to", n = 2)[,1],
                            "to",
                            str_split_fixed(window_lab, "to", n = 2)[,2])) %>%
  # Break the prices into categorical bins to make it easier to create a heatmap
  mutate(price_class = cut(price,
                           breaks=c(-Inf, 50, 100, 200, 300, 400, 500, Inf),
                           labels=c("< $50","$51-$100",
                                    "$101-$200","$201-$300",
                                    "$301-$400","$401-$500","> $500")))
avail_data_long
## # A tibble: 15,272 × 8
##         id price availability_window available_days window availability_perc
##      <dbl> <dbl> <chr>                        <dbl>  <dbl>             <dbl>
##  1  241032    85 available_0to30                 14     30             46.7 
##  2  241032    85 available_30to60                27     30             90   
##  3  241032    85 available_60to90                30     30            100   
##  4  241032    85 available_90to365              275    275            100   
##  5  953595   150 available_0to30                 13     30             43.3 
##  6  953595   150 available_30to60                 0     30              0   
##  7  953595   150 available_60to90                 3     30             10   
##  8  953595   150 available_90to365              275    275            100   
##  9 3308979   975 available_0to30                  1     30              3.33
## 10 3308979   975 available_30to60                 5     30             16.7 
## # ℹ 15,262 more rows
## # ℹ 2 more variables: window_lab <chr>, price_class <fct>

The next step is to summarize the data across all the listings by grouping by each availability window and price bin. The first important summary statistic is the overall average percentage of days available. Then, using that I calculated the number of listings that had at least some availability (availability_perc > 0), and then used that to calculate the proportion of listings with some availability (perc_avail_list). Finally, I calculated the inverse of that as the proportion of listings that were completely booked (perc_booked_list):

avail_data_heat <- group_by(avail_data_long, window_lab, price_class) %>% 
  summarize(mean_avail_days = mean(availability_perc),
            # total number of available listings
            tot_avail_listings = sum(availability_perc > 0),
            # proportion of listings that are still available
            perc_avail_list = sum(availability_perc > 0)/length(availability_perc),
            # the inverse, or proportion of listings already booked:
            perc_booked_list = 1 - perc_avail_list) %>%
  ungroup()
## `summarise()` has grouped output by 'window_lab'. You can override using the
## `.groups` argument.
avail_data_heat
## # A tibble: 28 × 6
##    window_lab price_class mean_avail_days tot_avail_listings perc_avail_list
##    <chr>      <fct>                 <dbl>              <int>           <dbl>
##  1 0 to 30    < $50                  56.2                283           0.811
##  2 0 to 30    $51-$100               57.6               1290           0.801
##  3 0 to 30    $101-$200              55.1               1044           0.745
##  4 0 to 30    $201-$300              53.2                207           0.695
##  5 0 to 30    $301-$400              53.9                 67           0.698
##  6 0 to 30    $401-$500              45.5                 23           0.639
##  7 0 to 30    > $500                 53.7                 21           0.778
##  8 30 to 60   < $50                  70.4                286           0.819
##  9 30 to 60   $51-$100               68.7               1280           0.795
## 10 30 to 60   $101-$200              65.1               1053           0.752
## # ℹ 18 more rows
## # ℹ 1 more variable: perc_booked_list <dbl>

Finally, break the proportion of listings that were completely booked (perc_booked_list) into bins:

# create a new variable from perc_booked_list to split it into categories
avail_data_heat <- mutate(avail_data_heat,
                           perc_booked_list_cat = cut(perc_booked_list, breaks = c(seq(0, 1.00, 0.1)), include.lowest = T,
                                                     labels = rev(c("91-100%", "81-90%","71-80%", "61-70%",
                                                                    "51-60%", "41-50%","31-40%", "21-30%",
                                                                    "11-20%", "0-10%")))) %>%
  # reverse order of the levels so that the legend looks right
  mutate(perc_booked_list_cat=factor(perc_booked_list_cat, levels=rev(levels(perc_booked_list_cat))))
# Create a text description that will go with this figure:
text_desc <- tibble(desc = "<b>Since the best listings<br>
                    probably go first, use this<br>
                    chart to find how far in<br>
                    advance you should be<br>
                    booking your rental so<br>
                    you don't miss out!</b><br><br>
                    e.g., If your budget is<br>
                    $201-$300, the top 31-40%<br>
                    of those rentals might already<br>
                    be gone if you only book<br>
                    0 to 30 days in advance!")
# Now plot a heatmap
ggplot(avail_data_heat, aes(x = window_lab, y = price_class, fill = perc_booked_list_cat)) +
  # create the heatmap geometry
  geom_tile(color = "white", size=0.2) +
  # format the legend
  guides(fill = guide_legend(title="Listings booked**")) +
  # format the labels and text
  labs( x="Days in Advance", y = "Price Range",
        title = "How many days in advance should you\nbe booking your AirBnB?*",
        caption = "* Data are based on AirBnB listings from Seattle, WA (Jan, 2016)\n** Listings may also appear booked if they were blocked by the host") +
  # add the text description on the right of the plot
  geom_richtext(data=text_desc, fill = NA, label.color = NA, hjust=0, vjust = 1,
               aes(label=desc), x = 4.6, y = 4.3, color = "grey40") +
  # allow text to display outside of the plotting area
  coord_fixed(clip = "off") +
  # adjust the axis scales so that there is not extra space
  scale_y_discrete(expand = c(0, 0)) +
  scale_x_discrete(expand = c(0, 0)) +
  # create a color ramp scale for the heatmap
  scale_fill_manual(values = (colorRampPalette(c("#f2c718", "#28989e"))(4)), na.value = "grey90") +
  # use the basic "grey" theme
  theme_grey(base_size=0) +
  # but add in some customizations to make it look pretty:
  theme(plot.margin = unit(c(1,2,1,1), "cm"),
        panel.grid.minor = element_blank(),
        axis.text.x = element_text(size = 14, margin = margin(t=5), angle=45, hjust=1),
        axis.text.y = element_text(size = 14, hjust=1, margin = margin(r=10)),
        axis.title.x = element_text(size = 15, margin=margin(t=5), color = "grey40"),
        axis.title.y = element_text(size = 17, margin=margin(r=15), color = "grey40"),
        plot.title = element_text(color = "grey40", hjust = 0,size = 18, face = "bold",
                                  margin = margin(b=20)),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0, size=10,  margin=margin(t=15), 
                                    color = "grey40", face="italic"),
        legend.position = "right", legend.direction = "vertical", 
        legend.justification = "top", legend.key.size = unit(25, "pt"),
        legend.title = element_text(color = "grey40", size = 14),
        legend.text = element_text(color = "grey40", size = 12, face = "bold"),
        legend.margin = margin(l=10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion

The visualization I ended up using was a simple heat map that shows the relationship between different price ranges and listing availability (or lack thereof). I chose this one because I wanted to present a simple display of actionable insights from an otherwise very complex relationship between price and availability. To accomplish this, I ended up calculating various features such as the percentage of listings that were completely booked out and the binning of various continuous variables such as price and availability. Then I leveraged the power of ggplot2 and tweaked the theme parameters until I was satisfied with the final product.

Another important consideration is audience. I designed this visualization more as an infographic with a general consumer audience in mind. An audience that is not super technical, but also appreciates finding “life hacks” for optimizing decisions such as vacation planning.

Finally, the big caveat here is that I did not use any statistical tests on the back-end to confirm the reliability of these data on different times of the year. The data were collected at the beginning of 2016, so the differences in the windows of availability might just reflect differences in seasonality. More work would need to be done before this was made public as an official recommendation chart (or at least the caveats explained clearly with the figure).