First, load required packages:
library(tidyverse)
library(ggtext) # for adding rich text to plots
Upload the airbnb listings data:
listings <- read_csv("data/listings.csv")
There’s a warning message due to a typo in one of the zipcode values, but that won’t affect this analysis.
Now explore the “availability” and price data from listings:
# Extract all columns that contain "avail" in their name:
avail_data <- select(listings, id, contains("avail"), price) %>%
# and parse price to numeric
mutate(price = parse_number(price))
Before looking at the data, I started thinking about some of the potential trends between price and percent availability. For example, maybe more expensive rentals are also more desirable so they have a lower availability across longer time spans than cheaper rentals since guests might book them with more time in advance.
The thing is, there are a ton more questions and ways to look at the data, as well as extracting availability directly from the “calendar” dataframe in more detail. It would help to have more context since there are so many different ways to look at these data and many different interpretations that can be derived. After exploring a bit, I took the creative license to focus on visualizing the relationship between availability and price with the goal of helping Airbnb customers better understand how far in advance they should be booking their stays based on their budget.
The “availability_XX” columns indicate the number of days available XX days into the future. But I think the proportion of days (rather than the actual value) is more useful since each period of time (30, 60, 90, and 365) has more potential days that can be available.
table(avail_data$has_availability) # all are true
##
## TRUE
## 3818
par(mfrow=c(2,3)) # set plotting parameters to a 2x3 grid
hist(avail_data$availability_30/30, 20, main="")
hist(avail_data$availability_60/60, 20, main="")
hist(avail_data$availability_90/90, 20, main="")
hist(avail_data$availability_365/365, 20, main="")
hist(avail_data$price, 20, main="")
hist(log(avail_data$price), 20, main="")
par(mfrow=c(1,1)) # reset plotting parameters to default
Ok, so there are a lot of 1s and 0s (totally available and no vacancies). The number of listings with no vacancies (0s) also goes down over greater time spans into the future as you’d expect. And as with the other assessment questions, price is skewed but can be transformed to ~normality by logging.
Now for some quick preliminary visualizations to see what we have to work with in terms of availability ~ price:
# customize my plotting theme:
theme_mygrey <- theme_gray() +
theme(plot.margin = unit(c(.7,.7,.7,.7), "cm"),
panel.grid.minor = element_blank(),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
axis.title.x = element_text(size = 15, margin=ggplot2::margin(t=0.5, unit="cm")),
axis.title.y = element_text(size = 15, margin=ggplot2::margin(r=0.5, unit="cm")),
plot.title = element_text(size = 16))
# Here is a direct scatterplot of availability ~ price:
ggplot(avail_data, aes(x = price, y = availability_30/30)) +
geom_point(alpha = 0.5, stroke = 0, size = 3) +
labs(x = "Price (log-scaled)", y = "0-30 day Availability") +
scale_x_continuous(trans = "log10", labels=scales::dollar_format()) +
theme_mygrey
Cool! The problem is that there is so little to meaningfully extract from this visualization. I looked at all the different time periods as well, but they all look more or less the same. Conclusion is that I need to do some feature engineering to extract some more meaningful aspects of these data.
Here I do all the cleaning, engineering, and converting from wide to long format. For example, instead of the total availability 0 to 30 days, 0 to 60 days, and 0 to 90 days, etc., it would be more informative to subtract the availability from the previous chunk of time so that we have the availability of a booking per different windows of time (0 to 30 days, 30 to 60 days, 60 to 90 days, etc.).
# Remove the "has_availability" column (not useful, all TRUEs)
avail_data_long <- select(avail_data, -has_availability) %>%
# Calculate the available days within each time period (rather than the cumulative)
mutate(available_0to30 = availability_30,
available_30to60 = availability_60 - availability_30,
available_60to90 = availability_90 - availability_60,
available_90to365 = availability_365 - availability_90) %>%
# Remove the previous availability_XX columns
select(-contains("availability")) %>%
# Convert to long format
pivot_longer(cols = contains("_"),
names_to = "availability_window",
values_to = "available_days") %>%
# extract and parse the number of days in each availability window:
mutate(window = case_when(grepl("0to30", availability_window) ~ 30,
grepl("30to60", availability_window) ~ 30,
grepl("60to90", availability_window) ~ 30,
grepl("90to365", availability_window) ~ 275)) %>%
# Calculate percent availability by dividing available_days by the window size:
mutate(availability_perc = round((available_days / window * 100), 2)) %>%
# Create a label for the windows by converting "30to60" to "30 to 60":
mutate(window_lab = str_split_fixed(availability_window, "_", n = 2)[,2],
window_lab = paste(str_split_fixed(window_lab, "to", n = 2)[,1],
"to",
str_split_fixed(window_lab, "to", n = 2)[,2])) %>%
# Break the prices into categorical bins to make it easier to create a heatmap
mutate(price_class = cut(price,
breaks=c(-Inf, 50, 100, 200, 300, 400, 500, Inf),
labels=c("< $50","$51-$100",
"$101-$200","$201-$300",
"$301-$400","$401-$500","> $500")))
avail_data_long
## # A tibble: 15,272 × 8
## id price availability_window available_days window availability_perc
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 241032 85 available_0to30 14 30 46.7
## 2 241032 85 available_30to60 27 30 90
## 3 241032 85 available_60to90 30 30 100
## 4 241032 85 available_90to365 275 275 100
## 5 953595 150 available_0to30 13 30 43.3
## 6 953595 150 available_30to60 0 30 0
## 7 953595 150 available_60to90 3 30 10
## 8 953595 150 available_90to365 275 275 100
## 9 3308979 975 available_0to30 1 30 3.33
## 10 3308979 975 available_30to60 5 30 16.7
## # ℹ 15,262 more rows
## # ℹ 2 more variables: window_lab <chr>, price_class <fct>
The next step is to summarize the data across all the listings by
grouping by each availability window and price bin. The first important
summary statistic is the overall average percentage of days available.
Then, using that I calculated the number of listings that had at least
some availability (availability_perc > 0
), and
then used that to calculate the proportion of listings with
some availability (perc_avail_list
). Finally, I calculated
the inverse of that as the proportion of listings that were completely
booked (perc_booked_list
):
avail_data_heat <- group_by(avail_data_long, window_lab, price_class) %>%
summarize(mean_avail_days = mean(availability_perc),
# total number of available listings
tot_avail_listings = sum(availability_perc > 0),
# proportion of listings that are still available
perc_avail_list = sum(availability_perc > 0)/length(availability_perc),
# the inverse, or proportion of listings already booked:
perc_booked_list = 1 - perc_avail_list) %>%
ungroup()
## `summarise()` has grouped output by 'window_lab'. You can override using the
## `.groups` argument.
avail_data_heat
## # A tibble: 28 × 6
## window_lab price_class mean_avail_days tot_avail_listings perc_avail_list
## <chr> <fct> <dbl> <int> <dbl>
## 1 0 to 30 < $50 56.2 283 0.811
## 2 0 to 30 $51-$100 57.6 1290 0.801
## 3 0 to 30 $101-$200 55.1 1044 0.745
## 4 0 to 30 $201-$300 53.2 207 0.695
## 5 0 to 30 $301-$400 53.9 67 0.698
## 6 0 to 30 $401-$500 45.5 23 0.639
## 7 0 to 30 > $500 53.7 21 0.778
## 8 30 to 60 < $50 70.4 286 0.819
## 9 30 to 60 $51-$100 68.7 1280 0.795
## 10 30 to 60 $101-$200 65.1 1053 0.752
## # ℹ 18 more rows
## # ℹ 1 more variable: perc_booked_list <dbl>
Finally, break the proportion of listings that were completely booked
(perc_booked_list
) into bins:
# create a new variable from perc_booked_list to split it into categories
avail_data_heat <- mutate(avail_data_heat,
perc_booked_list_cat = cut(perc_booked_list, breaks = c(seq(0, 1.00, 0.1)), include.lowest = T,
labels = rev(c("91-100%", "81-90%","71-80%", "61-70%",
"51-60%", "41-50%","31-40%", "21-30%",
"11-20%", "0-10%")))) %>%
# reverse order of the levels so that the legend looks right
mutate(perc_booked_list_cat=factor(perc_booked_list_cat, levels=rev(levels(perc_booked_list_cat))))
# Create a text description that will go with this figure:
text_desc <- tibble(desc = "<b>Since the best listings<br>
probably go first, use this<br>
chart to find how far in<br>
advance you should be<br>
booking your rental so<br>
you don't miss out!</b><br><br>
e.g., If your budget is<br>
$201-$300, the top 31-40%<br>
of those rentals might already<br>
be gone if you only book<br>
0 to 30 days in advance!")
# Now plot a heatmap
ggplot(avail_data_heat, aes(x = window_lab, y = price_class, fill = perc_booked_list_cat)) +
# create the heatmap geometry
geom_tile(color = "white", size=0.2) +
# format the legend
guides(fill = guide_legend(title="Listings booked**")) +
# format the labels and text
labs( x="Days in Advance", y = "Price Range",
title = "How many days in advance should you\nbe booking your AirBnB?*",
caption = "* Data are based on AirBnB listings from Seattle, WA (Jan, 2016)\n** Listings may also appear booked if they were blocked by the host") +
# add the text description on the right of the plot
geom_richtext(data=text_desc, fill = NA, label.color = NA, hjust=0, vjust = 1,
aes(label=desc), x = 4.6, y = 4.3, color = "grey40") +
# allow text to display outside of the plotting area
coord_fixed(clip = "off") +
# adjust the axis scales so that there is not extra space
scale_y_discrete(expand = c(0, 0)) +
scale_x_discrete(expand = c(0, 0)) +
# create a color ramp scale for the heatmap
scale_fill_manual(values = (colorRampPalette(c("#f2c718", "#28989e"))(4)), na.value = "grey90") +
# use the basic "grey" theme
theme_grey(base_size=0) +
# but add in some customizations to make it look pretty:
theme(plot.margin = unit(c(1,2,1,1), "cm"),
panel.grid.minor = element_blank(),
axis.text.x = element_text(size = 14, margin = margin(t=5), angle=45, hjust=1),
axis.text.y = element_text(size = 14, hjust=1, margin = margin(r=10)),
axis.title.x = element_text(size = 15, margin=margin(t=5), color = "grey40"),
axis.title.y = element_text(size = 17, margin=margin(r=15), color = "grey40"),
plot.title = element_text(color = "grey40", hjust = 0,size = 18, face = "bold",
margin = margin(b=20)),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.caption = element_text(hjust = 0, size=10, margin=margin(t=15),
color = "grey40", face="italic"),
legend.position = "right", legend.direction = "vertical",
legend.justification = "top", legend.key.size = unit(25, "pt"),
legend.title = element_text(color = "grey40", size = 14),
legend.text = element_text(color = "grey40", size = 12, face = "bold"),
legend.margin = margin(l=10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The visualization I ended up using was a simple heat map that shows the relationship between different price ranges and listing availability (or lack thereof). I chose this one because I wanted to present a simple display of actionable insights from an otherwise very complex relationship between price and availability. To accomplish this, I ended up calculating various features such as the percentage of listings that were completely booked out and the binning of various continuous variables such as price and availability. Then I leveraged the power of ggplot2 and tweaked the theme parameters until I was satisfied with the final product.
Another important consideration is audience. I designed this visualization more as an infographic with a general consumer audience in mind. An audience that is not super technical, but also appreciates finding “life hacks” for optimizing decisions such as vacation planning.
Finally, the big caveat here is that I did not use any statistical tests on the back-end to confirm the reliability of these data on different times of the year. The data were collected at the beginning of 2016, so the differences in the windows of availability might just reflect differences in seasonality. More work would need to be done before this was made public as an official recommendation chart (or at least the caveats explained clearly with the figure).