Our first, over-confident attempt to become homeowners came in the Summer 2016. We found a house in our price range in a neighbourhood that we liked. It hadn’t been on the market since 1965 and needed a lot of work, but it was liveable. I decided to do a bit of research.
The base data for this analysis comes from the MLS system: all homes sold in the M1E FSA (forward sortation area), a geographical region roughly coterminous with the Guildwood neighbourhood in Scarborough, Toronto. These data aren’t publicly available and I’ll hold off saying exactly how I got them for now, but everything is above board. Guildwood is a quiet, older suburb almost near the GTA boundary, but the local GO Train station means that you can commute into downtown Toronto within 35 minutes.
Real estate listings are geospatial data, so a map makes sense as the starting point for this analysis. Because the data I am using comes with postal code, I thought that matching postal codes to latitude and longitude would be a piece of cake. Unfortunately, however, I wasn’t able to find an easily and readily available dataset of postalcodes with geospatial data attached. Canada Post has one, but you have to buy it. After a fair bit of digging around in google I came up with a crowd-sourced alternative that seems to work fine. Let’s start by loading upthe required libraries for all of our analysis, and then downloading and extracting the right geospatial data from the open source database of geospatial postal codes:
# Necessary Libraries library(dplyr) library(tidyr) library(ggplot2) library(ggthemes) library(scales) library(rgdal) library(rgeos) library(leaflet) library(htmlwidgets) # Postal Code Polygons # Download from: http://geocoder.ca/?freedata=1 postal <- readOGR(".", "CanadaPostalCodePolygons") latlong <- as.data.frame(coordinates(postal)) postal <- as.data.frame(postal$ZIP) loc <- cbind(latlong, postal)
The next step is to read in our house data and merge it with the location data:
list <- read.csv("house_data_final.csv", stringsAsFactors = FALSE) list_m <- merge(list, loc, by = "postal")
At this point we have what we need to map: a list of listings with corresponding latitude and longitude coordinates. I decided to use the
leaflet package to create an interactive map. I’d like each point on the map, which represents a listing, to provide some information when you click on it, so the next step is to create an appropriate popup using some of the fields in our dataframe:
pop <- paste0("<strong>Address: </strong>", list_m$address, "<br><strong>Date Sold: </strong>", list_m$sold_date, "<br><strong>Bedrooms: </strong>", list_m$bed, "<br><strong>Bathrooms: </strong>", list_m$bath, "<br><strong>List Price: </strong>", list_m$list, "<br><strong>Sale Price: </strong>", list_m$sold)
leaflet provides a fantastic
leaflet(list_m) %>% addProviderTiles("CartoDB.Positron") %>% addCircleMarkers(radius = 4, color = "navy", stroke = FALSE, fillOpacity = 1, popup = ~pop)
Let’s dive in to the listings. One of the most important criteria for house-hunters is always the number of bedrooms a potential property has: bedrooms are inextricably linked to your plans for the house as a home, the number of individuals it will be able to accomodate, its room for growth, etc. We were no different: we knew we needed more than the 2 br we were currently occupying if we planned to stay in the house for any length of time, and we thought that we were unlikely to be able to afford a 4 or 5 br home. Our overall preference, based on both capacity and price, was to get a 3 br house. My first cut at the data is to look at the number of listings in the dataset based on the number of bedrooms:
# Histogram of Listings ggplot(list, aes(bed)) + geom_histogram(bins = 30, fill = "navyblue", binwidth = .2) + scale_x_continuous(breaks = seq(0,6,1)) + stat_bin(binwidth=1, geom="text", aes(label=..count..), vjust=-1.5, size=3.3) + scale_y_continuous(limits=c(0,500)) + labs(title="Histogram of Listings by # of Bedrooms", subtitle = "Guildwood (M1E); 2014-2016. Bars represent the # of homes with x bedrooms", x = "# of Bedrooms", y = " # of Homes") + theme_minimal() + theme(axis.text.y = element_blank())
Guildwood looks like the right neighbourhood to be shopping for a 3 br home! Between July 2014 and July 2016 445 3 br homes were listed and sold, more than twice any other size house. (I’m going to remove the repetitive, stylistic elements of the
ggplot call above in the code for future plots.)
Slicing the data by number of bedrooms is a useful way of understanding general patterns, even if we were specifically interested in 3 br houses. The next chunk of code uses the
aggregate function to identify the average sale price by number of bedrooms, giving us a quick comparison of average prices across categories.
aggregate is piped into the
ggplot code used to plot the results: this has the benefit of integrated code that doesn’t leave any intermediate dataframes.
# Average selling price by number of beds round(aggregate(sold ~ bed, data = list, mean)) %>% ggplot(aes(as.factor(bed), sold)) + geom_point(color = "navyblue") + scale_y_continuous(limits=c(500000,1200000)) + geom_text(aes(label = comma(sold)), vjust=-0.80, size=3.3)
Averages are useful, but obscure a bunch of the variation that likely exists within each bedroom category. Boxplots are a useful way of vizualizing some of that variation. The code below plots bedroom category boxplots and a jittered scatterplot of individual sale prices as well:
# Boxplot of sale prices #### list %>% na.omit() %>% ggplot(aes(as.factor(bed), sold, group = bed)) + geom_boxplot() + geom_jitter(color = "navyblue", width = 0.35)
list %>% filter(bed == 3) %>% ggplot(aes(sold)) + geom_histogram(bins = 50, fill = "navyblue")