If you have not yet read the README.md associated with this project, located in the projects GitHub repository, please do so for information regarding the project's purpose, sources, data, and more.
In this analysis, I use house price data from the United States to measure the effect of the 2008 financial crisis on median house prices in two different kinds of cities - those with a higher proportion of university students vs. those with a lower proportion of university students. I call these university and non-university towns, respectively.
We all know that house prices to plummeted nationwide during the 2008 financial crisis. Is it possible that university towns and non-university towns differed in their ability to withstand this crash? My hypothesis is that university towns saw a smaller change in housing prices than non-university towns during this recession. While this is outside the scope of the analysis, I suspect that the demand for a university education may be relatively price inelastic, particularly given the value our society sees in post-secondary education. Furthermore, students have relatively easy access to private, state, and federal subsidies in the form of loans, grants, and scholarships. I hypothesize that these subsidies dampen the effect of decreased income on post-secondary enrollment rates. This would better retain residents, therefore cushioning house prices in university towns compared to non-university towns.
In this analysis, I clean three datasets, merge them as required, conduct summary statistics, and take a look at some visual analyses. Finally, I use a t-test to test my hypothesis.
University Towns Dataset
First, I clean the dataset from Wikipedia containing the list of university towns. Taking a look at at the first few entries, there is only one column with states and cities mixed together. However, note that the rows with states contain 'edit', while the city names contain the associated university within paranthesis. I will be using this to parse out the cities and states and move them to their correct columns. Lastly, some city names also have a number in brackets. These correspond to a footnote on the Wikipedia page, and must also be removed.
Here is a look at the first 10 entries of the dataset. The entire dataset has 567 entries.
I convert the above into a dataset with two columns, 'State' and 'City', without any brackets, paranthesis, or university names. This new dataframe is
Below is a look at the first 20 rows of the cleaned dataframe. As you can see, there are two columns, State and City, and each entry is clear and clean, with no paranthesis, brackets, or other unwanted characters.
Here is a preliminary look at the first 20 entries of the raw, unedited GDP dataset. As I mention in the project's README.md, it helps to download the data and open the file on Excel to see how it is supposed to look on a spreadsheet.
The dataset contains the GDP in various units for every year from 1927-2016, and every quarter from 1947q1 to 2016q2. As you can see, the way Python reads in the file is a total mess.
For my analysis, I'm only interested in the year 2000 onwards. For a more reliable analysis, I'll be using quarterly GDP data rather than annual. Lastly, I'll be using GDP measured in billions of chained 2009 dollars.
Take a look at the first twenty entries of the cleaned GDP dataset, above. There are two columns, Quarter and GDP. The dataframe starts at the first quarter of 2000, or 2000q1. While you can't see from the above, the last entry is 2016q2.
Housing Prices Dataset
Like with the GDP data, it helps if you first take a look on Excel to see what the table looks like on speadsheet software.
This dataset contains the median house price for houses sold within a subset of cities within each state. The data is split by month, starting at 1996-04, or April 1994, to 2016-08, or August 2016. Here is how it looks right after imported to Python. This is just the first ten rows, the entire dataset has 10,730 rows and 251 columns.
In my final dataset, I want just the State and City columns, and columns for each quarter (not month) from 2001q1 to 2016q2. Additionally, the State column in this dataframe contains acronyms for each state, not the full name. For example, 'New York' is written as 'NY'. My unitowns dataset contains the full names of each state. I need these to be same for merging down the road. I use the dictionary
states to replace the acronyms with the full name.
When converting from months to quarters, the quarter is the mean of the corresponding month's house prices. For example, the column 2000q1 will be the average of the columns 2000-01, 2000-02, and 2000-03.
The dictionary and cleaning for this dataset are below.
states was provided by Coursera.
Taking a look at the first twenty entries of the
house dataframe below, cities are now grouped by state in an orderly fashion. Columns now reflect median house prices from 2000-2016 quarterly.
Finding the Recession in the Data
A recession is defined as two consecutive quarters of falling GDP. Similairly, I'll define a recovery as two consecutive quarters of increasing GDP. In the cells below, I use the GDP dataframe to find the quarter where the recession started, the quarter where it ended, and the quarter where GDP was lowest during the recession. These are labelled as
The 2008 Financial Crisis - A Visualization
I verify the recession start, end, and bottom quarters are correct by taking a look at a small section of the GDP dataframe, below. It can be visually verified that GDP rose until 2008q3, hit a local minimum at 2009q2, and rose every quarter thereafter.
Looking at data from farther back, 2000q1 - 2016q2, the magnitude of this recession becomes more clear.
Here's a quick look at the summary statistics for GDP during throughout the recession, 2008q3-2009q4.
The Recession's Effect on Housing Prices
Here's a look at the average median house price by state throughout the recession. Interestingly, it can be seen that many states already have falling housing prices. On the next dataframe, we'll look at the quarters leading up to the start of the recession.
Looking at the quarters leading up to the recession, the quarter in which median housing prices start to fall for each state can be identified. For Alabama, for example, prices peaked in 2007q4. For New York, a state with aproximately double the average median house price, peaked in 2007q3.
Is there a difference between university and non-university towns? Are these statistically different?
To answer this question, I first calculate the measurement to be used for the t-test,
housing_recession_diff. It is the difference between the 2008q2 and 2009q4 median house price at the city level. I take the quarter before the recession start to capture the full effect of the recession. Here is a look at the resulting dataframe.
Now, I label all cities as university towns and non-university towns.
This results in two dataframes. Here is a look at the top 10 rows when 'University Town' = True.
Here is a look at the top 10 rows when 'University Town' = False.
Now, I find if the average difference in median house price for university towns and non-university towns is different.
University towns have, on average, a lower difference in median house prices from 2008q2 to 2009q4. For University Towns, the mean difference was 16133.72 USD. For non-University Towns, the difference was 24276.07 USD. I test the significance of the difference between these two groups using a t-test. However, I am making several assumptions about the data. In order for the t-test results to be valid, these assumptions should be verified to be true. I will verify these assumptions at a future time, as I have yet to learn how using Python. Additionally, with this amount of observations in both groups, you would typically want to run a z-test, not a t-test.
The t-test tests whether the difference between the average change (difference) in median housing prices from 2008q2-2009q4 is significant. The hypothesis is that they are significantly different, while the null hypothesis is that they aren't. Using mathematical notation: H1: tstat !=0, and H0: tstat =0. A tstat =0 means that being a university town has no effect on median house price difference.
The t-test returns a tstat of -3.5 and a p-value of 0.0004. Because 0.0004<0.01, we fail to reject the null hypothesis that two populations are not significantly different at a 1% significance level. At this significance level, we know that the tstat !=0 and therefore being a university town does significantly effect the change in median house price from 2008q2-2009q4. Furthermore, a negative tstat supports the results found when looking at the mean change between the two groups, as being a university town leads to a lower difference in median house price from 2008q2-2009q4.