*If you have not yet read the README.md associated with this project, located in the projects GitHub repository, please do so for information regarding the project's purpose, sources, data, and more.*

In this analysis, I use house price data from the United States to measure the effect of the 2008 financial crisis on median house prices in two different kinds of cities - those with a higher proportion of university students vs. those with a lower proportion of university students. I call these university and non-university towns, respectively.

We all know that house prices to plummeted nationwide during the 2008 financial crisis. Is it possible that university towns and non-university towns differed in their ability to withstand this crash? My hypothesis is that university towns saw a smaller change in housing prices than non-university towns during this recession. While this is outside the scope of the analysis, I suspect that the demand for a university education may be relatively price inelastic, particularly given the value our society sees in post-secondary education. Furthermore, students have relatively easy access to private, state, and federal subsidies in the form of loans, grants, and scholarships. I hypothesize that these subsidies dampen the effect of decreased income on post-secondary enrollment rates. This would better retain residents, therefore cushioning house prices in university towns compared to non-university towns.

In this analysis, I clean three datasets, merge them as required, conduct summary statistics, and take a look at some visual analyses. Finally, I use a t-test to test my hypothesis.

Importing the relevant Python libraries.

**University Towns Dataset**

First, I clean the dataset from Wikipedia containing the list of university towns. Taking a look at at the first few entries, there is only one column with states and cities mixed together. However, note that the rows with states contain 'edit', while the city names contain the associated university within paranthesis. I will be using this to parse out the cities and states and move them to their correct columns. Lastly, some city names also have a number in brackets. These correspond to a footnote on the Wikipedia page, and must also be removed.

Here is a look at the first 10 entries of the dataset. The entire dataset has 567 entries.

Loading output library...

I convert the above into a dataset with two columns, 'State' and 'City', without any brackets, paranthesis, or university names. This new dataframe is `unitowns`

.

Below is a look at the first 20 rows of the cleaned dataframe. As you can see, there are two columns, State and City, and each entry is clear and clean, with no paranthesis, brackets, or other unwanted characters.

Loading output library...

**GDP Dataset**

Here is a preliminary look at the first 20 entries of the raw, unedited GDP dataset. As I mention in the project's README.md, it helps to download the data and open the file on Excel to see how it is supposed to look on a spreadsheet.

The dataset contains the GDP in various units for every year from 1927-2016, and every quarter from 1947q1 to 2016q2. As you can see, the way Python reads in the file is a total mess.

Loading output library...

For my analysis, I'm only interested in the year 2000 onwards. For a more reliable analysis, I'll be using quarterly GDP data rather than annual. Lastly, I'll be using GDP measured in billions of chained 2009 dollars.

Loading output library...

Take a look at the first twenty entries of the cleaned GDP dataset, above. There are two columns, Quarter and GDP. The dataframe starts at the first quarter of 2000, or 2000q1. While you can't see from the above, the last entry is 2016q2.

**Housing Prices Dataset**

Like with the GDP data, it helps if you first take a look on Excel to see what the table looks like on speadsheet software.

This dataset contains the median house price for houses sold within a subset of cities within each state. The data is split by month, starting at 1996-04, or April 1994, to 2016-08, or August 2016. Here is how it looks right after imported to Python. This is just the first ten rows, the entire dataset has 10,730 rows and 251 columns.

Loading output library...

In my final dataset, I want just the State and City columns, and columns for each quarter (not month) from 2001q1 to 2016q2. Additionally, the State column in this dataframe contains acronyms for each state, not the full name. For example, 'New York' is written as 'NY'. My unitowns dataset contains the full names of each state. I need these to be same for merging down the road. I use the dictionary `states`

to replace the acronyms with the full name.

When converting from months to quarters, the quarter is the mean of the corresponding month's house prices. For example, the column 2000q1 will be the average of the columns 2000-01, 2000-02, and 2000-03.

The dictionary and cleaning for this dataset are below.

*The dictionary states was provided by Coursera.*

Taking a look at the first twenty entries of the `house`

dataframe below, cities are now grouped by state in an orderly fashion. Columns now reflect median house prices from 2000-2016 quarterly.

Loading output library...

**Finding the Recession in the Data**

A recession is defined as two consecutive quarters of falling GDP. Similairly, I'll define a recovery as two consecutive quarters of increasing GDP. In the cells below, I use the GDP dataframe to find the quarter where the recession started, the quarter where it ended, and the quarter where GDP was lowest during the recession. These are labelled as `recession_start`

, `recession_end`

, and `recession_bottom`

, respectively.

Loading output library...

Loading output library...

**The 2008 Financial Crisis - A Visualization**

I verify the recession start, end, and bottom quarters are correct by taking a look at a small section of the GDP dataframe, below. It can be visually verified that GDP rose until 2008q3, hit a local minimum at 2009q2, and rose every quarter thereafter.

Loading output library...

Loading output library...

Looking at data from farther back, 2000q1 - 2016q2, the magnitude of this recession becomes more clear.

Loading output library...

Here's a quick look at the summary statistics for GDP during throughout the recession, 2008q3-2009q4.

Loading output library...

**The Recession's Effect on Housing Prices**

Here's a look at the average median house price by state throughout the recession. Interestingly, it can be seen that many states already have falling housing prices. On the next dataframe, we'll look at the quarters leading up to the start of the recession.

Loading output library...

Looking at the quarters leading up to the recession, the quarter in which median housing prices start to fall for each state can be identified. For Alabama, for example, prices peaked in 2007q4. For New York, a state with aproximately double the average median house price, peaked in 2007q3.

Loading output library...

**Is there a difference between university and non-university towns? Are these statistically different?**

To answer this question, I first calculate the measurement to be used for the t-test, `housing_recession_diff`

. It is the difference between the 2008q2 and 2009q4 median house price at the city level. I take the quarter before the recession start to capture the full effect of the recession. Here is a look at the resulting dataframe.

Loading output library...

Now, I label all cities as university towns and non-university towns.

This results in two dataframes. Here is a look at the top 10 rows when 'University Town' = True.

Loading output library...

Here is a look at the top 10 rows when 'University Town' = False.

Loading output library...

Now, I find if the average difference in median house price for university towns and non-university towns is different.

University towns have, on average, a lower difference in median house prices from 2008q2 to 2009q4. For University Towns, the mean difference was 16133.72 USD. For non-University Towns, the difference was 24276.07 USD. I test the significance of the difference between these two groups using a t-test. *However*, I am making several assumptions about the data. In order for the t-test results to be valid, these assumptions should be verified to be true. I will verify these assumptions at a future time, as I have yet to learn how using Python. Additionally, with this amount of observations in both groups, you would typically want to run a z-test, not a t-test.

Loading output library...

The t-test tests whether the difference between the average change (difference) in median housing prices from 2008q2-2009q4 is significant. The hypothesis is that they are significantly different, while the null hypothesis is that they aren't. Using mathematical notation: H1: tstat !=0, and H0: tstat =0. A tstat =0 means that being a university town has no effect on median house price difference.

The t-test returns a tstat of -3.5 and a p-value of 0.0004. Because 0.0004<0.01, we fail to reject the null hypothesis that two populations are not significantly different at a 1% significance level. At this significance level, we know that the tstat !=0 and therefore being a university town does significantly effect the change in median house price from 2008q2-2009q4. Furthermore, a negative tstat supports the results found when looking at the mean change between the two groups, as being a university town leads to a lower difference in median house price from 2008q2-2009q4.