Analyze I - Aline Santoso
The data is split into three CSV files:
The votes dataframe has 3141 rows and 7 columns while the data dataframe has 3141 rows and 54 columns.
For 5 of the columns in the dataset, provide answers to these questions and create histograms of the column values if they're numerical or bar charts if they're categorical. While you're required to do this for 5 of the columns, we encourage you to mentally answer these questions for every column in the dataset.
First, let's focus on the votes dataframe and take a look of what they contain.
Let's answer some of the questions. As a note, my column name is in the format of "dataframe.column_name"
|Column Name||What does the column measure?||Absolute/Relative?||Categorical/Numerical?|
|data.State||The state where each county is located||N/A||Categorical|
|data.Median Earnings 2010||The median earnings of the population in 2010 in each FIPS code i.e. in each county||Absolute||Numerical|
|data.Graduate Degree||The percentage of population in each county who has graduate degree||Relative||Numerical|
|data.Votes||The number of votes in each FIPS code i.e. in each county||Absolute||Numerical|
|votes.votes16_trumpd||Number of votes for Trump||Relative||Numerical|
As can be seen from the histogram above, some of the data is positively-skewed distributed (also called right-skewed distribution). I will address this in the cleaning section of the data.
What other cleanup steps would be helpful here? For each cleanup step you take, describe what you've done, why you've chosen to take this step, and include the code in the cell of your Jupyter notebook.
Personally speaking, I don't like the how the names in "data" dataframe has different formats for each column. For example, there are columns with the names properly formatted such as "Adults 65 and Older Living in Poverty" and columns improperly formatted such as "Child.Poverty.living.in.families.below.the.poverty.line". I'm going to reformat these columns by removing the special characters such as "." or "_" and title capitalizing it.
In general, there are 3 ways to deal with null/missing values: 1. Ignore or drop the records with missing values 2. Fill in the values with mean/mode/median or the nearest neighbor 3. Use the prediction model to predict the values of the missing data
Before I start dealing with the missing values though, let's check the number of null values in each column for the votes dataframe:
Since the percentage of null values in the votes dataframe is really small, we can safely drop the rows with the missing values. Furthermore, since all the rows for these missing values are the same for both Trump and Clinton, just in case, let's flag them with -1.
Let's check the number of null values in each column for the "data" dataframe:
From the information above, the State column has no null values while the ST column has 1 null value. Let's find the name of the state that is related to the missing "ST" value and fill the missing "ST" value with the correct state abbreviation.
Let's check the percentage of missing values in the "data" dataframe.
The column "Precincts" and "Votes" are both missing 32 values and the percentage of null values for these two columns are approximately 1%. With such a low percentage, let's flag these categories and fill them with -1.
From the table above, we can see that the missing values are from the same rows, let's drop these rows now.
For other missing values, I will flag it with -1 values.
We can generally see which ones are normally distributed from the histograms above. Hence, I will use mean to replace missing values for these columns: 1. Preschool Enrollment Ratio Enrolled Ages 3 And 4 2. Low Birthweight 3. Teen Births 4. Children In Single Parent Households 5. Adult Smoking 6. Injury Deaths 7. Infant Mortality
For the skewed data, I will use median to replace the missing values for these columns: 1. Sexually Transmitted Infections 2. HIV Prevalence Rate 3. Violent Crime 4. Homicide Rate
There are 6 columns with duplicates. Now, let's drop these duplicates:
As previously mentioned, some of the distributions are highly skewed. For analysis and modelling purposes, it's advisable to transform the dataset to be statistically close enough to a Gaussian dataset. When the dataset is Gaussian, statistical analysis tests such as t-test and ANOVA can then be applied to the dataset. As such, I'm going to perform a logarithmic transformation to some of the skewed data. Since population and racial demographics might have important effect on the vote results for this exercise, I'm going to make a list of column names with population in it and see their distribution.
As you can see, majority of the population data is very skewed. For this exercise, let's use logarithmic transformation to normalize the distribution of the "Total Population". The same transformation can be done for other skewed data as well.
I'm going to insert the relative vote count for each party in the election to the existing votes dataframe.
Now that we've gone through the process of cleaning our data and preparing it for analysis, we can begin the analysis process. See below for the key questions you'll be answering as part of this assign,ent.
In this part we will perform some statistical analysis on the data.
To find out which data columns are correlated to the Republican vote share, I will use t-test for correlation coeefficents from the stats library.
Even with the corrections, there are still a lot of variables which are significantly correlated to the Republican vote shares. Let's filter them out using their correlation coefficient (r-values):
That looks better so the ones who have a positive significant correlation with the 2016 Republican vote share is the White Population, White Asian Population and Sire Homogeneity. However, Homicide Rate, Children In Single Parent Households, Asian American Population, African American Population, At Least Bachelors'S Degree, Hiv Prevalence Rate, Graduate Degree and Sexually Transmitted Infections have a negative significant correlation. In general, population and education correlates either positively or negatively to the Republican vote share.
To find out whether these trends were present in 2012 and 2008 election, let's put them all together in a dataframe and visualize them. Before this though, let's calculate the r- and p-values for the year of 2008 and 2012.
Let's plot these values into a stacked bar chart and heat map to see whether the trends hold up for the other two elections.
From the bar plot above, we can see that generally 2016 election is more polaried than the other two elections, meaning those variables that are correlated (either positively or negatively) to the Republican relative vote share in 2008 and 2012 become even more so in 2016. Some exceptions include variables such as Uninsured, Teen Births, Adults 65 And Older Living in Poverty, At least High School Diploma, Children Under 6 Living in Poverty, Native American Population, Adult Smoking and Service Occupations. These variables have a lower correlation (either positively or negatively) in 2016 than in 2008 and 2012. However, these categories have correlation values less than abs(0.3). This can be illustrated further by using a heat map.
We can see here that the as we go from 2008 to the 2016, the map goes lighter if it's positively correlated and goes darker if it's negatively correlated. The trends in 2016 also exist in 2012 and 2008, such as White (Not Latino) population still have the highest positive correlation to the Republican relative vote share. White Asian Population is also consistently high but since it's a sum of Asian and White Population, this is not a surprising trend.
While counties with high Graduate Degree have high negative correlation in 2012 and 2008 elections to the Republican relative vote share, it's even more negatively correlated in 2016. This trend is similar to counties with high percentages of people having "At least Bachelor's Degree". Sexually Diseases are also consistently negatively correlated.
What's interesting is the variables that show a jump in the correlation such as Production Transportation And Material Moving Occupations, whose correlation coefficient doubles positively in 2016. On the opposite end, Latino Population and Management Professional And Related Occupations also double negatively in 2016 i.e. they vote Democrats more than they did in 2012 and 2008. It is important to note that in the correlation values for these categories are relatively small i.e. ~0.2 and ~-0.2 respectively.
Since Total Population is the sum of all racial demographics and White Asian Population is the sum of the White Population and Asian, I'm going to ignore them for analyzing the trend in racial demographics.
From the p-values, all the racial demographics are significantly correlated to the Republican vote share, either positively or negatively.
For White Population, they are consistently and positively correlated with the Republican vote share throughout the three elections. African American, Asian American, Latino and Other Races are negatively correlated throughout the three election years. Native American correlation coefficient was more positively correlated in 2016 even though their correlation coefficients were small.
Looking at the bar chart and the heat map, one trend that can be seen is that the 2016 election polarizes the racial demographics even more. The demographic that is positively correlated to the Republican relative vote share in 2012 and 2008 becomes even more so in 2016 and vice versa.
Let's merge the two dataframes first into one dataframe called merged_df and then group them by state.
Now I will create a pivoted multi-index dataframe, where the columns are the Racial Demographics and the Year, while the row index is the state. p-values less than 0.05 is filtered out in this dataframe.
Let's chart a simple line graph and see whether the trend holds. Please note that there will gaps in the line graph because for those areas where the correlation coefficient is not significantly correlated, there will be gaps.
As previously seen before, there is a divide between the population groups. White (Not Latino) Population has a high positive correlation compared to the rest of the groups. Asian American and African American generally have a low correlation to the Republican vote share across the states. Latino and Native American has a lot of missing values since their p-values show that their correlation values are not signicantly correlated.
Comparing from state, it is interesting to see that depending on the states, the correlation coefficient changes. For example, in Texas, African American and Asian American correlation value is the highest amongst all the states. In California and Oregon, the correlation values for White (Not Latino) Population drops down to one of the lowest numbers.
It's important to note that the demographics that have non-significant correlation values might be due to the low number of population in there. For example, Latino and Other Race have significant correlation values in Mississipi probably because their population is concentrated there.
Now, let's compare the data for each election year in a heat map. I will also turn off the scrollable auto window for ease of readibility
From the heatmap above, we can see that for the majorities of the states, the Native American Population are not significantly correlated to the Republican vote share.
For the White (Not Latino) Population, we can see a clear distinction in correlation values between different states. For example, states that are consistently Republican in 2008 and 2012 like Alabama, Maryland and South Carolina become even more positively correlated to the Republican. While states that have low correlation coefficients lean slightly more towards the Republican in 2016 such as in California, Oregon, Iowa and Kansas etc.
Majorities of the correlation coefficients for the Native American are not significantly correlate but the trends for the ones that are present remain consistent between each election year and divided dependent on the states. For example, Arkansas has a relatively high positive correlation coefficient (0.5) throughout while South and North Dakota have a high negative correlation coefficient (-0.8).
Asian American and African American Population have a similar trend, where they have a high negative correlation. Texas is an exception for African American Population, where the correlation value becomes more sensitive in that state. It is worth nothing for in general, for both of these demographics, the correlation values decrease in 2016. This can be seen in Connecticut and and Wisconsin for Asian American demographics.
Other races and Latino population are either neutral, where their correlation coefficients are around 0 or slightly negative. The stand out are states such as Alabama, Arkansas and Misssissipi.
In this part we will perform some more advanced statistical analysis on the data.
Visualizing the percentage share of the Republican votes in each county is difficult if we look at each point individually since there are more than 3000 data points. I will use histogram here to see the distribution for the percentage share of the Republican votes in each election.
First, let's plot these histograms together and see whether they're similar to each other.
There seems to be a difference in the distribution between each election year from the histogram. I will use a one-way ANOVA to see how similar the distribution for the Republican Relative Vote between each year is. I'm defining similarity here as how similar the distribution is. Essentially, my null hypothesis is the mean for the distribution in each election year is statistically equal to each. Setting my critical p-value to 0.05 as usual, let's run the one-way ANOVA to see the F-value.
The p-value is less than p_critical of 0.05 so I'm rejecting the null hypothesis that mean-value is the same for these three election years and that they are significantly different statistically.
For this question, let's aggregate the data by the mean values first and plot the histogram for each election year.
Let's do the one-way ANOVA for this data that is grouped by states and then two-sampled, two-tailed t-test on the 2012 and 2016 elections.
From the one-way ANOVA, the p-value is < 0.05 signifying that the mean of distributions for the three elections is significantly different. However, the two-sampled t-test between the election year of 2012 and 2016 has a p-value > 0.05, which means that the mean of the distribution between the election year 2012 and 2016 is similar when we grouped the data by states. The p-value of the t-test between the election year of 2008 and 2016 indicate that the mean of the two election years is significantly different from each other.
Now, let's look at how the data looks like for each state in a line graph.
We can see from the line graph that each election generally follows the same trend across each state. Those states that have a high Republican vote share in 2008 and 2012 still have high Republican vote share in 2016. Looking from this perspective, we can say that the three elections are similar.
However, in 2016, the Republican vote share is even more amplified compared to the rest of the years. Those states that have Republican vote share around 50-60% jumped to 60 - 70% in some states such as in Ilinois, Indiana, Iowa, Michigan, Minnesota, North Dakota and Ohio. From this perspective, the election in 2016 differs in the unexpected jump of Republican vote share in some of the states.
Create a model which measures the fraction of votes a party (Republican) gets within a county based on the county data. When building your model keep the following points in mind:
The dependent variable for this exercise is the fraction of votes that Republicans for the year 2008, 2012 and 2016.
If the goal of the model is to predict the outcome of the election in 2016, it also makes sense to use the past election data in 2008 and 2012 as the train set and the 2016 data as the test set so I'm going to average the fraction of votes for 2008 and 2012.
The independent variables are those variables that have high correlation values. From our correlation dataframe before, we can apply a cut-off correlation coefficient. Let's include only those variables with correlation coefficient greater than 0.3 and less than -0.3.
White Asian Population is the sum of the White and Asian Population while the Total Population is the sum of all population. Including these columns are going to bias our model so let's drop these.
Let's see how many independent variables we end up using for the model.
The fraction of votes that the Republican party gets within a county is a continuous value between 0 and 1. Logistic regression model is suitable in this case. Even though our final result is a continuous value in the form of fraction instead of discrete values of 0 and 1, this problem is essentially a classification problem with a binary dependent variable of Republican vs. Democrat. For this case though, the binary dependent variable is just transformed to a continuous fraction value between 0 and 1.
Linear regression can also be suitable for this sort of problem. However, to use linear regression for this model, we have to transform the dependent variable from y to z where: @@0@@ We can then apply linear regression to z.
I will do both to see how both models perform.
To fit the fraction of votes that the Republican party gets within a county i.e. the "y-value", I will two different cases. First, I will use the average fraction of votes from all the election years in 2008, 2012 and 2016. In the second one, I will only use the average fraction of votes for only the years in 2008 and 2012.
In the real-world scenario where the model is supposed to predict the outcome of the 2016 election, it makes sense to fit only the Republican fraction votes data from 2008 and 2012 as our training data.
Let's put the logistic prediction values together in a dataframe.
Let's put the linear prediction values together in a dataframe.
To evaluate the fit of the model, I can evaluate it on the average Republican fraction of votes or just on the election year of 2016. Hence, I have 4 different choices of prediction values and two y-true-values I can evaluate my model on.
Using the @@0@@ value to evaluate the fit of my model, I will have 8 different @@1@@ depending on the combinations of prediction values and y-values I will use them on.
From the two tables above, we can see that once set up properly, the performance of both the linear and logistic regression is similar to each other. The best @@0@@ for both models happens when we're fitting the predicted values of Republican fraction votes for all election years (2008, 2012 and 2016) and evaluate them against the true values Republican fraction votes for all election years i.e. @@1@@ is ~75%. The worst @@2@@ is when we fit only the 2008 and 2012 Republican fraction votes and evaluate them against the true values of 2016 Republican fraction votes i.e. @@3@@ is ~50%.
Looking back at the analysis of Republican Relative Vote across all states, this result makes sense. We can already see that these three election years are significantly different from each other so using only 2008 and 2012 data will result in a lower @@4@@ value.
Furthermore, the relatively low @@5@@ values can be explained by the large number of independent variables used in the model. Using the regression models, I assumed that the independent variables are independent of each other and are not correlated to each other but it might not be the case, i.e. multicollinearity might exist in the model.
Personally, I don't think using @@6@@ value is enough to evaluate the performance of each model. To truly evaluate the performance of the model, predicting the electoral votes for each state will be the correct evaluation. Electoral vote prediction is a binary classification problem since it's a winner takes all. As such, the accuracy score for the logistic regression model can be evaluated for this.
One way I would deal predict the vote share for counties that only contained the racial data is to set up a regression model with only the racial data as the independent variables to predict what the vote share would be. However, with only the racial data as the independent variable, I would predict that the model performance will be poor.
Another way is to fill in the missing data points by predicting the missing values using regression models. Once the missing values are filled in, we can then include the data in our model to predict the vote share for those counties.
Another way, which is quite popular to fill in missing data points is to use the machine learning algorithms such as k-nearest neighbors (KNN) and random forest. For KNN, an example would be to predict the Graduate Degree fraction in that particular county by assuming that the Graduate Degree fraction of other counties with similar racial data will be the same.
In the previous deliverable, you build a model that predicts the fraction of votes that the Republican party gets within a county. The goal of this deliverable is create a model that predicts the outcome of an entire election (i.e. predicts who will become president).
Each of the 50 US states gets a certain number of electoral college votes, (this info is contained in the ec.csv file). The party that gets the majority of the vote share in a state receives all of that state's electoral college votes. The overall presidential winner is the party that gets the most electoral college votes.
There are many ways you can approach this challenge, but one possible way involves using your county-level preimpordictions from the model that you build last week to determine the vote share of a party within each state.
Whichever method you choose, describe to us the reasoning behind your model. Fit your model on the 2008 and 2012 election date and then use your model to predict the outcome of the 2016 election. Does your model predict the winner accurately?
Using 2008 and 2012 election data as training set, I'm going to convert the data format from "wide" to long
The independent variables are those variables that have high correlation values. From our correlation dataframe before, we can apply a cut-off correlation coefficient. Let's include only those variables with correlation coefficient greater than 0.2 and less than -0.2 and only with p-values < 0.05
Let's put the logistic prediction values together in a dataframe.
Each county has a different number of total votes and a state, as a consequence has a different total number of states. Hence, our predicted probabilities of Republican vote fraction in each county (y_predict) needs to be weighted based on the fraction of total votes in each county over the total votes where the county is in.
Electoral votes is a "winner-takes-all" in a state-all. The party that has the majority of the votes win all the electoral college votes. The threshold is hence 0.5.
To calculate the y_predict per state, I will group all the counties by the state and calculate the sum of the weighted y_predict per county by state.
In this section, I will calculate the accuracy of my model based on how many states I got the electoral college votes correct or not.
Since each state has a different number of electoral votes, I will also calculate a weighted accuracy where if I predict the states with a high number of electoral college votes such as Texas, it'll matter more than states such as North Dakota.
As can be seen from the result above, the accuracy of the model is 82.35% while the weighted accuracy is 90% right. This is because the model predicts the states with the high number of electoral votes correctly, hence, resulting in higher weighted accuracy. These values look good enough for me so I'm leaving the model as it is. Furthermore, the model also predicts that the Republican wins the 2016 election. The prediction is quite borderline though, because if we actually drop the missing votes in Alaska, the model wasn't able to predict the votes correctly and would actually predict that the Democrats would win. However, since we flag Alaska instead of dropping them, the model is able to predict the votes correctly for Alaska.
The real total number of electoral votes is actually 306 electoral votes, not 302 but some states such as Maine actually split their electoral votes. Furthermore, the state of Alaska is missing in the data, which has 3 electoral votes that Republican did win. Hence, with Maine and Alaska, it'll be 306 votes. Despite the missing data in Alaska, the model predicts correctly that Republican won Alaska.