Education Analysis in Portugal

#Education-Analysis-in-Portugal

Students enrolled in a Portuguese and Math Class.

#Students-enrolled-in-a-Portuguese-and-Math-Class.

Skills Shown:

#Skills-Shown:

Machine Learning-Regression & Classification
Data Visualization
Dimensionality Reduction

Introduction:

#Introduction:

I have long held an interest in education. In fact, I will be working in the education realm this summer. One of the most important aspects of education is improving assessment scores and final grades. It is a particularly tricky task, as students differ along so many characteristics. The particular datasets that I decided to analyze tried to approach achievement for Portuguese students at a secondary education level. There were two datasets I looked at. The first corresponded to students in a Portuguese class and the other corresponded to those in a math class. A lot of my questions revolved around the G3 (Final Grade) variable present in both, as it is usually considered the most important variable in education assessment. I was also curious if I could cluster students because perhaps it was possible that these clusters learn differently in the classroom. Finally, I aimed to see if I could predict the “famsup” (Family Education Support) variable using a classification algorithm. If an individual does not have family support for his/her education, it is likely they are not putting an optimal amount of effort into school.

Questions I asked:

1.How does the distribution of finals grades for the math class compare to the final grades of the Portuguese class?
2.What variables are highly positively/negatively correlated for final grades in the math class and Portuguese class?
3.Can I use a Linear Regression model to predict G3 scores for both classes?
4.Do students who drink heavily class on the weekend get worse final grades than those who do not drink at all on the weekend?
5.Can students in the math class be clustered into groups?
6.Can I create a classification model to predict family support for students in the Portuguese class?

Data Sources:

#Data-Sources:-

The two datasets I used were acquired from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/student+performance).

There were 33 different variablesin both datasets with 650 subjects in both. The most important variables in both datasets were:
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
famsup - family educational support (binary: yes or no)
Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

Importing the necessary libraries

#Importing-the-necessary-libraries

Question 1: How does the distribution of finals grades for the math class compare to the final grades of the Portuguese class?

#Question-1:-How-does-the-distribution-of-finals-grades-for-the-math-class-compare-to-the-final-grades-of-the-Portuguese-class?

Conclusion: As you will see from my code, on average, students do better in the Portuguese class. This is further indicated by the histogram which shows the histogram of G3 values to be left-skewed. Both distributions are not normally distributed according to QQ plots.

#Conclusion:-As-you-will-see-from-my-code,-on-average,-students-do-better-in-the-Portuguese-class.-This-is-further-indicated-by-the-histogram-which-shows-the-histogram-of-G3-values-to-be-left-skewed.-Both-distributions-are-not-normally-distributed-according-to-QQ-plots.

I found that ,on average, students get around a 10.4 in the math class and a 11.9 in the Portuguese class (out of 20). Similarly, median and mode are both higher for final Portuguese class grades. The distributions are similar in that they are both not normal. However, G3 scores for the Portuguese Class are definitely more left skewed (That graph is below). Implications: Students do better, on average, in the Portuguese class. I would not consider either group to have normally distributed data.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Question 2: What variables are highly positively/negatively correlated for final grades in the math class and Portuguese class?

#Question-2:-What-variables-are-highly-positively/negatively-correlated-for-final-grades-in-the-math-class-and-Portuguese-class?

Conclusion: In the following section I ran correlation matrixes for both datasets. After sorting values in descending order, I found that for the most part, both G3 variables have a very similar order of variables that are highly/negatively correlated. G2, G1, and Medu are both extremely positively correlated for both, while failures are highly negatively correlated for both.

#Conclusion:-In-the-following-section-I-ran-correlation-matrixes-for-both-datasets.-After-sorting-values-in-descending-order,-I-found-that-for-the-most-part,-both-G3-variables-have-a-very-similar-order-of-variables-that-are-highly/negatively-correlated.-G2,-G1,-and-Medu-are-both-extremely-positively-correlated-for-both,-while-failures-are-highly-negatively-correlated-for-both.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Question 3: Can I use a Linear Regression model to predict G3 scores for both classes?

#Question-3:-Can-I-use-a-Linear-Regression-model-to-predict-G3-scores-for-both-classes?

Conclusion: I would say the answer to this question is yes, but with a caveat. I ran an OLS regression method for both classes. I included variables that had an absolute correlation of at least .2 with the G3 score from their repective dataset. I found an extremely high R^2 value for both models (north of .8 for both). However, I question these results because the residuals are not normally distributed and violate the randomness assumption.

#Conclusion:-I-would-say-the-answer-to-this-question-is-yes,-but-with-a-caveat.-I-ran-an-OLS-regression-method-for-both-classes.-I-included-variables-that-had-an-absolute-correlation-of-at-least-.2-with-the-G3-score-from-their-repective-dataset.-I-found-an-extremely-high-R^2-value-for-both-models-(north-of-.8-for-both).-However,-I-question-these-results-because-the-residuals-are-not-normally-distributed-and-violate-the-randomness-assumption.

I found that without G1 and G2 included in the models, OLS returns a low R^2 value.

#I-found-that-without-G1-and-G2-included-in-the-models,-OLS-returns-a-low-R^2-value.

OLS for Math class-Independent Variables are Failures, Medu, G2 and G1. R^2 value of .824.

#OLS-for-Math-class-Independent-Variables-are-Failures,-Medu,-G2-and-G1.-R^2-value-of-.824.
Loading output library...

Residuals For Math Class: Plotted histogram, qq plot, lag plot and Run Sequence plot.

#Residuals-For-Math-Class:-Plotted-histogram,-qq-plot,-lag-plot-and-Run-Sequence-plot.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

OLS for Portuguese Class-Independent Variables -Failures, Medu, G2, G1, Studytime, Fedu and Dalc.

#OLS-for-Portuguese-Class-Independent-Variables--Failures,-Medu,-G2,-G1,-Studytime,-Fedu-and-Dalc.
Loading output library...

Residuals for Portuguese Class : Plotted histogram, qq plot, lag plot and Run Sequence plot.

#Residuals-for-Portuguese-Class-:-Plotted-histogram,-qq-plot,-lag-plot-and-Run-Sequence-plot.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

OLS without G1 or G2-Kept other Variables

#OLS-without-G1-or-G2-Kept-other-Variables
Loading output library...
Loading output library...

Question 4-Do Heavy Weekend Drinkers get worse Final Grades (G3), than those who do not drink at all on the weekends?

#Question-4-Do-Heavy-Weekend-Drinkers-get-worse-Final-Grades-(G3),-than-those-who-do-not-drink-at-all-on-the-weekends?

Conclusion: For this section, I found that heavy weekend drinkers, on average,(either 4 or 5 for the Walc variable) did worse than those who did not drink on the weekend. I found a statistically significant p-value from a Mann Whitney U test.

#Conclusion:-For-this-section,-I-found-that-heavy-weekend-drinkers,-on-average,(either-4-or-5-for-the-Walc-variable)-did-worse-than-those-who-did-not-drink-on-the-weekend.-I-found-a-statistically-significant-p-value-from-a-Mann-Whitney-U-test.

I first created two dataframes. One for heavy drinkers and one for non-drinkers. I then used a Mann Whitney U test to determine if the difference in medians was significant ( I could not use a t-test because data was not exactly normally distributed). The p-value was below .05 so I determined it to be a significant difference. I then calculated the means for both groups and saw that the people who did not drink did better on average.

#I-first-created-two-dataframes.-One-for-heavy-drinkers-and-one-for-non-drinkers.-I-then-used-a-Mann-Whitney-U-test-to-determine-if-the-difference-in-medians-was-significant-(-I-could-not-use-a-t-test-because-data-was-not-exactly-normally-distributed).-The-p-value-was-below-.05-so-I-determined-it-to-be-a-significant-difference.-I-then-calculated-the-means-for-both-groups-and-saw-that-the-people-who-did-not-drink-did-better-on-average.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Null Hypothesis: There is no difference in median Math G3 score for students who are heavy drinkers compared to those who are not.

#Null-Hypothesis:-There-is-no-difference-in-median-Math-G3-score-for-students-who-are-heavy-drinkers-compared-to-those-who-are-not.

P-value is less than .05, so there seems to be enough evidence to reject null hypothesis

#P-value-is-less-than-.05,-so-there-seems-to-be-enough-evidence-to-reject-null-hypothesis

Performing Dimensionality Reduction on math_class data using Factor Analysis

#Performing-Dimensionality-Reduction-on-math_class-data-using-Factor-Analysis

This will be Relevant For my clustering section

#This-will-be-Relevant-For-my-clustering-section

Before using Factor Analysis, I normalized my math class data using the preprocessing module from scikitlearn.

#Before-using-Factor-Analysis,-I-normalized-my-math-class-data-using-the-preprocessing-module-from-scikitlearn.

I created two factors because I wanted to use the factors to visualize the K Means clustering later in this notebook.

#I-created-two-factors-because-I-wanted-to-use-the-factors-to-visualize-the-K-Means-clustering-later-in-this-notebook.

I was aiming to create two factors that were determined by students.

#I-was-aiming-to-create-two-factors-that-were-determined-by-students.
Loading output library...

Question 5: Clustering- Can math students be grouped in different clusters based on numeric variables?

#Question-5:-Clustering--Can-math-students-be-grouped-in-different-clusters-based-on-numeric-variables?

Conclusion: For this part, I found that both clustering methods gave me similar results: students from the math class can be clustered into two groups where one group is significantly larger than the other. I would say that these students are not easily clustered into distinct groups, especially with the factor analysis I calculated.

#Conclusion:-For-this-part,-I-found-that-both-clustering-methods-gave-me-similar-results:-students-from-the-math-class-can-be-clustered-into-two-groups-where-one-group-is-significantly-larger-than-the-other.-I-would-say-that-these-students-are-not-easily-clustered-into-distinct-groups,-especially-with-the-factor-analysis-I-calculated.

Hierarchical Clustering

#Hierarchical-Clustering

I decided to use both forms of clustering we learned in class to see if results would be consistent. I first created a dataframe of only numerical valuables from the dataframe. I then created a square form distance matrix using Euclidean as my metric (got similar results with cosine) and used single linkage method (got similar results for other linkage methods). Finally, I graphed a dendrogram that showed that there were two clusters created with one being significantly larger than the other.

#I-decided-to-use-both-forms-of-clustering-we-learned-in-class-to-see-if-results-would-be-consistent.-I-first-created-a-dataframe-of-only-numerical-valuables-from-the-dataframe.-I-then-created-a-square-form-distance-matrix-using-Euclidean-as-my-metric-(got-similar-results-with-cosine)-and-used-single-linkage-method-(got-similar-results-for-other-linkage-methods).-Finally,-I-graphed-a-dendrogram-that-showed-that-there-were-two-clusters-created-with-one-being-significantly-larger-than-the-other.
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

K-means Clustering to see if I come up with a different result

#K-means-Clustering-to-see-if-I-come-up-with-a-different-result

To find the number of clusters I would need, I used the silhouette plot file you gave us in class and found that two clusters gave me the highest silhouette score. I was slightly disappointed because one cluster was way larger than the other. Supplementary, the two factors I created using Factor Analysis did not do a great job of visualizing the clusters. There is a lot of overlap in the graph.

#To-find-the-number-of-clusters-I-would-need,-I-used-the-silhouette-plot-file-you-gave-us-in-class-and-found-that-two-clusters-gave-me-the-highest-silhouette-score.-I-was-slightly-disappointed-because-one-cluster-was-way-larger-than-the-other.-Supplementary,-the-two-factors-I-created-using-Factor-Analysis-did-not-do-a-great-job-of-visualizing-the-clusters.-There-is-a-lot-of-overlap-in-the-graph.
Loading output library...

Question 6: Can I predict if a student in the Portuguese Class has family support using a classification algorithm?

#Question-6:-Can-I-predict-if-a-student-in-the-Portuguese-Class-has-family-support-using-a-classification-algorithm?

Conclusion: For this part, I used both the Random Forests and Naive Bayes algorithms to make predictions. I found below average success for both when comparing my accuracy scores to the baseline. However, the Naive Bayes was slightly better. Nonetheless, I would say the answer to this question is no, neither were great at predicting family education support.

#Conclusion:-For-this-part,-I-used-both-the-Random-Forests-and-Naive-Bayes-algorithms-to-make-predictions.-I-found-below-average-success-for-both-when-comparing-my-accuracy-scores-to-the-baseline.-However,-the-Naive-Bayes-was-slightly-better.-Nonetheless,-I-would-say-the-answer-to-this-question-is-no,-neither-were-great-at-predicting-family-education-support.

Before plugging any data into these classifiers, I Z-Scored the data. I wanted to make sure that all data units looked the same. To decide what variables I would use for both models, I grouped the students into two groups: one where students have family support, and one where they do not. I then averaged all variables for both groups and subtracted the means between the two groups. Whichever variables had absolute differences above .1 are the ones I included in my model. The variables that I ended up including were G3, Walc, studytime, Fedu, age, and Medu.

#Before-plugging-any-data-into-these-classifiers,-I-Z-Scored-the-data.-I-wanted-to-make-sure-that-all-data-units-looked-the-same.-To-decide-what-variables-I-would-use-for-both-models,-I-grouped-the-students-into-two-groups:-one-where-students-have-family-support,-and-one-where-they-do-not.-I-then-averaged-all-variables-for-both-groups-and-subtracted-the-means-between-the-two-groups.-Whichever-variables-had-absolute-differences-above-.1-are-the-ones-I-included-in-my-model.-The-variables-that-I-ended-up-including-were-G3,-Walc,-studytime,-Fedu,-age,-and-Medu.

I split the dataframe into a training and testing set. 70% of the original data is in the training set.

#I-split-the-dataframe-into-a-training-and-testing-set.-70%-of-the-original-data-is-in-the-training-set.

I ran a for loop that indicated to me the depth and number of estimators that would give me the best accuracy score for the Random Forests model.

#I-ran-a-for-loop-that-indicated-to-me-the-depth-and-number-of-estimators-that-would-give-me-the-best-accuracy-score-for-the-Random-Forests-model.

I fit both models, and evaluated them using a 10-fold cross validation. I also plotted a confusion matrix for both of them. I found results that were slightly better than the baseline for Naive Bayes.

#I-fit-both-models,-and-evaluated-them-using-a-10-fold-cross-validation.-I-also-plotted-a-confusion-matrix-for-both-of-them.-I-found-results-that-were-slightly-better-than-the-baseline-for-Naive-Bayes.

Random Forest Model

#Random-Forest-Model

Cross Validation Results here were not good. Some of the results were below the baseline for several measures.

#Cross-Validation-Results-here-were-not-good.-Some-of-the-results-were-below-the-baseline-for-several-measures.

Naive Bayes Model

#Naive-Bayes-Model

The Cross Validation Scores were slightly better for the Bayes model. However, I would still not say it gave me great results.

#The-Cross-Validation-Scores-were-slightly-better-for-the-Bayes-model.-However,-I-would-still-not-say-it-gave-me-great-results.

Future Work

#Future-Work

I feel as though there is a lot of future work that could be done to improve my findings. I would love to see a dataset with the same variables as this one but for a different age group or different class subject. Comparing the results I had with this project with similar work for different age groups or class subject could produce fascinating results. In terms of improving my results for this project, there are several things I could do. Since my residuals for question 3 were not normally distributed I could look into using nonlinear regression methods for the same task in the future. For questions 5 and 6, I could try different clustering/classification methods to try to get better results.