I have long held an interest in education. In fact, I will be working in the education realm this summer. One of the most important aspects of education is improving assessment scores and final grades. It is a particularly tricky task, as students differ along so many characteristics. The particular datasets that I decided to analyze tried to approach achievement for Portuguese students at a secondary education level. There were two datasets I looked at. The first corresponded to students in a Portuguese class and the other corresponded to those in a math class. A lot of my questions revolved around the G3 (Final Grade) variable present in both, as it is usually considered the most important variable in education assessment. I was also curious if I could cluster students because perhaps it was possible that these clusters learn differently in the classroom. Finally, I aimed to see if I could predict the “famsup” (Family Education Support) variable using a classification algorithm. If an individual does not have family support for his/her education, it is likely they are not putting an optimal amount of effort into school.
Questions I asked:1.How does the distribution of finals grades for the math class compare to the final grades of the Portuguese class? 2.What variables are highly positively/negatively correlated for final grades in the math class and Portuguese class? 3.Can I use a Linear Regression model to predict G3 scores for both classes? 4.Do students who drink heavily class on the weekend get worse final grades than those who do not drink at all on the weekend? 5.Can students in the math class be clustered into groups? 6.Can I create a classification model to predict family support for students in the Portuguese class?
The two datasets I used were acquired from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/student+performance).There were 33 different variablesin both datasets with 650 subjects in both. The most important variables in both datasets were: G1 - first period grade (numeric: from 0 to 20) G2 - second period grade (numeric: from 0 to 20) G3 - final grade (numeric: from 0 to 20, output target) Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) famsup - family educational support (binary: yes or no) Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
I found that ,on average, students get around a 10.4 in the math class and a 11.9 in the Portuguese class (out of 20). Similarly, median and mode are both higher for final Portuguese class grades. The distributions are similar in that they are both not normal. However, G3 scores for the Portuguese Class are definitely more left skewed (That graph is below). Implications: Students do better, on average, in the Portuguese class. I would not consider either group to have normally distributed data.
I feel as though there is a lot of future work that could be done to improve my findings. I would love to see a dataset with the same variables as this one but for a different age group or different class subject. Comparing the results I had with this project with similar work for different age groups or class subject could produce fascinating results. In terms of improving my results for this project, there are several things I could do. Since my residuals for question 3 were not normally distributed I could look into using nonlinear regression methods for the same task in the future. For questions 5 and 6, I could try different clustering/classification methods to try to get better results.