Part Three: Machine Learning Solution 1: Logistic Regression Model
Machine Learning Solution 1: Using Linear Regression
This machine learning algorithm utilizes high school students’ quantitative attributes to predict their college GPAs. The admissions team determines that this algorithm can paint a picture of how well a student will perform academically in college. Additionally, the admissions team can gain insight into their eventual student body by running this algorithm across the entire prospective high school student applications.
To begin, the admissions team cleans the dataset. The team converts all student attributes to quantitative data, and then normalizes each column. Through this process, the team can quantify each student’s deviation from the average. As a result, the team maps the “Parents Level of Education” column to numeric labels, normalizes the data, and sets the ‘college GPA’ column as our test y label (Exhibit 2).
Exhibit 2: Cleaned & Normalized Student Dataset
Next, the team determines which attributes have the strongest correlation with college grade point average by implementing a correlation matrix on the dataset. The team produces a correlation matrix between each field and college GPA, shown below as a heatmap (Exhibit 3):
Exhibit 3: Correlation Matrix Heatmap
Viewing its analysis, the admissions team deduces that high school GPA, SAT score, and ACT score are strongly correlated with college GPA. This result is consistent with the research; high school GPAs serve as the strongest individual contributor to academic achievement (Allensworth & Clark, 2020). Another resource claims high school performance & class attendance form the most powerful foundation for students, and are shown to be least likely to produce a student earning a D, F, or Withdrawl (Al Hazaa & Abdel-Salam, 2021).
The average difference in the team’s sample’s college GPA and high school GPA is xx. Generally speaking, college studies are more difficult for the student. An attribute of its difficulty is that students are often graded against a curve in their classes; the grades they earn are relative to their peers.
Finally, the admissions team computes a linear regression model. The team uses 500 iterations, an alpha of 0.1, and standard gradient descent to produce a model that has a total mean squared error of around 0.185 for 10,000 values. Thus, the model calculates the expected college grade point averages of incoming students with impressive accuracy.
This is an ongoing five part series: Read on to learn about the different chapters creating and analyzing this project!
Previous:
Introduction: “Luring Big Fish to Your Small Pond”
Part One: Context and Problem Statement
Part Two: Reason for Models and Data Collection
Current:
Part Three: Machine Learning Solution 1: Logistic Regression
Next:
Part Four: Machine Learning Solution 2: K-Means Cluster Analysis