Part Two: Reason for Models and Data Collection
Reason for Models
When determining the student’s predicted college GPA, the team selected a regression analysis over other models because it can best quantify each factor to determine a predicted float label, the GPA, rounded to two decimal places. The team predicts the college GPA to determine how well the student will perform academically from high school to college.
Additionally, our dataset does not contain labeled data signifying whether or not a student attended a “Small Pond school; thus, a K-Means cluster algorithm is the ideal use case of grouping similar data points together.
Many machine learning models require using labeled data to train our models, but this dataset allows us to clump together similar students to determine which “cluster group” they would fall into.
Materials & Data Sources
This analysis begins with data collection and data preprocessing to gain an accurate insight into the student population.
First, the team must find dataset representing high school seniors interested in attending college. The team needs key factors from each student — their grade point average (GPA) & test scores, parental income, parental level of education, and years required to graduate.
Fortunately, the team found a dataset on Kaggle. It gives a distribution of 10,000 students, each with SAT and ACT scores, GPA, and parental income of high schoolers across the United States (College GPA Prediction). Exhibit 1 gives an example of what a sample college dataset would look like:
Exhibit 1: High School Dataset
Assumptions
When conducting research of any kind, it is important to note exceptions. This paper assumes these high schoolers will perform similarly academically and their parental incomes are comparable for the next four years of college. Even though this algorithm gives an accurate prediction of a static moment in time, these students are impacted by external circumstances that make the transition to college an easy or difficult one. This objective ruling of predicted college GPA, and student cluster is based on quantitative factors.
This is an ongoing five part series: Read on to learn about the different chapters creating and analyzing this project!
Previous:
Introduction: “Luring Big Fish to Your Small Pond”
Part One: Context and Problem Statement
Current:
Part Two: Reason for Models and Data Collection
Next:
Part Three: Machine Learning Solution 1: Logistic Regression
Part Four: Machine Learning Solution 2: K-Means Cluster Analysis