Previous blog posts have provided details on how the source dataset was created, from scratch, with a mixture of tools such as R + Python. Now we have that, what are the next steps, what things do we need to consider ?
Which Machine Learning Algorithm ?
For this specific case in question, given we have a target or outcome variable (ie. Attendance), a Supervised ML algorithm is required, specifically Multiple Linear Regression.
Exploratory Data Analysis (EDA)
An important key step that involves exploring the data set, producing some initial descriptive statistics, univariate/multivariate analysis, and correlation analysis.
In order for regression to work correctly, there are a number of assumptions that need to be satisfied :-
- There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s).
- There should be no correlation between the residual (error) terms.
- The independent variables should not be correlated.
- The error terms must have constant variance. This phenomenon is known as homoskedasticity.
- The error terms must be normally distributed.
We will look further into these assumptions, alongside the EDA, in the next blog