Summary to date / Next Steps ?

Previous blog posts have provided details on how the source dataset was created, from scratch, with a mixture of tools such as R + Python. Now we have that, what are the next steps, what things do we need to consider ?

Which Machine Learning Algorithm ?

This article by Jason Brownlee, provides a real good overview of the different types of ML algorithms, as does this one on Analytics Vidhya.

For this specific case in question, given we have a target or outcome variable (ie. Attendance), a Supervised ML algorithm is required, specifically Multiple Linear Regression.

Exploratory Data Analysis (EDA)

An important key step that involves exploring the data set, producing some initial descriptive statistics, univariate/multivariate analysis, and correlation analysis.

Regression Assumptions

In order for regression to work correctly, there are a number of assumptions that need to be satisfied :-

  1.  There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s).
  2.  There should be no correlation between the residual (error) terms.
  3.  The independent variables should not be correlated.
  4.  The error terms must have constant variance. This phenomenon is known as homoskedasticity.
  5.  The error terms must be normally distributed.

We will look further into these assumptions, alongside the EDA, in the next blog

Leave a Comment