Previous blog posts have provided details on how the source dataset was created, from scratch, with a mixture of tools such as R + Python. Now we have that, what are the next steps, what things do we need to consider ?

**Which Machine Learning Algorithm ?**

This article by Jason Brownlee, provides a real good overview of the different types of ML algorithms, as does this one on Analytics Vidhya.

For this specific case in question, given we have a target or outcome variable (ie. Attendance), a Supervised ML algorithm is required, specifically Multiple Linear Regression.

**Exploratory Data Analysis (EDA)**

An important key step that involves exploring the data set, producing some initial descriptive statistics, univariate/multivariate analysis, and correlation analysis.

**Regression Assumptions**

In order for regression to work correctly, there are a number of assumptions that need to be satisfied :-

- There should be a
**linear**and**additive**relationship between*dependent*() variable and**response***independent*(**predictor**) variable(s). - There should be
**no correlation**between the residual (error) terms. - The independent variables should not be correlated.
- The error terms must have constant variance. This phenomenon is known as
**homoskedasticity**. - The error terms must be normally distributed.

We will look further into these assumptions, alongside the EDA, in the next blog