Predicting Soccer Attendances – Feature Scaling

Taken from Wikipedia ...

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Within the soccer dataset i have produced, we have a number of features that vary in terms of values or ranges of values. For example "capacity" has values, of 30,000+, "distance" has values of 250+, whereas "form home" and "form away" have values not exceeding 20.

Given the above, it would seem that feature scaling is possibly something that needs to be considered. Further research seems to confirm that assumption, however there may be some cases (eg. algorithms based on clustering, KNN etc), where it is more critical, and others where there is no real impact of performing the scaling. See this link for further info on this.

There are a number of different methods that can be used with Feature Scaling, again Wikipedia provides a reasonable explanation of these here.

Initially, with this project I have looked at the "rescaling" or "min-max" method. My first attempt, can be found in my Gitlab Repo here, where i manually applied the method, via a Python/Jupyter notebook/script, and the creation of a function to apply the scaling to the features in my dataset. Further research then revealed the Scikit-Learn library, and within this the MinMaxScaler package that will automate the application of the min-max method of scaling. A second python/jupyter notebook can be found here, which uses scikit-learn. Interestingly, and to my delight, the manual first script appears to match the scikit-learn automated process.

A further point to note, within the second script, use was also made of the LabelEncoder package, which was used to convert categorical data within the dataset (eg. Home Team, DayEve features) into numerical values.

Another useful link, discussing feature scaling, and in specific the scikit-learn package, Scikit-Learn (Feature Scaling)

Predicting Soccer Attendances – Feature Scaling

What is Feature Scaling ?

Predicting Soccer Attendances - Part 4: Dataset Creation with Python

English Soccer Matches - Shiny Dashboard

Leave a Comment