Predicting Soccer Attendances – Part 4: Dataset Creation with Python

Dataset Creation

The creation of a dataset was split into two parts.

The initial phase, involved looking at League One fixtures and creating a dataset within Excel for matches played in season 2016/17. The process to produce this dataset was manual, and involved sourcing data from various websites, which was then added into an Excel spreadsheet. This spreadsheet is named "league1_2016_17_v5.xlsx" and can be found in the github repository linked below.

The second phase, was linked to improving my understanding and learning of the python language.  A python script was created (later to be converted into a jupyter notebook). With the aid of various other source files, the script when run, creates a dataset for Championship matches in the 2016/17 season. 

The script (a jupyter/ipython notebook) can be found in my Github Repository here.

The two datasets produced, are then merged into one main dataset, which will then be used in the modelling phase.

See my Github Repository here, for a full list of the files used in the creation of the dataset.  

The list of features produced by this python script, are as follows :-

Feature List

  1. Date : date of match HomeTeam
  2. Home Team AwayTeam
  3. Away Team Day_Eve : Is game a day or evening match ?
  4. Day Type : Is the game on a weekend or during week ?
  5. Holiday : Is the game played on a bank holiday ?
  6. Hol Type : Same as holiday.
  7. Capacity : Capacity of home teams ground
  8. Average Travelling Fans : Average number of travelling fans that away team takes (based on previous season)
  9. Cheapest Season T : Lowest Season ticket price for home team
  10. Home League Position : Current position at time of game, of home team
  11. Away League Position : Current position at time of game, of away team
  12. Form Home : Current form of the home team (based on last 5 matches)
  13. Form Away : Current form of the away team (based on last 5 matches)
  14. Distance : Distance between the home sides ground and the away team
  15. Temperature : Temperature on day of game , Weather Event
  16. Lowest Home Ticket Price : Lowest ticket price for a home fan
  17. Lowest Away Ticket Price : Lowest ticket price for an away fan
  18. Home PostCode : Postcode of home team
  19. Away PostCode : Postcode of away team
  20. Attendance : Attendance for the game
  21. Highest Home Ticket Price : Highest home ticket price that a fan can pay

The table below gives an example output from the dataset file :-

Date HomeTeam AwayTeam Day_Eve Day Type Hol Type Capacity Average Travelling Fans Cheapest Season T Home League Position Away League Position Form Home Form Away Distance Temperature Lowest Home Ticket Price Lowest Away Ticket Price Attendance Highest Home Ticket Price
  05/08/2016 Fulham Newcastle E 1 0 25700 3140 254 0 0 0 0 249 20.2 25 24 23922 45
  06/08/2016 Birmingham Cardiff D 0 0 30016 775 230 0 0 0 0 89 18.8 15 22 19833 40
  06/08/2016 Blackburn Norwich City D 0 0 31367 1661 279 0 0 0 0 175 17.9 18 20 12641 35
  06/08/2016 Bristol City Wigan Athletic D 0 0 21497 1284 299 0 0 0 0 145 19.1 25 20 17635 41
  06/08/2016 Derby Brighton & Hove Albion D 0 0 33597 1611 319 0 0 0 0 154 19.3 17.6 25 28749 33

In the next update, feature scaling will be looked at, and its possible application to the dataset created in this project.

In the next update, feature scaling will be looked at, and its possible application to the dataset created in this project.

Leave a Comment