So briefly summarising part 1 of this Blog Series. The steps involved in the Data Science process are :-
- Business / Data Understanding
- Data Acquisition
- Data Preparation/Wrangling
- Data Modelling and Evaluation
Business or Data Understanding
I am trying to produce a model that will predict the attendances of soccer matches in the English Football League. I therefore need to consider what features, or associated items of data are likely to contribute to the amount of people that will turn up and watch a game of soccer ?
I would suggest the following are relevant :-
- Capacity of the home teams ground
- The current form of the two teams playing each other
- The distance that the away team fans have to travel
- The weather/temperature on the day
- The league position of each team
- Day of the week, and is it a bank holiday
- Ticket prices
- The average number of fans the away team usually take
This information will form the start of the dataset I am looking to create. Each one of the above is a potential feature, that will be used to ultimately produce a predictive model for Soccer attendances.
Scope also needs to be considered at this point. For this model, and given that I support a lower league team (Northampton Town FC, The Cobblers!!), I have decided initially to concentrate on League 1 and the Championship within England.
Within a business environment this step may well be where data is sourced from existing systems. For this project, I am not aware of any pre-built data set that holds the data I need, and so I will be looking at websites that hold details of soccer matches, to use as a source. The main site I have identified to use is :-
This is a feature packed website, that contains a lot of the information we need for each soccer match, and will be a good starting point for my journey into producing the dataset, to then be used for the predictive modelling phase. Clearly there are some data items however, that will not be found here, notably weather related data, more on that later.
Having decided where I can obtain the data from, how do I get it into a dataset for modelling ?
In the early stages of this project, and while I was sourcing data for League 1 matches, I used a very manual approach to get the data. This involved the creation of a csv file, with data sourced from the aforementioned soccer website. Note, there were some data items, such as the weather/temperature, and distance between teams, where I had to source the data from other websites, and manually calculate and add these features into the csv file.
This very inflexible, manual approach to creating the dataset, as you can imagine, was time consuming, and open to error. It was at this point, and in producing the Championship matches dataset, that I decided to use a combination of Python and R scripts, that would considerably reduce the manual work involved. Note, that there was still some manual aspects, and potentially “web scraping”techniques could have been employed, reducing manual intervention even further. However, my experience in this area was limited, and so I did not use that technique. This is very much a technique I would like to use though in the future (watch out for a future blog post on that!!).
Part 3 of this blog series will look in detail at the Python and R scripts.
For now though, I will leave you with a taster, and as promised, more on the weather related issue. Firstly, the following website was used to obtain what are known as Station IDs , a code to denote the location of a weather station that records temperature :-
This site was then used, in conjunction with an R package, called “weatherData”.