Final Project Report: Predicting the 2018 FIFA World Cup Results

Team members: Maxim Hammer, Sai Mahabhashyam, Megala Anandakumar, Johnny Guamanquispe

Assigned TF: Karan R Motwani

Last edited: 12/12/2018

Group ID: 2

Project Outline

Goal: The goal of the project is to predict the outcome of all matches of the 2018 FIFA World Cup and not specifically who will be the Final Winner. We will be considering all the games (32 group stage and 32 knockout games) as independent matches. That is, our prediction will be tested against the actual results of the 64 games after the 90 minutes. In knockouts, since there has to be a winner, the game goes on in order to have one winner. We will not be considering the post-90 minute play if it ends up as a draw at 90-minute time.

Model Type: Since we are predicting the outcome of a match, the possibilities are Win, Loss and Draw. It is a Classification problem. Initially we planned to go with just Win and Loss since the knockout stages do not allow a draw as a final outcome and build a binary classification. However, upon discussing with our TF, we understood that we could consider all three possible states at the end of 90 minutes of play, as explained earlier, and test our model against those results. Initially, we considered binary classification for initial EDA and model, we now have a 3-outcome classification models as part of the final project submission.

Preliminary Models: We initially built two simple baseline models: Random Generator and Logistic Regression. With 3 possible outcomes with equal probability, the Random Generator model is expected to have an accuracy of 0.33. The second baseline model, Logistic Regression, the features considered were: FIFA rankings, Game Location (country), and Continent. The Negative Log Likelihood loss function is considered to train the model.

Feature Engineering: We compiled several features such as demographics (population and GDP of a country), continent (used in the baseline model: logistic regression), Distance traveled by teams, Stakes (how important is the match - World Cup vs Friendly), and very importantly Team/Player Strengths. We built features to indicate “form” or “momentum” as well as “team chemistry”. Intuitively, a team with a better form or chemistry value has a better chance to win a game. We will show some EDA plots if the intuition is true, in which case, it can be a good predictor in our final model. We tried various Decision Tree based models such as Bagging, Boosting and Random forests, and Neural Nets. We will be using 3 final outcome states (Win, Loss, Draw) as explained earlier for our final model.

Data Sources and Cleaning

Approach and Sources: Our aim was to take the historic collection of FIFA games, and exclude the data (rows) as well as columns that are not relevant. For example, it did not make sense to keep very old data with countries that are no longer exist or participate. Also, there are no FIFA rankings prior to 1993. So, we discarded data prior to 1993.

Then, we added new columns (features) from Player Data, Demographics (Population, GDP etc), Game location and distance from their countries and stakes (whether it is serious match or a friendly one). We also added “Form Factor” and “Team Chemistry” variables, which we will explain later.

Data Sources:

FIFA ranking data: Source is https://www.kaggle.com/tadhgfitzgerald

Player Data: Used data provided as well as web scraped from https://sofifa.com

Continents: Source is https://gist.github.com/pamelafox/986163

Distances: Source is https://gist.githubusercontent.com/mtriff/.../countries_distances.csv

This is the shortest distance from one border to the other. Neighboring countries have distance as zero.

Population: Source is https://data.worldbank.org/indicator/SP.POP.TOTL

This is yearly population by country.

GDP per capita: Source is https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD

Data Cleaning

There were a few issues primarily when joining data from different sources.

Country names: One of the main issues was that some country names were different in CSV files obtained from different sources. So, one of the tasks was to see which ones are different and manually replace the names of those identified countries and then perform merge operation (Example - North Korea vs Korea DPR). There was also the issue of having United Kingdom in the demographics related data sets while the other data set contained England, Ireland, Northern Ireland, and Wales as individual counties. We looked up data from various online sources for those countries and populated approximate values in the CSV manually.

Data Imputation: There were some missing values for population data, for example. So, we imputed through simple means, such as using the previous year’s population.

Incorrect values: In the player data, we had some values that were weird with data operations such as 60+3. So , they needed to be interpreted correctly / cleaned up before joining them.

Different format: We needed data with multiple rows for each country, one per each country whereas the data found had just one row per country with different years as column names.

Not enough data: Finally, we have issues in finding data online (we will try for the next few days) for years earlier than 2008 or 2010. We might need to take some decisions to discard some of the old games or do some imputation if required.

Organization of the Report

The rest of the report is organized as follows. First, we will discuss Exploratory Data Analysis (EDA) of all the data we have collected and cleaned up. In this section, we will discuss the relation of features with the outcome in the training set. We will also show the new features created out of the available data based on some intuition about the game. The next section will be about the Feature Engineering and Model building. Here, we will show how the models have been built and compare the results of all the models we have built. Finally, the last section will have our conclusions as well as scope of future work. We would like to thank the Professors and Teaching Fellows for their guidance in the course as well as this project.