Final Project Report: Predicting the 2018 FIFA World Cup Results

Team members: Maxim Hammer, Sai Mahabhashyam, Megala Anandakumar, Johnny Guamanquispe

Assigned TF: Karan R Motwani

Last edited: 12/12/2018

Group ID: 2

Project Outline

Goal: The goal of the project is to predict the outcome of all matches of the 2018 FIFA World Cup and not specifically who will be the Final Winner. We will be considering all the games (32 group stage and 32 knockout games) as independent matches. That is, our prediction will be tested against the actual results of the 64 games after the 90 minutes. In knockouts, since there has to be a winner, the game goes on in order to have one winner. We will not be considering the post-90 minute play if it ends up as a draw at 90-minute time.

Model Type: Since we are predicting the outcome of a match, the possibilities are Win, Loss and Draw. It is a Classification problem. Initially we planned to go with just Win and Loss since the knockout stages do not allow a draw as a final outcome and build a binary classification. However, upon discussing with our TF, we understood that we could consider all three possible states at the end of 90 minutes of play, as explained earlier, and test our model against those results. Initially, we considered binary classification for initial EDA and model, we now have a 3-outcome classification models as part of the final project submission.

Preliminary Models: We initially built two simple baseline models: Random Generator and Logistic Regression. With 3 possible outcomes with equal probability, the Random Generator model is expected to have an accuracy of 0.33. The second baseline model, Logistic Regression, the features considered were: FIFA rankings, Game Location (country), and Continent. The Negative Log Likelihood loss function is considered to train the model.

Feature Engineering: We compiled several features such as demographics (population and GDP of a country), continent (used in the baseline model: logistic regression), Distance traveled by teams, Stakes (how important is the match - World Cup vs Friendly), and very importantly Team/Player Strengths. We built features to indicate “form” or “momentum” as well as “team chemistry”. Intuitively, a team with a better form or chemistry value has a better chance to win a game. We will show some EDA plots if the intuition is true, in which case, it can be a good predictor in our final model. We tried various Decision Tree based models such as Bagging, Boosting and Random forests, and Neural Nets. We will be using 3 final outcome states (Win, Loss, Draw) as explained earlier for our final model.

Data Sources and Cleaning

Approach and Sources: Our aim was to take the historic collection of FIFA games, and exclude the data (rows) as well as columns that are not relevant. For example, it did not make sense to keep very old data with countries that are no longer exist or participate. Also, there are no FIFA rankings prior to 1993. So, we discarded data prior to 1993.

Then, we added new columns (features) from Player Data, Demographics (Population, GDP etc), Game location and distance from their countries and stakes (whether it is serious match or a friendly one). We also added “Form Factor” and “Team Chemistry” variables, which we will explain later.

Data Sources:

This is the shortest distance from one border to the other. Neighboring countries have distance as zero.

This is yearly population by country.

Data Cleaning

There were a few issues primarily when joining data from different sources.

Organization of the Report

The rest of the report is organized as follows. First, we will discuss Exploratory Data Analysis (EDA) of all the data we have collected and cleaned up. In this section, we will discuss the relation of features with the outcome in the training set. We will also show the new features created out of the available data based on some intuition about the game.  The next section will be about the Feature Engineering and Model building. Here, we will show how the models have been built and compare the results of all the models we have built. Finally, the last section will have our conclusions as well as scope of future work. We would like to thank the Professors and Teaching Fellows for their guidance in the course as well as this project.