Conclusion and Future work

Conclusion

Our aim for this project was to predict the outcome of the 64 games played in 2018 World Cup. We took the results of the games for the past several decades and discarded some old data, formed some obvious variables, formed some new features and took a shot at forming a couple of complex features based on intuition. Although our goal was to see a high predictive capability, our initial goal was to be able to use the methods learned in the course, such as Data Collection, Wrangling, and forming a core data set for all the team members to work on together. This in itself was a formidable task because of discrepancies across different datasets found on the Internet.

With a reasonably cleaned-up dataset, we ran several models learned in the course such as Logistics Regression, Decision Tree based models such as Bagging, Boosting and Random Forests. We also tried Neural Networks and Support Vector Machines in this project. There were some surprising results from some advanced models in that they failed to produce high accuracy results. It was a good exercise for us because we learned that there might not be one method that always beats the others in practical use, but it can depend on the data and also the feature set.

Given our time limitations, we looked at the results at various model results and picked a couple to delve a bit more and tune them - Random Forests and Neural Networks. We were able to achieve our best 3-outcome score using Random Forests with an accuracy of 66.67%.  However, with the same Random Forest algorithm, if we considered only 2-classes instead (Win and Loss - discarding Draw), we get a much improved accuracy of 82%.

Future Work

We feel we are not yet there to confidently predict World Cup game results because of the accuracy rates we have seen for most of our models. For example, the “team chemistry” variable that we ended up discarding after a lot of effort, needs to be re-invented. That is, we can do more research on how to incorporate data to represent the “team chemistry” of a team. Is it just the standard deviation of the Player’s strength? Or should it be a metric separated by attack, defense and midfield? Should it be a totally different metric that can capture the coordination of the team members?

As future work, we would like to propose more research on forming such features that could increase our model prediction from sixties to eighties to make the model a formidable one.  Also, we could spend more time tuning the models (hyperparameters) to improve the accuracies. As mentioned before, since the test data is of a specific type, that is, World Cup matches played in one country with many neutral matches (not much home advantage), should we be considering a smaller subset for our train set. Would that improve accuracy? We would leave many such questions as part of future work.

References

Data Scientists built a Random Forest Model to predict the World Cup 2018 Winner:
https://www.analyticsvidhya.com/blog/2018/06/data-scientists-used-a-random-forest-model-to-predict-the-world-cup-2018-winner

Predicting the winner of the 2018 FIFA World Cup:
https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner

When to Use MLP, CNN, and RNN Neural Networks: https://www.analyticsvidhya.com/blog/2018/06/data-scientists-used-a-random-forest-model-to-predict-the-world-cup-2018-winner

A Comprehensive Guide to Ensemble Learning (with Python codes):
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models