TLDR: I made an app that predicts outcomes of ODI cricket matches - find it here.

Somewhat following on from my previous play with visualising the cricket data, I was interested in applying some of the machine learning tools learnt in my first year classes at UCLA to see how I could do in building a predictive model for the ODI cricket results.

I excluded a little more data this time, since ODI cricket has evolved a lot in recent years, in order to help prediction of current games I only considered matches that occurred in 2010 or after.

I split the data into training and testing data and tried a large array of model I used various terms, including interaction terms for each model to determine the best performing model in terms of prediction error. A non exhuastive list is a follow:

  • Logistic/Probit/Cauchit Regression
  • Logistic Lasso Regression
  • Logistic Ridge Regression
  • Random Forest
  • Naive Bayes Classifier
  • Support Vector Machine (SVM)
  • Kernel Regularised Regression
  • Random Forest Ensemble

The best model in the end was ridge regularized logistic regression, on a subset of the available terms. Notably more flexible kernel regularized least squares did not improve, suggesting that the issues was not a mis-specified model, but inherently noisy data, this seems anecdotally reasonable.