Premier League Match Prediction

Elisha Antunes
Jun 12, 2025
3 min read

Updated: Jun 24, 2025

As a lifelong football fan and data enthusiast, I’ve often wondered: how much of a football match is truly unpredictable? While tactics, injuries, and momentum make the game beautifully chaotic, patterns do emerge across seasons.

In this project, I set out to uncover some of those patterns — and to see if a machine learning model could predict match outcomes in the Premier League using historical match data.

Goal of the Project

The objective was simple, but a bit ambitious in execution:

Predict the full-time result (home win, draw, away win) of a Premier League match based solely on pre-match statistical data.

To pull this off, I used a public dataset, cleaned and prepped the stats into something a model could understand, trained it to make predictions, and then tested how well it did — all while digging into the small details that often hint at whether a team will win, lose, or draw.

Step 1: Cleaning and Structuring the Data

The dataset used was a season file from Football-Data.co.uk, specifically E0.csv, which contains detailed match-level statistics for the Premier League.

I began by narrowing the scope to just the essential columns:

HomeTeam and AwayTeam
FTHG (Full-Time Home Goals)
FTAG (Full-Time Away Goals)
FTR (Full-Time Result: H, D, or A)

To prepare the data for modeling:

I removed any rows with missing values
I encoded the result variable (FTR) into a numerical format:
- 0 = Home Win
- 1 = Draw
- 2 = Away Win
I created new features such as:
- goal_diff = home goals - away goals
- total_goals = home goals + away goals
I also label encoded team names to allow the algorithm to process them numerically — a crucial step in preparing categorical data for training.

These simple but effective transformations helped the model begin to recognize patterns in team performance and goal distributions.

Step 2: Model Building with Random Forest

I chose the Random Forest Classifier from scikit-learn — a proven ensemble method that’s well-suited for tabular data and multi-class classification problems.

The process involved:

Splitting the data into training (80%) and testing (20%) subsets
Feeding the model with selected features:
- Encoded HomeTeam, AwayTeam
- goal_diff
- total_goals
Fitting the model and generating predictions on the test set

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Step 3: Evaluating Model Performance

To evaluate how well the model predicted match outcomes, I calculated the accuracy and plotted a confusion matrix.

The confusion matrix revealed:

The model had the highest success rate in predicting home wins, which aligns with the well-known home advantage in football.
Draws were the most difficult to predict — a known challenge in sports modeling due to their low frequency and variance.
The overall accuracy was respectable for a baseline model using only team and goal statistics, without incorporating deeper variables like team form, injuries, or betting odds.

These results validated the approach and helped me understand where the model's predictive power was strong — and where it could be improved.

Key Insights

Simple features can go a long way. Goal difference and total goals proved useful indicators.
Team names carry weight. Encoding teams captured latent strength differences over a season.
Draws require more context. Without advanced stats (like xG or match tempo), predicting draws remains a challenge.

Future Enhancements

There are several promising directions for taking this project further:

Incorporate advanced statistics like expected goals (xG), possession, or pass accuracy
Add player-level data or lineup information
Explore alternative models such as XGBoost, logistic regression, or neural networks
Evaluate performance across multiple seasons for better generalization

Takeaways

This project wasn’t just about football — it was about seeing how structured data, smart feature choices, and beginner-friendly machine learning tools can come together to mimic the kind of gut instinct we usually chalk up to experience.

It also reflects the kind of challenge I enjoy: breaking down a high-variance, high-stakes environment into measurable signals and turning them into repeatable, testable predictions.

➡️ View the project on GitHub