By Chris Clouten
As fans of a particular club it’s reasonable to believe that we Gooners don’t exactly get the full picture of the league. Sure, many of us watch other games, or teams, but not with the same eye for detail and painstaking patience as we do with the Gunners. As Arsenal fans (and in a sense, fans of football in general) we have some idea of what the generic, lifeless data points of football might tell us about the game and league we love. Yet when one digs past the domain-specific knowledge and into the numbers, the results are quite interesting.
I collected data on the past four seasons of the Barclay’s Premier League. What I wanted to know was quite simple: Over the course of the season, looking at various team statistics, what ends up being a better predictor of a team’s overall success (i.e. Point’s Total): home form or away form?
With my datasets for each season split into home, away, and overall I begin to work on building a regression model to analyze the problem, and to predict final league position based on a team’s statistics (or in data-nerd talk, features) for both home and away. For the regression itself, I chose to use Ordinary Least Squares regression. OLS regression is a method for estimating the unknown parameters in a linear regression model by minimizing the sum of squared vertical distances between the observed responses in the dataset (a Club’s overall points total) and the responses predicted by the linear approximation (Points predicted by the model given home or away data).
With the data cleaned the model revealed what many of us I’m sure would have guessed: how a team does at home is a better indicator of where that team will finish the season than how a team does on the road. Statistically speaking, the difference is not entirely meaningless — the model revealed an average R-Squared for home of .90 while away had an average R-Squared of .85. For anyone unfamiliar with statistics, R-Squared is the measure of how well the model’s prediction fit the actual results. The higher the R-squared, the better the fit. In our case, the home dataset was a better fit, by about 5%.*
While all this is great — many of us would have guessed to begin with that home for was more important than away — what proves to be quite novel are the features the model selected for each dataset, as well as the features both datasets shared. These are both topics I’ll dig into in another post. For now, what is important is that over the past four seasons how a team performs at home is the better predictor of how they will end the season.
Looking at this season’s league table for Home Results one sees (obviously) Manchester City topping it, with Chelsea not far behind followed by Liverpool and then Arsenal (both of whom have played one less home game than Chelsea and City). What the model tells us is that clearly each of the aforementioned teams (yes, as much as it pains me to say it, even The Scousers) should be considered title contenders at this moment in time, even if City is currently 6 points off top-of-the-table Arsenal at the moment and Liverpool have hit a dip in form.
With just one home defeat in the League this season Arsenal have done quite well to bounce back from the opening day defeat to Aston Villa. Yet, long as the season is, if the past four seasons are anything to go off of than how Arsenal performs at home over the rest of the season will be of crucial importance to whether or not this season has a happy ending.
* This project is fully summarized here: http://triplec1988.github.io/regressionBPL/#!. All source code, datasets, images, presentation and analysis is hosted on GitHub, licensed under Apache 2.0 Software License.