0% found this document useful (0 votes)
27 views13 pages

Deep Learning and Transfer Learning Architectures For English Premier League Player Performance Forecasting

Uploaded by

Ilyas Ndhote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views13 pages

Deep Learning and Transfer Learning Architectures For English Premier League Player Performance Forecasting

Uploaded by

Ilyas Ndhote
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Deep Learning and Transfer Learning Architectures

for English Premier League Player Performance Forecasting

Daniel Frees* 1 Pranav Ravella 1 Charlie Zhang 1

Abstract 1. Introduction
arXiv:2405.02412v1 [cs.LG] 3 May 2024

Soccer is the most popular sport in the world, and most


This paper presents a groundbreaking model for soccer fans agree that the greatest of its many leagues is the
forecasting English Premier League (EPL) player English Premier League (EPL). The Premier League is so
performance using convolutional neural networks popular that over 11 million people sign up each year to
(CNNs). We evaluate Ridge regression, Light- manage their own ‘Fantasy’ Premier League (FPL) team,
GBM and CNNs on the task of predicting upcom- selecting players each week based on their expected perfor-
ing player FPL score based on historical FPL data mance (Fantasy Football Reports). FPL managers then score
over the previous weeks. Our baseline models, points according to the real-world performance of their se-
Ridge regression and LightGBM, achieve solid lected players that week. For example, if a midfielder such
performance and emphasize the importance of as James Maddison gets an assist and a goal, he scores 3 + 5
recent FPL points, influence, creativity, threat, = 8 points plus other points for clean sheets and other pos-
and playtime in predicting EPL player perfor- itive contributions. It is natural to ask whether machine
mances. Our optimal CNN architecture achieves learning can competitively model FPL player performance
better performance with fewer input features and measured through the proxy of FPL points.
even outperforms the best previous EPL player The task of predicting a single players’ upcoming perfor-
performance forecasting models in the literature. mance is challenging, especially because soccer is a low-
The optimal CNN architecture also achieves very scoring sport with very high variation in possible outcomes
strong Spearman correlation with player rankings, (Bunker & Susnjak, 2022). However, achieving solid perfor-
indicating its strong implications for supporting mance on this fine-grained regression task could translate
the development of FPL artificial intelligence (AI) into much better performance on the simpler tasks of player
Agents and providing analysis for FPL managers. selection (ranking) and match outcome prediction (3-class
We additionally perform transfer learning exper- classification). As a result, developing a superior player
iments on soccer news data collected from The forecasting model has implications for FPL artificial intelli-
Guardian, for the same task of predicting upcom- gence (AI) agents, profiting from sports betting by winning
ing player score, but do not identify a strong against official betting agencies in expectation, and more.
predictive signal in natural language news texts,
achieving worse performance compared to both Here we construct several model architectures with the goal
the CNN and baseline models. Overall, our CNN- of predicting upcoming player match performance. In pre-
based approach marks a significant advancement dicting a player p’s upcoming gameweek performance, our
in EPL player performance forecasting and lays input data consists of tabular player performance data from
the foundation for transfer learning to other EPL the prior gameweeks. For our transfer learning models, we
prediction tasks such as win-loss odds for sports also input recent news corpus data mentioning p leading up
betting and the development of cutting-edge FPL to the upcoming gameweek. Since FPL points are scored dif-
AI Agents. ferently for each position, four models are trained for each
architecture: goalkeeper (GK), defender (DEF), midfielder
(MID), and forward (FWD) models.

1
Stanford University. Correspondence to:
Daniel Frees <dfrees@stanford.edu>, Pranav
Ravella <pravella@stanford.edu>, Charlie Zhang
<czzq@stanford.edu>.

1
Deep Learning for EPL Performance Forecasting

2. Related Works was cleaned using pandas (McKinney, 2011) and aug-
mented with a column for upcoming match difficulties,
2.1. Soccer Match Outcome Binary Classification feature engineered from official difficulty ratings for EPL
Researchers have applied machine learning to soccer fore- teams each season. Benched players with no minutes played
casting problems extensively over the last decade. EPL were dropped. Notably, if benched players are not dropped,
match prediction was a very popular research area in the model performance is skewed high since predicting these
mid 2010s, with researchers investigating Poisson processes, players to score 0 points becomes nearly trivial for all mod-
Bayesian networks, graph modeling, and other techniques els investigated here to learn. After dropping benched play-
with success (Baboota, 2018; Grund, 2012; Koopman, 2013; ers, data was organized by position (GK, DEF, MID, FWD).
Razali, 2017). For example, (Koopman, 2013) was able Time-series data was discretized into windows of size w.
to successfully model EPL match results with Poisson pro- For the CNN, the entire w × f window of f features is
cesses and beat betting odds at sports agencies to win money. retained to allow for pattern learning in the CNN filters. We
While these models are strong, they model a simpler prob- denote this feature-window X (i) . For baseline modeling, a
lem than player performance forecasting. sliding average of w weeks is used to generate a sliding av-
erage feature vector x(i) . The target data y (i) and upcoming
2.2. Cutting Edge Player Performance Forecasting match difficulty d(i) are collected from the week following
the current window.
More recently, modern machine learning approaches stem-
ming from the field of natural language processing (NLP)
such as LSTMs, designed for learning sequential data, have
been proposed as strong candidates for modeling EPL player
performance since they have the capability to model com-
plex patterns in time-series data (Lindberg & Sodenberg,
2020). While LSTMs are strong model candidates for
time-series data, the most recent research suggests that one-
dimensional (1D) CNNs may be more powerful due to their
increased flexibility in modeling time patterns through fil-
ters (Wibawa, 2022). In fact, CNNs have been shown to
be a feasible architecture for EPL player performance fore- Figure 1. High-Level Data Scraping and Cleaning Pipeline.
casting, though the original research in this area did not
experiment widely to find an optimal CNN architecture, and Based on stability of validation errors in initial experiments,
failed to report test errors (Ramdas, 2022). we settled on split sizes of approximately 60% training, 25%
validation, and 15% testing data. Splits were performed
2.3. Natural Language Signals player-by-player to avoid data leakage.
The idea that predictive signals for upcoming soccer match
results could be obtained via text data is not new; in fact sev- Table 1. Dataset Size by Model (# of Examples)
eral researchers have investigated the usage of Twitter text Note: Transfer model trained on fewer data due to compute limitations
data for this task (Godin, 2019; Schumaker, 2016). Godin
had great success, managing a 30% profit by betting laterally Train Validation Test
to official bookmakers using his Twitter aggregation model.
GK 1272 558 429
However, no literature to our knowledge has investigated
DEF 6016 2548 1479
the use of a news corpus for soccer prediction, nor has the
MID 6753 2993 1839
efficacy of text data been evaluated for the more challenging
FWD 2437 963 735
task of player performance forecasting.

Notably, some players are much more challenging to fore-


3. Datasets and Features cast than others (eg. a superstar like Erling Haaland will
We scraped 2020 − 2021 and 2021 − 2022 season EPL data fluctuate in score more wildly than a solid CDM like Yves
from vaastav/Fantasy-Premier-League (vaas- Bissouma). As a result, some splits end up with more pre-
tav, 2023). Fuzzy matching was used to clean up varia- dictable players and validation error and test error can wind
tions found in player names so that data could be matched up lower than train error. To mitigate this, we stratified
together. Slight mismatches were typically the result of on player skill (discretized from an engineered helper fea-
abbreviation or variations in diacritics. Each player’s data ture avg score). Cross-validation (CV) demonstrated
improved average performance with skill stratification, and

2
Deep Learning for EPL Performance Forecasting

more consistency between splits (stdev score stratifica- 3.2. Data EDA
tion performed equivalently, likely due to the heterscedas-
ticity of player points residuals across avg score). 4. Methods
Standard scaling (z = x−µ
σ ) was applied to the CNN and
4.1. Baseline Models
Ridge Regression input data, but not to the LightGBM. For our first baseline model, we chose a simple regression
LightGBM was the primary comparison model competing algorithm: Ridge regression. For a more complex baseline,
with our custom CNN, and tree-based models are invariant we evaluated gradient boosting. Gradient boosting builds
to scaling. a sequence of trees in which the next tree in the sequence
is fit to the residuals of the current ensemble. In other
words, let T (x; Θm ) be the mth tree in the ensemble, then
Table 2. Example X (i) with w = 2 its prediction target should be the negative gradient, −gm ,
goals scored assists total points name at iteration m, where
0 0 1 Aleks. Mitrović
N
2 0 12 Aleks. Mitrović X
Θm = arg min (−gim − T (xi ; Θ))2
Θ
i

In essence, it is a ensemble method whose output f (x) can


Table 3. Example d(i) , y (i) corresponding to the X (i) in Table 3 be expressed as a combination
PMof the predictions of multiple
Upcoming Match Difficulty (d(i) ) Target Score (y (i) ) weak learners fM (x) = η m T (x; Θm ). The implemen-
tation of gradient boosting we use here is LightGBM (Ke
-1 2 et al., 2017). Compared to other implementations, Light-
GBM grows trees leaf-wise (as opposed to depth-wise),
splitting the leaf node with the highest gain across all depth
3.1. Transfer Learning Dataset levels (Shi, 2007).

4.2. Deep Learning Models

Figure 3. Custom CNN Architecture for FPL Performance Fore-


casting.
Figure 2. Expansion of Data Pipeline for Transfer Learning per
Position.
For our deep learning model, we decided to architect a
custom one-dimensional convolutional neural network (1D
Additional steps were required to construct the transfer learn- CNN) using Tensorflow (Abadi et al., 2015) to enable learn-
ing dataset. New corpus data for each player was scraped ing of complex time-series patterns by using backpropa-
from The Guardian (Guardian, 1821). An average of ≈ 950 gation to derive filter weights. The model was constructed
articles were found per player, but some players were dis- iteratively through eleven versions, performing grid searches
cussed much more than others (stdev ≈ 900). For each ex- for optimal hyperparameters and evaluating averaged top-
ample, the three most recent articles referencing the player 10 lowest validation MSEs at each iteration to choose the
prior to the upcoming game were collected. Tabular win- best architecture going forward. Our optimal configuration
dowed data (X) was reworded as a sentence and appended consisted of a single one-dimensional convolution layer,
into each text example alongisde the first 512 words of each which is flattened and concatenated with upcoming match
article (N ) to form each example XN (Figure 2 ). difficulty before being passed through a single dense hidden

3
Deep Learning for EPL Performance Forecasting

layer, and then to the final layer (Figure 3). fully-connected layer. Our goal here was merely to identify
whether there is any learnable signal from these longform
Our cost function for this CNN consisted of MSE with Elas-
texts. Should there prove to be a predictive signal, it would
ticNet (L1L2) regularization on both the convolution weight
make sense to purchase a GPU cluster to learn weights more
matrices (C) and the dense layer weight matrix (W [1] ), with
deeply back into the longformer network and achieve better
L1 strength determined by λ1 and L2 strength determined
performance.
by λ2 . Thus, our cost function for batch size B was as
follows: Each input example XN was tokenized into 4096 tokens for
B
! input into our longformer model. Each model (GK, DEF,
1 X (i) ˆ
(i) 2 MID, FWD) took around four to seven hours to train for 25
JCN N = (y − y )
B epochs using an NVIDIA A100 with PyTorch (Paszke et al.,
k=1
  2019), so tuning was limited and data was restricted to the
+ λ1 ||C||1 + ||W [1] ||1 2021 − 22 season.
 
+ λ2 ||C||22 + ||W [1] ||22
5. Experiments and Results
To train our CNN, we used backpropagation to calculate We used mean squared error (MSE) as our cost function to
gradients with respect to our cost function, and the Adam evaluate our baseline, CNN, and transfer learning models.
(Kingma & Ba, 2014) optimizer for weight updates. Early
stopping was employed to further counteract overfitting.
Table 4. Holdout (Test) MSE for Optimal Architectures
4.3. Transfer Learning Model Architecture
Model
Forecasting player performance using news corpus data Ridge LGBM CNN Transfer*
boils down to a text regression problem. Scoring text us- GK 6.46 6.22 5.08 8.22
ing transformers has proven to be one of the best methods DEF 7.20 7.24 5.87 9.66
for tasks such as these (Hasan, 2022). A transformer net- MID 6.08 6.11 6.16 8.40
work uses attention blocks to model token connections in FWD 7.19 7.28 6.22 10.12
input sequences (in the form of directional soft weights be- AVG 6.73 6.71 5.83 9.10
tween tokens) and output embeddings. These embeddings *Trained only on 2021-2022 due to GPU limitations
can then be input into a fully-connected layer to output
logits (real numbers in the embedding space). For classifi-
cation, a softmax activation is used to convert the logits into 5.1. Baseline ML Experiments
probabilities to predict the next best word (Hasan, 2022),
The experiments for Ridge regression and gradient boosting
whereas for regression we project the output to a single
mostly consisted of hyperparameter tuning. We used grid
real-valued score. The self-attention mechanism core to
search with stratified 5-fold CV (Pedregosa et al., 2011).
transformers enables them to understand context and per-
This is particularly critical for gradient boosting as it has
form extremely well in understanding signals from longform
numerous hyperparameters to inhibit excessive model com-
texts such as news articles. Given the length of our news
plexity. The hyperparameter search space can grow expo-
corpus data, a longformer model from AllenAI with a se-
nentially, so we iteratively performed several rounds of grid
quence length of 4096 was chosen for the transfer learning
search to rule out unreasonable options.
regression. Most transformer-based models are unable to
process long sequences due to the attention mechanism that In Figure 5, we observed that even with the adjusted search
scales quadratically. Longformer uses a sliding-window space, multiple combinations were still vulnerable to over-
attention mechanism to do this in linear time (Beltagy et al., fitting of varying severity. How does the best set of hyperpa-
2020). rameters achieve the optimal bias-variance tradeoff? On the
ensemble level, only 50 trees are fit. For each tree, we had
Since we desire a real number output, we want a single
only a depth of 3, an L2 regularization strength of 10, and 7
output logit from the regression. Our selected HuggingFace
leaf nodes each holding 70 observations minimum. As our
model (Wolf et al., 2020) applies sigmoid activation to out-
dataset was smaller than a typical application scenario that
put logits, so we appended a simple scaling layer to project
calls for LightGBM, CV ended up favoring hyperparameters
the sigmoid output to a reasonable range of possible FPL
that more aggressively limit tree complexity.
scores (between −5 and 24). This architecture is not ideal
because it attempts to train a scaled classification problem There are multiple ways we can interpret baseline model
as regression. However, compute constraints meant that outputs and feature importance. For Ridge Regression,
we could not freeze layers up to the logits and retrain a we can compare the coefficients, shown as a heatmap in

4
Deep Learning for EPL Performance Forecasting

form). Subsequently, we notice the effect of modeling


each position separately. For example, the LightGBM
goalkeeper model values saves a lot and, compared to
other positions, relies much less on creativity/threat (which
focus on actions building up to goals) compared to influence.

A number of the regression coefficients agree in direc-


tionality of effect across positions. This breaks down
quote strongly for bps (bonus points), goals scored,
goals conceded, clean sheets, threat and total points. The
predictor for which the coefficient trend varies most across
positions is goals scored (sliding window average goals in
recent weeks) for forwards. One possible explanation is
that scoring goals is such a rare event for individual players
Figure 4. LightGBM feature importance measured by the percent- that, as a result, scoring goals is often followed by a week
age of splits performed with a given feature. without goals and with low points. Our results do seem to
suggest that goal-scoring is not an independent event week-
to-week. Whether the negative effect of previous week goals
on current week goals for forwards is an element of psychol-
ogy, team management, or something else entirely is not
within the scope of this analysis, but it does call for further
research.

Figure 5. 2D Histogram on the average performance of LightGBM


during CV for each combination in the hyperparameter grid.

Figure 6. For LightGBM, we can examine in Figure 4 what


percentage of all the splits is based on a given feature as a Figure 6. Heatmap of Ridge Regression coefficients. Models have
proxy for its importance. Alternatively, visualized in Figure the same regularization strength. Warm colors represent positive
coefficients and vice versa.
7, Shapley values can illustrate each feature’s numerical
contribution to the final tree ensemble output, adding
some much needed explainability (Lundberg et al., 2020).
First, we observe several features that are consistently
important for both baseline models. Those include, for
example, minutes (as players with more playtime have more From Table 4 and 5 it is noticeable that gradient boosting
opportunities to score points and are more likely recognized consistently achieved lower MSE on the training set yet tied
by the team to be on the starting lineup), difficulty gap against Ridge regression during validation and testing. This
(as the opponent’s relative strength to the player’s team could be an indicator for slight overfitting despite the grid
can translate to ball possession percentages and pressure), search to balance fit and generalizability. As a result, Ridge
influence/creativity/threat (as these FPL-compiled metrics regression is favorable for its superior interpretability. How-
take into account finer-grained match-events beyond ever, in the future, if more advanced features with interplay
low occurrence events such as goals and assists), and are introduced, (such as tactics, team formations, and player
total points (as this sliding-window average of recent movement on the field), gradient boosting may have a more
weeks’ points is a fair indicator of a player’s recent definitive performance advantage.

5
Deep Learning for EPL Performance Forecasting

Figure 7. LightGBM feature contribution visualized with Shapley


values. The starting point E[f (x)] represents the average (FWD)
model output with the testing set. The ending point f (x) repre- Figure 8. Large grid search for CNN v6. Train, Val, Test MSEs.
sents the model output for a given observation with feature values
specified on the left side.
5.2.3. A RCHITECTURE
Model architecture was refined iteratively through the first
7 CNN grid search experiments, yielding the architecture
described in 4.2. This architecture was selected based on
decreased average validation MSE, as well as reduced upper
Table 5. Avg MSE during CV for baseline models
bound on validation MSE. For example, between our v6
R IDGE L IGHT GBM architecture experiments (see Figure 8), and v11 architec-
M ODEL
T RAIN VALID T RAIN VALID ture experiments, the upper bound on validation MSE fell
GK 6.59 6.65 6.34 6.68 approximately 50%, and average validation MSE fell 20%.
DEF 7.11 7.14 6.69 7.12
MID 6.12 6.14 5.87 6.18 5.2.4. O PTIMAL A RCHITECTURE AND PARAMETERS
FWD 6.75 6.77 6.41 6.73
The optimal CNN models, selected by lowest validation
MSE, achieved a holdout MSE of 5.08 for goalkeepers, 5.87
for defenders, 6.16 for midfielders, and 6.22 for forwards.
The top 5 averaged CNN models achieved a holdout MSE
5.2. Deep Learning Experiments of 6.08 for goalkeepers, 6.50 for defenders, 6.27 for mid-
fielders, and 6.59 for forwards. An abridged view of their
5.2.1. H ARDWARE hyperparameter configurations can be found in Table 6.
All CNN training was performed using TensorFlow (Abadi
et al., 2015) on a 14 core Apple Silicon M3 Max CPU. Table 6. Optimal configuration for each CNN model (abridged)

5.2.2. H YPERPARAMETERS
# W EEKS
Experiments were performed using learning curves to de- P OSITION A MT N UM F EATURES
W INDOW K ERNEL
termine optimal training hyperparameters of 250 epochs,
learning rate of 0.001, batch size of 32 and early stopping GK 6 1 PTSONLY
DEF 9 1 PTSONLY
tolerance and patience of 1 × 10−4 and 20 iterations, re- MID 3 2 PTSONLY
spectively. A custom implementation of grid search was FWD 9 1 PTSONLY
developed to determine optimal window size, kernel size,
number of filters, number of dense neurons, activation func-
tions, amount of numerical features to include, and strat-
5.2.5. F EATURE I MPORTANCES
ification strategy for GK, DEF, MID, and FWD models.
Subsequent experiments reduced the hyperparameter search As seen in Table 6, only the previous w weeks of FPL points
space by fixing certain optimal hyperparameters based on and upcoming match difficulty d were critical for the CNN
the results of previous experiment iterations. to best predict a player’s points in the upcoming week.

6
Deep Learning for EPL Performance Forecasting

Notably, including ict index (containing information


Figure 9. Goalkeeper Predictions vs. True Performance (Holdout
about influence, creativity, and threat) did not benefit the
Data)
CNN model, despite the importance of each of these fea-
tures in the optimal LightGBM models. minutes were
important in the optimal trained Ridge regression models
(as selected by cross validation), but also did not prove to
be an important feature for the CNN.

5.2.6. Q UALITATIVE A NALYSIS


Qualitatively, we observe reasonable results. Taking a closer
look at the best and worst examples for the optimized CNN
MID model, we see that the two worst predictions (Table 7)
are made on outstanding outlier performances of 21 points,
likely via hat tricks plus bonus points. The best two exam-
ples (Table 7) occur where the CNN predicts relatively poor
performance based on low-scoring previous weeks, and the
player performs as predicted. The CNN model makes rea-
sonable predictions, but is unsurprisingly unable to forecast Figure 10. Defender Predictions vs. True Performance (Holdout
Data)
outliers.

Table 7. Best and Worst 2 examples by MSE (CNN MID model).


Points Points Previous Weeks
MSE d(i)
True Pred Wk 0 Wk 1 Wk 2
21 2.2 352.8 -2.0 2 3 2
21 4.6 270.4 2.0 3 15 12
2 2.0 ≈0 0.0 2 6 1
1 1.0 ≈0 3.0 0 1 0

5.2.7. P REDICTIONS VS . T RUE P ERFORMANCES


While our CNN models achieve great performance relative
to Ridge regression, LightGBM, and previous deep learning Figure 11. Midfielder Predictions vs. True Performance (Holdout
models in the literature (Lindberg & Sodenberg, 2020), they Data)
fail to accurately predict outliers. As seen in Table 7, the
worst errors for the midfielder model occurred when play-
ers had standout performances. Below, we visualize this
general failure to predict outliers across each CNN model.
Most likely, training punishes outlier model predictions due
to their rarity. As seen in the plots, player gameweeks with
points between 0 and 3 constitute the majority of the data.
As such, unless the CNN is extremely certain a player is
going to have an outlier performance, it is better to predict
within the typical range of player performance values. On
average, a wrong outlier prediction will add a greater er-
ror since on average, player performances fall within the
aforementioned low-score range.

5.2.8. OVERFITTING
rate with low patience for early stopping largely mitigated
Overfitting plagued early versions of the CNN, whereby this. Some overfitting lingers in the final architecture, likely
validation MSE would inflect upwards as training continued. the result of employing grid search with regular train/val/test
Implementing ElasticNet regularization and slower learning shuffle splits rather than cross-validation due to compute

7
Deep Learning for EPL Performance Forecasting

to the validation and test sets compared the baseline models


Figure 12. Forward Predictions vs. True Performance (Holdout
and CNN (Table 4). The FWD model stands out in that it
Data)
was unable to learn any signal in the news corpus data to
improve even the training MSE (see Figure 14).

Figure 14. Learning curve and metrics for MID and FWD.

time limitations. Notice that in Figure 13, the validation


data performs better than the training data (both in the first 5.3.2. N EWS C ORPUS DATA Q UALITY
iteration prior to any learning, and in the final iteration). Beyond asserting that articles occurred before upcoming
We hypothesize that this results from the wide variance in kickoff time and mentioned each player, articles were not
prediction challenge between splits, depending on the score evaluated for quality in great detail. A lot of players did
variance of the players in the data split. Various strategies not have recent articles before kickoff time and as such
for stratified splitting were tested, including player score many articles were outdated or referred to general EPL
standard deviation (week-to-week) and player average score. news completely unrelated to the upcoming game.
Unfortunately, with a limited number of players in the EPL
and the need to keep players separate across splits it was Furthermore, predicting goal-scoring (critical for FWDs)
impossible to completely solve the issue. Further discussion for a particular player based on text analysis of upcoming
of this anomaly can be found in the Limitations section. games is harder than predicting a strong defensive perfor-
mance (critical for DEFs) against a weaker team. This pro-
vides another possible justification for the inability of the
Figure 13. CNN v11 GK Learning Curve transfer learning model to learn better prediction for FWDs
Adam achieves quick convergence. Validation set gets lucky split and we overfit to the validation (Figure 14). Overall, the news corpus did not provide a
data via early stopping. strong signal, as evidenced by the poorer performance of
the transfer learning model compared with the CNN, Ridge,
and LightGBM (Table 4).

5.4. Player Ranking Performance


To gain a better understanding of how well our best CNN
models would perform on a player ranking task, such as
selecting players in Fantasy Premier League to maximize
total team points, we calculate Spearman correlation be-
tween our model predictions and true player points for the
holdout data. We calculate the generalized Spearman’s ρs
as follows:
5.3. Transfer Learning Experiments
Pn  
5.3.1. L EARNING N EWS C ORPUS S IGNAL Syŷ
1
n i=1
R(y (i) ) − R(y) R(ŷ (i) ) − R(ŷ)
ρs = = r
Sy Sŷ 2   1 Pn 2 
Figure 14 shows the learning curves for the MID and FWD 1
Pn
R(y (i) ) − R(y) R(ŷ (i) ) − R(ŷ)
n i=1 n i=1
transfer learning models. GK and DEF learning curves were
very similar to the MID model other than final MSE loss. As
expected, the model overfit for most of the positions due to where:
massive model parameterization compared to the relatively
small dataset. As a result, the model generalizes quite poorly • y (i) is the true player points for player i.

8
Deep Learning for EPL Performance Forecasting

• yˆ(i) is the predicted player points for player i. LSTM) by 30% (Lindberg & Sodenberg, 2020) 1 . Further-
more, we investigated the feasibility of forecasting perfor-
• R(y (i) ) is the rank of the true player points for player mance using signals from text data collected via a news cor-
i. pus, combined with transfer learning of a transformer-based
• R(ŷ (i) ) is the rank of the predicted player points for model. Initial experiments did not demonstrate increased
player i. efficacy relative to our CNN forecasting model, though
data and compute limitations contributed to the lower per-
• R(y) is the mean rank of the true player points. formance. Lastly, we evaluated Spearman correlation as
a proxy to expected player ranking performance for our
• R(ŷ) is the mean rank of the predicted player points, optimal Ridge, LightGBM, and CNN models, finding that
averaged across all players. the CNN models achieve highly promising ranking perfor-
• Syŷ is the covariance of the true and predicted ranks. mance.

• Sy and Sŷ are the standard deviations of the true ranks 6.1. Limitations
and predicted ranks, respectively.
The foremost limitation of the present work is that we were
Note that it is important we calculate the generalized version unable to completely standardize the difficulty of the perfor-
of Spearman’s ρ, which handles tied ranks, since the true mance prediction task between train/val/test splits. Because
player points each gameweek are always integers. This soccer is a low scoring sport, player FPL points can vary
leads to many ties. widely. For example, a talented forward player such as Son
Heung-min might play very well but score no goals and
A high Spearman correlation indicates that the predicted receive a yellow card, netting 1 point. The next week, Son
model points are highly monotonically related to the true might score a hat trick and bonus points, netting 17 points.
player points across predicted player gameweeks. By ex- Predicting these high-score, low-probability events is sig-
tension, an AI agent or player using our model to select nificantly more challenging than predicting a moderately
players in FPL should achieve better performance the higher positive performance (1 assist, 5 − 6 points) by a consistent
the Spearman correlation. Looking at our Spearman corre- playmaking midfielder such as Kevin De Bruyne.
lation results in Table 8, we see that our optimized Ridge
and LightGBM models achieve moderately high Spearman We attempted to standardize the prediction task across splits
correlations and the CNN models achieve strong Spearman by stratifying our train/val/test splits by player skill (mea-
correlations with the true player rankings. sured by average FPL points per week) and player variance
(measured by standard deviatation of FPL points per week),
Overall, our results suggest that the CNN model is also but were unable to achieve completely similar splits. This is
the highest performer in the player ranking task, and that partly due to the fact that splits are performed player-wise
the CNN model is highly promising as a foundation for to prevent data leakage, which prevents the possibility of
developing an automatic FPL player selection tool or for perfect stratified splits in small player-sets, such as the EPL
providing player selection insights in a human-in-the-loop goalkeepers. As seen in Figure 13, sometimes one split
(HITL) pipeline. ended up with an easier prediction task. Note that splits
were kept consistent across runs in comparing various hy-
Table 8. Ranking Performance: Spearman’s Rho for Optimal Con- perparameters, so that an easy validation split would not
figurations skew the hyperparameter optimization results. The variation
between splits is problematic regardless as we ’leave some
GK DEF MID FWD performance behind’ in overfitting to the validation dataset,
R IDGE 0.50 0.40 0.49 0.47 and the noisiness of split difficulty by extension means our
L IGHT GBM 0.53 0.40 0.49 0.48 estimates of prediction accuracy are noisy. Cross-validation
CNN 0.70 0.57 0.58 0.62 combined with superior split strategy might largely miti-
gate this issue, and would be a great extension to this work.
Cross-validation should be possible with a larger GPU clus-
ter given the relatively small dataset.
6. Conclusion
GPU availability was also a limitation in the transfer learn-
Here we developed, to the best of our knowledge, the ing experiments. Without significant external funding, we
highest-performing model for EPL player performance fore- did not have access to the significant GPU compute neces-
casting in the literature. Averaged across positions, our sary to train our longformer model sufficiently.
CNN architecture outperformed our best LightGBM models
1
by 13%, and the previous best model in the literature (an Same data source, but model evaluated on different seasons.

9
Deep Learning for EPL Performance Forecasting

6.2. Future Directions 7. Code


Future work should focus on improving the CNN architec- All code for the present work can be found at: https:
ture proposed here via a cross-validation grid search experi- //github.com/danielfrees/mlpremier.
ment framework and more advanced stratification strategies
to normalize prediction difficulty between splits. A simple
8. Acknowledgments
next direction to start with for further stratification strategies
would be to stratify splits based on average point difference We would like to acknowledge the support of Stanford’s
week-to-week, though there may be other measures of vari- CS229 teaching staff who helped us work through the initial
ability that better mitigate difficulty differences between iteration of this project.
splits.
The optimal CNN architecture presented here has strong 9. Funding
implications for the development of FPL AI Agents and
No outside funding was utilized in support of this research.
sports betting models. An AI Agent could be developed on
top of our current model by leveraging budget optimization
techniques alongside the CNN (Gupta, 2017). Furthermore,
sports betting models might be developed based on aggre-
gated player predictions for each competing team. By uti-
lizing intermediary player performance features in learning
win-loss probabilities (ie. by training a Logistic Regression
for wins and losses with player performances as input fea-
tures), it may be possible to predict better odds than sports
betting agencies, and achieve profit in expectation.
Another valuable extension to this work would be investi-
gating the benefits of learning via direct optimization for
Spearman’s correlation. Recent research has developed a
methodology for direct optimization of Spearman’s corre-
lation using convex hull projections (Blondel et al., 2020).
Directly optimizing for Spearman’s correlation might im-
prove player ranking performance, though this would need
to be confirmed empirically, especially because ranking
optimization is a novel field.
Lastly, traditional natural language understanding (NLU)
techniques should be employed to directly expose senti-
ment, entity relationships, and other valuable NLU features
attached to the relevant EPL player as features for the trans-
former model. Directly exposing these features might en-
able more effective transfer learning by reducing noise in
the input data.
Though our minimal transfer learning example here failed to
detect a signal, experimentation with a more complete trans-
fer learning architecture might be worthwhile. To properly
model the regression task, transfer learning should be con-
ducted by freezing longformer (or other transformer model)
weights, using these as a feature extractor, and learning a
fully-connected layer (or a few) to predict the real-valued up-
coming FPL points outcome. Transfer learning could also
be improved by collecting datasets across multiple news
corpuses and training with more compute power.

10
Deep Learning for EPL Performance Forecasting

References Gupta, S. Ensemble arima and rnn method for fpl team
selection. 2017. URL https://browse.arxiv.o
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen,
rg/pdf/1909.12938.pdf.
Z., Citro, C., Corrado, G., Davis, A., Dean, J., Devin,
M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Hasan, M. Transformers in natural language processing. 09
Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, 2022. doi: 10.13140/RG.2.2.18062.84809.
M., Levenberg, J., Mané, D., Schuster, M., Monga, R.,
Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma,
Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va- W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient
sudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten- gradient boosting decision tree. Advances in neural infor-
berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: mation processing systems, 30:3146–3154, 2017.
Large-Scale Machine Learning on Heterogeneous Sys- Kingma, D. and Ba, J. Adam: A method for stochastic opti-
tems. TensorFlow Authors, 2015. Software available mization. arXiv preprint arXiv:1412.6980, 2014. URL
from https://www.tensorflow.org/. https://arxiv.org/abs/1412.6980.
Baboota, R. e. a. Predictive analysis and modelling football Koopman, S. e. a. A dynamic bivariate poisson model for
results using machine learning approach for english pre- analysing and forecasting match results in the english
mier league. 2018. URL https://www.scienced premier league. 2013. URL https://academic.o
irect.com/science/article/pii/S01692 up.com/jrsssa/article/178/1/167/7058
07018300116. 470.
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The Lindberg, A. and Sodenberg, B. Comparison of machine
long-document transformer, 2020. learning approaches applied to predicting football players
performance. Technical report, 2020. URL https:
Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. //odr.chalmers.se/server/api/core/bi
Fast differentiable sorting and ranking. arXiv preprint tstreams/c7d1c22f-c8e5-4dd9-b07c-cf5
arXiv:2002.08871, 2020. doi: 10.48550/arXiv.2002.0887 7733f1592/content.
1. URL https://arxiv.org/abs/2002.08871.
[Submitted on 20 Feb 2020 (v1), last revised 29 Jun 2020 Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin,
(this version, v2)]. J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N.,
and Lee, S.-I. From local explanations to global under-
Bunker, R. and Susnjak, T. The application of machine standing with explainable ai for trees. Nature Machine
learning techniques for predicting match results in team Intelligence, 2(1):56–67, 2020. doi: 10.1038/s42256-019
sport: A review. 2022. URL https://jair.org/i -0138-9.
ndex.php/jair/article/view/13509/267
86. McKinney, W. Pandas: a foundational python library for
data analysis and statistics. Python for Data Analysis,
Fantasy Football Reports. How Many People Play FPL? 2011. URL https://pandas.pydata.org/.
https://www.fantasyfootballreports.c
om/how-many-people-play-fpl/. Accessed: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
May 7, 2024. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
Godin, F. e. a. Beating the bookmakers: Leveraging statis- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
tics and twitter microposts for predicting soccer results. Bai, J., and Chintala, S. Pytorch: An imperative style,
2019. URL https://fredericgodin.com/wp high-performance deep learning library. In Advances
-content/uploads/2019/03/Beating-the in Neural Information Processing Systems 32, pp. 8024–
-bookmakers-leveraging-statistics-and 8035. Curran Associates, Inc., 2019. URL http://pa
-Twitter-microposts-for-predicting-s pers.neurips.cc/paper/9015-pytorch-a
occer-results.pdf. n-imperative-style-high-performance-d
eep-learning-library.pdf.
Grund, H. Network structure and team performance: The
case of english premier league soccer teams. 2012. URL Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
https://www.sciencedirect.com/scienc Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
e/article/pii/S0378873312000500. Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Guardian, T. The guardian. The Guardian, 1821. URL Scikit-learn: Machine learning in Python. Journal of
https://www.theguardian.com/. Machine Learning Research, 12:2825–2830, 2011.

11
Deep Learning for EPL Performance Forecasting

Ramdas, D. Using convolution neural networks to predict


the performance of footballers in the fantasy premier
league. Technical report, 2022. URL https://www.
researchgate.net/profile/Delano-Ramda
s/publication/360009648_Using_Convol
ution_Neural_Networks_to_Predict_the
_Performance_of_Footballers_in_the_F
antasy_Premier_League/links/625c73fc
4173a21a0d1a9543/Using-Convolution-N
eural-Networks-to-Predict-the-Perform
ance-of-Footballers-in-the-Fantasy-P
remier-League.pdf.
Razali, N. e. a. Predicting football matches results using
bayesian networks for english premier league (epl). 2017.
URL https://iopscience.iop.org/artic
le/10.1088/1757-899X/226/1/012099/met
a.
Schumaker, R. Predicting wins and spread in the premier
league using a sentiment analysis of twitter. 2016. URL
https://www.sciencedirect.com/scienc
e/article/pii/S0167923616300835.
Shi, H. Best-first decision tree learning. Master Thesis, 01
2007.

vaastav. Fantasy Premier League Repository. https:


//github.com/vaastav/Fantasy-Premier
-League, 2023.
Wibawa, A. Time-series analysis with smoothed convolu-
tional neural network. 2022. URL https://journa
lofbigdata.springeropen.com/articles
/10.1186/s40537-022-00599-y.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,
Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite,
Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M.,
Lhoest, Q., and Rush, A. M. Huggingface’s transformers:
State-of-the-art natural language processing, 2020.

12
Deep Learning for EPL Performance Forecasting

A. Learned Average Filters


Here, we briefly take a look into the (not so pretty) learned filters for the optimal CNNs to get a qualitative sense of
longitudinal patterns in player performance. Note that we need to be careful about interpreting learned filters, as their effects
on the final output are obstructed by the myriad of weighted connections between the flattened filters and the dense layer(s)
of the CNN.
To get a sense of excitatory (more important/ predictive) patterns in the time series data, we begin by retrieving the filter
weights from the second hidden layer of the CNN model. Next, we perform z-score normalization on the filter weights to
ensure they have a mean of 0 and a standard deviation of 1. Finally, we calculate the mean filter by taking the average of all
(64) of the normalized filters.
As seen in Table 6, the MID model was the only one which had an optimal (as selected by GridSearch) kernel with size > 1,
so we look at the average standardized filter learned for the MID model (Figure 15). We see that there may be a very slightly
more excitatory effect of the penultimate week in each application of the filter, though the average normalized filter weights
are quite close to 0. Between the confounding effects of dense layer weights and the near-zero average normalized filter
weights it seems most likely that there is no consistent pattern in longitudinal feature importance. In other words, it does
not seem that older weeks and more recent weeks are consistently more predictive of upcoming performance based on this
initial qualitative filter analysis.
Of course, attempting to interpret anything from such a small filter is, to begin with, a mostly futile qualitative task.

Figure 15. Averaged Normalized 1D Convolution Filter (MID Model).

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy