Deep Learning and Transfer Learning Architectures For English Premier League Player Performance Forecasting
Deep Learning and Transfer Learning Architectures For English Premier League Player Performance Forecasting
Abstract 1. Introduction
arXiv:2405.02412v1 [cs.LG] 3 May 2024
1
Stanford University. Correspondence to:
Daniel Frees <dfrees@stanford.edu>, Pranav
Ravella <pravella@stanford.edu>, Charlie Zhang
<czzq@stanford.edu>.
1
Deep Learning for EPL Performance Forecasting
2. Related Works was cleaned using pandas (McKinney, 2011) and aug-
mented with a column for upcoming match difficulties,
2.1. Soccer Match Outcome Binary Classification feature engineered from official difficulty ratings for EPL
Researchers have applied machine learning to soccer fore- teams each season. Benched players with no minutes played
casting problems extensively over the last decade. EPL were dropped. Notably, if benched players are not dropped,
match prediction was a very popular research area in the model performance is skewed high since predicting these
mid 2010s, with researchers investigating Poisson processes, players to score 0 points becomes nearly trivial for all mod-
Bayesian networks, graph modeling, and other techniques els investigated here to learn. After dropping benched play-
with success (Baboota, 2018; Grund, 2012; Koopman, 2013; ers, data was organized by position (GK, DEF, MID, FWD).
Razali, 2017). For example, (Koopman, 2013) was able Time-series data was discretized into windows of size w.
to successfully model EPL match results with Poisson pro- For the CNN, the entire w × f window of f features is
cesses and beat betting odds at sports agencies to win money. retained to allow for pattern learning in the CNN filters. We
While these models are strong, they model a simpler prob- denote this feature-window X (i) . For baseline modeling, a
lem than player performance forecasting. sliding average of w weeks is used to generate a sliding av-
erage feature vector x(i) . The target data y (i) and upcoming
2.2. Cutting Edge Player Performance Forecasting match difficulty d(i) are collected from the week following
the current window.
More recently, modern machine learning approaches stem-
ming from the field of natural language processing (NLP)
such as LSTMs, designed for learning sequential data, have
been proposed as strong candidates for modeling EPL player
performance since they have the capability to model com-
plex patterns in time-series data (Lindberg & Sodenberg,
2020). While LSTMs are strong model candidates for
time-series data, the most recent research suggests that one-
dimensional (1D) CNNs may be more powerful due to their
increased flexibility in modeling time patterns through fil-
ters (Wibawa, 2022). In fact, CNNs have been shown to
be a feasible architecture for EPL player performance fore- Figure 1. High-Level Data Scraping and Cleaning Pipeline.
casting, though the original research in this area did not
experiment widely to find an optimal CNN architecture, and Based on stability of validation errors in initial experiments,
failed to report test errors (Ramdas, 2022). we settled on split sizes of approximately 60% training, 25%
validation, and 15% testing data. Splits were performed
2.3. Natural Language Signals player-by-player to avoid data leakage.
The idea that predictive signals for upcoming soccer match
results could be obtained via text data is not new; in fact sev- Table 1. Dataset Size by Model (# of Examples)
eral researchers have investigated the usage of Twitter text Note: Transfer model trained on fewer data due to compute limitations
data for this task (Godin, 2019; Schumaker, 2016). Godin
had great success, managing a 30% profit by betting laterally Train Validation Test
to official bookmakers using his Twitter aggregation model.
GK 1272 558 429
However, no literature to our knowledge has investigated
DEF 6016 2548 1479
the use of a news corpus for soccer prediction, nor has the
MID 6753 2993 1839
efficacy of text data been evaluated for the more challenging
FWD 2437 963 735
task of player performance forecasting.
2
Deep Learning for EPL Performance Forecasting
more consistency between splits (stdev score stratifica- 3.2. Data EDA
tion performed equivalently, likely due to the heterscedas-
ticity of player points residuals across avg score). 4. Methods
Standard scaling (z = x−µ
σ ) was applied to the CNN and
4.1. Baseline Models
Ridge Regression input data, but not to the LightGBM. For our first baseline model, we chose a simple regression
LightGBM was the primary comparison model competing algorithm: Ridge regression. For a more complex baseline,
with our custom CNN, and tree-based models are invariant we evaluated gradient boosting. Gradient boosting builds
to scaling. a sequence of trees in which the next tree in the sequence
is fit to the residuals of the current ensemble. In other
words, let T (x; Θm ) be the mth tree in the ensemble, then
Table 2. Example X (i) with w = 2 its prediction target should be the negative gradient, −gm ,
goals scored assists total points name at iteration m, where
0 0 1 Aleks. Mitrović
N
2 0 12 Aleks. Mitrović X
Θm = arg min (−gim − T (xi ; Θ))2
Θ
i
3
Deep Learning for EPL Performance Forecasting
layer, and then to the final layer (Figure 3). fully-connected layer. Our goal here was merely to identify
whether there is any learnable signal from these longform
Our cost function for this CNN consisted of MSE with Elas-
texts. Should there prove to be a predictive signal, it would
ticNet (L1L2) regularization on both the convolution weight
make sense to purchase a GPU cluster to learn weights more
matrices (C) and the dense layer weight matrix (W [1] ), with
deeply back into the longformer network and achieve better
L1 strength determined by λ1 and L2 strength determined
performance.
by λ2 . Thus, our cost function for batch size B was as
follows: Each input example XN was tokenized into 4096 tokens for
B
! input into our longformer model. Each model (GK, DEF,
1 X (i) ˆ
(i) 2 MID, FWD) took around four to seven hours to train for 25
JCN N = (y − y )
B epochs using an NVIDIA A100 with PyTorch (Paszke et al.,
k=1
2019), so tuning was limited and data was restricted to the
+ λ1 ||C||1 + ||W [1] ||1 2021 − 22 season.
+ λ2 ||C||22 + ||W [1] ||22
5. Experiments and Results
To train our CNN, we used backpropagation to calculate We used mean squared error (MSE) as our cost function to
gradients with respect to our cost function, and the Adam evaluate our baseline, CNN, and transfer learning models.
(Kingma & Ba, 2014) optimizer for weight updates. Early
stopping was employed to further counteract overfitting.
Table 4. Holdout (Test) MSE for Optimal Architectures
4.3. Transfer Learning Model Architecture
Model
Forecasting player performance using news corpus data Ridge LGBM CNN Transfer*
boils down to a text regression problem. Scoring text us- GK 6.46 6.22 5.08 8.22
ing transformers has proven to be one of the best methods DEF 7.20 7.24 5.87 9.66
for tasks such as these (Hasan, 2022). A transformer net- MID 6.08 6.11 6.16 8.40
work uses attention blocks to model token connections in FWD 7.19 7.28 6.22 10.12
input sequences (in the form of directional soft weights be- AVG 6.73 6.71 5.83 9.10
tween tokens) and output embeddings. These embeddings *Trained only on 2021-2022 due to GPU limitations
can then be input into a fully-connected layer to output
logits (real numbers in the embedding space). For classifi-
cation, a softmax activation is used to convert the logits into 5.1. Baseline ML Experiments
probabilities to predict the next best word (Hasan, 2022),
The experiments for Ridge regression and gradient boosting
whereas for regression we project the output to a single
mostly consisted of hyperparameter tuning. We used grid
real-valued score. The self-attention mechanism core to
search with stratified 5-fold CV (Pedregosa et al., 2011).
transformers enables them to understand context and per-
This is particularly critical for gradient boosting as it has
form extremely well in understanding signals from longform
numerous hyperparameters to inhibit excessive model com-
texts such as news articles. Given the length of our news
plexity. The hyperparameter search space can grow expo-
corpus data, a longformer model from AllenAI with a se-
nentially, so we iteratively performed several rounds of grid
quence length of 4096 was chosen for the transfer learning
search to rule out unreasonable options.
regression. Most transformer-based models are unable to
process long sequences due to the attention mechanism that In Figure 5, we observed that even with the adjusted search
scales quadratically. Longformer uses a sliding-window space, multiple combinations were still vulnerable to over-
attention mechanism to do this in linear time (Beltagy et al., fitting of varying severity. How does the best set of hyperpa-
2020). rameters achieve the optimal bias-variance tradeoff? On the
ensemble level, only 50 trees are fit. For each tree, we had
Since we desire a real number output, we want a single
only a depth of 3, an L2 regularization strength of 10, and 7
output logit from the regression. Our selected HuggingFace
leaf nodes each holding 70 observations minimum. As our
model (Wolf et al., 2020) applies sigmoid activation to out-
dataset was smaller than a typical application scenario that
put logits, so we appended a simple scaling layer to project
calls for LightGBM, CV ended up favoring hyperparameters
the sigmoid output to a reasonable range of possible FPL
that more aggressively limit tree complexity.
scores (between −5 and 24). This architecture is not ideal
because it attempts to train a scaled classification problem There are multiple ways we can interpret baseline model
as regression. However, compute constraints meant that outputs and feature importance. For Ridge Regression,
we could not freeze layers up to the logits and retrain a we can compare the coefficients, shown as a heatmap in
4
Deep Learning for EPL Performance Forecasting
5
Deep Learning for EPL Performance Forecasting
5.2.2. H YPERPARAMETERS
# W EEKS
Experiments were performed using learning curves to de- P OSITION A MT N UM F EATURES
W INDOW K ERNEL
termine optimal training hyperparameters of 250 epochs,
learning rate of 0.001, batch size of 32 and early stopping GK 6 1 PTSONLY
DEF 9 1 PTSONLY
tolerance and patience of 1 × 10−4 and 20 iterations, re- MID 3 2 PTSONLY
spectively. A custom implementation of grid search was FWD 9 1 PTSONLY
developed to determine optimal window size, kernel size,
number of filters, number of dense neurons, activation func-
tions, amount of numerical features to include, and strat-
5.2.5. F EATURE I MPORTANCES
ification strategy for GK, DEF, MID, and FWD models.
Subsequent experiments reduced the hyperparameter search As seen in Table 6, only the previous w weeks of FPL points
space by fixing certain optimal hyperparameters based on and upcoming match difficulty d were critical for the CNN
the results of previous experiment iterations. to best predict a player’s points in the upcoming week.
6
Deep Learning for EPL Performance Forecasting
5.2.8. OVERFITTING
rate with low patience for early stopping largely mitigated
Overfitting plagued early versions of the CNN, whereby this. Some overfitting lingers in the final architecture, likely
validation MSE would inflect upwards as training continued. the result of employing grid search with regular train/val/test
Implementing ElasticNet regularization and slower learning shuffle splits rather than cross-validation due to compute
7
Deep Learning for EPL Performance Forecasting
Figure 14. Learning curve and metrics for MID and FWD.
8
Deep Learning for EPL Performance Forecasting
• yˆ(i) is the predicted player points for player i. LSTM) by 30% (Lindberg & Sodenberg, 2020) 1 . Further-
more, we investigated the feasibility of forecasting perfor-
• R(y (i) ) is the rank of the true player points for player mance using signals from text data collected via a news cor-
i. pus, combined with transfer learning of a transformer-based
• R(ŷ (i) ) is the rank of the predicted player points for model. Initial experiments did not demonstrate increased
player i. efficacy relative to our CNN forecasting model, though
data and compute limitations contributed to the lower per-
• R(y) is the mean rank of the true player points. formance. Lastly, we evaluated Spearman correlation as
a proxy to expected player ranking performance for our
• R(ŷ) is the mean rank of the predicted player points, optimal Ridge, LightGBM, and CNN models, finding that
averaged across all players. the CNN models achieve highly promising ranking perfor-
• Syŷ is the covariance of the true and predicted ranks. mance.
• Sy and Sŷ are the standard deviations of the true ranks 6.1. Limitations
and predicted ranks, respectively.
The foremost limitation of the present work is that we were
Note that it is important we calculate the generalized version unable to completely standardize the difficulty of the perfor-
of Spearman’s ρ, which handles tied ranks, since the true mance prediction task between train/val/test splits. Because
player points each gameweek are always integers. This soccer is a low scoring sport, player FPL points can vary
leads to many ties. widely. For example, a talented forward player such as Son
Heung-min might play very well but score no goals and
A high Spearman correlation indicates that the predicted receive a yellow card, netting 1 point. The next week, Son
model points are highly monotonically related to the true might score a hat trick and bonus points, netting 17 points.
player points across predicted player gameweeks. By ex- Predicting these high-score, low-probability events is sig-
tension, an AI agent or player using our model to select nificantly more challenging than predicting a moderately
players in FPL should achieve better performance the higher positive performance (1 assist, 5 − 6 points) by a consistent
the Spearman correlation. Looking at our Spearman corre- playmaking midfielder such as Kevin De Bruyne.
lation results in Table 8, we see that our optimized Ridge
and LightGBM models achieve moderately high Spearman We attempted to standardize the prediction task across splits
correlations and the CNN models achieve strong Spearman by stratifying our train/val/test splits by player skill (mea-
correlations with the true player rankings. sured by average FPL points per week) and player variance
(measured by standard deviatation of FPL points per week),
Overall, our results suggest that the CNN model is also but were unable to achieve completely similar splits. This is
the highest performer in the player ranking task, and that partly due to the fact that splits are performed player-wise
the CNN model is highly promising as a foundation for to prevent data leakage, which prevents the possibility of
developing an automatic FPL player selection tool or for perfect stratified splits in small player-sets, such as the EPL
providing player selection insights in a human-in-the-loop goalkeepers. As seen in Figure 13, sometimes one split
(HITL) pipeline. ended up with an easier prediction task. Note that splits
were kept consistent across runs in comparing various hy-
Table 8. Ranking Performance: Spearman’s Rho for Optimal Con- perparameters, so that an easy validation split would not
figurations skew the hyperparameter optimization results. The variation
between splits is problematic regardless as we ’leave some
GK DEF MID FWD performance behind’ in overfitting to the validation dataset,
R IDGE 0.50 0.40 0.49 0.47 and the noisiness of split difficulty by extension means our
L IGHT GBM 0.53 0.40 0.49 0.48 estimates of prediction accuracy are noisy. Cross-validation
CNN 0.70 0.57 0.58 0.62 combined with superior split strategy might largely miti-
gate this issue, and would be a great extension to this work.
Cross-validation should be possible with a larger GPU clus-
ter given the relatively small dataset.
6. Conclusion
GPU availability was also a limitation in the transfer learn-
Here we developed, to the best of our knowledge, the ing experiments. Without significant external funding, we
highest-performing model for EPL player performance fore- did not have access to the significant GPU compute neces-
casting in the literature. Averaged across positions, our sary to train our longformer model sufficiently.
CNN architecture outperformed our best LightGBM models
1
by 13%, and the previous best model in the literature (an Same data source, but model evaluated on different seasons.
9
Deep Learning for EPL Performance Forecasting
10
Deep Learning for EPL Performance Forecasting
References Gupta, S. Ensemble arima and rnn method for fpl team
selection. 2017. URL https://browse.arxiv.o
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen,
rg/pdf/1909.12938.pdf.
Z., Citro, C., Corrado, G., Davis, A., Dean, J., Devin,
M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Hasan, M. Transformers in natural language processing. 09
Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, 2022. doi: 10.13140/RG.2.2.18062.84809.
M., Levenberg, J., Mané, D., Schuster, M., Monga, R.,
Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma,
Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va- W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient
sudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten- gradient boosting decision tree. Advances in neural infor-
berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: mation processing systems, 30:3146–3154, 2017.
Large-Scale Machine Learning on Heterogeneous Sys- Kingma, D. and Ba, J. Adam: A method for stochastic opti-
tems. TensorFlow Authors, 2015. Software available mization. arXiv preprint arXiv:1412.6980, 2014. URL
from https://www.tensorflow.org/. https://arxiv.org/abs/1412.6980.
Baboota, R. e. a. Predictive analysis and modelling football Koopman, S. e. a. A dynamic bivariate poisson model for
results using machine learning approach for english pre- analysing and forecasting match results in the english
mier league. 2018. URL https://www.scienced premier league. 2013. URL https://academic.o
irect.com/science/article/pii/S01692 up.com/jrsssa/article/178/1/167/7058
07018300116. 470.
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The Lindberg, A. and Sodenberg, B. Comparison of machine
long-document transformer, 2020. learning approaches applied to predicting football players
performance. Technical report, 2020. URL https:
Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. //odr.chalmers.se/server/api/core/bi
Fast differentiable sorting and ranking. arXiv preprint tstreams/c7d1c22f-c8e5-4dd9-b07c-cf5
arXiv:2002.08871, 2020. doi: 10.48550/arXiv.2002.0887 7733f1592/content.
1. URL https://arxiv.org/abs/2002.08871.
[Submitted on 20 Feb 2020 (v1), last revised 29 Jun 2020 Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin,
(this version, v2)]. J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N.,
and Lee, S.-I. From local explanations to global under-
Bunker, R. and Susnjak, T. The application of machine standing with explainable ai for trees. Nature Machine
learning techniques for predicting match results in team Intelligence, 2(1):56–67, 2020. doi: 10.1038/s42256-019
sport: A review. 2022. URL https://jair.org/i -0138-9.
ndex.php/jair/article/view/13509/267
86. McKinney, W. Pandas: a foundational python library for
data analysis and statistics. Python for Data Analysis,
Fantasy Football Reports. How Many People Play FPL? 2011. URL https://pandas.pydata.org/.
https://www.fantasyfootballreports.c
om/how-many-people-play-fpl/. Accessed: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
May 7, 2024. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
Godin, F. e. a. Beating the bookmakers: Leveraging statis- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
tics and twitter microposts for predicting soccer results. Bai, J., and Chintala, S. Pytorch: An imperative style,
2019. URL https://fredericgodin.com/wp high-performance deep learning library. In Advances
-content/uploads/2019/03/Beating-the in Neural Information Processing Systems 32, pp. 8024–
-bookmakers-leveraging-statistics-and 8035. Curran Associates, Inc., 2019. URL http://pa
-Twitter-microposts-for-predicting-s pers.neurips.cc/paper/9015-pytorch-a
occer-results.pdf. n-imperative-style-high-performance-d
eep-learning-library.pdf.
Grund, H. Network structure and team performance: The
case of english premier league soccer teams. 2012. URL Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
https://www.sciencedirect.com/scienc Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
e/article/pii/S0378873312000500. Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Guardian, T. The guardian. The Guardian, 1821. URL Scikit-learn: Machine learning in Python. Journal of
https://www.theguardian.com/. Machine Learning Research, 12:2825–2830, 2011.
11
Deep Learning for EPL Performance Forecasting
12
Deep Learning for EPL Performance Forecasting
13