Chapter 4 Modeling

4.1 Variable and Model Selection

After creating the pace-of-play metrics, we wanted to evaluate their effectiveness when used as variables in models that predict the outcome of a game. We implemented models with and without the pace metrics to determine if the models with the pace metrics achieve a higher accuracy. We only considered pre-game, historical variables that can be gathered before each game is played and did not include traditional post-game performance-based features, such as the number of shots or corners taken. Even though we incorporate a pace metric from the same game as a variable, we anticipate that they can be made into pre-game variables by substituting our pace metrics with historical measurements of pace.

Table 4.1: Description of modeling variables.
Variable Description Values
Response
FTR Outcome of a game with respect to home team -1 (Loss), 0 (Draw), 1 (Win)
Predictors
Unbeaten Current home unbeaten streak (win/draw) 0 to 19
Derby Indicates if game is a derby 0 (No), 1 (Yes)
League League that game takes place in EPL, Ligue 1, Bundesliga, Serie A, La Liga
Team Name of the home team Liverpool, Barcelona, etc.

For each of the 1,826 games, the response variable, FTR, describes the outcome with respect to the home team. A FTR of 1 indicates that the home team won while -1 indicates the away team won. Unbeaten is the home team’s current unbeaten streak leading up to a game. The streak resets to 0 when a team loses at home. For the first home game of the season, a team’s unbeaten streak from the 2016-17 season is used. For example, Manchester City went unbeaten in their last 12 home games in the 2016-17 season, so their Unbeaten for their first home game is 12. For the 14 newly promoted teams, their Unbeaten for their first home game is 0. This is because they played in a lower division, so their home unbeaten streak is not comparable to that of a team that played in the first division. Derby is an indicator variable, where 1 indicates that a game is a derby game. A game is marked as a derby game if the two teams are located in the same city (Manchester City vs. Manchester United) or if there is a historical rivalry (El Clásico). League specifies which of the five leagues the game takes place in and Team provides the name of the home team.

Table 4.2: Description of pace variables.
Variable Description Values
\(\Delta_{ij}^{AZ}\) Sum of the differences in total velocity for all zones (1-8) for home team \(i\) and away team \(j\) -88.93 to 67.72 (m/s)
\(\Delta_{ij}^{OZ}\) Sum of the differences in total velocity for all zones in offensive half (5-8) for home team \(i\) and away team \(j\) -80.68 to 45.55 (m/s)
\(\Delta_{ij}^{FZ}\) Sum of the differences in total velocity for zones 5,7,8 for home team \(i\) and away team \(j\) -79.1 to 39.74 (m/s)

For each game, we conducted a zonal analysis of the \(V_{T}\) for the home and away teams. We took the median of the median velocities of the 5x5 polygrids to determine the aggregate velocities for each zone instead of the mean of the medians. Lower tier teams have a smaller number of recorded events per game and are more susceptible to outliers in both the polygrid and zonal analyses. Thus, using the median of the median velocities makes these zonal velocities more resistant to outliers. Then we calculated the difference (home - away) in \(V_{T}\) for each of the 8 zones. Let \(i \in (1, 2, \dots, 98)\) represent the home team and \(j \in (1, 2, \dots, 38)\) (\((1, 2, \dots, 34)\) for teams in the Bundesliga) be the \(j^{th}\) game team \(i\) plays during the season. Then \(\Delta_{ij}^{AZ}\) is the sum of the differences for all 8 zones, \(\Delta_{ij}^{OZ}\) is the sum of the differences for the four zones in the offensive half (5-8) and \(\Delta_{ij}^{FZ}\) is the sum of the differences in the flank zones (5, 7, 8).

To evaluate the models, we first split the data into a train and test set. The test data, which is 21.5% of the full data, includes 2 home and 2 away games for each of the 98 teams, for a total of 392 games. We perform 4-fold cross validation on the training data and lastly assess model performance by predicting on the testing data. We propose two different types of models - the first is a hierarchical logistic regression model that predicts between wins and non-wins (draws and losses) while the second is a multinomial logistic regression model that predicts on all three potential match outcomes. Draws and losses are the baseline category of the response variable in the hierarchical logistic models and losses are the baseline in the multinomial logistic models. These models were preferred over other classification algorithms since we are concerned with both predictive power and interpretability.

For both sets of models, we first constructed a baseline model that only uses the predictors mentioned in Table 4.1. We then added one of the pace variables from Table 4.2 to determine if the addition of a pace variable improves the model’s accuracy. Only one pace variable can be added to the model since they are all highly correlated. Interaction effects between the baseline predictors and quadratic terms for Unbeaten and the pace variables were also considered, but none of these modifications significantly improved the predictive power of any model. We used accuracy and AUC as evaluation metrics for the hierarchical logistic regression and accuracy and True Positive Rate (TPR) for the multinomial logistic regression. Model assumptions and diagnostics, such as binned residual plots, are discussed in the Appendix.

4.2 Hierarchical Logistic Model

The baseline hierarchical logistic model (without any pace variables) is as follows:

\[ \begin{aligned} Y_{ij} \sim Bernoulli(&\pi_{ij}) \\ log(\frac{\pi_{ij}}{1-\pi_{ij}}) = \beta_{0} + \beta_{1} * Unbeaten_i \ + \beta_{2} * I(&Derby_{ij}=Yes) + \alpha_{i} && (1.1)\\ \alpha_i \sim N(0,\tau^2) \end{aligned} \]

The modified hierarchical logistic model (with the pace variable) is as follows:

\[ \begin{aligned} Y_{ij} \sim Bernoulli(&\pi_{ij}) \\ log(\frac{\pi_{ij}}{1-\pi_{ij}}) = \beta_{0} + \beta_{1} * Unbeaten_i \ + \beta_{2} * I(Der&by_{ij}=Yes) +\beta_3 * \Delta_{ij}^{FZ} + \alpha_{i} & (1.2)\\ \alpha_i \sim N(0,\tau^2) \end{aligned} \] Recall that the baseline of the response variable is draws and losses. \(Y_{ij}\) is the outcome (win vs. draw/loss) of the game and \(\pi_{ij}\) is the probability that home team \(i\) wins the game. \(\alpha_i\) represents the random intercept term for team \(i\). We do not include a random intercept for League, as most of the variability between leagues is already explained by the variability between teams. The only difference between models 1.1 and 1.2 is the addition of the pace variable \(\Delta_{ij}^{FZ}\) in 1.2.

Table 4.3: Hierarchical logistic model results with 4-fold cross validation.
Train Data
Test Data
Model Mean Accuracy Mean AUC Accuracy AUC
Baseline 58.29% 56.53 63.52% 60.57
\(\Delta_{ij}^{FZ}\) 59.91% 58.52 64.29% 61.68

The baseline hierarchical logistic model reports an accuracy of 63.52% and AUC of 60.57 on the test data while the best performing pace model, the one with \(\Delta_{ij}^{FZ}\), reports a slightly higher accuracy of 64.29% and AUC of 61.68. This suggests that the addition of a pace variable does not significantly improve the predictive power of the model on the test set. The results for the other two pace models can be found in Appendix Table 6.2.

We expect the pace model with \(\Delta_{ij}^{AZ}\) to have the lowest performance out of the three pace models. These pace variables assume that pace across the pitch is weighted evenly. Even though pace varies in the defensive half of the pitch, these differences are not necessarily indicative of a team’s scoring capabilities. Variation in pace in the offensive half is more indicative of a team’s attacking strength, which is more directly related to the outcome of a match.

Table 4.4: Coefficients obtained from models 1.1 and 1.2.
Baseline Model
Pace Model
Predictor Log Odds Ratio Odds Ratio p-value Log Odds Ratio Odds Ratio p-value
(Intercept) -0.24 0.78 (0.67, 0.92) < 0.01 -0.29 0.75 (0.64, 0.88) < 0.001
Unbeaten 0.03 1.03 (1, 1.07) 0.06 0.04 1.04 (1.01, 1.08) 0.02
Derby -0.9 0.41 (0.24, 0.69) < 0.01 -0.80 0.45 (0.27, 0.76) < 0.01
\(\Delta_{ij}^{FZ}\) -0.05 0.95 (0.94, 0.97) < 0.001

Table 4.4 displays the log odds and odds ratios for all the variables used in the baseline and pace models, respectively. All the coefficients, except for Unbeaten in the baseline model, are statistically significant. We note that the log odds for \(\Delta_{ij}^{FZ}\) is negative and statistically significant. This indicates that as the home team’s \(V_T\) in the flank zones increases by one meter per second, the odds of the home team winning the match is expected to multiply by 0.95, holding all else constant. This reflects the results from Figures 3.5 and 6.3, which showed that lower ranked teams, and thus teams that are expected to have a lower chance of winning a match, generally have a higher \(V_T\).

4.3 Multinomial Logistic Model

The baseline multinomial logistic model (without any pace variables) is as follows:

\[ \begin{aligned} P(Y_{ijk} = k) = \ &\pi_{ijk} \\ log(\frac{\pi_{ijk}}{\pi_{ij1}}) = \beta_{0k} + \beta_{1k} * Unbeaten + \ \beta_{2k}* I&(Derby_{ij}=Yes) + \ & (2.1)\\ \sum_{T = 2}^{98}\beta_{3kt}*I(Tea&m_{i}=t) & \\ k \in (2, 3) \end{aligned} \]

The modified multinomial logistic model (with the pace variable) is as follows:

\[ \begin{aligned} P(Y_{ijk} = k) =& \ \pi_{ijk} \\ log(\frac{\pi_{ijk}}{\pi_{ij1}}) = \beta_{0k} + \beta_{1k} * Unbeaten + \ \beta_{2k}* I&(Derby_{ij}=Yes) + \ & (2.2)\\ \sum_{T = 2}^{98}\beta_{3kt}*I(Team_{i}=t&) + \beta_{4k} * \Delta_{ij}^{FZ}& \\ k \in (2, 3) \end{aligned} \]

Recall that the baseline of the response variable, \(k=1\), is a loss. \(\pi_{ijk}\) is the probability that the game ends in a draw when \(k=2\) and a win when \(k=3\). For the term Team, \(t=1\) is the baseline, which is Manchester City. The only difference between models 2.1 and 2.2 is the addition of the pace variable \(\Delta_{ij}^{FZ}\) in 2.2.

Table 4.5: Multinomial logistic model results with 4-fold cross validation.
Train Data
Test Data
Model Mean Accuracy Accuracy
Baseline 45.27% 45.15%
\(\Delta_{ij}^{FZ}\) 45.23% 46.68%

The baseline multinomial model reports an accuracy of 45.15% on the test data while the best performing pace model, the one with \(\Delta_{ij}^{FZ}\), reports a slightly higher accuracy of 46.68%. This modified model, which is the best of the three pace models, actually performs slightly worse on the train set than the baseline model. Once again, we see that the addition of a pace variable does not significantly improve the predictive power of the model. The results for the other two pace models can be found in Appendix Table 6.5.

Table 4.6: True Positive Rates from the baseline and modified multinomial models for each match outcome.
Train Data
Test Data
Model Mean TPR (win) Mean TPR (draw) Mean TPR (loss) TPR (win) TPR (draw) TPR (loss)
Baseline 65.1% 16.55% 38.08% 69.64% 11.65% 39.67%
\(\Delta_{ij}^{FZ}\) 64.61% 16.85% 38.38% 67.86% 13.59% 45.45%

Table 4.6 indicates that the models predict wins decently well but struggle to predict draws and losses. On paper, one team is typically stronger than the other and thus more likely to win. Predicting a draw or loss requires one team to either perform better or worse than they normally do, which can be unexpected and therefore more unpredictable. Draws are considerably harder to predict than losses because this requires that both teams also score the same number of goals.

Table 4.7: Coefficients obtained from models 2.1 and 2.2.
Baseline Model Odds Ratio
Pace Model Odds Ratio
Predictor Response Log Odds Odds Ratio p-value Log Odds Odds Ratio p-value
(Intercept) FTR = draw 2.09 8.07 (0.66, 98.31) 0.1 1.69 5.41 (0.44, 66) 0.19
Unbeaten FTR = draw -0.16 0.85 (0.8, 0.91) < 0.001 -0.15 0.86 (0.8, 0.91) < 0.001
Derby FTR = draw 0.16 1.18 (0.6, 2.31) 0.64 0.22 1.25 (0.63, 2.45) 0.52
\(\Delta_{ij}^{FZ}\) FTR = draw -0.05 0.95 (0.93, 0.97) < 0.001
(Intercept) FTR = win 4.16 64.2 (7.46, 552.51) < 0.001 3.68 39.47 (4.58, 339.97) < 0.01
Unbeaten FTR = win -0.14 0.86 (0.82, 0.92) < 0.001 -0.14 0.87 (0.82, 0.92) < 0.001
Derby FTR = win -1.15 0.32 (0.16, 0.63) < 0.01 -1.05 0.35 (0.18, 0.69) < 0.01
\(\Delta_{ij}^{FZ}\) FTR = win -0.06 0.94 (0.92, 0.96) < 0.001

Table 4.7 displays the the log odds and odds ratios for all the variables used in the baseline and pace variable models, respectively. The log odds and odds ratios for the Team variable for both models can be found in Appendix Tables 6.6 and 6.7. Once again, we note that both the log odds for \(\Delta_{ij}^{FZ}\) are negative and statistically significant. This indicates that as the home team’s \(V_T\) in the flank zones increases by one meter per second, the odds of the home team drawing a match versus losing are expected to multiply by 0.95 and the odds of the home team winning a match versus losing are expected to multiply by 0.94, holding all else constant. Both draws and wins are better outcomes than a loss, so this matches the results from Figures 3.5, 6.3 and Table 4.4. Teams with higher \(V_T\) are generally weaker and thus more likely to lose a match rather than draw or win a match.