Chapter 5 Discussion
Our findings show that although pace varies considerably, it is generally highest in the offensive third of the pitch, relatively consistent across leagues, and increases with decreasing team quality, though there is much more variability in pace among the bottom tier teams. This observation is most noticeable in a team’s goal kicks. Top tier teams may feel more confident playing out from the back and may be less likely to take longer goal kicks. On the other hand, bottom tier teams may struggle to maintain possession for a long time, so their goalkeepers may feel pressured to take longer goal kicks, with the hopes that one could lead to a goal-scoring opportunity. We also see that teams vary in their ability to attack and defend pace in different regions on the pitch.
Forward attacking pace (\(V_E\)) is currently the most used metric of team-level pace (Harkins, 2016; Alexander, 2017; Silva, David and Swartz, 2018), but Yu et al. (2019) notes that \(V_E\) decreases drastically as teams move into offensive regions on the pitch and is thus not an ideal metric for measuring a team’s offensive capabilities. However, our findings show a contrasting result. \(V_E\) only declines in the polygrids in front of the goal, but not in other polygrids in the offensive half. Since \(V_E\) in the offensive half is also comparable to that in the defensive half, we believe that \(V_E\) is an appropriate metric to gauge team-level pace.
Although we extracted meaningful findings from the pace metrics, there were limitations with the available data and methodology. The first is the presence of inaccurately tagged events. Incorrectly labeled coordinates or timestamps can affect the calculation of the pace metrics. In addition, we assumed that the ball always traveled in a straight line, as we did not know the true trajectory of the ball or if a player dribbled the ball before passing. Another limitation is the lack of player tracking data, as this type of data is not widely publicly available. Player tracking data could provide more information about the true, 3D trajectory of the ball, thus giving us more robust and accurate pace metrics. Lastly, many of the explanations we provided for our results are hypotheses that we cannot fully verify. We are unsure if some of our results are simply due to noise in the data or if they actually hold across seasons.
While the models performed adequately at predicting the outcome of a match, it is worth pointing out some limitations. We only had 1,826 regular seasons in our dataset, so creating train and test sets further reduced the amount of data used to train the models. Another limitation is the lack of uncorrelated pre-game variables. We tried other variables, such as the number of points a team accumulated last season and the number of top 100 players that play for a team. Unfortunately, they were all extremely correlated with one another and the Unbeaten variable. We also considered variables such as the average age of a team’s players and the market valuation of a team’s players, but were unable to find these values on a game-by-game basis.
5.1 Future Steps
The scope of this analysis discusses pace on a team and league level. However, pace can also be evaluated at the player-level. Future work includes quantifying player-level pace and evaluating passing networks using network analysis to determine a player’s value within a team and if that player’s value has changed across the season. We can also examine the impact of pace on events in the game, such as pace before a shot is taken and how pace impacts a team’s pass completion rate.
Our models may perform better if we are able to incorporate data from across multiple seasons. This would not only provide more data to train the models on but could also help quantify heterogeneity across seasons, if included as a random effect in the hierarchical logistic model. We can also consider incorporating network-based variables. It may be interesting to consider other classification algorithms, such as a multinomial hierarchical logistic regression, or utilize a Bayesian approach to modeling.