I recently built a quick from-scratch Premier League season simulator. For the game simulator, I simply took the Squad’s average goals in the season and fed that into a Poisson model generator (rpois in R) and ran against the opposing squad’s simulated distribution. This is OK, but it doesn’t take into account a few things : home/away splits, strength of defense, or noise in the data.
Home/away is easy enough to grab. We have goals allowed. Noise can be reduced by using Expected Goals (xG) instead of Goals. My biggest challenge was how to capture the interaction between Goals and opposing defense. I eventually found a really helpful article that used the following methodology:
For the home team, calc their average home xG and divide by league average xG.
Then calculate the opposition average home Expected Goals Allowed (xGA) and divide by league average.
Multiply these two numbers together, then multiply by league average home xG.
For example, let’s look at Man City’s match against Sheffield United. City went into the game with a home xG of 1.78, which is 1.03 from league average of 1.73. Sheffield United had an away xGA of 1.29, which of 1.05 from league average of 1.23. So we get 1.03 * 1.05 * 1.73, or 1.87 xG for City. It’s a relatively small adjustment, but worth the trouble.
Note these are done using 12 week exponential moving averages (another concept I took from the article linked above). This is in Premier League / Champions league only, and uses the most recent data to forecast the rest of the season. The idea being that we can capture change in team performance without giving up too much historical data. At some point it’d be nice to update the model to adjust the moving average after each simulated match, but I’m saving that for another day.
OK, I made all this effort, how well does it work? We’ll calculate each match’s Brier scores., using matchweeks 18 and 19 for testing. The total is simply the average of the Brier for Win, Draw, and Loss.
Show the code
library(gt)library(gtExtras)library(hrbrthemes)library(patchwork) library(pracma)library(plotly)library(rvest)library(tidyverse)library(worldfootballR)#https://beatthebookie.blog/2021/06/07/the-predictive-power-of-xg/#set seed for reproducibilityset.seed(20231228)#use this to for the season end year#makes it easier to update stuff laterseason_end <-2024ema_duration <-12#how many seasons we want to simulatenumber_of_sims <-20000#rather than use a case statement over and over, just make this a functioncalculate_points <-function(a, b){case_when(a > b ~3, a == b ~1, a > b ~0,TRUE~0)}#fetch current table using the worldfootballr packageprem_2024_table <-fb_season_team_stats(country ="ENG", gender ="M", season_end_year ="2024", tier ="1st", stat_type ="league_table")pl_table <- prem_2024_table %>%select(Squad, MP, Pts, W, D, L, GF, GA, GD, xG, xGA, Rk) %>%rename(Points = Pts,Goals = GF,Goals_Allowed = GA,Rank = Rk,Goal_Dif = GD,xG_Allowed = xGA)#get match results for current season and prior season#also grab champions league so we have something for clubs that got promotedmatch_pl_cs <-fb_match_results(country ="ENG", gender ="M",season_end_year = season_end,tier ="1st")match_pl_ps <-fb_match_results(country ="ENG", gender ="M",season_end_year = season_end -1,tier ="1st")match_efl_ps <-fb_match_results(country ="ENG", gender ="M",season_end_year = season_end -1,tier ="2nd")#set aside unplayed matches for laterpl_future_matches <- match_pl_cs %>%filter(is.na(HomeGoals) | Notes =='Match Suspended') %>%select(-MatchURL) %>%as_tibble()#get max date with a gameMax_Date <- match_pl_cs %>%filter(!is.na(HomeGoals)) %>%summarise(Max_Date =max(Date)) %>%pull()#get home/away goals for the league using moving averageleaguge_exp <- match_pl_ps %>%rbind(match_pl_cs) %>%rbind(match_efl_ps) %>%mutate(Wk =as.numeric(Wk)) %>%group_by(Competition_Name, Season_End_Year, Wk) %>%summarise(mean_Home_xG =mean(Home_xG, na.rm =TRUE),mean_Away_xG =mean(Away_xG, na.rm =TRUE),d =min(Date)) %>%arrange(d) %>%group_by(Competition_Name) %>%mutate(lg_ema_Home_xG =movavg(mean_Home_xG, n = ema_duration, type ='e'),lg_ema_Away_xG =movavg(mean_Away_xG, n = ema_duration, type ='e') ) %>%select(-mean_Home_xG, -mean_Away_xG, -d)#get club expected goals relative to league average for both home and away#using a expoential moving average for each, i.e. compared to league over #last n matches, how good/bad are they at scoring and preventing goalsclub_xg_ema <- match_pl_ps %>%rbind(match_pl_cs) %>%rbind(match_efl_ps) %>%mutate(Wk =as.numeric(Wk)) %>%select(Competition_Name, Season_End_Year, Wk, Home, Away, Home_xG, Away_xG) %>%gather(key = Location, value = Squad, Home, Away) %>%filter(!is.na(Home_xG)) %>%filter(!(Season_End_Year ==2024& Wk ==20)) %>%mutate(xG =case_when(Location =='Home'~ Home_xG, TRUE~ Away_xG),xGA =case_when(Location =='Home'~ Away_xG , TRUE~ Home_xG)) %>%select(-Home_xG, -Away_xG) %>%arrange(Squad, Season_End_Year, Wk) %>%group_by(Squad, Location) %>%mutate(ema_xG =movavg(xG, n = ema_duration, type ='e'),ema_xGA =movavg(xGA, n = ema_duration, type ='e')) %>%inner_join(leaguge_exp) %>%mutate(rel_xG =case_when(Location =='Home'~ ema_xG / lg_ema_Home_xG,TRUE~ ema_xG / lg_ema_Away_xG),rel_xGa =case_when(Location =='Home'~ ema_xGA / lg_ema_Away_xG,TRUE~ ema_xGA / lg_ema_Home_xG)) %>%group_by(Squad, Location) %>%filter(Season_End_Year == season_end) %>%filter(Wk ==max(Wk)) %>%select(Wk, Squad, lg_ema_Home_xG, lg_ema_Away_xG, rel_xG, rel_xGa)sim_games3 <-function(home, away){ n_sims <- number_of_sims home_table <- club_xg_ema %>%filter(Squad == home & Location =='Home') away_table <- club_xg_ema %>%filter(Squad == away & Location =='Away') home_exp <- (home_table$rel_xG * away_table$rel_xGa) * home_table$lg_ema_Home_xG away_exp <- (away_table$rel_xG * home_table$rel_xGa) * away_table$lg_ema_Home_xG home_goals <-rpois(n_sims, home_exp) away_goals<-rpois(n_sims, away_exp) results <-cbind(home_goals, away_goals) %>% as_tibble %>%mutate(goal_dif = home_goals - away_goals,home_points =calculate_points(home_goals, away_goals),away_points =calculate_points(away_goals, home_goals),sim_num =row_number()) %>%relocate(sim_num) results }#old version of simmer just in casesim_games <-function(home_goals, away_goals){ n_sims <- number_of_sims home_goals <-rpois(n_sims, home_goals) away_goals<-rpois(n_sims, away_goals) results <-cbind(home_goals, away_goals) %>% as_tibble %>%mutate(goal_dif = home_goals - away_goals,home_points =calculate_points(home_goals, away_goals),away_points =calculate_points(away_goals, home_goals),sim_num =row_number()) %>%relocate(sim_num) results }sim_new <- match_pl_cs %>%filter(Wk %in%c(18, 19) &!is.na(HomeGoals)) %>%select(Wk, Date, Home, Away, HomeGoals, AwayGoals) %>%inner_join(pl_table %>%select(Squad, Goals, xG, MP) %>%rename(Home_Goals = Goals, Home_xG = xG, Home_MP = MP), join_by(Home == Squad)) %>%inner_join(pl_table %>%select(Squad, Goals, xG, MP) %>%rename(Away_Goals = Goals, Away_xG = xG, Away_MP = MP), join_by(Away == Squad)) %>%mutate(Home_G_per_Match = Home_Goals / Home_MP,Away_G_per_Match = Away_Goals / Away_MP) %>%mutate(simmed_games =map2(Home, Away, ~sim_games3(.x, .y))) %>%unnest(simmed_games)sim_agg_with_brier <- sim_new %>%mutate(W =case_when(HomeGoals > AwayGoals ~1, TRUE~0),D =case_when(HomeGoals == AwayGoals ~1, TRUE~0),L =case_when(HomeGoals < AwayGoals ~1, TRUE~0)) %>%mutate(W_sim =case_when(home_goals > away_goals ~1, TRUE~0),D_sim =case_when(home_goals == away_goals ~1, TRUE~0),L_sim =case_when(home_goals < away_goals ~1, TRUE~0)) %>%group_by(Wk, Home, Away) %>%summarise(W =mean(W),D =mean(D),L =mean(L),HomeGoals =mean(HomeGoals),AwayGoals =mean(AwayGoals),W_sim =mean(W_sim),D_sim =mean(D_sim),L_sim =mean(L_sim)) %>%mutate(Brier_W =case_when(W ==1~ (W_sim -1)^2, TRUE~ W_sim ^2),Brier_D =case_when(D ==1~ (D_sim -1)^2, TRUE~ D_sim ^2),Brier_L =case_when(L ==1~ (L_sim -1)^2, TRUE~ L_sim ^2),Brier_T = (Brier_W + Brier_D + Brier_L) /3 )bw <- sim_agg_with_brier %>%ggplot(aes(x = Brier_W))+geom_density()+geom_rug()+theme_ipsum()+labs(title ='Brier Win')+xlim(0, 0.85)bd <- sim_agg_with_brier %>%ggplot(aes(x = Brier_D))+geom_density()+geom_rug()+theme_ipsum()+labs(title ='Brier Draw')+xlim(0, 0.85)bl <- sim_agg_with_brier %>%ggplot(aes(x = Brier_L))+geom_density()+geom_rug()+theme_ipsum()+labs(title ='Brier Loss')+xlim(0, 0.85)bt <- sim_agg_with_brier %>%ggplot(aes(x = Brier_T))+geom_density()+geom_rug()+theme_ipsum()+labs(title ='Brier Total')+xlim(0, 0.85)(bw | bd) / (bl | bt)
This isn’t bad - the average sits around 0.35, so we’re better than a coin toss. It’s doing suspiciously well on draws, while win and loss only look OK. Let’s compare our average prediction for each outcome with the average outcomes overall.
Not too shabby, with an overall Brier score of 0.24. We’d like it to be closer, it overestimated draws and underestimated wins in our test data. Where is it missing? Let’s check our top 10 Brier scores. Win/loss perspectives are from the home team’s perspective.
Our highest score comes from Arsenal vs. West Ham. The model pegged the Gunners with a 67.4% chance of winning and a 17.3% chance of losing, leaving West Ham with a mere 15.3% win probability. They won it 0-2, while the xG was 2.7 to 1.4 suggests it was a tough loss it was nonetheless a loss. (It was a great game to watch, too - both squads played very well.) I think we’d all have been suspicious of a model that heavily favored West Ham going in - if anything, 15.3% was too high a win probability for them - so we won’t worry about this match.
Going down the list, most of these are upsets. I like the Wolves and all, but wouldn’t pick them to beat Chelsea (even considering how lousy Chelsea’s been lately), yet here we go. Nottingham Forest* beating Newcastle was another great game to watch, and yet again an upset.
*Caveat: Nottingham is quickly becoming my favorite underdog squad. Just a lot of fun to watch.
The model thought City had a 73.5% chance of beating Everton, and guess what City did? Most of these are really good clubs beating up weaker clubs, though there are also some mid-table clubs in there as well.
The chief thing here is that the model is being perhaps too conservative. City’s odds for winning against Everton were surely higher. (Yeah, the Brier for their win was 0.07, but if it’d given 80% odds the score would’ve been 0.04, a 43% improvement.) As a rule, the higher the odds, the lower the Brier score when you’re right. So that’s definitely something to look tweaking going forward. For now, it’ll do.
With this done, we can update the projected table for the rest of the season.
Show the code
pl_simmed <- pl_future_matches %>%select(Wk, Date, Home, Away) %>%inner_join(pl_table %>%select(Squad, Goals, xG, MP) %>%rename(Home_Goals = Goals, Home_xG = xG, Home_MP = MP), join_by(Home == Squad)) %>%inner_join(pl_table %>%select(Squad, Goals, xG, MP) %>%rename(Away_Goals = Goals, Away_xG = xG, Away_MP = MP), join_by(Away == Squad)) %>%mutate(Home_G_per_Match = Home_Goals / Home_MP,Away_G_per_Match = Away_Goals / Away_MP) %>%mutate(simmed_games =map2(Home, Away, ~sim_games3(.x, .y)))#unnest our sims, split into home and awaypl_simmed_long <- pl_simmed %>%select(Wk, Date, Home, Away, simmed_games) %>%unnest(simmed_games)pl_simmed_home <- pl_simmed_long %>%select(sim_num, Home, home_points, home_goals, away_goals) %>%rename(Squad = Home,Goals = home_goals,Goals_Allowed = away_goals,Points = home_points ) %>%mutate(W =case_when(Goals > Goals_Allowed ~1, TRUE~0),D =case_when(Goals == Goals_Allowed ~1, TRUE~0),L =case_when(Goals < Goals_Allowed ~1, TRUE~0))pl_simmed_away <- pl_simmed_long %>%select(sim_num, Away, away_points, away_goals, home_goals) %>%rename(Squad = Away,Goals = away_goals,Goals_Allowed = home_goals,Points = away_points ) %>%mutate(W =case_when(Goals > Goals_Allowed ~1, TRUE~0),D =case_when(Goals == Goals_Allowed ~1, TRUE~0),L =case_when(Goals < Goals_Allowed ~1, TRUE~0))#we need actual in the data as well, so we repliate the table #to match our number of simstable_rep <-seq(1:number_of_sims) %>%as_tibble() %>%rename(sim_num = value) %>%crossing(pl_table %>%select(Squad, Points, Goals, Goals_Allowed, W, D, L))#take all of our sims and summarize. Note this includes the sim number if we want to play with thatpl_aggregated <- pl_simmed_home %>%rbind(pl_simmed_away) %>%rbind(table_rep) %>%group_by(sim_num, Squad) %>%summarise_if(is.numeric, sum, na.rm =TRUE) %>%mutate(Goal_Dif = Goals - Goals_Allowed) %>%group_by(sim_num) %>%arrange(desc(Points), desc(Goal_Dif)) %>%mutate(Rank =rank(desc(Points), ties.method="first"),win_PL =case_when(Rank ==1~1, TRUE~0),qualify_UCL =case_when(Rank <=4~1, TRUE~0),relgated =case_when(Rank >=18~1, TRUE~0))pl_team <- pl_aggregated %>%ungroup() %>%select(-sim_num, -Rank) %>%group_by(Squad) %>%summarise_if(is.numeric, mean, na.rm =TRUE) %>%arrange(desc(Points))pl_sim_table <- pl_table %>%inner_join(pl_team, by =join_by(Squad)) %>%select(-xG, -xG_Allowed) %>%arrange(desc(win_PL), relgated) pl_sim_table %>%gt() %>%gt_theme_espn() %>%tab_spanner(label ="Current Table",columns =c("MP", "Points.x", "Goals.x", "Goals_Allowed.x", "Goal_Dif.x", "Rank", "W.x", "D.x", "L.x") ) %>%tab_spanner(label ="Projected",columns =c("Points.y", "Goals.y", "Goals_Allowed.y", "Goal_Dif.y", "win_PL", "qualify_UCL", "relgated", "W.y", "D.y", "L.y") ) %>%gt_add_divider(columns ="Rank", style ="solid", color ='lightgrey') %>%data_color(columns =c("win_PL", "qualify_UCL", "relgated"),method ="numeric",palette =c('#00ff85', '#ffffff'),#omain = c(0.01, 1),na_color ="white",reverse =TRUE ) %>%fmt_percent(columns =c("win_PL", "qualify_UCL", "relgated"), decimals =1) %>%fmt_number(columns =c("Points.y", "Goals.y", "Goals_Allowed.y", "Goal_Dif.y","W.y", "D.y", "L.y"), decimals =1) %>%cols_width(win_PL ~px(95), qualify_UCL ~px(95), relgated ~px(95)) %>%cols_label(Points.x ="Points",Goals.x ="G",Goals_Allowed.x ="GA",Goal_Dif.x ="GD",W.x ="W",D.x ="D",L.x ="L",Points.y ="Points",Goals.y ="G",Goals_Allowed.y ="GA",Goal_Dif.y ="GD",win_PL ='Win PL',qualify_UCL ='Qualify UCL',relgated ='Relegated',W.y ="W",D.y ="D",L.y ="L", ) %>%tab_header(title ='Premier League Projected Standings',subtitle =paste('Using data from matches played through ', Max_Date)) %>%tab_footnote('Table data courtesy of fbref.com') %>%tab_style(style =cell_text(align ="center"),locations =cells_column_labels()) %>%cols_hide(columns =c(W.x, D.x, L.x, W.y, D.y, L.y))
Premier League Projected Standings
Using data from matches played through 2024-01-02
Squad
Current Table
Projected
MP
Points
G
GA
GD
Rank
Points
G
GA
GD
Win PL
Qualify UCL
Relegated
Liverpool
20
45
43
18
25
1
79.4
81.7
41.0
40.6
43.7%
99.0%
0.0%
Manchester City
19
40
45
21
24
3
77.2
87.0
43.9
43.2
28.1%
97.3%
0.0%
Arsenal
20
40
37
20
17
4
77.0
75.4
38.7
36.7
24.1%
97.4%
0.0%
Aston Villa
20
42
43
27
16
2
70.9
75.1
52.7
22.3
4.0%
78.9%
0.0%
Tottenham
20
39
42
29
13
5
61.4
75.3
68.1
7.2
0.0%
10.1%
0.0%
Newcastle Utd
20
29
39
29
10
9
59.7
79.9
60.2
19.7
0.0%
7.2%
0.0%
Chelsea
20
28
34
31
3
10
57.9
73.1
61.9
11.3
0.0%
4.0%
0.0%
Brighton
20
31
38
33
5
7
57.3
72.9
66.1
6.8
0.0%
2.9%
0.0%
West Ham
20
34
33
30
3
6
56.3
61.4
63.3
−1.9
0.0%
1.8%
0.0%
Manchester Utd
20
31
22
27
-5
8
55.1
53.2
60.0
−6.8
0.0%
1.1%
0.0%
Bournemouth
19
25
28
35
-7
12
51.3
58.3
65.6
−7.3
0.0%
0.2%
0.2%
Brentford
19
19
26
31
-5
16
48.6
61.0
61.3
−0.3
0.0%
0.1%
0.5%
Wolves
20
28
30
31
-1
11
47.1
55.4
66.9
−11.5
0.0%
0.0%
0.9%
Crystal Palace
20
21
22
29
-7
14
44.1
48.6
58.8
−10.3
0.0%
0.0%
3.7%
Fulham
20
24
28
35
-7
13
43.7
52.5
68.3
−15.8
0.0%
0.0%
4.5%
Everton
20
16
24
28
-4
17
39.9
53.6
59.6
−6.0
0.0%
0.0%
13.5%
Nott'ham Forest
20
20
24
35
-11
15
40.1
51.3
71.3
−20.0
0.0%
0.0%
14.2%
Burnley
20
11
20
41
-21
19
30.5
40.8
69.8
−29.1
0.0%
0.0%
82.9%
Luton Town
19
15
23
37
-14
18
30.7
43.3
75.8
−32.5
0.0%
0.0%
83.8%
Sheffield Utd
20
9
15
49
-34
20
26.5
36.4
82.8
−46.4
0.0%
0.0%
95.7%
Table data courtesy of fbref.com
Liverpool is sitting pretty nice right now, with City and Arsenal in contention. Liverpool also just beat Newcastle 4-2, with an xG line of 7-0.6 (!), so their numbers are a bit inflated right now.
The takeaway here is that bootstrapping is relatively trivial, but making sure that what you’re bootstrapping is solid is a lot more difficult. Soccer has a fair amount on unpredictability which should adjust our expectations, but that doesn’t mean there’s no room for improvement.