We Have the 538 Simulator at Home

statistics
r
soccer
Author

Mark Jurries II

Published

December 29, 2023

One of my weekly check-ins during the last Premier League season was the 538 Predictions page. Seeing the odds of where each team would finish up after each matchweek was always satisfying. However; Nate Silver is no longer at 538 and he took this model with him, so it’s no longer being updated. With the season at the halfway point, I decided to make my own sim to scratch that itch.

For the match sims, I take each squad’s average goals per game and use that to generate scores. This is ludicrously simplistic - it’s not taking into account opposing defense, recent history, who’s playing, etc. There’s also a case to use expected goals instead of goals. But for a quick and dirty model, it’ll do the trick.

We then run the sim for each match 20,000 times, assigning an ID to each that will be used for a season-level sim. For instance, here’s the summary of the expected outcome of this weekend’s match of Manchester City (2.38 goals per game) against Sheffield United (0.78 goals per game). Hit “Show the Code” if you want to see how it works under the hood.

Show the code
library(gt)
library(gtExtras)
library(rvest)
library(tidyverse)
library(worldfootballR)

#set seed for reproducibility
set.seed(20231228)

#fetch current table using the worldfootballr package
prem_2024_table <- fb_season_team_stats(country = "ENG", 
                                        gender = "M", 
                                        season_end_year = "2024", 
                                        tier = "1st", 
                                        stat_type = "league_table")

pl_table <- prem_2024_table %>%
  select(Squad, MP, Pts, W, D, L, GF, GA, GD, xG, xGA, Rk) %>%
  rename(Points = Pts,
         Goals = GF,
         Goals_Allowed = GA,
         Rank = Rk,
         Goal_Dif = GD,
         xG_Allowed = xGA)

#read data from fb ref. Since there's only one table, it's pretty simple.
pl_url <- read_html("https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures")

pl_tbls <- html_nodes(pl_url, "table")

pl_scores <- pl_tbls %>%
  html_table(fill = TRUE)

#rather than use a case statement over and over, just make this a function
calculate_points <- function(a, b){
  case_when(a > b ~ 3,
            a == b ~ 1,
            a > b ~ 0,
            TRUE ~ 0)
}

#how many seasons we want to simulate
number_of_sims <- 20000

#function to simulate games. Accepts goals as an average,
#so could use average goals or average xG
sim_games <- function(home_goals, away_goals){
  n_sims <- number_of_sims
  home_goals <- rpois(n_sims, home_goals)
  away_goals<- rpois(n_sims, away_goals)
  
  results <- cbind(home_goals, away_goals) %>%
    as_tibble %>%
    mutate(goal_dif = home_goals - away_goals,
           home_points = calculate_points(home_goals, away_goals),
           away_points = calculate_points(away_goals, home_goals),
           sim_num = row_number()) %>%
    relocate(sim_num)
  
  results  
}

#clean our data
pl_prepped <- pl_scores[[1]] %>%
  as_tibble(.name_repair = "unique") %>%
  rename(HomexG = `xG...6`,
         AwayxG = `xG...8`) %>%
  separate(Score, c('HomeGoal', 'AwayGoal')) %>%
  mutate(HomePoints = calculate_points(HomeGoal, AwayGoal),
         AwayPoints = calculate_points(AwayGoal, HomeGoal)) %>%
  filter(is.na(Wk) == FALSE) %>%
  mutate(HomeGoal = as.numeric(HomeGoal),
         AwayGoal = as.numeric(AwayGoal))

#get max date with a game
Max_Date <- pl_prepped  %>%
  filter(`Match Report` == 'Match Report') %>%
  summarise(Max_Date = max(Date)) %>%
  pull()

#get unplayed games - we don't need to sim games that have already been played
pl_remaning <- pl_prepped %>%
  filter(`Match Report` != 'Match Report' | Notes == 'Match Suspended')

#do some light prep to bring in average goals for each team from the table,
#then run our sime
pl_simmed <- pl_remaning %>%
  select(Wk, Date, Home, Away) %>%
  inner_join(pl_table %>% select(Squad, Goals, MP) %>% rename(Home_Goals = Goals, Home_MP = MP), join_by(Home == Squad)) %>%
  inner_join(pl_table %>% select(Squad, Goals, MP) %>% rename(Away_Goals = Goals, Away_MP = MP), join_by(Away == Squad)) %>%
  mutate(Home_G_per_Match = Home_Goals / Home_MP,
         Away_G_per_Match = Away_Goals / Away_MP) %>%
  mutate(simmed_games = map2(Home_G_per_Match, Away_G_per_Match, ~sim_games(.x, .y)))

#unnest our sims, split into home and away
pl_simmed_long <- pl_simmed %>%
  select(Wk, Date, Home, Away, simmed_games) %>%
  unnest(simmed_games)

pl_simmed_home <- pl_simmed_long %>%
  select(sim_num, Home, home_points, home_goals,  away_goals) %>%
  rename(Squad = Home,
         Goals = home_goals,
         Goals_Allowed = away_goals,
         Points = home_points
         ) %>%
  mutate(W = case_when(Goals > Goals_Allowed ~ 1, TRUE ~ 0),
         D = case_when(Goals == Goals_Allowed ~ 1, TRUE ~ 0),
         L = case_when(Goals < Goals_Allowed ~ 1, TRUE ~ 0))

pl_simmed_away <- pl_simmed_long %>%
  select(sim_num, Away, away_points, away_goals, home_goals) %>%
  rename(Squad = Away,
         Goals = away_goals,
         Goals_Allowed = home_goals,
         Points = away_points
         ) %>%
  mutate(W = case_when(Goals > Goals_Allowed ~ 1, TRUE ~ 0),
         D = case_when(Goals == Goals_Allowed ~ 1, TRUE ~ 0),
         L = case_when(Goals < Goals_Allowed ~ 1, TRUE ~ 0))

#we need actual in the data as well, so we repliate the table 
#to match our number of sims
table_rep <- seq(1:number_of_sims) %>%
  as_tibble() %>%
  rename(sim_num = value) %>%
  crossing(pl_table %>% select(Squad, Points, Goals, Goals_Allowed, W, D, L))

#take all of our sims and summarize. Note this includes the sim number if we want to play with that
pl_aggregated <- pl_simmed_home %>%
  rbind(pl_simmed_away) %>%
  rbind(table_rep) %>%
  group_by(sim_num, Squad) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  mutate(Goal_Dif = Goals - Goals_Allowed) %>%
  group_by(sim_num) %>%
  arrange(desc(Points), desc(Goal_Dif)) %>%
  mutate(Rank = rank(desc(Points), ties.method="first"),
         win_PL = case_when(Rank == 1 ~ 1, TRUE ~ 0),
         qualify_UCL = case_when(Rank <= 4 ~ 1, TRUE ~ 0),
         relgated = case_when(Rank >= 18 ~ 1, TRUE ~ 0))

pl_team <- pl_aggregated %>%
  ungroup() %>%
  select(-sim_num, -Rank) %>%
  group_by(Squad) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  arrange(desc(Points))

pl_sim_table <- pl_table %>%
  inner_join(pl_team, by = join_by(Squad)) %>%
  select(-xG, -xG_Allowed) %>%
  arrange(desc(win_PL), relgated) 

pl_simmed_long %>%
  filter(Date == '2023-12-30' & Home == 'Manchester City') %>%
  mutate(W = case_when(home_goals > away_goals ~ 1, TRUE ~ 0),
         D = case_when(home_goals == away_goals ~ 1, TRUE ~ 0),
         L = case_when(home_goals < away_goals ~ 1, TRUE ~ 0)) %>%
  group_by(Date, Home, Away) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  select(-Wk, -sim_num) %>%
  ungroup() %>%
  gt() %>%
  gt_theme_espn() %>%
  cols_label(home_goals = 'Home Goals',
             away_goals = 'Away Goals',
             goal_dif = 'GD',
             home_points = 'Home Points',
             away_points = 'Away Points') %>%
  fmt_percent(columns = c(W, D, L), decimals = 1) %>%
  fmt_number(columns = c(home_goals, away_goals, goal_dif, home_points, away_points),
             decimals = 2)
Date Home Away Home Goals Away Goals GD Home Points Away Points W D L
2023-12-30 Manchester City Sheffield Utd 2.40 0.79 1.61 2.36 0.48 73.4% 16.1% 10.5%

City is expected to win this game in 73.4% of sims, draw in 16.1%, and lose in 10.5%, for an average of 2.36 points. Notice that the expected goals for each club wound up matching their average goals per match for the season.

Now that we have this down, let’s get to our projected table.

Show the code
pl_sim_table %>%
  gt() %>%
  gt_theme_espn() %>%
  tab_spanner(
    label = "Current Table",
    columns = c("MP", "Points.x", "Goals.x", "Goals_Allowed.x", "Goal_Dif.x", "Rank", "W.x", "D.x", "L.x")
  ) %>%
  tab_spanner(
    label = "Projected",
    columns = c("Points.y", "Goals.y", "Goals_Allowed.y", "Goal_Dif.y", "win_PL", "qualify_UCL", "relgated", "W.y", "D.y", "L.y")
  ) %>%
  gt_add_divider(columns = "Rank", style = "solid", color = 'lightgrey') %>%
  data_color(
    columns = c("win_PL", "qualify_UCL", "relgated"),
    method = "numeric",
    palette = c('#00ff85', '#ffffff'),
    #omain = c(0.01, 1),
    na_color = "white",
    reverse = TRUE
  ) %>%
  fmt_percent(columns = c("win_PL", "qualify_UCL", "relgated"), decimals = 1) %>%
  fmt_number(columns = c("Points.y", "Goals.y", "Goals_Allowed.y", "Goal_Dif.y",
                         "W.y", "D.y", "L.y"), decimals = 1) %>%
  cols_width(win_PL ~ px(95),
             qualify_UCL ~ px(95),
             relgated ~ px(95)) %>%
  cols_label(
    Points.x = "Points",
    Goals.x = "G",
    Goals_Allowed.x = "GA",
    Goal_Dif.x = "GD",
    W.x = "W",
    D.x = "D",
    L.x = "L",
    Points.y = "Points",
    Goals.y = "G",
    Goals_Allowed.y = "GA",
    Goal_Dif.y = "GD",
    win_PL = 'Win PL',
    qualify_UCL = 'Qualify UCL',
    relgated = 'Relegated',
    W.y = "W",
    D.y = "D",
    L.y = "L",
  ) %>%
  tab_header(title = 'Premier League Projected Standings',
             subtitle = paste('Using data from matches played through ', Max_Date)) %>%
  tab_footnote('Table data courtesy of fbref.com') %>%
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_column_labels()) %>%
  cols_hide(columns = c(W.x, D.x, L.x, W.y, D.y, L.y))
Premier League Projected Standings
Using data from matches played through 2023-12-28
Squad Current Table Projected
MP Points G GA GD Rank Points G GA GD Win PL Qualify UCL Relegated
Manchester City 18 37 43 21 22 4 75.4 90.8 50.9 39.8 36.3% 89.1% 0.0%
Liverpool 19 42 39 16 23 1 74.8 78.0 44.9 33.1 29.5% 86.6% 0.0%
Aston Villa 19 39 40 25 15 3 72.4 80.0 53.9 26.2 15.9% 74.6% 0.0%
Arsenal 19 40 36 18 18 2 71.0 72.0 47.0 25.0 11.3% 66.6% 0.0%
Tottenham 19 36 39 28 11 5 68.7 77.9 56.9 21.0 5.7% 49.4% 0.0%
West Ham 19 33 33 30 3 6 62.0 66.0 59.2 6.8 0.4% 11.2% 0.0%
Brighton 19 30 38 33 5 8 62.2 76.0 61.9 14.1 0.4% 12.7% 0.0%
Newcastle Utd 19 29 37 25 12 9 60.7 74.0 53.9 20.1 0.3% 8.6% 0.0%
Manchester Utd 19 31 21 25 -4 7 51.5 42.0 54.8 −12.9 0.0% 0.2% 0.1%
Chelsea 19 25 31 29 2 10 52.7 61.9 58.3 3.6 0.0% 0.4% 0.1%
Bournemouth 18 25 27 32 -5 12 52.7 57.1 62.6 −5.5 0.0% 0.4% 0.1%
Wolves 19 25 27 31 -4 11 50.0 54.1 60.4 −6.4 0.0% 0.1% 0.4%
Fulham 19 21 26 34 -8 13 45.2 52.0 63.6 −11.5 0.0% 0.0% 2.9%
Brentford 18 19 25 28 -3 14 44.4 52.9 60.0 −7.1 0.0% 0.0% 4.3%
Everton 19 16 24 25 -1 17 38.8 48.1 54.6 −6.6 0.0% 0.0% 19.8%
Nott'ham Forest 19 17 22 34 -12 16 38.4 44.0 63.7 −19.7 0.0% 0.0% 24.5%
Luton Town 18 15 21 34 -13 18 37.6 44.3 65.3 −20.9 0.0% 0.0% 30.8%
Crystal Palace 19 18 19 28 -9 15 37.0 38.0 58.0 −20.0 0.0% 0.0% 33.0%
Burnley 19 11 18 38 -20 19 29.3 36.0 68.0 −32.0 0.0% 0.0% 86.6%
Sheffield Utd 19 9 15 47 -32 20 24.8 30.0 77.1 −47.2 0.0% 0.0% 97.5%
Table data courtesy of fbref.com

So even though City is currently in 4th, the model has them winning the PL in 36% of simulations. (And the model has no idea that they should be getting DeBruyne and Haaland back in January.) It has Arsenal slipping to 4th, with Liverpool sliding to 2nd (with a 29.5% chance of winning the league) and Aston Villa remaning in 3rd. Meanwhile, Manchester United drops all the way to 10th. Having a negative goal differential is not a great long-term strategy, as it turns out.

Meanwhile, poor Sheffield United is almost certain to be relegated, as is Burnley. Crystal Palace, Luton Town, Nottingham Forest and Everton will be fighting to avoid the final relegation spot.

Building this framework was not particularly time-consuming. Improving the match simulator would certainly help things out, as would making it flexible enough to back-test to check accuracy - as a City fan, I like seeing them on top in my projections, but as an analyst I’m skeptical of any model that shows me what I want to see. Still, it looks reasonable enough to work with for now.