Building a Premier League Champion Simulator

statistics
r
soccer
Author

Mark Jurries II

Published

April 28, 2023

City beat Arsenal 4-1 this week, bringing them to within 2 points of the top with 7 games to go compared to Arsenal’s 5. This gave me the itch to simulate the rest of the season, which is one of those projects that’s easy to start but would take a while to make really good. So what we’ve got here is a basic start, but there are better models out there.

We’ll start by creating a function that simply uses R’s pbinom function (using each teams’ goals per 90 minutes so far this season) to simulate 10000 games. It’s technically doing so for both teams playing, then checking how many points City or Arsenal would walk away with. This doesn’t consider opponent strength, home away, games played in the last week, expected goals, etc., so it’s overly simplistic.

We’ll save all of our simulations and check the results of each. For example:

Show the code
library(hrbrthemes)
library(gt)
library(gtExtras)
library(tidyverse)

set.seed(20230427)

#yes, manual input of the data - said this was quick. Could be improved by scraping data online.

man_city_schedule <- tibble(team = 'Man_City', 
       opp = c('Fulham', 'West Ham', 'Leeds United', 'Everton', 'Chelsea', 'Brighton', 'Brentford'),
       goals_per_90 = 2.58,
       opp_goals_per_90 = c(1.28, 1.03, 1.18, 0.72, 0.94, 1.70, 1.45))

arsenal_schedule <- tibble(team = 'Arsenal',
                          opp = c('Chelsea', 'Newcastle', 'Brighton', 'Nottingham Forest', 'Wolves'),
                          goals_per_90 = 2.27,
                          opp_goals_per_90 = c(0.94, 1.68, 1.70, 0.82, 0.79))

#function to simulate games
sim_games <- function(team_goals, opp_goals){
  n_sims <- 10000
  team_goals <- rpois(n_sims, team_goals)
  opp_goals<- rpois(n_sims, opp_goals)
  
  results <- cbind(team_goals, opp_goals) %>%
    as_tibble %>%
    mutate(goal_dif = team_goals - opp_goals,
           points = case_when(team_goals > opp_goals ~ 3, team_goals == opp_goals ~ 1, TRUE ~ 0),
           sim_num = row_number()) %>%
    relocate(sim_num)

  results  
}

simmed_season <- man_city_schedule %>%
  rbind(arsenal_schedule) %>%
  mutate(game_sims = map2(goals_per_90, opp_goals_per_90, ~sim_games(.x, .y))) %>%
  unnest(cols = c(game_sims))

simmed_season %>%
  filter(sim_num == 1) %>%
  mutate(team = ifelse(team == 'Man_City', 'Man City', team)) %>%
  select(-goals_per_90, -opp_goals_per_90, -sim_num) %>%
  gt(groupname_col = 'team') %>%
  gt_theme_espn() %>%
  summary_rows(
    groups = everything(),
    columns = c(team_goals, opp_goals, goal_dif, points),
    fns = list(
      Total = ~ sum(., na.rm = TRUE)
    )
  ) %>%
  tab_header(title = 'SIM 1 RESULTS')
SIM 1 RESULTS
opp team_goals opp_goals goal_dif points
Man City
Fulham 1 2 -1 0
West Ham 3 1 2 3
Leeds United 2 1 1 3
Everton 4 0 4 3
Chelsea 3 0 3 3
Brighton 1 2 -1 0
Brentford 2 0 2 3
Total 16 6 10 15
Arsenal
Chelsea 3 3 0 1
Newcastle 1 0 1 3
Brighton 1 3 -2 0
Nottingham Forest 5 0 5 3
Wolves 2 2 0 1
Total 12 8 4 8

So in sim 1, City gained 15 points and Arsenal only 8. With the gap currently at 2 points, that gives City 5 point advantage and the title. Chelsea loses to both clubs in this sim, so it checks out from a realistic results perspective.

Now that we have our bearings on individual sims, let’s aggregate things at the points level - i.e what are the different remaining combinations of points for each club and how likely are the different permutations?

Show the code
sims_agg <- simmed_season %>%
  group_by(team, sim_num) %>%
  summarise(points = sum(points),
            goal_dif = sum(goal_dif)) %>%
  pivot_wider(names_from = team, values_from  = c(points, goal_dif)) %>%
  mutate(winner = case_when(points_Man_City - points_Arsenal > 2 ~ 'City',
                            points_Man_City - points_Arsenal == 2 ~ 'Draw',
                            TRUE ~ 'Arsenal')) %>%
  group_by(winner, points_Man_City, points_Arsenal) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  ungroup() %>%
  mutate(perc = n / sum(n))

sims_agg %>%
  ggplot(aes(x = as.factor(points_Man_City), 
             y = as.factor(points_Arsenal), 
             fill = winner))+
  geom_tile(color = "darkgrey",
            lwd = 0.5,
            linetype = 1)+
  geom_text(aes(label = scales::percent(perc, accuracy = .11)))+
  theme_ipsum()+
  scale_fill_manual(values = c('#EF0107', '#6CABDD', 'grey'))+
  xlab('Man City Points')+
  ylab('Arsenal Points')+
  theme(legend.position = 'none',
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  ggtitle('Arsenal vs City Expected Outcome Odds')

So we’d say there’s a 0.11% chance that Arsenal gets 15 more points and City gets 7. The most likely outcome is City with 16 points and Arsenal with 10, occurring in 2.97% of our sims. But as interesting as this may or may not be*, we can do better and just aggregate everything up to who has the most points.

*Let’s face it - this is a lot.
Show the code
sims_agg %>%
  group_by(winner) %>%
  summarise(sims = sum(n),
            perc = sum(perc)) %>%
  arrange(desc(sims)) %>%
  gt() %>%
  gt_theme_espn() %>%
  fmt_percent(columns = perc, decimals = 1) %>%
  fmt_number(columns = sims, decimals = 0)
winner sims perc
City 7,386 73.9%
Arsenal 2,000 20.0%
Draw 614 6.1%

City wins in 73.9% of our sims. And we may as well give them the ties - they currently have a 13 goal lead in differential, which is going to be hard to make up in the time remaining, and anyway it didn’t happen in any of our sims. As high as 80% odds are, that’s actually under the 90% odds 538 is giving City*.

*They’ve spent years perfecting their model, while I just threw this together in an hour, so we’ll cut some slack here.

So if you’re a City fan, you can mostly relax. And if you’re an Arsenal fan, 10% to 20% might seem low, but…