Simulating Data, Or How Many Goals Will Haaland Get?

probability
r
soccer
Author

Mark Jurries II

Published

September 2, 2022

Erling Haaland is off to a fantastic start this year, with 9 goals in only 5 games. If we assumed that he would keep this pace up, then an average of 1.8 goals per game times 38 games would result in 68 goals, easily smashing Mo Salah’s record of 32. This projection is perhaps a bit silly, though. Thankfully, we can use simulation in R to arrive at (somewhat) more reasonable goals.

He wears nine and has scored nine, which is nice.

Goal scoring follows a Possion distrubition, so we’ll use the rpois function in R to simulate the remaining 33 games, using his games so far this year plus his goals last season for Borussia Dortmund. For example:

Show the code
library(tidyverse)
library(hrbrthemes)
library(gt)
library(ggtext)

set.seed(82722)

haaland_pl <- c(2, 0, 1, 3, 3)
haaland_bl <- c(2, 0, 1, 2, 2, 2, 1, 1, 0, 2, 0, 0, 2, 1, 0, 0, 0, 0, 0, 2, 0, 3, 0, 1)
haaland_both <- append(haaland_bl, haaland_pl)
games_remaining <- 38 - length(haaland_pl)

example_sim <- rpois(games_remaining, mean(haaland_both))
example_sim
 [1] 1 0 0 0 3 0 0 2 1 1 0 2 1 0 1 1 0 2 1 1 1 0 1 1 2 2 0 2 0 0 2 1 5

We see a bunch of zeros here, which is realistic - nobody’s going to score every match. Lots of 1s and 2s, which is realistic. There’s a 5 in there which is …optimistic. But that’s just one run. What if we did it 10,000 times instead? Note we’ll now add hist current 9 goals to the total, since we know those already happened and don’t need to simulate them*.

*Look, we’re already simulating the future. Let’s leave the past alone.
Show the code
haaland_sim <- NULL

for(i in 1:10000){
    sim <- sum(rpois(games_remaining, mean(haaland_both))) + sum(haaland_pl)
    haaland_sim[i] <- sim  
  }

poisson_hist <- haaland_sim %>%
  as_tibble() %>%
  ggplot(aes(x = value,))+
  geom_histogram(binwidth = 1, color = 'white', fill = '#6CABDD')+
  theme_ipsum()+
  xlab('Projected Goals')+
  ylab('Number of Simulations')+
  ggtitle('Haaland Projected Goals Using Poisson Simulation',
          subtitle = '10,000 Simulations')+
  theme(plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position =  "plot") 

poisson_hist

We get a range going from about 30 to 60, with a median of 44. Still dominant stuff, but we’re already closer to reality than our original 68, which doesn’t even show up as a realistic option here.

We can also simulate by taking his goals per game from this season and last and randomly drawing a number 35 times. Since our sample is pretty small, we’ll put a score back in the bag after we’ve drawn it. So if we reach in and grab a 3, we’ll note it down, then put it back in the bag before we draw our next number. This lets us get a pretty decent guess using a small amount of data.

Show the code
boot_sim <- NULL

for(i in 1:10000){
  sim <- sum(sample(haaland_both, size = 30, replace = T)) + sum(haaland_pl)
  boot_sim[i] <- sim  
}

boot_hist <- boot_sim %>%
  as_tibble() %>%
  ggplot(aes(x = value,))+
  geom_histogram(binwidth = 1, color = 'white', fill = '#6CABDD')+
  theme_ipsum()+
  xlab('Projected Goals')+
  ylab('Number of Simulations')+
  ggtitle('Haaland Projected Goals Using Bootstrap',
          subtitle = '10,000 Simulations')+
  theme(plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position =  "plot") 

boot_hist

Show the code
quantile(haaland_sim, c(.05, .5, .95)) %>%
  rbind(quantile(boot_sim, c(.05, .5, .95))) %>%
  as_tibble() %>%
  mutate(n = row_number(),
         type = case_when(n == 1 ~ 'Poisson',
                          TRUE ~ 'Bootstrap')) %>%
  select(type, '5%', '50%', '95%') %>%
  gt() %>%
    tab_style(
    locations = cells_column_labels(columns = everything()),
      style = list(
        cell_text(weight = "bold")
      )
    )
type 5% 50% 95%
Poisson 35 44 54
Bootstrap 32 41 51

Replication having a narrower range and a lower median, demonstrated in table form

It’s subtle, but we’ve narrowed our range down a bit from Poisson and adjusted our median project to 41. Narrower projections may or may not be a good thing depending on the subject, amount of data fed in, etc. In this case, the reason is that Poisson assumes a range - less so than a normal distribution, but remember the 5-goal game it thought about earlier - while bootstrapping just takes what he’s done and assumes he’ll do it a bunch more times, so it never even considers a 5-goal game.

We should note that just because the median is 41 doesn’t mean that any final tally other than 41 means the model is wrong. Its most likely outcomes are between 32 and 51, so there’s a wide range of outcomes. 41 is simply the most likely outcome, not the only outcome. In a business context, a narrower range would help us make better decisions, though we must be careful to not be more confident about out predictions than is warranted.

Note that like any model, our assumptions are pretty important. This model doesn’t answer questions like “what if Haaland gets hurt?”, “will the League figure him out?”, or “will Foden remember to pass him the ball?”. It also ignore quality of opponents - as fun as hat tricks are, they’re slightly less impressive against Nottingham Forest than they would be against Arsenal. We could further improve this by comparing to other strikers around the same age and their historical performance, or maybe by doing something with Expected Goals. Nonetheless, the question may wind up being not if he’ll set the record, but by how much he’ll break it.