Haaland Goal Forecast Update

statistics
r
soccer
Author

Mark Jurries II

Published

November 14, 2022

With the Premier League not playing any more games until the end of the year due to the World Cup, it seems a natural space to check in on our forecasts of how many goals Erling Haaland will score. I wanted to see how the number changes with each matchweek, so I set the code to run a forecast after every game, adding his goal tally from the game to the pool of data being used for the sims.

*Yes, the infamous past future theoretical tense.

Conceptually, this means that the deeper we get into the season, the more accurate the forecasts should get since they’re drawing from increasingly larger samples. We could use the whole data set regardless of time, but that would be cheating since the future hadn’t happened yet*. Since our forecast is a combination of future projections + actual totals, it weighs more and more on real results (18 goals in 13 appearances!) that aren’t subject to change, so it will become more stable as we draw to the close of the season next spring.

Enough jibber-jabber, onto the fun stuff!

Show the code
library(hrbrthemes)
library(ggridges)
library(plotly)
library(tidyverse)
library(worldfootballR)

haaland_summary_2022 <- fb_player_match_logs("https://fbref.com/en/players/1f44ac21/Erling-Haaland", season_end_year = 2022, stat_type = 'summary')
haaland_summary_2023 <- fb_player_match_logs("https://fbref.com/en/players/1f44ac21/Erling-Haaland", season_end_year = 2023, stat_type = 'summary')

haaland_game_logs <- haaland_summary_2022 %>%
  rbind(haaland_summary_2023) %>%
  as_tibble()

injury_days <- tibble(Date = as.Date('2022-10-29'),
                      Comp = 'Premier League',
                      Round = 'Matchweek 14',
                      Opponent = 'Leicester City',
                      Gls = NA,
                      xG_Expected = NA)

haaland_prepped <- haaland_game_logs %>%
  filter(Comp %in% c('Bundesliga', 'Premier League')) %>%
  rename(Gls = Gls_Performance) %>%
  select(Date, Comp, Round, Opponent, Gls, xG_Expected) %>%
  mutate(Date = as.Date(Date)) %>%
  rbind(injury_days) %>%
  arrange(Date) %>%
  mutate(rownum = row_number(),
         running_Gls_avg = cummean(ifelse(is.na(Gls), 0, Gls)))
  
haaland_gls_vctrs <- NULL

for(i in 1:max(haaland_prepped$rownum)){
  
  tmp_vector <- haaland_prepped %>%
    filter(rownum <= i & is.na(Gls) == FALSE) %>%
    select(Gls) %>%
    as.vector()
  
  tmp_tibble <- i %>%
    as_tibble %>%
    mutate(gls_vector = tmp_vector) %>%
    rename(rownum = value)
  
  haaland_gls_vctrs <- haaland_gls_vctrs %>%
    rbind(tmp_tibble)
}

haaland_prepped_final <- haaland_prepped %>%
  group_by(Comp, rownum, Gls) %>%
  nest() %>%
  inner_join(haaland_gls_vctrs) %>%
  filter(Comp == 'Premier League') %>%
  ungroup() %>%
  mutate(total_gls = cumsum(ifelse(is.na(Gls), 0, Gls)))

mc_sim <- function(goal_vector, n_games, goals_scored){
  
  sample_sim <- NULL
  
  for(i in 1:10000){
    sim <- sum(sample(goal_vector, size = n_games, replace = T)) + goals_scored
    sample_sim[i] <- sim  
  }
  
  sample_sim
  #quantile(sample_sim, c(.05, .5, .95))
}

poisson_sim <- function(goal_vector, n_games, goals_scored){
  p_sim <- NULL
  
  for(i in 1:10000){
    sim <- sum(rpois(n_games, mean(goal_vector))) + goals_scored
    p_sim[i] <- sim  
  }
  
  p_sim
}

haaland_simmed <- haaland_prepped_final %>%
  mutate(games_left = 38 - row_number()) %>%
  #rowwise() %>%
  mutate(simmed_data = pmap(list(gls_vector, games_left, total_gls), mc_sim)) %>%
  mutate(p_simmed_data = pmap(list(gls_vector, games_left, total_gls), poisson_sim)) %>%
  unnest(data) %>%
  rowwise() %>%
  mutate(median_sim = median(simmed_data),
         lower_025 = quantile(simmed_data, .025),
         lower_05 = quantile(simmed_data, .05),
         upper_95 = quantile(simmed_data, .95),
         upper_975 = quantile(simmed_data, .975)) %>%
  mutate(Matchweek = gsub('Matchweek ', '', Round)) %>%
  ungroup() %>%
  mutate(Matchweek = fct_reorder(Matchweek, -rownum))

haaland_sim_trend_chart <- haaland_simmed %>%
  ggplot(aes(x = Date, 
             y = median_sim,
             text = paste('Date:', Date, 
                          "<br>Goals:", Gls,
                          "<br>Goals to Date:", total_gls,
                          "<br>Forecasted Total:", median_sim,
                          "<br><br>90% Range:", lower_05, "to", upper_95,
                          "<br>95% Range:", lower_025, "to", upper_975)
             ))+
  geom_ribbon(aes(ymin = lower_025, ymax = upper_975), 
              fill = 'grey80',
              alpha = 0.5,
              group = 1)+
  geom_ribbon(aes(ymin = lower_05, ymax = upper_95), 
              fill = 'grey70', 
              alpha = 0.5,
              group = 1)+
  geom_line(group = 1, color = '#6CABDD')+
  geom_hline(yintercept = 32, linetype = 'dashed')+
  geom_text(aes(as.Date('2022-08-14'),32, label = "Salah's Record", vjust = 6), size = 3)+
  ggtitle('Erling Haaland 2022 Projected Goal Total by Week')+
  theme_ipsum()

ggplotly(haaland_sim_trend_chart, tooltip = 'text') %>% 
  layout(hoverlabel = list(bgcolor = "#6CABDD"))

We started the season with a pretty bullish forecast of 37, it shot up to 44 after back to back hat tricks in September. It’s been hovering around 46 since his hat trick against United on October 10, interestingly it didn’t budge when he missed a game with a foot injury on Oct. 29. Three hat tricks made the model slightly optimistic, though a sluggish past few weeks have corrected that some.

*I encoded this as a NA, which doesn’t budge our forecasts. I used NA instead of a zero since zero indicates the presence of an absence, while NA is the absence of a presence.

What’s slightly insane is that even the low end of the forecast has him beating the PL record by a few goals. This assumes he stays healthy and the league doesn’t figure out a way to slow him down, of course. There are still a lot of matches to be played, but so far, he’s exceeded an already bullish forecast.