Upset Wins in the Beautiful Game

We all love a good underdog win. Be it the Orioles beating the Yankees*, the Lions beating whoever they’re playing that week, or Croatia beating Brazil, we love it when the team nobody thinks will win pulls it off.

*Or anybody beating the Yankees, let’s be honest.

But how often does this happen? We’d need to find not only the outcome of the game itself, but also find a way to measure just how unlikely the win was in the first place. Thankfully, 538 makes their soccer* match-level predictions available on GitHub (details about the data here), so we can see what odds their model gave each team before the match and use that as a proxy for how much we’d expect a win. Let’s look at an example game.

*I was going to do this with baseball, and was thinking through how to go about calcing who the favorite was when I stumbled on this. Sometimes you just go with what’s there.

Show the code

library(hrbrthemes)
library(gt)
library(plotly)
library(tidyverse)
library(zoo)

soccer_data <- read.csv('spi_matches.csv') %>%
  mutate(game_id = paste(date, team1, 'vs', team2))

soccer_home <- soccer_data %>%
  filter(league == "Barclays Premier League") %>%
  select(game_id, season, date, league, team = team1, spi = spi1, 
         win_prob = prob1, lose_prob = prob2, tie_prob = probtie,
         goals_for = score1, goals_against = score2) %>%
  mutate(location = 'home')

soccer_away <- soccer_data %>%
  filter(league == "Barclays Premier League") %>%
  select(game_id, season, date, league, team = team2, spi = spi2, 
         win_prob = prob2, lose_prob = prob1, tie_prob = probtie,
         goals_for = score2, goals_against = score1) %>%
  mutate(location = 'away')

soccer_tall <- soccer_home %>%
  bind_rows(soccer_away) %>%
  mutate(W = case_when(goals_for > goals_against ~ 1, TRUE ~ 0),
         D = case_when(goals_for == goals_against ~ 1, TRUE ~ 0),
         L = case_when(goals_for < goals_against ~ 1, TRUE ~ 0))

soccer_tall %>%
  arrange(desc(win_prob)) %>%
  filter(game_id == '2019-04-03 Manchester City vs Cardiff City') %>%
  select(date, team, location, win_prob, lose_prob, tie_prob, goals_for, goals_against, W) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
      )
    )

date	team	location	win_prob	lose_prob	tie_prob	goals_for	goals_against	W
2019-04-03	Manchester City	home	0.9389	0.0080	0.0531	2	0	1
2019-04-03	Cardiff City	away	0.0080	0.9389	0.0531	0	2	0

I picked the game with the highest win probability in the data (which only goes back to 2016/17, but that’s enough for our purposes), which was Man City vs. Cardiff City back in 2019. The 583 model gave The Citizens a 93.89% chance of winning the game, which they did going 2-0. But c’mon, that’s the big boys winning - let’s find the biggest upset instead!

Show the code

soccer_tall %>%
  filter(game_id == '2021-04-03 Chelsea vs West Bromwich Albion') %>%
  select(date, team, location, win_prob, lose_prob, tie_prob, goals_for, goals_against, W) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
      )
    )

date	team	location	win_prob	lose_prob	tie_prob	goals_for	goals_against	W
2021-04-03	Chelsea	home	0.8659	0.0198	0.1143	2	5	0
2021-04-03	West Bromwich Albion	away	0.0198	0.8659	0.1143	5	2	1

Oh, Chelsea. You had an 87% chance of winning and 11% of tying, whereas lonely West Bromwhich Albion - who would go on to be relegated that season - was only give a 2% chance of winning*. Yet they beat mighty Chelsea 5-2 in the textbook definition of an upset.

*soyouresayingtheresachance.gif

OK, so we’ve got our bearings, and have a pretty good idea how this works. So let’s take this to the next step. What percent of games do teams with a 5% chance of winning actually win? Should we make like Han Solo and insist we’re never told the odds, or is there something there?

Show the code

win_hist <- soccer_tall %>%
  filter(season < 2022) %>%
  mutate(win_prob_bin = cut(win_prob, c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, .95, 1))) %>%
  group_by(win_prob_bin) %>%
  summarise(win_percent = mean(W)) %>%
  ggplot(aes(x = win_prob_bin, y = win_percent))+
  geom_bar(stat = 'identity', fill = '#055C9D')+
  theme_ipsum()+
  scale_y_percent()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  labs(title = 'PL Win % by Pre-Game Win Odds',
       subtitle = '2015/16 to 2021/22',
       caption = 'Data source: 538')

ggplotly(win_hist)

Well that’s good news for the 538 model - if a team has a win probability between 0% and 5%, they win about 3% of their games, which seems reasonable. There are some odd bits in the data (teams between 70% and 75% odds only win 65% of their matches), but those are likely just random noise in the data.

The overall story is that if a team is a favorite to win, they probably will. And that’s a good thing - the rarity of upsets makes them unique and memorable. As West Brom reminded us, even a mere 2% chance is still a chance.