Forecasting the MLB Postseason

probability
r
baseball
Author

Mark Jurries II

Published

October 5, 2022

The MLB postseason is upon us, and we all want to make our best guess who will win it all. We could just go to Fangraph’s Playoff Odds and see who they’re picking - and their model is good - but it’s so much more fun to build our own model.

Before we begin, we need to make sure we know what it is we want to do. These are fairly basic requirements, but having them laid out before we start a project is always a good practice. What we want is a script that will:

To simulate a game, we’ll use Bill James’ Log5 method. It’s perhaps a bit simple for this - it doesn’t factor in home field advantage, starting pitcher, etc., but our task is already pretty big so that’s out of scope. We’ll test it on a team with a .615 win percent facing one with a .550.

Show the code
library(gt)
library(tidyverse)

set.seed(9262)

james_log5 <- function(team_a_win_prob, team_b_win_prob){
  
    (team_a_win_prob - (team_a_win_prob * team_b_win_prob)) / 
    (team_a_win_prob + team_b_win_prob - 2 * team_a_win_prob * team_b_win_prob)
  
}

james_log5(.615, .550)
[1] 0.5665302

OK, so the .615 team has a .566 chance of beating the .550 team. Since we’ll be using this number to check the odds of a winning a single game (using R’s rbinom function), we want to make sure it’s not biased. So let’s run the function the other way to make sure it doesn’t throw an odd result.

Show the code
james_log5(.550, .615)
[1] 0.4334698

Excellent. We get .433, which we can add to .566 to get 1. So the order we put the teams in shouldn’t bias the model. Next, we can use our new function inside another function to simulate a series. We’ll run it once to make sure it works.

Show the code
sim_series <- function(team_a_name, 
                       team_a_win_prob, 
                       team_b_name, 
                       team_b_win_prob, 
                       series_length){
  
  team_a_prob <- james_log5(team_a_win_prob, team_b_win_prob)
  tmp_tracker <- tibble(team = c(team_a_name, team_b_name),
                        W = c(0, 0),
                        L = c(0, 0),
                        won_series = c(0,0),
                        win_perc = c(team_a_win_prob, team_b_win_prob))
  
  for (i in 1:series_length){
    if(tmp_tracker[1,]$won_series == 1 | tmp_tracker[2,]$won_series == 1){
      break
    }else{
    if(rbinom(1,1,team_a_prob) == 1){
      tmp_tracker[1,]$W = tmp_tracker[1,]$W + 1
      tmp_tracker[2,]$L = tmp_tracker[2,]$L + 1
      if(tmp_tracker[1,]$W > series_length/2){
        tmp_tracker[1,]$won_series = 1
      }else{
        tmp_tracker[1,]$won_series = 0
      }
    }else{
      tmp_tracker[2,]$W = tmp_tracker[2,]$W + 1
      tmp_tracker[1,]$L = tmp_tracker[1,]$L + 1
      if(tmp_tracker[2,]$W > series_length/2){
        tmp_tracker[2,]$won_series = 1
      }else{
        tmp_tracker[2,]$won_series = 0
      }
    }
    }
    }
  tmp_tracker
}

sim_series(team_a_name = 'A', team_a_win_prob = .615, team_b_name = 'B', team_b_win_prob = .550, series_length = 7) 
# A tibble: 2 × 5
  team      W     L won_series win_perc
  <chr> <dbl> <dbl>      <dbl>    <dbl>
1 A         1     4          0    0.615
2 B         4     1          1    0.55 

So team A was better during the season but still lost in 5. That’s baseball for ya. Note we brought back two rows, one for each team - this way, if we want, we can reference either team later. We also added a flag for whether or not they won the series, which will be useful when we want to count those up later.

We need to know who’s playing, which we’ll enter manually. Normally we’d automate this, but since this is a one-time deal it’s faster this way and we aren’t losing anything. We’re going to use a variable to control our total number of simulations, since we need to create a tibble for each round and the code gets long. This makes it easier to test with a low number early on and change later. We’ll set it to 20,000.

*The golden rule of analysis - automate by default or if it makes testing easier, but don’t waste time automating if you’re not going back.
Show the code
al_teams <- tibble(seed = 1, team = 'HOU', win_perc = .654) %>%
  add_row(seed = 2, team = 'NYY', win_perc = .611) %>%
  add_row(seed = 3, team = 'CLE', win_perc = .568) %>%
  add_row(seed = 4, team = 'TOR', win_perc = .568) %>%
  add_row(seed = 5, team = 'SEA', win_perc = .556) %>%
  add_row(seed = 6, team = 'TBR', win_perc = .531)

nl_teams <- tibble(seed = 1, team = 'LAD', win_perc = .685) %>%
  add_row(seed = 2, team = 'ATL', win_perc = .623) %>%
  add_row(seed = 3, team = 'STL', win_perc = .574) %>%
  add_row(seed = 4, team = 'NYM', win_perc = .623) %>%
  add_row(seed = 5, team = 'SDP', win_perc = .549) %>%
  add_row(seed = 6, team = 'PHI', win_perc = .537)

nsims <- 20000

Now “all” we need to do is simulate 4 Wildcard Rounds, 4 Division Series, 2 Championship Series and one World Series. There’s likely a more elegant way to doing this, but for what it is it’s OK. Let’s run a bunch of sims and see what the first one looks like.

Show the code
#Run AL wildcard through CS
al_wc_1 <- al_teams[6,] %>%
  cbind(al_teams[3,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                              team_a_win_prob = .x$win_perc_a,
                              team_b_name = .x$team_b,
                              team_b_win_prob = .x$win_perc_b,
                              series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'Wildcard')

al_wc_2 <- al_teams[5,] %>%
  cbind(al_teams[4,]) %>%
  rename(seed_a = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         seed_b = 4, 
         team_b = 5, 
         win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'Wildcard')

#al_ds
al_ds_1 <- al_wc_1 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(al_teams[2,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'DS')

al_ds_2 <- al_wc_2 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(al_teams[1,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'DS')

#alcs
al_cs <- al_ds_1 %>%
  filter(won_series == 1) %>%
  inner_join(al_ds_2, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'CS')

#Run NL wildcard through CS
nl_wc_1 <- nl_teams[6,] %>%
  cbind(nl_teams[3,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'Wildcard')

nl_wc_2 <- nl_teams[5,] %>%
  cbind(nl_teams[4,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'Wildcard')

#nl_ds
nl_ds_1 <- nl_wc_1 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(nl_teams[2,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'DS')

nl_ds_2 <- nl_wc_2 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(nl_teams[1,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'DS')

#nlcs
nl_cs <- nl_ds_1 %>%
  filter(won_series == 1) %>%
  inner_join(nl_ds_2, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'CS')

#world series time
ws <- al_cs %>%
  filter(won_series == 1) %>%
  inner_join(nl_cs, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = '', round = 'WS')

playoffs_all <- al_wc_1 %>%
  rbind(nl_wc_1) %>%
  rbind(al_wc_2) %>%
  rbind(nl_wc_2) %>%
  rbind(al_ds_1) %>%
  rbind(nl_ds_1) %>%
  rbind(al_ds_2) %>%
  rbind(nl_ds_2) %>%
  rbind(al_cs) %>%
  rbind(nl_cs) %>%
  rbind(ws)
Show the code
playoffs_all %>%
  filter(sim_id == 1 & won_series == 1) %>%
  ungroup() %>%
  select(-won_series, -win_perc) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  )
sim_id team W L league round
1 TBR 2 0 AL Wildcard
1 PHI 2 1 NL Wildcard
1 SEA 2 1 AL Wildcard
1 NYM 2 1 NL Wildcard
1 NYY 3 1 AL DS
1 ATL 3 1 NL DS
1 SEA 3 1 AL DS
1 LAD 3 1 NL DS
1 SEA 4 3 AL CS
1 ATL 4 1 NL CS
1 ATL 4 3 WS

Tampa Bay and Seattle won the AL Wildcards, while the Phillies and the Mets won NL. The Yankees and Mariners proceeded to win the Division Series, while the Braves and Dodgers advance in the NL. The Mariners and the Braves then won their Championship Series and went on to the World Series, where Atlanta won it all. There are 19,999 more of these, we’ll spare the commentary and show a summary instead.

Show the code
playoffs_all %>%
  group_by(team, round) %>%
  summarise(series_won = sum(won_series)) %>%
  mutate(series_perc = series_won / nsims) %>%
  pivot_wider(id_cols = team, names_from = round, values_from = series_perc) %>%
  select(team, Wildcard, DS, CS, WS) %>%
  arrange(desc(WS)) %>%
  ungroup() %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_percent(
    columns = 2:5,
    decimals = 1
  )
team Wildcard DS CS WS
LAD NA 67.0% 46.3% 30.4%
HOU NA 67.5% 43.4% 22.7%
ATL NA 62.1% 26.2% 13.8%
NYY NA 61.6% 28.6% 11.9%
NYM 61.3% 23.1% 12.7% 6.6%
STL 55.6% 22.5% 7.2% 3.0%
CLE 54.8% 23.1% 8.9% 3.0%
TOR 51.6% 17.1% 7.8% 2.7%
SEA 48.4% 15.4% 6.7% 2.1%
SDP 38.7% 9.9% 3.9% 1.4%
TBR 45.2% 15.3% 4.7% 1.3%
PHI 44.4% 15.4% 3.8% 1.1%

Well, if you’re a Dodgers fan, this is the model you want. In 30% of our simulations they won it all. The Phillies won 1.1% of simulations, so if they actually win we can say they did so as underdogs. This model is more bullish on the top teams and bearish on the lower compared to Fangraphs. Their model is much more sophisticated - they’re well beyond the Log5 method - so I’d give it a lot more weight.

Remember, the top two teams in each league get a bye in first round. Our sim still had them play 20,000 rounds, though, so them having the best World Series odds is a function of them being the best.

One often has regrets after finishing a project like this, and in this case, I built something that can’t really be updated with actual information. So if the Mariners win the wildcard, for instance, I can’t take the Blue Jays out of the remaining calcs. Or if the Braves are up 2-1 in the DS, I can’t let this model know that since it starts every series from scratch. Live and learn. Regardless, this gives a decent idea of what to expect. Hopefully the excitement of the postseason exceeds that of reading a bunch of probability tables.

Bonus 1: Just for kicks - how many teams made it through with no loses?

Show the code
playoffs_all %>%
  group_by(team, sim_id) %>%
  filter(won_series == 1) %>%
  mutate(won_ws = max(case_when(round == 'WS' ~ won_series, TRUE ~ 0))) %>%
  filter(won_ws == 1) %>%
  summarise(W = sum(W),
            L = sum(L)) %>%
  mutate(sweep = case_when(L == 0 ~ 1, TRUE ~ 0)) %>%
  group_by(team) %>%
  summarise(world_series_wins = n(),
            sweeps = sum(sweep)) %>%
  arrange(desc(sweeps)) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_number(
    columns = 2,
    decimals = 0,
    use_seps = TRUE
  )
team world_series_wins sweeps
LAD 6,076 66
HOU 4,537 35
ATL 2,760 21
NYY 2,385 7
NYM 1,326 1
STL 608 1
CLE 594 0
PHI 230 0
SDP 277 0
SEA 414 0
TBR 260 0
TOR 533 0

Bonus 2: here are all of the World Series permutations. 21% of sims had an Astros-Dodgers series, with Houston winning 40% of sims and the Dodgers 60%.

Show the code
ws %>%
  group_by(sim_id) %>%
  mutate(NL_Team = lead(team)) %>%
  mutate(NL_Wins = case_when(won_series == 0 ~ 1, TRUE ~ 0)) %>%
  rename(AL_Team = team,
         AL_Wins = won_series) %>%
  group_by(AL_Team, NL_Team) %>%
  filter(!is.na(NL_Team)) %>%
  summarise(Series = n(),
            AL_Wins = sum(AL_Wins),
            NL_Wins = sum(NL_Wins)) %>%
  mutate(AL_Win_Perc = AL_Wins / Series,
         NL_Win_Perc = NL_Wins / Series,
         Perc_of_Sims = Series/nsims)%>%
  arrange(desc(Series)) %>%
  ungroup() %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_number(
    columns = 3:5,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_percent(
    columns = 6:8,
    decimals = 1
  )
AL_Team NL_Team Series AL_Wins NL_Wins AL_Win_Perc NL_Win_Perc Perc_of_Sims
HOU LAD 4,032 1,704 2,328 42.3% 57.7% 20.2%
NYY LAD 2,615 865 1,750 33.1% 66.9% 13.1%
HOU ATL 2,241 1,270 971 56.7% 43.3% 11.2%
NYY ATL 1,558 702 856 45.1% 54.9% 7.8%
HOU NYM 1,091 618 473 56.6% 43.4% 5.5%
CLE LAD 849 218 631 25.7% 74.3% 4.2%
TOR LAD 722 159 563 22.0% 78.0% 3.6%
NYY NYM 720 320 400 44.4% 55.6% 3.6%
SEA LAD 633 153 480 24.2% 75.8% 3.2%
HOU STL 625 428 197 68.5% 31.5% 3.1%
CLE ATL 444 158 286 35.6% 64.4% 2.2%
TOR ATL 410 163 247 39.8% 60.2% 2.0%
TBR LAD 407 83 324 20.4% 79.6% 2.0%
NYY STL 392 211 181 53.8% 46.2% 2.0%
HOU SDP 359 261 98 72.7% 27.3% 1.8%
SEA ATL 335 109 226 32.5% 67.5% 1.7%
HOU PHI 328 256 72 78.0% 22.0% 1.6%
TBR ATL 243 69 174 28.4% 71.6% 1.2%
CLE NYM 233 87 146 37.3% 62.7% 1.2%
NYY SDP 221 129 92 58.4% 41.6% 1.1%
NYY PHI 218 158 60 72.5% 27.5% 1.1%
TOR NYM 189 82 107 43.4% 56.6% 0.9%
SEA NYM 180 67 113 37.2% 62.8% 0.9%
TBR NYM 127 40 87 31.5% 68.5% 0.6%
TOR STL 126 63 63 50.0% 50.0% 0.6%
CLE STL 119 61 58 51.3% 48.7% 0.6%
SEA STL 100 42 58 42.0% 58.0% 0.5%
TBR STL 78 27 51 34.6% 65.4% 0.4%
CLE PHI 66 36 30 54.5% 45.5% 0.3%
CLE SDP 63 34 29 54.0% 46.0% 0.3%
TOR PHI 59 37 22 62.7% 37.3% 0.3%
TOR SDP 56 29 27 51.8% 48.2% 0.3%
SEA PHI 44 22 22 50.0% 50.0% 0.2%
TBR PHI 43 19 24 44.2% 55.8% 0.2%
SEA SDP 41 21 20 51.2% 48.8% 0.2%
TBR SDP 33 22 11 66.7% 33.3% 0.2%