Forecasting the MLB Postseason

The MLB postseason is upon us, and we all want to make our best guess who will win it all. We could just go to Fangraph’s Playoff Odds and see who they’re picking - and their model is good - but it’s so much more fun to build our own model.

Before we begin, we need to make sure we know what it is we want to do. These are fairly basic requirements, but having them laid out before we start a project is always a good practice. What we want is a script that will:

Simulate a game against any two teams.
Repeat for a series, stopping if one team has one the majority of available games.
Use a bracket format exactly like the new MLB format.
Replicate the entire process n times, with a consistent ID allowing us to track an entire sim from Wild Card to World Series.

To simulate a game, we’ll use Bill James’ Log5 method. It’s perhaps a bit simple for this - it doesn’t factor in home field advantage, starting pitcher, etc., but our task is already pretty big so that’s out of scope. We’ll test it on a team with a .615 win percent facing one with a .550.

Show the code

library(gt)
library(tidyverse)

set.seed(9262)

james_log5 <- function(team_a_win_prob, team_b_win_prob){
  
    (team_a_win_prob - (team_a_win_prob * team_b_win_prob)) / 
    (team_a_win_prob + team_b_win_prob - 2 * team_a_win_prob * team_b_win_prob)
  
}

james_log5(.615, .550)

[1] 0.5665302

OK, so the .615 team has a .566 chance of beating the .550 team. Since we’ll be using this number to check the odds of a winning a single game (using R’s rbinom function), we want to make sure it’s not biased. So let’s run the function the other way to make sure it doesn’t throw an odd result.

Show the code

james_log5(.550, .615)

[1] 0.4334698

Excellent. We get .433, which we can add to .566 to get 1. So the order we put the teams in shouldn’t bias the model. Next, we can use our new function inside another function to simulate a series. We’ll run it once to make sure it works.

Show the code

sim_series <- function(team_a_name, 
                       team_a_win_prob, 
                       team_b_name, 
                       team_b_win_prob, 
                       series_length){
  
  team_a_prob <- james_log5(team_a_win_prob, team_b_win_prob)
  tmp_tracker <- tibble(team = c(team_a_name, team_b_name),
                        W = c(0, 0),
                        L = c(0, 0),
                        won_series = c(0,0),
                        win_perc = c(team_a_win_prob, team_b_win_prob))
  
  for (i in 1:series_length){
    if(tmp_tracker[1,]$won_series == 1 | tmp_tracker[2,]$won_series == 1){
      break
    }else{
    if(rbinom(1,1,team_a_prob) == 1){
      tmp_tracker[1,]$W = tmp_tracker[1,]$W + 1
      tmp_tracker[2,]$L = tmp_tracker[2,]$L + 1
      if(tmp_tracker[1,]$W > series_length/2){
        tmp_tracker[1,]$won_series = 1
      }else{
        tmp_tracker[1,]$won_series = 0
      }
    }else{
      tmp_tracker[2,]$W = tmp_tracker[2,]$W + 1
      tmp_tracker[1,]$L = tmp_tracker[1,]$L + 1
      if(tmp_tracker[2,]$W > series_length/2){
        tmp_tracker[2,]$won_series = 1
      }else{
        tmp_tracker[2,]$won_series = 0
      }
    }
    }
    }
  tmp_tracker
}

sim_series(team_a_name = 'A', team_a_win_prob = .615, team_b_name = 'B', team_b_win_prob = .550, series_length = 7)

# A tibble: 2 × 5
  team      W     L won_series win_perc
  <chr> <dbl> <dbl>      <dbl>    <dbl>
1 A         1     4          0    0.615
2 B         4     1          1    0.55

So team A was better during the season but still lost in 5. That’s baseball for ya. Note we brought back two rows, one for each team - this way, if we want, we can reference either team later. We also added a flag for whether or not they won the series, which will be useful when we want to count those up later.

We need to know who’s playing, which we’ll enter manually. Normally we’d automate this, but since this is a one-time deal it’s faster this way and we aren’t losing anything. We’re going to use a variable to control our total number of simulations, since we need to create a tibble for each round and the code gets long. This makes it easier to test with a low number early on and change later. We’ll set it to 20,000.

*The golden rule of analysis - automate by default or if it makes testing easier, but don’t waste time automating if you’re not going back.

Show the code

al_teams <- tibble(seed = 1, team = 'HOU', win_perc = .654) %>%
  add_row(seed = 2, team = 'NYY', win_perc = .611) %>%
  add_row(seed = 3, team = 'CLE', win_perc = .568) %>%
  add_row(seed = 4, team = 'TOR', win_perc = .568) %>%
  add_row(seed = 5, team = 'SEA', win_perc = .556) %>%
  add_row(seed = 6, team = 'TBR', win_perc = .531)

nl_teams <- tibble(seed = 1, team = 'LAD', win_perc = .685) %>%
  add_row(seed = 2, team = 'ATL', win_perc = .623) %>%
  add_row(seed = 3, team = 'STL', win_perc = .574) %>%
  add_row(seed = 4, team = 'NYM', win_perc = .623) %>%
  add_row(seed = 5, team = 'SDP', win_perc = .549) %>%
  add_row(seed = 6, team = 'PHI', win_perc = .537)

nsims <- 20000

Now “all” we need to do is simulate 4 Wildcard Rounds, 4 Division Series, 2 Championship Series and one World Series. There’s likely a more elegant way to doing this, but for what it is it’s OK. Let’s run a bunch of sims and see what the first one looks like.

Show the code

#Run AL wildcard through CS
al_wc_1 <- al_teams[6,] %>%
  cbind(al_teams[3,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                              team_a_win_prob = .x$win_perc_a,
                              team_b_name = .x$team_b,
                              team_b_win_prob = .x$win_perc_b,
                              series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'Wildcard')

al_wc_2 <- al_teams[5,] %>%
  cbind(al_teams[4,]) %>%
  rename(seed_a = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         seed_b = 4, 
         team_b = 5, 
         win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'Wildcard')

#al_ds
al_ds_1 <- al_wc_1 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(al_teams[2,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'DS')

al_ds_2 <- al_wc_2 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(al_teams[1,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'DS')

#alcs
al_cs <- al_ds_1 %>%
  filter(won_series == 1) %>%
  inner_join(al_ds_2, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'AL', round = 'CS')

#Run NL wildcard through CS
nl_wc_1 <- nl_teams[6,] %>%
  cbind(nl_teams[3,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'Wildcard')

nl_wc_2 <- nl_teams[5,] %>%
  cbind(nl_teams[4,]) %>%
  rename(seed_a = 1, team_a = 2, win_perc_a = 3, seed_b = 4, team_b = 5, win_perc_b = 6) %>%
  crossing(sim_id = 1:nsims) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 3))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'Wildcard')

#nl_ds
nl_ds_1 <- nl_wc_1 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(nl_teams[2,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'DS')

nl_ds_2 <- nl_wc_2 %>%
  filter(won_series == 1) %>%
  select(sim_id, team, win_perc) %>%
  cbind(nl_teams[1,]) %>%
  select(-seed) %>%
  rename(sim_id = 1, 
         team_a = 2, 
         win_perc_a = 3, 
         team_b = 4, 
         win_perc_b = 5) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 5))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'DS')

#nlcs
nl_cs <- nl_ds_1 %>%
  filter(won_series == 1) %>%
  inner_join(nl_ds_2, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = 'NL', round = 'CS')

#world series time
ws <- al_cs %>%
  filter(won_series == 1) %>%
  inner_join(nl_cs, by = c('sim_id', 'won_series')) %>%
  select(sim_id,
         team_a = team.x,
         win_perc_a = win_perc.x,
         team_b = team.y,
         win_perc_b = win_perc.y) %>%
  group_by(sim_id) %>%
  nest() %>%
  mutate(sim_tmp = map(data, ~sim_series(team_a_name = .x$team_a,
                                         team_a_win_prob = .x$win_perc_a,
                                         team_b_name = .x$team_b,
                                         team_b_win_prob = .x$win_perc_b,
                                         series_length = 7))) %>%
  select(sim_id, sim_tmp) %>%
  unnest() %>%
  mutate(league = '', round = 'WS')

playoffs_all <- al_wc_1 %>%
  rbind(nl_wc_1) %>%
  rbind(al_wc_2) %>%
  rbind(nl_wc_2) %>%
  rbind(al_ds_1) %>%
  rbind(nl_ds_1) %>%
  rbind(al_ds_2) %>%
  rbind(nl_ds_2) %>%
  rbind(al_cs) %>%
  rbind(nl_cs) %>%
  rbind(ws)

Show the code

playoffs_all %>%
  filter(sim_id == 1 & won_series == 1) %>%
  ungroup() %>%
  select(-won_series, -win_perc) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  )

sim_id	team	W	L	league	round
1	TBR	2	0	AL	Wildcard
1	PHI	2	1	NL	Wildcard
1	SEA	2	1	AL	Wildcard
1	NYM	2	1	NL	Wildcard
1	NYY	3	1	AL	DS
1	ATL	3	1	NL	DS
1	SEA	3	1	AL	DS
1	LAD	3	1	NL	DS
1	SEA	4	3	AL	CS
1	ATL	4	1	NL	CS
1	ATL	4	3		WS

Tampa Bay and Seattle won the AL Wildcards, while the Phillies and the Mets won NL. The Yankees and Mariners proceeded to win the Division Series, while the Braves and Dodgers advance in the NL. The Mariners and the Braves then won their Championship Series and went on to the World Series, where Atlanta won it all. There are 19,999 more of these, we’ll spare the commentary and show a summary instead.

Show the code

playoffs_all %>%
  group_by(team, round) %>%
  summarise(series_won = sum(won_series)) %>%
  mutate(series_perc = series_won / nsims) %>%
  pivot_wider(id_cols = team, names_from = round, values_from = series_perc) %>%
  select(team, Wildcard, DS, CS, WS) %>%
  arrange(desc(WS)) %>%
  ungroup() %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_percent(
    columns = 2:5,
    decimals = 1
  )

team	Wildcard	DS	CS	WS
LAD	NA	67.0%	46.3%	30.4%
HOU	NA	67.5%	43.4%	22.7%
ATL	NA	62.1%	26.2%	13.8%
NYY	NA	61.6%	28.6%	11.9%
NYM	61.3%	23.1%	12.7%	6.6%
STL	55.6%	22.5%	7.2%	3.0%
CLE	54.8%	23.1%	8.9%	3.0%
TOR	51.6%	17.1%	7.8%	2.7%
SEA	48.4%	15.4%	6.7%	2.1%
SDP	38.7%	9.9%	3.9%	1.4%
TBR	45.2%	15.3%	4.7%	1.3%
PHI	44.4%	15.4%	3.8%	1.1%

Well, if you’re a Dodgers fan, this is the model you want. In 30% of our simulations they won it all. The Phillies won 1.1% of simulations, so if they actually win we can say they did so as underdogs. This model is more bullish on the top teams and bearish on the lower compared to Fangraphs. Their model is much more sophisticated - they’re well beyond the Log5 method - so I’d give it a lot more weight.

Remember, the top two teams in each league get a bye in first round. Our sim still had them play 20,000 rounds, though, so them having the best World Series odds is a function of them being the best.

One often has regrets after finishing a project like this, and in this case, I built something that can’t really be updated with actual information. So if the Mariners win the wildcard, for instance, I can’t take the Blue Jays out of the remaining calcs. Or if the Braves are up 2-1 in the DS, I can’t let this model know that since it starts every series from scratch. Live and learn. Regardless, this gives a decent idea of what to expect. Hopefully the excitement of the postseason exceeds that of reading a bunch of probability tables.

Bonus 1: Just for kicks - how many teams made it through with no loses?

Show the code

playoffs_all %>%
  group_by(team, sim_id) %>%
  filter(won_series == 1) %>%
  mutate(won_ws = max(case_when(round == 'WS' ~ won_series, TRUE ~ 0))) %>%
  filter(won_ws == 1) %>%
  summarise(W = sum(W),
            L = sum(L)) %>%
  mutate(sweep = case_when(L == 0 ~ 1, TRUE ~ 0)) %>%
  group_by(team) %>%
  summarise(world_series_wins = n(),
            sweeps = sum(sweep)) %>%
  arrange(desc(sweeps)) %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_number(
    columns = 2,
    decimals = 0,
    use_seps = TRUE
  )

team	world_series_wins	sweeps
LAD	6,076	66
HOU	4,537	35
ATL	2,760	21
NYY	2,385	7
NYM	1,326	1
STL	608	1
CLE	594	0
PHI	230	0
SDP	277	0
SEA	414	0
TBR	260	0
TOR	533	0

Bonus 2: here are all of the World Series permutations. 21% of sims had an Astros-Dodgers series, with Houston winning 40% of sims and the Dodgers 60%.

Show the code

ws %>%
  group_by(sim_id) %>%
  mutate(NL_Team = lead(team)) %>%
  mutate(NL_Wins = case_when(won_series == 0 ~ 1, TRUE ~ 0)) %>%
  rename(AL_Team = team,
         AL_Wins = won_series) %>%
  group_by(AL_Team, NL_Team) %>%
  filter(!is.na(NL_Team)) %>%
  summarise(Series = n(),
            AL_Wins = sum(AL_Wins),
            NL_Wins = sum(NL_Wins)) %>%
  mutate(AL_Win_Perc = AL_Wins / Series,
         NL_Win_Perc = NL_Wins / Series,
         Perc_of_Sims = Series/nsims)%>%
  arrange(desc(Series)) %>%
  ungroup() %>%
  gt() %>%
  tab_style(
    locations = cells_column_labels(columns = everything()),
    style = list(
      cell_text(weight = "bold")
    )
  ) %>%
  fmt_number(
    columns = 3:5,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_percent(
    columns = 6:8,
    decimals = 1
  )

AL_Team	NL_Team	Series	AL_Wins	NL_Wins	AL_Win_Perc	NL_Win_Perc	Perc_of_Sims
HOU	LAD	4,032	1,704	2,328	42.3%	57.7%	20.2%
NYY	LAD	2,615	865	1,750	33.1%	66.9%	13.1%
HOU	ATL	2,241	1,270	971	56.7%	43.3%	11.2%
NYY	ATL	1,558	702	856	45.1%	54.9%	7.8%
HOU	NYM	1,091	618	473	56.6%	43.4%	5.5%
CLE	LAD	849	218	631	25.7%	74.3%	4.2%
TOR	LAD	722	159	563	22.0%	78.0%	3.6%
NYY	NYM	720	320	400	44.4%	55.6%	3.6%
SEA	LAD	633	153	480	24.2%	75.8%	3.2%
HOU	STL	625	428	197	68.5%	31.5%	3.1%
CLE	ATL	444	158	286	35.6%	64.4%	2.2%
TOR	ATL	410	163	247	39.8%	60.2%	2.0%
TBR	LAD	407	83	324	20.4%	79.6%	2.0%
NYY	STL	392	211	181	53.8%	46.2%	2.0%
HOU	SDP	359	261	98	72.7%	27.3%	1.8%
SEA	ATL	335	109	226	32.5%	67.5%	1.7%
HOU	PHI	328	256	72	78.0%	22.0%	1.6%
TBR	ATL	243	69	174	28.4%	71.6%	1.2%
CLE	NYM	233	87	146	37.3%	62.7%	1.2%
NYY	SDP	221	129	92	58.4%	41.6%	1.1%
NYY	PHI	218	158	60	72.5%	27.5%	1.1%
TOR	NYM	189	82	107	43.4%	56.6%	0.9%
SEA	NYM	180	67	113	37.2%	62.8%	0.9%
TBR	NYM	127	40	87	31.5%	68.5%	0.6%
TOR	STL	126	63	63	50.0%	50.0%	0.6%
CLE	STL	119	61	58	51.3%	48.7%	0.6%
SEA	STL	100	42	58	42.0%	58.0%	0.5%
TBR	STL	78	27	51	34.6%	65.4%	0.4%
CLE	PHI	66	36	30	54.5%	45.5%	0.3%
CLE	SDP	63	34	29	54.0%	46.0%	0.3%
TOR	PHI	59	37	22	62.7%	37.3%	0.3%
TOR	SDP	56	29	27	51.8%	48.2%	0.3%
SEA	PHI	44	22	22	50.0%	50.0%	0.2%
TBR	PHI	43	19	24	44.2%	55.8%	0.2%
SEA	SDP	41	21	20	51.2%	48.8%	0.2%
TBR	SDP	33	22	11	66.7%	33.3%	0.2%