The MLB postseason is upon us, and we all want to make our best guess who will win it all. We could just go to Fangraph’s Playoff Odds and see who they’re picking - and their model is good - but it’s so much more fun to build our own model.
Before we begin, we need to make sure we know what it is we want to do. These are fairly basic requirements, but having them laid out before we start a project is always a good practice. What we want is a script that will:
Simulate a game against any two teams.
Repeat for a series, stopping if one team has one the majority of available games.
Replicate the entire process n times, with a consistent ID allowing us to track an entire sim from Wild Card to World Series.
To simulate a game, we’ll use Bill James’ Log5 method. It’s perhaps a bit simple for this - it doesn’t factor in home field advantage, starting pitcher, etc., but our task is already pretty big so that’s out of scope. We’ll test it on a team with a .615 win percent facing one with a .550.
OK, so the .615 team has a .566 chance of beating the .550 team. Since we’ll be using this number to check the odds of a winning a single game (using R’s rbinom function), we want to make sure it’s not biased. So let’s run the function the other way to make sure it doesn’t throw an odd result.
Show the code
james_log5(.550, .615)
[1] 0.4334698
Excellent. We get .433, which we can add to .566 to get 1. So the order we put the teams in shouldn’t bias the model. Next, we can use our new function inside another function to simulate a series. We’ll run it once to make sure it works.
# A tibble: 2 × 5
team W L won_series win_perc
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 4 0 0.615
2 B 4 1 1 0.55
So team A was better during the season but still lost in 5. That’s baseball for ya. Note we brought back two rows, one for each team - this way, if we want, we can reference either team later. We also added a flag for whether or not they won the series, which will be useful when we want to count those up later.
We need to know who’s playing, which we’ll enter manually. Normally we’d automate this, but since this is a one-time deal it’s faster this way and we aren’t losing anything. We’re going to use a variable to control our total number of simulations, since we need to create a tibble for each round and the code gets long. This makes it easier to test with a low number early on and change later. We’ll set it to 20,000.
*The golden rule of analysis - automate by default or if it makes testing easier, but don’t waste time automating if you’re not going back.
Show the code
al_teams <-tibble(seed =1, team ='HOU', win_perc = .654) %>%add_row(seed =2, team ='NYY', win_perc = .611) %>%add_row(seed =3, team ='CLE', win_perc = .568) %>%add_row(seed =4, team ='TOR', win_perc = .568) %>%add_row(seed =5, team ='SEA', win_perc = .556) %>%add_row(seed =6, team ='TBR', win_perc = .531)nl_teams <-tibble(seed =1, team ='LAD', win_perc = .685) %>%add_row(seed =2, team ='ATL', win_perc = .623) %>%add_row(seed =3, team ='STL', win_perc = .574) %>%add_row(seed =4, team ='NYM', win_perc = .623) %>%add_row(seed =5, team ='SDP', win_perc = .549) %>%add_row(seed =6, team ='PHI', win_perc = .537)nsims <-20000
Now “all” we need to do is simulate 4 Wildcard Rounds, 4 Division Series, 2 Championship Series and one World Series. There’s likely a more elegant way to doing this, but for what it is it’s OK. Let’s run a bunch of sims and see what the first one looks like.
Tampa Bay and Seattle won the AL Wildcards, while the Phillies and the Mets won NL. The Yankees and Mariners proceeded to win the Division Series, while the Braves and Dodgers advance in the NL. The Mariners and the Braves then won their Championship Series and went on to the World Series, where Atlanta won it all. There are 19,999 more of these, we’ll spare the commentary and show a summary instead.
Well, if you’re a Dodgers fan, this is the model you want. In 30% of our simulations they won it all. The Phillies won 1.1% of simulations, so if they actually win we can say they did so as underdogs. This model is more bullish on the top teams and bearish on the lower compared to Fangraphs. Their model is much more sophisticated - they’re well beyond the Log5 method - so I’d give it a lot more weight.
Remember, the top two teams in each league get a bye in first round. Our sim still had them play 20,000 rounds, though, so them having the best World Series odds is a function of them being the best.
One often has regrets after finishing a project like this, and in this case, I built something that can’t really be updated with actual information. So if the Mariners win the wildcard, for instance, I can’t take the Blue Jays out of the remaining calcs. Or if the Braves are up 2-1 in the DS, I can’t let this model know that since it starts every series from scratch. Live and learn. Regardless, this gives a decent idea of what to expect. Hopefully the excitement of the postseason exceeds that of reading a bunch of probability tables.
Bonus 1: Just for kicks - how many teams made it through with no loses?
Bonus 2: here are all of the World Series permutations. 21% of sims had an Astros-Dodgers series, with Houston winning 40% of sims and the Dodgers 60%.