We all deal with time series every day*. Whether it’s sales, conversion rates, if we’re skipping desert like we said we would, or any other of the things we keep track of, we want to know if things are getting better or worse. But how can we keep from reacting to one drop? Surely if things have an average, it follows that some will be below and some will be above. Do we want a full-scale investigation every time the numbers dip? If not, what should our criteria be?
*You could even measure how many days you deal with time series if you like.
The standard approach, rightfully, is to plot your data as a time series. Traditionally, we also add the average and +/- 2 standard deviations to give us an idea of how much the data fluctuates. Let’s do this with stolen bases per game. (Data courtesy Fangraphs.) MLB changed the rules this year, limiting pickoff attempts per plate appearance and making the bases bigger, so we want to see if that changed anything. We’ll look at 1923 - 2023*, 100 seasons seems like enough.
*Typcially, we’d only compare either completed seasons or look at season to date. But since we’re using a rate stat and the season is almost done, we’ll let it slide.
We can see some interesting trends here - the drop in stolen bases with a low point in the 50s, a gradual climb until a plateau in the late 70s (it’d be interesting to look at this without Rickey or Raines), a drop heading into the 2000s with a brief blip around 2010, and finally the sharp increase this year with the rule changes.
Our reference lines are distinctly unhelpful, though. Only a handful of seasons sit near the average, and everything’s within our 2 sigma bounds. What went wrong here? Essentially, we forgot to make sure that our data is normally distributed. Let’s make up for that now by charting the distribution. We’ll include the average as a solid line for reference.
Show the code
mean_line <-max(league_xmr$mean_sb_per_g)league_xmr %>%ggplot(aes(x = sb_per_g))+geom_histogram(aes(y = ..density..), fill ="grey85")+geom_density()+geom_vline(aes(xintercept = mean_line))+theme_ipsum()+labs(title ="Stolen Bases per Game",subtitle ="1923 - 2023 (partial)")+xlab("")
Now, if this had been normally distributed, we’d have a nice bell shape. Instead, we have two bumps, reflecting the high and low eras we saw when we trended the data., with our overall average sitting in between.
Thankfully, there are other ways to handle this. XmR charts are a great alternative, since they’re more robust to outliers and can also adopt to new means. It uses the difference between two consecutive data points to set the limits, and has a variety of rules to identify what’s in and out of control. If you want the nuts and bolts, this is a helpful guide by Stacey Barr, or read Don Wheeler’s book or Stephen Few’s overview in “Signal”. In our case, we’ll let the xmmr package do all the lifting for us, including recalculating when appropriate.
Show the code
league_xmr %>%ggplot(aes(x = Season))+geom_ribbon(aes(ymin =`Lower Natural Process Limit`, ymax =`Upper Natural Process Limit`),fill ="grey85")+geom_line(aes(y = sb_per_g))+geom_line(aes(y =`Central Line`))+theme_ipsum()+ylab("")+labs(title ="Stolen Bases per Game")
Note how our bounds become more useful - they’re reflective of the era, not the overall picture, so it’s a more accurate story. We also see just how much 2023 has seen an increase.
In any context, we’d want to look at the component parts as well as the overall. For instance, if you’re looking at return rates, your overall might be fine but you may have individual warehouses or product lines that need attention. Each warehouse may have distinct processes that would further warrant reporting on each separately. So let’s break down our stolen base chart by team. To keep things clean, we’ll limit ourselves to 1998 - 2023, since ’98 is when we last had expansion teams added.
Show the code
team_data <- data %>%mutate(HR_per_PA = HR / PA,runs_game = R / TG,sb_per_g = SB / TG ) %>%mutate(Team =case_when(Team =='MON'~'WSN', Team =='ANA'~'LAA', Team =='TBD'~'TBR', Team =='FLA'~'MIA',TRUE~ Team)) %>%group_by(Team) %>%filter(Season >=1998) team_xmr <- team_data %>%group_split(Team) %>%map(xmr, measure ="sb_per_g", recalc = T) %>%map_df(as_tibble)team_xmr %>%ggplot(aes(x = Season))+geom_ribbon(aes(ymin =`Lower Natural Process Limit`, ymax =`Upper Natural Process Limit`),fill ="grey85")+geom_line(aes(y = sb_per_g))+geom_line(aes(y =`Central Line`))+theme_ipsum()+facet_wrap(Team ~ .)+ylab("")
Here, we see almost all teams have seen a spike this year. The Rangers, Marlins, and Giants have all dropped, and some teams are kinda flat, but most teams are taking advantage of the rule change. This doesn’t always help - the A’s and Royals both have increased, but have lousy records.
This also shows that a bunch of small increases can add up to one big increase. Not many teams have changes as dramatic as the league total, but the overall shows a shift in how teams approach baserunning.
One note - most visualization software doesn’t support XmR out of the box. You can hack it in Tableau, and there’s a plugin for PowerBI that does a decent job*. But generally, you’ll be using mean and standard deviation. This can be fine depending on your data, and in most cases you’d be showing last 60 months, last 20 quarters, etc., where finding in/out of range is a bit simpler. You can also use moving mean and standard deviation, the number of time periods to use will depend on your data.
*There, I said something nice about PowerBI.
Also note that in this instance, stolen bases per game going up could be either good or bad depending on any number of factors. Your data should be fairly clear : more profit is good, more complaints are bad, etc.
Ultimately, measuring time series is a bit of an art and a science - knowing your data well enough to know what methods work best, how large a timeframe to include, and how to break it down. But by having some sort of standard, you can show when things are normal and don’t require additional work vs. when something’s way out of bounds.