Baseball data analytics fun

5 min readAug 10, 2018

What is more fun than data analysis? Data analysis with baseball trend data, that’s what! It is being said this year that most Major League Baseball at bats are ending in either a strikeout or a home run. That is an exaggeration, of course, but it does seem to be true that these two outcomes are becoming increasingly common for at bats. I was wondering if the strikeout trend was even true and, if so, when this trend may have started. Using the Lahman database, let’s look at the history.

First, let’s look at our raw data showing the number of strikeouts by year.

Well, the trend is definitely up and to the right, but we also notice a few outlier years there. What could be going on?

First, there is a blip down in 1981. There was a baseball strike that year. Other years where work stoppages or strikes or lockouts reduced the number of games played included 1994 as seen above. So it seems we will have to normalize for the number of games played per team per season. That normalization will also make years with 154 games apples-to-apples with years with 162 games. As a side note, until researching the schedule changes, I was not aware that the National League moved to the 162 game schedule one year after the American League did in 1961.

Are there other things that could explain the trend upwards? Ah, baseball expanded in 1977, meaning there were more games played and hence more plate appearances and therefore more strikeouts after that, all things being equal. Other expansion years were 1961, 1962, 1969, 1993, and 1998. So it seems we will have to normalize not just for the number of games played per season in the schedule but also for the numbers of teams.

What other changes could be muddying the data water? What about the steroids era? Or the lowering of the mound in 1968? Or the introduction of the designated hitter in the American League in 1973 (presumably pitchers strike out more than DHs do, or at least the DH advocates would have us believe that!). Rather than adjust the numbers for these, I would prefer to use these to explain differences that we (maybe) will see.

To adjust for the variable number of games played each year either due to expansion of the league or the schedule length changing, we can calculate the number of strikeouts per game. But is that even the best metric? Wouldn’t the highest scoring teams in the league have more plate appearances per game and therefore more chance for strikeouts in each game? I tend to think of this as a secondary effect in terms of magnitude, but it’s simple enough to account for it now, and we always can come back and test the assumption afterwards. For now, let’s look at strikeouts per plate appearance. We will use plate appearances rather than at bats because we want to ensure that things don’t get distorted by walks, for example.

One data challenge with plate appearances is that it is not always tracked in aggregate as clearly as at bats. We can use the shorthand of at bats + walks + hit by pitch, but that is missing instances of catcher interference. For purposes of this analysis, those outlying results (catcher interference) are few and far between, so let’s ignore them and use strikeouts divided by the sum of at bats, walks, and hit by pitch, or K/(AB+BB+HBP).

Wow, all of that time just to look at our raw data and decide on a metric. We have not even started the analysis yet! This is a key point I often see in business cases — if the wrong question is asked or the wrong item measured, it can lead to a less meaningful analysis. It is worth it to spend a bit more time upfront looking at our data and making sure our measures reflect reality as well as possible. It does not have to be perfect (i.e. can ignore catcher’s interference) but should not ignore more meaningful factors (i.e. league expansion).

Let’s see what we get now. I will also include walks per plate appearance as a comparison point.

Interesting, strikeouts per plate appearance (red line) have definitely been on the upswing, most notably since 1980.

One nice effect of switching our metric from gross strikeouts per season to K/PA is that, by removing the noise of season length and number of teams, we see that K/PA actually went down from the latter-1960s to 1980, resulting in one of the only such sustained periods in MLB history. We would have missed this if only using our first chart at the top of this post.

Anyway, what could explain this ~12 year downward trend in strikeouts per plate appearance? The DH was introduced in the American League in 1973. While that is only one batter out of nine in each lineup, it does make a visible difference in the strikeout rates (maybe an analysis for a different post!). That only applies, however, to the American League — but the National also sees a reduction. The mound was lowered in the late 1960s to lessen the dominance that pitchers had at that time. That’s the best explanation I can come up with. But then I am at a loss to explain why it was not just a one-time step change and then why it even reversed again to rise after 1980. One idea presented by Geoff Young of The Hardball Times is that the introduction of specialty pitchers meant tired starters were being removed earlier in games and fresh relief arms brought into games. This theoretically would have been more pronounced (in the opposite direction) than the mound effect. Again, that is an analysis for a different post!

For those interested, here is the little bit of Python (including Pandas and Bokeh libraries, of course) that I used for the analysis. It is no expert’s Python, that’s for sure, but it worked to get the job done.

Baseball data analytics fun

Written by Stuart Zussman