How Predictive Are Statcast Metrics?

 Intro

Baseball is a hard sport to predict because of all the random variation in it, and that alone probably accounts for like half of sabermetrics. I was curious about how predictive Statcast metrics are and it got me inspired. For this assignment, I looked at the 2017-2021 player-seasons that had at least 200 plate appearances. A lot has been written about how long it takes for Statcast metrics to stabilize, and it turns out they take much, much shorter than other stats like batting average. I’ve read that Statcast metrics stabilize within 70 batted balls (some by 40 even), which a player will easily achieve by the time he’s reached 200 plate appearances. This gives me a sample size of 1548 player-seasons. Statcast’s main metrics are the batted ball ones, such as average and max exit velocity, considering those are the ones that actually need Statcast in order to exist, but any player’s Baseball Savant dashboard will also show plate discipline metrics like out-of-strike-zone swing percentage (AKA chase rate) and swings-and-misses-on-swings percentage (AKA whiff rate).

This is Aaron Judge's Statcast dashboard.

How Did Statcast Metrics Change in 2021?

Did Statcast metrics change in 2021? We do know one thing: the baseball was different this year. In recent years, there had been lots of talk about the baseballs being “juiced” for more distance and home runs (though Major League Baseball tried to pretend otherwise for a while).

The new baseball didn’t seem to have too much of an effect on home runs: in 2020, 3.5% of plate appearances resulted in home runs, while in 2021, that number only dropped to 3.3%. Even if the home run rate did not change, there still could have been an effect on Statcast metrics, and I want to make sure that I’m comparing apples to apples when comparing 2021 to years prior. The plan for the new baseball, according to the Athletic, was “reducing the weight of the ball by less than one-tenth of an ounce, and also a slight decrease in the bounciness of the ball.” One comparison I heard (unfortunately I can’t remember the source) was comparing the old baseball to a basketball and the new, lighter, and less-bouncy baseball to a balloon. Hitting a basketball would have a low exit velocity but travel a further distance, while hitting a balloon would have a higher exit velocity but a shorter distance. Did the distributions of the metrics reflect this?

In 2021, it looks like the frequency of higher max exit velocities was even higher—which supports the prediction that the higher-drag, lighter ball would have higher exit velocities. All in all, the distribution is pretty similar. The same can be said about the average exit velocity distribution, where the 2021 line looks identical to the 2017-2020 one, just slightly shifted to the right. Hard hit rate is the percentage of a player’s batted balls that are 95.0 mph or faster, so it is no surprise that similar patterns are also present in its distribution comparisons. Barrel rate is a similar metric. Instead of looking at batted balls at a specific exit velocity, it looks for batted balls of specific exit velocity and launch angle combos that have produced at least a .500 batting average and 1.500 slugging percentage in the Statcast era. Once again, its distributions look identical, but 2021 slightly shifted to the right to reflect a small spike in exit velocity. Barrel rate is an interesting one to look at because it deals with more than just exit velocity. Because of the similar shape of the compared distributions for these four metrics, and because the increased shift in 2021 is still relatively tiny, I do think it is fair to lump in 2021 stats with 2017-2020 stats. However, it is important to keep in mind year-to-year changes are present. xwOBA, meanwhile, has stayed the same, which to me makes sense because it is more of an all-encompassing stat.

R-Squared (Coefficients of Determination)

R2 is a number that quantifies what percentage of variability in the y variable can be explained by the effect of x variables. When used between just two variables, it can be calculated by squaring the Pearson correlation coefficient between the two variables. First, I looked at which metrics were the strongest determinants for their next year values. So for example, the first value shows how strong of a determinant a player’s average exit velocity in the current year is to his next year’s average exit velocity.

The big takeaway here is that batted ball metrics have much stronger year-to-year R2 values than standard stats like batting average and slugging percentage, and even wOBA. Of the four batted ball stats, max exit velocity is the most powerful year-to-year, which makes sense to me as that seems more like a true skill stat than a performance stat. The highest R2 values, however, are in the plate discipline metrics, showing that most players really are who they are when it comes to their plate discipline habits.

Which metrics are the strongest determinants to how well the player performed in the current year? I use wOBA as the rate stat for overall offensive output. There are a few takeaways here. For the batted ball metrics, max exit velocity is by far the lowest, which makes sense again because it is more of a skill stat than performance stat. Barrel rate is the highest, which feels right to me because I view barrel rate as the most informative of the four. Another take away is that strikeout rate has almost no strength to it, yet walk rate has a notable R2. Slugging percentage is much higher than batting average, which is to be expected considering wOBA is weighted (a double is worth more than single, and so on) and batting average carries a lot of random variation with it.

Lastly, we have the most important set of R2 values that we came here for: the relationship between metrics and a hitter’s wOBA the next season. The biggest takeaway here for me is that xwOBA has a stronger value than wOBA itself—clearly the inventors of xwOBA are doing a good job of estimating true talent level. Meanwhile, plate discipline metrics and batting average are very weak. Slugging percentage is much higher than batting average, which is to be expected considering the randomness that comes with batting average. The most surprising thing for me is seeing which batted ball metrics are stronger than others. My prediction would have been that barrel rate would be the strongest determinant of next year wOBA, considering its definition seems to adhere the most closely to simply how often a batter hits a ball well, yet it is lower than both hard hit rate and average exit velocity. Personally, I’ve always ignored average exit velocity compared to barrel rate because by definition it seemed less informative, so I’m surprised to see it higher and find this very insightful. Meanwhile, none of the batted ball stats are stronger determinants than slugging percentage, and it’s neat to see a non-batted ball, non-expected, matter-of-fact, old school number performing better.

Machine Learning

I was curious how accurate these stats could predict player’s next year wOBA using machine learning. In my algorithms, I included 11 variables as features: plate appearances, walk rate, strikeout rate, average exit velocity, barrel rate, max exit velocity, hard hit rate, chase rate, xwOBA, expected batting average, age, and whiff rate. I wanted to include plate appearances because I figured seasons with higher sample sizes could carry more weight, and I wanted to include age because players entering or leaving their prime could have different behaviors in their next year wOBA. As a baseline, I predicted every player in the test dataset as having an equal next year wOBA as their current year wOBA. This produced a root mean square error (RMSE) of .044. A variability of .044 is significant in the context of wOBA because it is measured in the thousandths—the difference between a .300 wOBA hitter and a .344 wOBA hitter is pretty large. Can a dataset that has undergone machine learning perform more accurately?

For this assignment, I tried four different machine learning algorithms: decision trees, decision forests, support vector machines, and neural networks. Each algorithm produced similar RMSE results, with the decision forest the best at .036. While this is more accurate than the baseline, it is still disappointing to not see it any lower. It really goes to show how hard it is to predict performance the next season.

When looking at which features were most important in the decision forest model, it’s no surprise to see xwOBA at the highest, considering we saw its R2 value being the strongest before.




Something interesting to look at is the physical decision tree created by the decision tree algorithm, which can be seen above. The variable names in the feature importance plot and decision tree correspond like so: X0- plate appearances, X1- walk rate,  X2- strikeout rate,  X3- average exit velocity,  X4- barrel rate, X5- max exit velocity, X6- hard hit rate,  X7- chase rate, X8- xwOBA, X9- expected batting average, X10- age, X11- whiff rate. 

To complete this project, I used data from Fangraphs, Baseball Savant, and the Chadwick Baseball Bureau. Everything was completed in RStudio, and the machine learning was completed in Python Jupyter notebooks. Special thanks to Dr. Varol Kayan, whose Python scripts taught in class were heavily borrowed to complete this project. You can my created files for this project on GitHub


Comments

Popular posts from this blog

Analyzing Strike Zone Data From the Statcast Database

Introducing the Full Statcast Database (2019-2021)

Player Stat Percentiles in R