Introducing the Full Statcast Database (2019-2021)

Back in July, me and my school partners created a similar data warehouse for every pitch thrown in the 2021 season up until June 23. The project was a lot of fun and you can read a PDF version of it here under the Data Warehousing section. I wanted to continue that project for all of 2021 and beyond, and have finally completed it. Baseball Savant contains data from every season since 2008, and Statcast metrics were measured starting in 2015. In a perfect world, I wanted my database to include everything since 2008, and at the very least 2015 to have a "Statcast Era" database, but unfortunately my student account in Oracle has a limited table space, so I was limited to 2019, 2020, and 2021. Across those three seasons, there were a total of 1,743,857 pitches thrown. 

My database is set up in a similar dimension model "warehouse" style. The "fact table" contains all the pitches that I just described, with all their details such as velocity, spin rate, pitch type, result of the pitch, and so on. There are then two "dimension tables," one that describes the pitcher on the play and one that describes the batter, with descriptive attributes such as birthday, height, weight, and handedness. 

This has been something that I've wanted to complete for a while and I'm really excited for it. Here are a few example queries that allowed me to have a lot of fun with it. I'll have the queries in question form.

1) What was Aroldis Chapman's pitch arsenal in 2021?

2) What are the most common results of a 3-0 pitch?









3) Of the 875 times the 3-0 pitch was put into play, what was the result?









This means that batters hit for a .387 batting average and a .824 slugging percentage when swinging on 3-0. I think I might do a separate 3-0 pitch analysis later on.

4) Who had the most hits against 97+ mph pitches?









5) Of all the players who threw at least one pitch, what countries were they from?














6) How did fastball spin rate change month by month?









There is a big drop off in June when the crack down began. Starting in August, it gradually increases again.

7) What are the most common results of hitting a ball 110+ mph?












Hitting a ball 110+ mph results in a .667 batting average and 1.597 slugging percentage, although this isn't accurate because there are 761 batted balls where the event was not recorded (null).

8) Which players had the most 110+ mph batted balls that were NOT hits?

We see here Vladimir Guerrero Jr. had the most batted balls at 110+ mph that didn't result in a hit, but he also had the most batted balls at 110+  mph overall. If we look at percentage of those batted balls that did not result in hits, Yandy Diaz of the Rays is the least luckiest, where 12 of his 28 such batted balls were not hits. Of the high-volume 110+ exit velocity hitters, Aaron Judge was the least luckiest, where 32% of his such batted balls were not hits. Mike Zunino had the best percentage, where only two of his 42 such batted balls were not hits. Pete Alonso had the best percentage among the high-volume guys, where only 12% of his 110+ mph batted balls were not hits. 

Of course, we know that exit velocity is not the only thing that determines likelihood of a hit; luck and launch angle have an impact as well. A launch angle less than 10 degrees is typically a groundball. A launch angle of 10-25 degrees is typically a line drive. Guerrero's average launch angle on those batted balls was 6.5 degrees, while Yandy Diaz was a very low 3.2 degrees. These guys were often hitting ground balls. Zunino had an average of 21.8, a stark contrast. Alonso had an average of 13.2, while Judge had an average of 9.1.

SQL Code

1)

SELECT pitch_name, game_year, round(avg(release_speed), 1) as velocity, round((count(*)/ (select count(*) from MLB.pitch_fact where pitcher = 547973 AND game_year = 2021)*100), 2) as percent_thrown FROM MLB.pitch_fact WHERE pitcher = 547973 AND game_year = 2021 GROUP BY pitch_name, game_year ORDER BY pitch_name, game_year DESC;

2)

SELECT description, count(*), round((count(*)/ (select count(*) from MLB.pitch_fact where balls = 3 AND strikes = 0)*100), 2) as percent FROM MLB.pitch_fact WHERE balls = 3 AND strikes = 0 GROUP BY description ORDER BY count(*) DESC;

3)

SELECT events, count(*), round((count(*)/ (select count(*) from MLB.pitch_fact where balls = 3 AND strikes = 0 AND events NOT IN ('walk', 'hit_by_pitch'))*100), 2) as percent FROM MLB.pitch_fact WHERE balls = 3 AND strikes = 0 AND events NOT IN ('walk', 'hit_by_pitch') GROUP BY events ORDER BY count(*) DESC;

4)

SELECT batter, nameFirst, nameLast, count(*) FROM MLB.pitch_fact f INNER JOIN MLB.batter_dimension b ON f.batter = b.playerID WHERE release_speed > 97.0 AND events IN ('home_run', 'single', 'double', 'triple') GROUP BY batter, nameFirst, nameLast ORDER BY count(*) DESC;

5)

SELECT birthCountry, COUNT(DISTINCT pitcher) AS No_Of_Pitchers FROM MLB.pitch_fact f INNER JOIN MLB.pitcher_dimension p ON f.pitcher = p.playerID WHERE game_year = 2021 GROUP BY birthCountry ORDER BY No_Of_Pitchers DESC;

6)

SELECT EXTRACT(month from TO_DATE(game_date, 'YYYY-MM-DD')) AS month, ROUND(AVG(release_spin_rate),0) AS spin_rate FROM MLB.pitch_fact WHERE pitch_type = 'FF' AND game_year = 2021 GROUP BY EXTRACT(month from TO_DATE(game_date, 'YYYY-MM-DD')) ORDER BY month;

7)

SELECT count(*), events FROM MLB.pitch_fact WHERE launch_speed > 110 GROUP BY events ORDER BY count(*) DESC;

8)

SELECT batter, nameFirst, nameLast, count(*) AS not_hits FROM MLB.pitch_fact f INNER JOIN MLB.batter_dimension b ON f.batter = b.playerID WHERE launch_speed > 110 AND events NOT IN ('home_run', 'single', 'double', 'triple') GROUP BY batter, nameFirst, nameLast ORDER BY count(*) DESC; SELECT batter, nameFirst, nameLast, count(*) as batted_balls_110, round(AVG(launch_angle),1) FROM MLB.pitch_fact f INNER JOIN MLB.batter_dimension b ON f.batter = b.playerID WHERE launch_speed > 110 GROUP BY batter, nameFirst, nameLast ORDER BY count(*) DESC;















Comments

Popular posts from this blog

Analyzing Strike Zone Data From the Statcast Database

How Predictive Are Statcast Metrics?