Player Stat Percentiles in R

 A huge feature of Baseball Savant is the player dashboards when you go on their page. Everybody loves them. It's a great first-glance visual to see if a player is generally performing well or poorly in Statcast's batted-ball metrics and expected statistics. These percentile visuals are perfect in that it gives us the context of how well a player is compared to the rest of the league. For example, Aaron Judge's .287 batting average sounds pretty good-not-great until you realize that it was a 89th percentile batting average, and that the median batting average in 2021 was .249. 

The only thing missing in Baseball Savant's dashboards? Well, the fact that it's only Statcast metrics. If I want a first-glance look at a player, I don't want to just know how well he did compared to the league in Statcast metrics, but in all metrics important to me. I think of the situation where I'm in a time crunch during a fantasy baseball draft. I'll hop onto his Baseball Savant page to get a quick idea of how well he is performing there, and then I'll check his Fangraphs page for his more standard stats (i.e., his slashline, walk rate, strikeout rate, etc.). Not only am I having to check two websites but I also am reading his Fangraphs page without the context of percentiles.

I created two functions in R: mlb_percentiles_batters() and mlb_percentiles_pitchers(). The functions work by using Bill Petti's baseballr package to scrape Fangraphs and Baseball Savant leaderboards, using the player's playerid to identify them in the leaderboards, and then calculating their percentiles for certain stats in the leaderboards. The function then returns both the player's value and percentile for those stats. Even with a lot going on in the background, the functions take "only" about 21 seconds to run; a lot of time, but still much more convenient than going on two pages and not knowing the percentiles for one of them.
Here are some examples.
  • mlb_percentiles_batters("Alex", "Kirilloff", 2021, 200)
  • mlb_percentiles_pitchers("Sandy", "Alcantara", 2021, 100)
The four parameters are first name, last name, year, and qualifier. For batters, that's plate appearances and for pitchers, that's innings pitched.
Alex Kirilloff is someone I'm interested as someone who struggled really hard after his call-up to the majors, then had a strong power streak (including four home runs in three games at one point), and ultimately a wrist injury slowed him down the rest of the year before finishing his year. Here is what his stats and percentiles look like among batters with at least 200 plate appearances in 2021:


His Statcast metrics look very promising compared to his standard stats. Here is the pitcher function for Sandy Alcantara. Alcantara had a bit of a breakout year in 2021. He's one of the hardest throwing pitchers in baseball yet was always more of a groundball pitcher than strikeout pitcher. At the end of the season, he hit another gear and had a string of high-strikeout performances. How does his breakout year look compared to pitchers with at least 100 innings? There is good reason to be a fan of his in 2022 as well:


Code

mlb_percentiles_batters <- function(first_name, last_name, year, qual) {
  library(baseballr)
  
  #Pull player id from name
  id <- playerid_lookup(last_name, first_name)
  id <- id$mlbam_id
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Create Fangraphs dataset
  fangraphs <- fg_bat_leaders(year, year, qual = qual, ind = 0)
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Create Baseball Savant dataset
  savant <- scrape_savant_leaderboards(leaderboard = "exit_velocity_barrels", year = year, 
                                       player_type = "batter", min_pa = 1)
  savant <- savant[c(1:4, 8, 9, 17, 19)]
  #Pulls avg_hit_speed (9), max_hit_speed (8), ev95percent (17), brl_percent (19)
  savant2 <- scrape_savant_leaderboards(leaderboard = "expected_statistics", year = year, 
                                        player_type = "batter", min_pa = 1)
  savant2 <- savant2[c(4, 8, 11, 14)]
  #Pulls est_ba (8), est_slg (11), est_woba (14)
  
  #Merge the two savant datasets together
  savant_df <- merge(savant, savant2, by = "player_id")
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Merge Fangraphs with Baseball Savant
  
  chadwick <- get_chadwick_lu()
  #Pull only Fangraphs playerid (key_fangraphs) and Baseball Savant player id (key_mlbam)
  chadwick <- chadwick[c(3,7)]
  
  df1 <- merge(fangraphs, chadwick, by.x = "playerid", by.y = "key_fangraphs")
  df <- merge(df1, savant_df, by.x = c("key_mlbam"), by.y = c("player_id"))
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Print stats and percentiles
  print(paste("Total players compared to: ", nrow(df)))
  print(paste("Player age: ",df$Age[ which(df$key_mlbam == id)]))
  print(paste("AVG:      ",df$AVG[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$AVG)(df$AVG[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("OBP:      ",df$OBP[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$OBP)(df$OBP[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("SLG:      ",df$SLG[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$SLG)(df$SLG[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("wOBA:     ",df$wOBA[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$wOBA)(df$wOBA[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  
  k_rate <- round(df$SO[ which(df$key_mlbam == id)] / df$PA[ which(df$key_mlbam == id)]*100 , 1)
  print(paste("K%:       ",format(round(k_rate, 1), nsmall=1),", ", 100-(round(ecdf(df$SO / df$PA)(k_rate / 100), 2)*100),"percentile"))

  bb_rate <- round((df$BB[ which(df$key_mlbam == id)] / df$PA[ which(df$key_mlbam == id)])*100, 1)
  print(paste("BB%:      ",format(round(bb_rate, 1), nsmall=1),", ", round(ecdf(df$BB / df$PA)(bb_rate / 100), 2)*100,"percentile"))
  
  print(paste("avgEV:    ",format(round(df$avg_hit_speed[ which(df$key_mlbam == id)], 1), nsmall=1),", ",
              round(ecdf(df$avg_hit_speed)(df$avg_hit_speed[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("maxEV:    ",format(round(df$max_hit_speed[ which(df$key_mlbam == id)], 1), nsmall=1),", ",
              round(ecdf(df$max_hit_speed)(df$max_hit_speed[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("HardHit%: ",format(round(df$ev95percent[ which(df$key_mlbam == id)], 1), nsmall=1),", ",
              round(ecdf(df$ev95percent)(df$ev95percent[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("Barrel%:  ",format(round(df$brl_percent[ which(df$key_mlbam == id)], 1), nsmall=1),", ",
              round(ecdf(df$brl_percent)(df$brl_percent[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("xBA:      ",df$est_ba[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$est_ba)(df$est_ba[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("xSLG:     ",df$est_slg[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$est_slg)(df$est_slg[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("xwOBA:    ",df$est_woba[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$est_woba)(df$est_woba[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  
}


mlb_percentiles_pitchers <- function(first_name, last_name, year, qual) {
  library(baseballr)
  
  #Pull player id from name
  id <- playerid_lookup(last_name, first_name)
  id <- id$mlbam_id
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Create Fangraphs dataset
  fangraphs <- fg_pitch_leaders(year, year, qual = qual, ind = 0)

  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Create Baseball Savant dataset
  savant <- scrape_savant_leaderboards(leaderboard = "expected_statistics", year = year, 
                                        player_type = "pitcher", min_pa = 1)
  savant <- savant[c(4, 8, 11, 13, 14, 17)]
  #Pulls est_ba (8), est_slg (11), woba (13), est_woba (14), est_ERA(17)
  
 
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Merge Fangraphs with Baseball Savant
  
  chadwick <- get_chadwick_lu()
  #Pull only Fangraphs playerid (key_fangraphs) and Baseball Savant player id (key_mlbam)
  chadwick <- chadwick[c(3,7)]
  
  df1 <- merge(fangraphs, chadwick, by.x = "playerid", by.y = "key_fangraphs")
  df <- merge(df1, savant, by.x = c("key_mlbam"), by.y = c("player_id"))
  
  #-------------------------------------------------------------------------------------------------
  #-------------------------------------------------------------------------------------------------
  #Print stats and percentiles
  #Note: Stats where lower is better (i.e., ERA) will be calculated as (1-percentile) so that higher is better
  print(paste("Total players compared to: ", nrow(df)))
  print(paste("Player age: ",df$Age[ which(df$key_mlbam == id)]))
  print(paste("ERA:     ",df$ERA[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$ERA)(df$ERA[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("FIP:     ",df$FIP[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$FIP)(df$FIP[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("xFIP:    ",df$xFIP[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$xFIP)(df$xFIP[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("SIERA:   ",df$SIERA[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$SIERA)(df$SIERA[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("kwERA:   ",df$kwERA[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$kwERA)(df$kwERA[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("xERA:    ",df$xera[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$xera)(df$xera[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("K%:      ",df$K_pct[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$K_pct)(df$K_pct[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("BB%:     ",df$BB_pct[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$BB_pct)(df$BB_pct[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("K-BB%:   ",df$`K-BB_pct`[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$`K-BB_pct`)(df$`K-BB_pct`[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("GB%:     ",df$GB_pct[ which(df$key_mlbam == id)],", ",
              round(ecdf(df$GB_pct)(df$GB_pct[ which(df$key_mlbam == id)]), 2)*100,"percentile"))
  print(paste("HR/9:    ",df$HR_9[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$HR_9)(df$HR_9[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("AVG:     ",df$AVG[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$AVG)(df$AVG[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("xBA:     ",df$est_ba[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$est_ba)(df$est_ba[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("wOBA:    ",df$woba[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$woba)(df$woba[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("xwOBA:   ",df$est_woba[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$est_woba)(df$est_woba[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  print(paste("WHIP:    ",df$WHIP[ which(df$key_mlbam == id)],", ",
              100-(round(ecdf(df$WHIP)(df$WHIP[ which(df$key_mlbam == id)]), 2)*100),"percentile"))
  
}

Comments

Popular posts from this blog

Analyzing Strike Zone Data From the Statcast Database

Introducing the Full Statcast Database (2019-2021)