I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before. Generalized additive models (GAMs) are statistical models that put minimal assumptions on the type of model you are fitting. Traditional statistical models are linear, in that they assume that the response variable you are modelling is a linear function of the explanatory variables. GAMs just assumes that the relationship is "smooth." Here is a good example of a relationship that may have traditionally been modeled as linear, but it is a much better assumption that the relationship is smooth.

I fit a GAM to PITCHf/x data. The response is whether or not Ichiro swung. The explanatory variables are pitch location on the x, pitch location on the z, and the day of the year. Obviously, we expect the probability of swinging to change as the pitch is closer or further away from the center of the strike zone. Additionally, I was interested in seeing his swinging propensity changed as the year went on.

You can see that the probability of swinging is smooth in both location and time. Also, you can see (ever so slightly) that the probability of swinging increased as the year went on. Looking at the splits, you can see that his walk percentage was 28/395 (7.1%) in the first half and 17/337 (5.0%) in the second half. This is in agreement with the swing probability increasing,

I used the mgcv package in R to run the GAM. I created an image for every day and stitched them together into a movie with ffmpeg. The R code is here.

I fit a GAM to PITCHf/x data. The response is whether or not Ichiro swung. The explanatory variables are pitch location on the x, pitch location on the z, and the day of the year. Obviously, we expect the probability of swinging to change as the pitch is closer or further away from the center of the strike zone. Additionally, I was interested in seeing his swinging propensity changed as the year went on.

You can see that the probability of swinging is smooth in both location and time. Also, you can see (ever so slightly) that the probability of swinging increased as the year went on. Looking at the splits, you can see that his walk percentage was 28/395 (7.1%) in the first half and 17/337 (5.0%) in the second half. This is in agreement with the swing probability increasing,

I used the mgcv package in R to run the GAM. I created an image for every day and stitched them together into a movie with ffmpeg. The R code is here.