Fangraphs recently published an interesting dataset that measures defensive efficiency of fielders. For each player, the Inside Edge dataset breaks their opportunities to make plays into five categories, ranging from almost impossible to routine. It also records the proportion of times that the player successfully made the play. With this data, we can see how successful each player is for each type of play. I wanted to think of a way to combine these five proportions into one fielding metric.

Factor Analysis of Baseball's Hall of Fame Voters body, td { font-family: sans-serif; background-color: white; font-size: 12px; margin: 8px; } tt, code, pre { font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace; } h1 { font-size:2.2em; } h2 { font-size:1.8em; } h3 { font-size:1.4em; } h4 { font-size:1.0em; } h5 { font-size:0.9em; } h6 { font-size:0.8em; } a:visited { color: rgb(50%, 0%, 50%); } pre { margin-top: 0; max-width: 95%; border: 1px solid #ccc; white-space: pre-wrap; } pre code { display: block; padding: 0.

Introduction Matrix factorization has been proven to be one of the best ways to do collaborative filtering. The most common example of collaborative filtering is to predict how much a viewer will like a movie. The power of matrix factorization was a key development of the Netflix Prize (see http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf).
Using the movie rating example, the idea is that there are some underlying features of the movie and underlying attributes of the user that interact to determine if the user will like the movie.

I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before. Generalized additive models (GAMs) are statistical models that put minimal assumptions on the type of model you are fitting. Traditional statistical models are linear, in that they assume that the response variable you are modelling is a linear function of the explanatory variables.

Recently, Chris Perez, the closer for the Indians, displayed some frustration with the fans for not supporting the team. Currently, they have the lowest attendance in the majors -- by a decent margin. The Indians are averaging about 15,000 fans per home game, while the next closest team, the Oakland A's, is averaging 19,000. It seemed like an odd time for Perez to bring this up because they have had attendance in the 29,000s each of the last two home games.

After signing a huge deal with the Angels, Pujols has been having a really bad year. He hasn't hit a home run this year, breaking a career long streak. So I thought it would be a good idea to use some statistics to tell how good or bad we think Pujols will actually be this year.
Coming into the year, he had a career .328/.420/.617 career AVG/OBP/SLG. Through one month, he has a .

I guess you could call this On Bayes Percentage. *cough*
Fresh off learning Bayesian techniques in one of my classes last quarter, I thought it would be fun to try to apply the method. I was able to find some examples of Hierarchical Bayes being used to analyze baseball data at Wharton. Setting up the problem
On base percentage (OBP) is probably the most important basic offensive statistic in baseball.

Powered by the Academic theme for Hugo.