Fangraphs recently published an interesting dataset that measures defensive efficiency of fielders. For each player, the Inside Edge dataset breaks their opportunities to make plays into five categories, ranging from almost impossible to routine. It also records the proportion of times that the player successfully made the play. With this data, we can see how successful each player is for each type of play. I wanted to think of a way to combine these five proportions into one fielding metric.
Trevor Hastie and Rob Tibshirani are currently teaching a MOOC covering an introduction to statistical learning. I am very familiar with most of the material in the course, having read Elements of Statistical Learning many times over.
One great thing about the class, however, is that they are truely experts and have collaborated with many of the influencial researchers in their field. Because of this, when covering certain topics, they have included interviews with statisticians who made important developments to the field.
CD1025’s Playlist and Summerfest Last time, I showed you how to download CD1025’s playlist back to last year and did some exploratory analysis to find that there were some gaps in the data. Using this data, I would like to look at the artists that are playing in this week’s Summerfest. Summerfest is one of the biggest shows that CD1025 puts on every year and is hyped quite a bit on the station.
Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.
Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.
In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases.
The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the "Markov chain Monte Carlo (MCMC) Revolution." The paper describes how we are able to solve a diverse set of problems with MCMC. The first example he gives is a text decryption problem solved with a simple Metropolis Hastings sampler.
I was always stumped by those cryptograms in the newspaper and thought it would be pretty cool if I could crack them with statistics.
Factor Analysis of Baseball's Hall of Fame Voters body, td { font-family: sans-serif; background-color: white; font-size: 12px; margin: 8px; } tt, code, pre { font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace; } h1 { font-size:2.2em; } h2 { font-size:1.8em; } h3 { font-size:1.4em; } h4 { font-size:1.0em; } h5 { font-size:0.9em; } h6 { font-size:0.8em; } a:visited { color: rgb(50%, 0%, 50%); } pre { margin-top: 0; max-width: 95%; border: 1px solid #ccc; white-space: pre-wrap; } pre code { display: block; padding: 0.
Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests. They can deal with messy, real data. If there are lots of extraneous predictors, it has no problem. It automatically does a good job of finding interactions as well. There are no assumptions that the response has a linear (or even smooth) relationship with the predictors.