Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now.
Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say that the probability of such an event must be 0 then. But we know better than that.
In this post, I am interested in looking at different ways of estimating how the odds of winning a game change as the difference between seeds increases.
Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).
I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before. Generalized additive models (GAMs) are statistical models that put minimal assumptions on the type of model you are fitting. Traditional statistical models are linear, in that they assume that the response variable you are modelling is a linear function of the explanatory variables.