Posts

I am planning to finish school soon and I would like to shed some weight before moving on. I have collected a fair number of books that …

Time lapses are a fun way to quickly show a long period of time. They typically involve setting up your camera on a tripod and taking …

Fangraphs recently published an interesting dataset that measures defensive efficiency of fielders. For each player, the Inside Edge …

Trevor Hastie and Rob Tibshirani are currently teaching a MOOC covering an introduction to statistical learning. I am very familiar …

In a previous post, I showed you how to scrape playlist data from Columbus, OH alternative rock station CD102.5. Since it's the end of …

CD1025’s Playlist and Summerfest Last time, I showed you how to download CD1025’s playlist back to last year and did some …

CD1025 is an “alternative” radio station here in Columbus. They are one of the few remaining radio stations that are independently …

Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now. Since the …

As a grad student, I do lots of searches for research related to my own. When I am off campus, a lot of the relevant results are not …

A lot of times we are given a data set in Excel format and we want to run a quick analysis using R's functionality to look at advanced …

The famous probabilist and statistician Persi Diaconis wrote an article not too long ago about the "Markov chain Monte Carlo (MCMC) …

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and …

Factor Analysis of Baseball's Hall of Fame Voters body, td { font-family: sans-serif; background-color: white; font-size: 12px; margin: …

With the election nearly upon us, I wanted to share an easy way I just found to download polling data and graph a few with ggplot2. …

Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical …

Introduction Matrix factorization has been proven to be one of the best ways to do collaborative filtering. The most common example of …

I have been toying around with Kaggle's Million Song Dataset Challenge recently because I have some interest in collaborative filtering …

Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of …

Forgive me if you are already aware of this, but I found it quite alarming. I know that most code is interpreted by the computer in …

I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of …

Don't you hate it when you are running a long piece of code and you keep checking the results every 15 minutes, hoping it will finish? …

Recently, Chris Perez, the closer for the Indians, displayed some frustration with the fans for not supporting the team. Currently, …

After signing a huge deal with the Angels, Pujols has been having a really bad year. He hasn't hit a home run this year, breaking a …

Correlation matrices are a common way to look at the dependence of a set of variables. When the variables have spatial relationships, …

That title is quite a mouthful. This quarter, I have been reading papers on Spectral Clustering for a reading group. The basic goal of …

I am a big fan of SAS's JMP software. It is the first statistical program I learned and I really like how the emphasize visualization. …

I guess you could call this On Bayes Percentage. *cough* Fresh off learning Bayesian techniques in one of my classes last quarter, I …

Continuing my series of trying to figure out which team is best to pick for survival football and then ignoring it, I present my week 3 …

So this is late, but I already did the analysis and I wanted to share my results for posterity. I used the same method as last time to …

The NFL season is starting tomorrow night and I am in a survival league this year. If you are not familiar, in a survival league, each …

If the past is a predictor of future performance, then there is about a 99.3% chance that I will stop updating this in 2 weeks. But you …