I have been toying around with Kaggle's Million Song Dataset Challenge recently because I have some interest in collaborative filtering (using matrix factorization). I haven't made much progress with the competition (all 3 of my submissions are below the baseline), but I have learned a few things about dealing with large amounts of data.
The goal of the competition is to predict the 500 most likely songs each of 110,000 users will listen to next.
Forgive me if you are already aware of this, but I found it quite alarming. I know that most code is interpreted by the computer in binary and we input in decimal, so problems can arise in conversion and with floating point. But the example I have below is so simple that it really surprised me.
I was converting a function from R into MATLAB so that a colleague could use it.