Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests. They can deal with messy, real data. If there are lots of extraneous predictors, it has no problem. It automatically does a good job of finding interactions as well. There are no assumptions that the response has a linear (or even smooth) relationship with the predictors.
As cited in the Wikipedia article, they do lack some interpretability. But what they lack in interpretation, they more than make up for in prediction power, which I believe is much more important than interpretation in most cases. Even though you cannot easily tell how one variable affects the prediction, you can easily create a partial dependence plot which shows how the response will change as you change the predictor. You can also do this for two variables at once to see the interaction of the two.
Also helping in the interpretation is that they can output a list of predictor variables that they believe to be important in predicting the outcome. If nothing else, you can subset the data to only include the most "important" variables, and use that with another model. The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable, averaged over all trees." I do not know much about this one, and will not talk about it further. The other is based on a permutation test. The idea is that if the variable is not important (the null hypothesis), then rearranging the values of that variable will not degrade prediction accuracy. Random forests use out-of-bag (OOB) samples to measure prediction accuracy.
In my experience, it does a pretty good job of finding the most important predictors, but it has issues with correlated predictors. For example, I was working on a problem where I was predicting the price that electricity trades. One feature that I knew would be very important was the amount of electricity being used at that same time. But I thought there might also be a relationship between price and the electricity being used a few hours before and after. When I ran the random forest with these variables, the electricity used 1 hour after was found to be more important than the electricity used at the same time. When including the 1 hour after electricity use instead of the current hour electricity use, the cross validation (CV) error increased. Using both did not significantly change the CV error compared to using just current hour. Because the electricity used at the current hour and the hour after are very correlated, it had trouble telling which one was more important. In truth, given the electricity use at the current hour, the electricity use at the hour after did not improve the predictive accuracy.
Why does the importance measure give more weight correlated predictors? Strobl et al. give some intuition behind what is happening and propose a solution. Basically, the permutation test is ill-posed. The permutation test is testing that the variable is independent of the response as well as all other predictors. Since the correlated predictors are obviously not independent, we get high importance scores. They propose a permutation test where you condition on the correlated predictors. This is a little tricky when the correlated predictors are continuous, but you can read the paper for more details.
Another way to think of it is that, since each split only considers a subset of the possible variables, a variable that is correlated with an "important" variable may be considered without the "important" variable. This would cause the correlated variable to be selected for the split. The correlated value does hold some predictive value, but only because of the truly important variable, so it is understandable why this procedure would consider it important.
I ran a simulation experiment (similar to the one in the paper) to demonstrate the issue. I simulated 5 predictor variables. Only the first one is related to the response, but the second one has a correlation of about 0.7 with the first one. Luckily, Strobl et al. included their version of importance in the package party in R. I compare variable importance from the randomForest package and the importance with and without taking correlated predictors into account from the party package.
For the randomForest, the ratio of importance of the the first and second variable is 4.53. For party without accounting for correlation it is 7.35. And accounting for correlation, it is 369.5. The higher ratios are better because it means that the importance of the first variable is more prominent. party's implementation is clearly doing the job.
There is a downside. It takes much longer to calculate importance with correlated predictors than without. For the party package in this example, it took 0.39 seconds to run without and 204.34 seconds with. I could not even run the correlated importance on the electricity price example. There might be a research opportunity to get a quicker approximation.
Possibly up next: confidence limits for random forest predictions.
As cited in the Wikipedia article, they do lack some interpretability. But what they lack in interpretation, they more than make up for in prediction power, which I believe is much more important than interpretation in most cases. Even though you cannot easily tell how one variable affects the prediction, you can easily create a partial dependence plot which shows how the response will change as you change the predictor. You can also do this for two variables at once to see the interaction of the two.
Also helping in the interpretation is that they can output a list of predictor variables that they believe to be important in predicting the outcome. If nothing else, you can subset the data to only include the most "important" variables, and use that with another model. The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable, averaged over all trees." I do not know much about this one, and will not talk about it further. The other is based on a permutation test. The idea is that if the variable is not important (the null hypothesis), then rearranging the values of that variable will not degrade prediction accuracy. Random forests use out-of-bag (OOB) samples to measure prediction accuracy.
In my experience, it does a pretty good job of finding the most important predictors, but it has issues with correlated predictors. For example, I was working on a problem where I was predicting the price that electricity trades. One feature that I knew would be very important was the amount of electricity being used at that same time. But I thought there might also be a relationship between price and the electricity being used a few hours before and after. When I ran the random forest with these variables, the electricity used 1 hour after was found to be more important than the electricity used at the same time. When including the 1 hour after electricity use instead of the current hour electricity use, the cross validation (CV) error increased. Using both did not significantly change the CV error compared to using just current hour. Because the electricity used at the current hour and the hour after are very correlated, it had trouble telling which one was more important. In truth, given the electricity use at the current hour, the electricity use at the hour after did not improve the predictive accuracy.
Why does the importance measure give more weight correlated predictors? Strobl et al. give some intuition behind what is happening and propose a solution. Basically, the permutation test is ill-posed. The permutation test is testing that the variable is independent of the response as well as all other predictors. Since the correlated predictors are obviously not independent, we get high importance scores. They propose a permutation test where you condition on the correlated predictors. This is a little tricky when the correlated predictors are continuous, but you can read the paper for more details.
Another way to think of it is that, since each split only considers a subset of the possible variables, a variable that is correlated with an "important" variable may be considered without the "important" variable. This would cause the correlated variable to be selected for the split. The correlated value does hold some predictive value, but only because of the truly important variable, so it is understandable why this procedure would consider it important.
I ran a simulation experiment (similar to the one in the paper) to demonstrate the issue. I simulated 5 predictor variables. Only the first one is related to the response, but the second one has a correlation of about 0.7 with the first one. Luckily, Strobl et al. included their version of importance in the package party in R. I compare variable importance from the randomForest package and the importance with and without taking correlated predictors into account from the party package.
# simulate the data
x1=rnorm(1000)
x2=rnorm(1000,x1,1)
y=2*x1+rnorm(1000,0,.5)
df=data.frame(y,x1,x2,x3=rnorm(1000),x4=rnorm(1000),x5=rnorm(1000))
# run the randomForest implementation
library(randomForest)
rf1 <- randomForest(y~., data=df, mtry=2, ntree=50, importance=TRUE)
importance(rf1,type=1)
# run the party implementation
library(party)
cf1 <- cforest(y~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)
For the randomForest, the ratio of importance of the the first and second variable is 4.53. For party without accounting for correlation it is 7.35. And accounting for correlation, it is 369.5. The higher ratios are better because it means that the importance of the first variable is more prominent. party's implementation is clearly doing the job.
There is a downside. It takes much longer to calculate importance with correlated predictors than without. For the party package in this example, it took 0.39 seconds to run without and 204.34 seconds with. I could not even run the correlated importance on the electricity price example. There might be a research opportunity to get a quicker approximation.
Possibly up next: confidence limits for random forest predictions.
1)"we need to choose m number of variables randomly for each node.."-can u pleas explain it..
OR
2)"take 1 bootstrap sample,choose some variables and create a decision tree"-is it correct??
Than you very much.
"Error in model.matrix.default(as.formula(f), data = blocks) :
allocMatrix: too many elements specified"
I'm not sure how to deal with it because I have 2 factor levels which I'm predicting with 23 continuous variable predictors (about 2200 data points. I know it's a lot to work with, but not sure how to get the correct VarImp values with this large data set.
I tried increasing the threshold, but it didn't help. fit.varimp=varimp(fit.cf,threshold=0.8,conditional=TRUE)
Do you have any suggestions? Thanks.
What it is doing isn't all that complicated. Say you want a partial dependence plot for the variable X_1. For each value of X_1=x that you want to plot, you take the average of the prediction with X_1=x and the other explanatory variables equal to the n values that they are in the data set. You are trying to average out the other variables.
I have a question for you: In your text you said:
"""
... Even though you cannot easily tell how one variable affects the prediction, you can easily create a partial dependence plot which shows how the response will change as you change the predictor. You can also do this for two variables at once to see the interaction of the two.
"""
Could you provide, please, an example of how build the partial dependence plot?
Thank you for the post and thanks in advance