Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The problem with this is that it is a greedy approach and you can easily get stuck in a local mode. Another approach is to look at all possible subsets of the variables and see which one maximizes an objective function (accuracy on a test set, for example).
Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.
There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
gam(y ~ s(x1) + s(x2) + x3, data=train)
where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.
My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().
For example, th is a vector of 0's and 1's which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with "MEDV ~". Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a "+". If it is continuous, I add the column name with "s()" around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.
Created by Pretty R at inside-R.org
For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.
It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was:
Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn't tell me which iterations were the best.
In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it.
The R code to make these plots is below.
Created by Pretty R at inside-R.org
Heuristics are required if the number of variables, p, gets large (>40) because the number of possible subsets is equal to 2^p, excluding interactions. One method that works well is called tabu search (TS). It is a more general optimization method, but in the case of finding the best subset, it is similar to stepwise methods. At any point, it will try to improve the objective function the most by adding or removing one variable. One difference is that TS will not add or remove variables that have been recently added or removed. This avoids the optimization from getting stuck at a local maximum. There are more details to the algorithm that you can read up on if you would like.
There is a tabuSearch package in R that implements this algorithm. I wanted to find a best subset regression using generalized additive models (GAMs) of the package mgcv. An issue arose, because you need to specify which terms are smooth and which are not in a formula to use mgcv. The general form of the formula looks like:
gam(y ~ s(x1) + s(x2) + x3, data=train)
where x1 and x2 are smooth terms and x3 is treated like it is in a linear model.
My data set has a combination of continuous variables, which I want to treat as smooth, and categorical variables, which I want to treat as factors. For each variable in the subset, I need to identify whether it is a factor or not, and then creating a string of the variable with or without s().
For example, th is a vector of 0's and 1's which indicate whether each variable (column) is in the regression. I used the housing data set from UCI ML repository. With 13 explanatory variables, there are 8192 possible main effects models. I am predicting MEDV, so I start the formula string with "MEDV ~". Then I loop through each element of th. If it is 1 then I want to add it to the formula. I check if it is a factor and if so I just add the name of the variable plus a "+". If it is continuous, I add the column name with "s()" around it. Finally I convert the string to a formula using formula(). I can plug in this formula into the gam function.
num.cols=sum(th)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
fo=as.formula(fo.str)
For evaluating the subsets, I split the data into training, validation, and testing. I trained the subset on the training data set and measured the R-squared on the prediction of the validation set. The full code can be found below.
library(mgcv)
library(tabuSearch)
# http://archive.ics.uci.edu/ml/datasets/Housing
housing=read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
colnames(housing)=c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
housing$CHAS=as.factor(housing$CHAS)
housing$RAD=as.factor(housing$RAD) # Changed to factor bc only 9 unique values
summary(housing)
set.seed(20120823)
cv=sample(nrow(housing))
train=housing[cv[1:300],]
valid=housing[cv[301:400],]
test=housing[cv[401:nrow(housing)],]
ssto=sum((valid$MEDV-mean(valid$MEDV))^2)
evaluate <- function(th){
num.cols=sum(th)
if (num.cols == 0) return(0)
fo.str="MEDV ~"
cum.cols=0
for (i in 1:length(th)) {
if (th[i]>0) {
if (is.factor(train[,i])) {
fo.str=paste(fo.str,colnames(train)[i],sep=" ")
} else {
fo.str=paste(fo.str," s(",colnames(train)[i],")",sep="")
}
cum.cols=cum.cols+1
if (cum.cols<num.cols) {
fo.str=paste(fo.str,"+")
}
}
}
# colnames(train)[c(th,0)==1]
fo=as.formula(fo.str)
gam1 <- gam(fo,data=train)
pred1 <- predict(gam1,valid,se=FALSE)
sse <- sum((pred1-valid$MEDV)^2,na.rm=TRUE)
return(1-sse/ssto)
}
res <- tabuSearch(size = ncol(train)-1, iters = 20, objFunc = evaluate, listSize = 5,
config = rbinom(ncol(train)-1,1,.5), nRestarts = 4,verbose=TRUE)
It was able to find a subset with a 0.8678 R-squared on the validation set (and 0.8349 on the test set). The formula found was:
MEDV ~ s(CRIM) + s(INDUS) + s(NOX) + s(RM) + s(DIS) + RAD + s(TAX) + s(PTRATIO) + s(LSTAT).
Visualizing the results
The tabu search gave me a subset which it thinks is best. But I would like to get a better handle on how it derived it, or if there were lots of models with similar quality, but different variables. Or if there were variables that were always in the top performing models. I created a heat plot that showed whether or not a variable was in the regression or not at each iteration.Below you can see which variables were included in each iteration. There are a few variables that seem to be more important because they are included in almost every iteration, like LSAT, PTRATIO,and RM. But this doesn't tell me which iterations were the best.
In the chart below, I shaded each region by the ranking of model in that iteration. The higher ranking mean it did better. It is not easy, but we can glean a little more information from this chart. The models with RAD and DIS do significantly better than the models without them, even though they are not in every iteration. Further, the models with AGE seem a bit worse than those without it.
The R code to make these plots is below.
library(reshape)
library(ggplot2); theme_set(theme_bw())
tabu.df=data.frame(res$configKeep)
colnames(tabu.df)=colnames(train)[1:(ncol(train)-1)]
tabu.df$Iteration=1:nrow(tabu.df)
tabu.df$RSquared=res$eUtilityKeep
tabu.df$Rank=rank(tabu.df$RSquared)
tabu.melt=melt(tabu.df,id=c("Iteration","RSquared","Rank"))
tabu.melt$RSquared=ifelse(tabu.melt$value==1,tabu.melt$RSquared,0)
tabu.melt$Rank=ifelse(tabu.melt$value==1,tabu.melt$Rank,0)
(pHeat01 <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = value)) +
scale_fill_gradient(low = "white", high = "steelblue",guide=FALSE))
(pHeatRank <- ggplot(tabu.melt, aes(Iteration,variable)) + geom_tile(aes(fill = Rank)) +
scale_fill_gradient(low = "white", high = "steelblue"))
Nice approach with the heat map to go beyond just accepting the optimal solution. I agree that variables seem to be important because they are present in every iteration, but this could perhaps also mean that the tabu search was trapped in a subregion - perhaps the search move operator which only adds or deletes a single variable is not powerful enough to escape this region. Some possible remedies for exploring a larger part of the search landscape (if indeed this is a problem) might be:
- Multiple tabu searches, each with a randomized set of initial variables in the model
- Other move operators. How about finding correlation clusters among the variables, then let a move add or delete the entire cluster? Both move operators can be used in the search, using some metaheuristics to guide them.
- Use metaheuristic techniques like "ruin and recreate" during the tabu search; occasionally, "ruin" the model by removing a random subset of the variables.
One more thought. With the tabu search already in place for regression, it wouldn´t be too hard to explore some nonlinear models. You could create temp variables that are products of the original variables, then let the tabu search add or delete these as well. Do you think such an approach would have any merit in this case?
I described the meat of the TS algorithm in the post. If you want more info, I found this reference informative http://www.math.ntua.gr/~fouskakis/Publications/4.pdf