Restricted Boltzmann Machines in R

Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods in the Netflix challenge (paper). Furthermore, they have been tantamount to training deep learning models, which appear to be the best current models for image and digit recognition.

I do not want to go into too much detail, but I would like to give a high level overview of RBMs. Edwin Chen has an introduction that is much better. The usual version of an RBM is a probabilistic model for binary vectors. An image can be represented as a binary vector if each pixel that is dark enough is represented as a 1 and the non-dark pixels are 0's. In addition to the visible binary vector, it is assumed that there is a hidden binary vector, and each element of the hidden unit is connected to each unit of the visible unit by symmetric weights. The weights are represented by the matrix W, where the i,jth component is the weight between hidden unit i and visible unit j. It is important that there are no connections between hidden units or between visible units. The probability that visible unit j is 1 is the inverse logistic function of the sum of the weights connected to visible unit j, in which the hidden units are 1. In math notation (where σ is the inverse logistic, or sigmoid, function):

Pr(vj=1|h,W)=σ(ihiWi,j).

The weights are symmetric, so

Pr(hi=1|v,W)=σ(jvjWi,j).
As you can see, the model is defined by its conditional probabilities. The task is to find the weight matrix W that best explains the visible data for a given number of hidden units.

I have been taking Geoff Hinton's coursera course on neural networks, which I would recommend to anyone interested. One of the programming assignments was to fill in code to write an RBM in Matlab/Octave. I have since tried to find a version for R, but have not had any luck, so I decided to translate the code myself. Below is the code to calculate the weight matrix. It is fairly simple and I only use contrastive divergence 1. The input data is p×n instead of the usual transpose.

# rbm_w is a matrix of size <number of hidden units> by <number of visible units>
# visible_state is matrix of size <number of visible units> by <number of data cases>
# hidden_state is a binary matrix of size <number of hidden units> by <number of data cases>
 
visible_state_to_hidden_probabilities <- function(rbm_w, visible_state) {
1/(1+exp(-rbm_w %*% visible_state))
}
 
hidden_state_to_visible_probabilities <- function(rbm_w, hidden_state) {
1/(1+exp(-t(rbm_w) %*% hidden_state))
}
 
configuration_goodness <- function(rbm_w, visible_state, hidden_state) {
out=0
for (i in 1:dim(visible_state)[2]) {
out=out+t(hidden_state[,i]) %*% rbm_w %*% visible_state[,i]
}
out/dim(visible_state)[2]
}
 
configuration_goodness_gradient <- function(visible_state, hidden_state) {
hidden_state %*% t(visible_state)/dim(visible_state)[2]
}
 
sample_bernoulli <- function(mat) {
dims=dim(mat)
matrix(rbinom(prod(dims),size=1,prob=c(mat)),dims[1],dims[2])
}
 
cd1 <- function(rbm_w, visible_data) {
visible_data = sample_bernoulli(visible_data)
H0=sample_bernoulli(visible_state_to_hidden_probabilities(rbm_w, visible_data))
vh0=configuration_goodness_gradient(visible_data, H0)
V1=sample_bernoulli(hidden_state_to_visible_probabilities(rbm_w, H0))
H1=visible_state_to_hidden_probabilities(rbm_w, V1)
vh1=configuration_goodness_gradient(V1, H1)
vh0-vh1
}
 
rbm <- function(num_hidden, training_data, learning_rate, n_iterations, mini_batch_size=100, momentum=0.9, quiet=FALSE) {
# This trains a model that's defined by a single matrix of weights.
# <num_hidden> is the number of hidden units
# cd1 is a function that takes parameters <model> and <data> and returns the gradient (or approximate gradient in the case of CD-1) of the function that we're maximizing. Note the contrast with the loss function that we saw in PA3, which we were minimizing. The returned gradient is an array of the same shape as the provided <model> parameter.
# This uses mini-batches no weight decay and no early stopping.
# This returns the matrix of weights of the trained model.
n=dim(training_data)[2]
p=dim(training_data)[1]
if (n %% mini_batch_size != 0) {
stop("the number of test cases must be divisable by the mini_batch_size")
}
model = (matrix(runif(num_hidden*p),num_hidden,p) * 2 - 1) * 0.1
momentum_speed = matrix(0,num_hidden,p)
 
start_of_next_mini_batch = 1;
for (iteration_number in 1:n_iterations) {
if (!quiet) {cat("Iter",iteration_number,"\n")}
mini_batch = training_data[, start_of_next_mini_batch:(start_of_next_mini_batch + mini_batch_size - 1)]
start_of_next_mini_batch = (start_of_next_mini_batch + mini_batch_size) %% n
gradient = cd1(model, mini_batch)
momentum_speed = momentum * momentum_speed + gradient
model = model + momentum_speed * learning_rate
}
return(model)
}
Created by Pretty R at inside-R.org

I loaded the hand written digit data that was given in the class. To train the RBM, use the syntax below.

weights=rbm(num_hidden=30, training_data=train, learning_rate=.09, n_iterations=5000,
mini_batch_size=100, momentum=0.9)
Created by Pretty R at inside-R.org

After training the weights, I can visualize them. Below is a plot of strength of the weights going to each pixel. Each facet displays the weights going to/coming from one of the hidden units. I trained 30 units so that it would be easy to show them all on one plot. You can see that most of the hidden units correspond strongly to one digit or another. I think this means it is finding the joint structure of all of the pixels and that pixels representing those numbers are darkened together often.

library(ggplot2)
library(reshape2)
mw=melt(weights); mw$Var3=floor((mw$Var2-1)/16)+1; mw$Var2=(mw$Var2-1)%%16 + 1; mw$Var3=17-mw$Var3;
ggplot(data=mw)+geom_tile(aes(Var2,Var3,fill=value))+facet_wrap(~Var1,nrow=5)+
scale_fill_continuous(low='white',high='black')+coord_equal()+
labs(x=NULL,y=NULL,title="Visualization of Weights")+
theme(legend.position="none")
Created by Pretty R at inside-R.org

Comments

Andrew Landgraf
There are no biases in this implementation. You could make v_1=1, but there still wouldn't be biases in the hidden units.
sidharth
where are the biases ? have you included them in the weights ?
Andrew Landgraf
The weights are randomly initialized, so any 2 runs will give different results. However, if your figure looks similar to mine, in that it looks like the weights represent numbers, I would say it is working correctly.
Xiaolin Gui
Thank you for sharing your code.I found the output figure had different with yours when I do the experiment. I have load the MNIST image data successfully. And the data's format is 784 ×60000. The pixels have been binarization. >= 127——>1.
I hope you give me some advice.
email:guixiaolinde@gmail.com
Anonymous
I have managed to import the training data (from the Coursera website) into R:

library(R.matlab)
dat = readMat("data_set.mat")
......
weights=rbm(num_hidden=30, training_data= dat$data[,,1]$training[[1]], learning_rate=.09, n_iterations=5000,
mini_batch_size=100, momentum=0.9)
Kent Johnson
R code for assignment 4 was posted here:
https://class.coursera.org/neuralnets-2012-001/forum/thread?thread_id=87&post_id=5105
I used it as the basis for a work project.
Zachary Mayer
I believe there's also a kaggle competition using that data right now: http://www.kaggle.com/c/digit-recognizer
Andrew
I don't want to post it here because of possible copyright violations, but you can download it in Octave format from the Coursera site (in programming assignment 4) or you can find it at this webpage: http://yann.lecun.com/exdb/mnist/
Anonymous
Do you have a sample of training data?
Avatar
Andrew J. Landgraf
Data Scientist