Quick Post About Getting and Plotting Polls in R

With the election nearly upon us, I wanted to share an easy way I just found to download polling data and graph a few with ggplot2. dlinzer at github created a function to download poll data from the Huffington Post's Pollster API.

The default is to download national tracking polls from the presidential election. After sourcing the function, I load the required packages, download the data, and make the plot.
library(XML)
library(reshape)
library(ggplot2); theme_set(theme_bw())
 
dat <- pollstR(pages=20)
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)
Created by Pretty R at inside-R.org
I have used transparency so that you can see when there are many polls on top of each other. You can see that Obama's lead decreased substantially after the first debate but has crawled back up since then. Of course, I am treating all polls as equal (although I am weighting by sample size) when the truth is that some polls are better than others and some are biased.

To have some more fun, I will show what some of the data from swing states look like. The code below loops through the swing states and downloads the polls. Then it plots the polls for each state in different facets.

swing.states=c("ohio","florida","virginia","colorado","nevada","north-carolina")
for (s in swing.states) {
print(s)
dat.state <- pollstR(chart=paste("2012-",s,"-president-romney-vs-obama",sep=""),pages="all")
dat.state=subset(dat.state,select=c("id","pollster","start.date","end.date","method","N","Obama","Romney"))
dat.state$State=s
 
if (s=="ohio") {
dat=dat.state
} else {
dat=rbind(dat,dat.state)
}
}
 
library(lubridate)
dat$end.date=ymd(as.character(dat$end.date))
ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_point(alpha=.5)+geom_smooth(aes(weight=sqrt(N)))+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+facet_wrap(~State)+xlim(c(mdy("8/1/2012"),mdy("11/6/2012")))
Created by Pretty R at inside-R.org

Unfortunately the x-axis didn't show up very well, but it starts at August 1. There have been quite a few polls in Ohio and Florida, haven't there? The state polls did not have nearly the same shift that the national poll did in reaction to the first debate. The state with the largest bump is Colorado, where the debate was held.

By just looking at the tracking polls, I think you would make the same conclusions that Nate Silver has with his fancy model. Ohio, Virginia, Nevada, and Colorado favor Obama. North Carolina favors Romney and Florida just barely tips toward Romney as well.

Finally, here are just the smoothed running means, all on one plot. You can see that There was also a first debate effect in Ohio.

ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+geom_smooth(aes(colour=State,weight=sqrt(N)),se=FALSE,size=2)+geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+xlim(c(mdy("8/1/2012"),mdy("11/6/2012")))
Created by Pretty R at inside-R.org

Comments

Andrew Clark
Interesting work. Thanks for the tip on the API/github. I have had a look at the GOP campaign - surely one of the strangest in history
http://wp.me/p17axt-jK
Andrew
Thanks for the tip. To add more information to the national tracker plot, you can make the of the points relative to the number polled. Also facet it by the type of poll. Like this:

ggplot(dat,aes(end.date,Obama/(Obama+Romney)))+
geom_point(aes(size=sqrt(N)),alpha=.5)+
geom_smooth(aes(weight=sqrt(N)))+
geom_hline(aes(yintercept=0.5),lty=2,size=1)+
labs(title="Proportion of Vote for Obama",x="Last Date of Poll",y=NULL)+
facet_wrap(~method)+
theme(axis.text.x=element_text(angle=-90))
Tony Hirst
You can tidy up the x-axis labels by rotating them, eg using something like:
+theme(axis.text.x=element_text(angle=-90))
Tony Hirst
This comment has been removed by the author.
Avatar
Andrew J. Landgraf
Data Scientist
comments powered by Disqus