Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")

# Pull out the appropriate line
author_lines <- web_page[grep("&lt;I&gt;", web_page)]

# Delete unwanted characters in the lines we pulled out
authors <- gsub("&lt;I&gt;", "", author_lines, fixed = TRUE)

# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]

We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")

# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")

# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)

jan09_parsed <- htmlTreeParse(jan09)

# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Helpful statistical references

In a previous article I provided a list of R programming resources. As a complement to that post, I’ve compiled a list of statistically oriented websites that colleagues and I have found useful below. For the most part, these sites focus on statistics and quantitative research methods rather than programming.

This first grouping lists sites that are mostly one-stop-shops for research design and analytical information. The first two, (and especially the UCLA website) are Tier I statistics/research methods sites. They are indispensable. The three remaining sites in this section cover less advanced topics and focus more on basics, but may be helpful for the R user who is more programmer than statistician.

The second group of sites is comprised of technical references such as statistical dictionaries and notation guides. The final section list two sites that have detailed information and examples focused on running statistical analyses in R. Note that the UCLA site also includes many examples using R.

Comprehensive coverage

Statistical computing at UCLA

Statnotes: Topics in Multivariate Analysis, by G. David Garson

Introductory Statistics: Concepts, models, and applications

Social Research Methods Knowledge Base

Wolfram MathWorld

Technical References

StatSoft statistical glossary

Glossary of technical notation

Dictionary of Algorithms and Data Structures

R specific sites

Journal of Statistical Software

QuickR

If you know of another site for either R programming or statistics that I’ve missed, mention it in the comments below and I’ll add it to the proper list.