Calling Python from R with rPython

Python has generated a good bit of buzz over the past year as an alternative to R. Personal biases aside, an expert makes the best use of the available tools, and sometimes Python is better suited to a task. As a case in point, I recently wanted to pull data via the Reddit API. There isn’t an R package that provides easy access to the Reddit API, but there is a very well designed and documented Python module called PRAW (or, the Python Reddit API Wrapper). Using this module I was able to develop a Python-based solution to get and analyze the data I needed without too much trouble.

However, I prefer working in R, so I was glad to discover the rPython package, which enables calling Python scripts from R. After finding rPython, I was able to rewrite my purely Python script as a primarily R-based program.

If you want to use rPython there are a couple of prerequisites you’ll need to address if you haven’t already. No surprise, you’ll need to have Python installed. After that, you’ll need to install the PRAW module via pip install praw. Finally, install the rPython package from CRAN. (But see the note below first if you’re on Windows.)

After you’ve completed those steps, it’s as easy as writing your Python script and adding a line or two to your R code.

First create a Python script that imports the praw module and does the first data call:

import praw

# Set the user agent information
# IMPORTANT: Change this if you borrow this code. Reddit has very strong 
# guidelines about how to report user agent information 
r = praw.Reddit('Check New Articles script based on code by ProgrammingR.com')
   
# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and put it into a usable format
new_subs=[str(x) for x in new_subs]

Since the Python session is persistent, we can also create a shorter Python script that we can use to fetch updated data without reimporting the praw module

# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and create a list of strings
new_subs=[str(x) for x in new_subs]

Finally, some R code that calls the Python script and gets the data from the Python variables we create:

library(rPython)

# Load/run the main Python script
python.load('GetNewRedditSubmissions.py')

# Get the variable
new_subs_data <- python.get('new_subs')

# Load/run re-fecth script
python.load('RefreshNewSubs.py')

# Get the updated variable
new_subs_data <- python.get('new_subs')

head(new_subs_data)

A few final notes:

  • The main drawback to the rPython package is that it currently doesn’t run on Windows. The developer (Carlos J. Gil Bellosta) is working to fix this, though. If that wrinkle gets resolved, I can see this being a very popular package.
  • You can use RStudio to write your Python programs, which is easier than switching to another IDE for simple scripts. However, it causes an issue with EOL characters. Namely, you need to add a blank line at the end of each .py file to get it to load properly.
  • The Python session rPython initiates is associated with the R session. Any Python modules you load or variables you create will be available until you remove them or close the R session.

Creating your personal, portable R code library with GitHub

Note: I sold ProgrammingR.com and its online assets in 2015. I’m posting this for posterity, but no longer control access to the GitHub account.

As I discussed in a previous post, I have a few helper functions I’ve created that I commonly use in my work. Until recently, I manually included these functions at the start of my R scripts by either the tried-and-true copy-and-paste method, or by extracting them from a local file with the source() function. The former approach has the benefit of keeping the helper code inextricably attached to the main script, but it adds a good bit of code to wade through. The latter approach keeps the code cleaner, but requires that whoever is running the code always has access to the sourced file and that it is always in the same relative path – and that makes sharing or moving code more difficult. The start of a recent project requiring me to share my helper function library prompted me to find a better solution.

The resulting approach takes advantage of GitHub Gists and R’s ability to source via a web-based location to enable you to create a personal, portable library of R functions for private use or to share.

The process is very straightforward. If you don’t have a GitHub account, getting one is the first step. After doing so, click on “Gist” in the menu bar to create a new Gist. (There are a couple of reasons for using a Gist instead of a formal repository. For one, repositories cannot be made private unless you have a paid GitHub membership. Gists, however, can be created as “Secret” and therefore available only to people who have the exact URL.) Name your Gist, add your code, and choose whether you want it to be a secret or public Gist to save that revision. After saving your Gist you will be taken to a screen displaying your new Gist.

Now that you’ve created your Gist, you just need to source it in your R code. That step is easily done via the source() function. Rather than including a file path as you would with a local file, you simply include the URL to your Gist. Note that you can’t just copy the URL of the GitHub page you’re currently on. You need to source the URL of the raw code. To get that URL, click on the <> button on the top right of your Gist (circled in red below). This will take you to the raw code for the current revision. To get the path that always points to the most recent revision, remove everything in the URL after “/raw/”.

GitHubGist

Regardless of which version you link to, you’ll still be left with a long URL that includes a randomly generated alphanumeric string that you don’t want to have to try to remember. Here’s a protip, use a URL shortener that let’s you define a custom URL (e.g., tinyurl) to create a shortened version that is easy to remember. Something like the little library at http://tinyurl.com/ProgRLib (YES! You can use and add to this library!).

Now you’ve got a portable, easy to reference, personal library of functions that you can easily include in your code or share with others.

Global Indicator Analyses with R

I was recently asked by a client to create a large number of “proof of concept” visualizations that illustrated the power of R for compiling and analyzing disparate datasets. The client was specifically interested in automated analyses of global data. A little research led me to the WDI package.

The WDI package is a tool to “search, extract and format data from the World Bank’s World Development Indicators” (WDI help). In essence, it is an R-based wrapper for the World Bank Economic Indicators Data API. When used in combination with the information on the World Bank data portal it provides easy access to thousands of global datapoints.

Here is an example use case that illustrates how simple and easy it is to use, especially with a little help from the countrycode and ggplot2 packages:

library(WDI)
library(ggplot2)
library(countrycode)

# Use the WDIsearch function to get a list of fertility rate indicators
indicatorMetaData <- WDIsearch("Fertility rate", field="name", short=FALSE)

# Define a list of countries for which to pull data
countries <- c("United States", "Britain", "Sweden", "Germany")

# Convert the country names to iso2c format used in the World Bank data
iso2cNames <- countrycode(countries, "country.name", "iso2c")

# Pull data for each countries for the first two fertility rate indicators, for the years 2001 to 2011
wdiData <- WDI(iso2cNames, indicatorMetaData[1:2,1], start=2001, end=2011)

# Pull out indicator names
indicatorNames <- indicatorMetaData[1:2, 1]

# Create trend charts for the first two indicators
for (indicatorName in indicatorNames) { 
  pl <- ggplot(wdiData, aes(x=year, y=wdiData[,indicatorName], group=country, color=country))+
    geom_line(size=1)+
    scale_x_continuous(name="Year", breaks=c(unique(wdiData[,"year"])))+
    scale_y_continuous(name=indicatorName)+
    scale_linetype_discrete(name="Country")+
    theme(legend.title=element_blank())+
    ggtitle(paste(indicatorMetaData[indicatorMetaData[,1]==indicatorName, "name"], "\n"))
  ggsave(paste(indicatorName, ".jpg", sep="&"), pl)
}

WDI package visualization 1WDI package visualization 2

This code can be adapted to quickly pull and visualize many pieces of data. Even if you don’t have an analytic need for the WDI data, the ease of access and depth of information available via the WDI package make them perfect for creating toy examples for classes, presentations or blogs, or conveying the power and depth of available R packages.

SPARQL with R in less than 5 minutes

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

The primary method for accessing data made available on the Semantic Web is via SPARQL queries to a data provider’s endpoint. Endpoints are portals to data that a provider has made available for querying. This data is often published in RDF format. If you’re familiar with using SQL to query a database, then the idea of using a query language (SPARQL) to access a structured database or data repository (the endpoint) should be common ground. The practice is also similar to webscraping an XML document with a known structure.

The Semantic Web and R

A primary goal of the Semantic Web movement is to enable machines and applications to interface using standardized data definitions. As the Semantic Web grows, it will be ripe for mining with statistical programs that already lend themselves to automation – such as R. To that end, let’s dive into a simple query that will illustrate 75% of what you’ll be doing on the Semantic Web.

Accessing Data.gov datasets with SPARQL

We’ll use data at the Data.gov endpoint for this example. Data.gov has a wide array of public data available, making this example generalizable to many other datasets. One of the key challenges of querying a Semantic Web resource is knowing what data is accessible. Sometimes the best way to find this out is to run a simple query with no filters that returns only a few results or to directly view the RDF. Fortunately, information on the data available via Data.gov has been cataloged on a wiki hosted by Rensselaer. We’ll use Dataset 1187 for this example. It’s simple and has interesting data – the total number of wildfires and acres burned per year, 1960-2008.

R code

Data.gov provides a Virtuoso endpoint here, which we could use to submit manual queries. But we want to automate this process, so we’ll use Willem Robert van Hage and Tomi Kauppinen’s SPARQL package to access the endpoint.

library(SPARQL) # SPARQL querying package
library(ggplot2)

# Step 1 - Set up preliminaries and define query
# Define the data.gov endpoint
endpoint <- "http://services.data.gov/sparql"

# create query statement
query <-
"PREFIX  dgp1187: <http://data-gov.tw.rpi.edu/vocab/p/1187/>
SELECT ?ye ?fi ?ac
WHERE {
?s dgp1187:year ?ye .
?s dgp1187:fires ?fi .
?s dgp1187:acres ?ac .
}"

# Step 2 - Use SPARQL package to submit query and save results to a data frame
qd <- SPARQL(endpoint,query)
df <- qd$results

# Step 3 - Prep for graphing

# Numbers are usually returned as characters, so convert to numeric and create a
# variable for "average acres burned per fire"
str(df)
df <- as.data.frame(apply(df, 2, as.numeric))
str(df)

df$avgperfire <- df$ac/df$fi

# Step 4 - Plot some data
ggplot(df, aes(x=ye, y=avgperfire, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Average acres burned per fire")

ggplot(df, aes(x=ye, y=fi, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Number of fires")

ggplot(df, aes(x=ye, y=ac, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Acres burned")

# In less than 5 mins we have written code to download just
# the data we need and have an interesting result to explore!

If you’re familiar with R, the only foreign code is that in Steps 1 and 2. There we define the location of the endpoint and write the SPARQL query. If you have experience with SQL, you’ll notice that SPARQL syntax is very similar. The most notable exception is that Semantic Web data is structured around chunks of data that contain three pieces of information. In Semantic Web terminology these three pieces of information as referred to as subjects, predicates and objects and together they are called “triples”. A SPARQL query returns pieces of these chunks of data; exactly which pieces are returned depends on the query. We have defined a prefix which establishes that everywhere we say “dgp1187:” we want it to know that we mean “<http://data-gov.tw.rpi.edu/vocab/p/1187/>”. This saves us from having to retype long URIs every time we need to reference them. URIs are used extensively on the Semantic Web, so this is a very helpful feature.

In English, our query says, “Give me the values for the attributes “fires”, “acres” and “year” wherever they are defined, and assign them to variables named “fi”, “ac” and “ye” respectively. Also fetch the location of the value in the data as a variable called ?s and merge the other data values by that variable.”

The last bit of the query is the closest we get to SPARQL magic. Because the variable ?s is present in each clause of the query, our fire, acres burned, and year data will be merged by that variable. This is somewhat analogous to saying, “Once you’ve found a row of data that contains information on the number of fires, also return the acres burned and year data from that row and list them on the same row in the new dataset.” (If the logic behind the query isn’t yet clear, or you’re ready to see how you can make this query more advanced or parsimonious, the text and web pages in the “References” and “Additional Information” sections at the end of this article are good resources for learning SPARQL.)

Once defined, we pass this query and endpoint information to the SPARQL function which returns an object with two values – a data frame of the new data and a list of namespaces. At this point, we’re concerned with the data frame, so we pull it out in the last line of Step 2.Data.gov SPARQL example

From that point on we treat the data just like any other dataset. Running some quick graphs, we see that although the number of fires per year have decreased, the number of acres burned per fire and the number of acres burned per year increased considerably between 1975 and 2008. (Despite being born in 1975, I disavow any responsibility for this trend.)

Summary

In a few short minutes, using R and SPARQL, we wrote code to pull and do initial analyses on an interesting government dataset. Hopefully, the power of R for mining the Semantic Web is evident from this simple example. As more data becomes available in RDF format, automated solutions for mining and analyzing the Semantic Web will become more and more useful.

Reference:

DuCharme, Bob (2011-07-14). Learning SPARQL . OReilly Media.

Additional Information:

Basic tutorial on querying Data.gov

More detailed tutorial on querying Data.gov

R for Dummies – De Vries and Meys (2012)

The for Dummies series has been around since 1991. (A bit of trivia, DOS for Dummies was the first title.) I’ve owned a few books in the series and have been adequately impressed with most of them, but when I learned there was an R for Dummies I was immediately skeptical. Possibly I was skeptical because R has a steep learning curve and many idiosyncrasies, so the idea of an R for Dummies text seemed oxymoronic – it’s difficult to imagine a (successfully) dumbed-down version of an introductory R text. But if you’re familiar with the for Dummies series, you already know that the moniker is just for marketing. In reality, these books usually do a good job of distilling a topic down to the important components a new user needs to know. This edition is no exception.

Title: R for Dummies
Author(s): Andrie de Vries and Joris Meys
Publisher/Date: Wiley and Sons/2012
Statistics level: Not Applicable
Programming level: Beginner
Overall recommendation: Highly Recommended

The core topic areas that R for Dummies covers should come as no surprise: A basic overview of R and its capabilities, importing data into R, writing and debugging functions, summarizing data and graphing. In addition, there are sections covering potentially frustrating tasks for beginners such as working with dates and multidimensional arrays.

I am a big fan of periodically reviewing the fundamentals. I pick up introductory texts every once and a while to make sure I haven’t forgotten anything important and R for Dummies is a nice edition to my library for that reason.

One thing that sets this book apart from other currently available introductory R texts is that it covers a couple of recent and important developments in R coding – namely the RStudio development environment and ggplot2 graphics.

If you’re not familiar with the for Dummies series, it is important to note that they are written in a specific informal style, which is in stark contrast to most R texts. (For reference, the style is more similar to The R Inferno than the ggplot2 manual.) You can get a sense of this style by browsing a few pages on Amazon to see if you find it helpful or distracting. On the other hand, R for Dummies has a more polished feel than many R texts I’ve read. I didn’t encounter any of the frustrating and distracting editing errors that are common in some R texts.

R for Dummies is primarily focused on R as a programming language, so for the most part, statistical analyses are presented only as a means of illustrating programming techniques. Given its focus on programming and fundamentals, this book is highly recommended for someone with little to no experience in R who wants to learn R programming. Intermediate to advanced R programmers like myself who want a current review of the fundamentals, might also find it useful.

I might also recommend R for Dummies to experienced users of other programming languages who are new to R. The discussion of basic programming concepts, such as control flow, is minimal and focused primarily on details specific to R. It is not recommended for those looking to learn statistics in conjunction with R.

The current price of $20 USD puts it in the middle price range for texts of its kind. It is available as a paperback or Kindle text.

R Helper Functions

If you do a lot of R programming, you probably have a list of R helper functions set aside in a script that you include on R startup or at the top of your code. In some cases helper functions add capabilities that aren’t otherwise available. In other cases, they replicate functionality that is available elsewhere without loading unnecessary components. Below I present two of my most frequently used data manipulation helper functions as examples.

### Descriptives R Helpler Function
# Display some basic descriptives
descs <- function (x) {
  if(!hidetables) {
    if(length(unique(x))>30) {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Number of categories is greater than 30, table not produced')
    } else {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Table results:')
      print(table(x, useNA='always'))
    }
   
  } else {
    print('Tables are hidden')
  }
}

# Set hide tables to true to hide tables
hidetables <- FALSE


### Dummy Variable R Helper Function
# Create dummy variables for each level of a categorical variable
createDummies <- function(x, df, keepNAs = TRUE) {
  for (i in seq(1, length(unique(df[, x])))) {
    if(keepNAs) {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i, 0, 1)
    } else {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i | is.na(df[, x]) , 0, 1)     
    }
  }
  df
}

The Descriptives R Helper Function produces a summary or table of the passed variable/object; it uses the number of unique values to determine whether to call just the summary() or summary() and table() functions. It also includes NAs by default in the tables (one of table()‘s biggest annoyances). Once the exploratory and data manipulation work is done, all output from this function can be suppressed by setting the hidetables object to TRUE.

The Dummy Variable R Helper Function creates indicator variables from all values of a variable. Based on experience, I avoid the factor object as much as possible and this approach allows me to quickly create indicators that can be used in any way I want.

If you don’t have a programming background or are just beginning with R, you might not have had time to realize the benefit of helper functions or identify the tasks you do repetitively, but it’s worthwhile to give the issue some consideration. Helper functions can be exceptionally useful for saving time on repetitive tasks or facilitating your work. They’re so useful in fact, that there is a special ProgrammingR event planned around helper functions coming soon. For the more experienced R programmers out there, make a mental note of the most useful helper functions you’ve written in the past. That list will come in handy in the near future!

Progress bars in R using winProgressBar

Using progress bars in R scripts can provide valuable timing feedback during development and additional polish to final products. winProgressBar and setWinProgressBar are the primary functions for creating progress bars in R.

Progress bars, and progress indicators in general, are relatively uncommon in R programming. This makes sense, as they can add bloat and, being design elements, they generally fall into the classification of “nice but not necessary”. However, during development, especially when using loops, progress bars can a cleaner way of tracking loop progress than, for example, printing iteration numbers. And for programmers who prepare scripts or packages for non-programmers, they add feedback that users have come to expect from other software.

To add progress bars in R scripts use the winProgressBar and setWinProgressBar functions. For non-Windows users there’s also a very similar Tcl/Tk version (tkProgressBar and settkProgressbar); depending on your current set up, you may need to install the tcltk library to use it.

Setting up a progress indicator to track the progress of a loop is very straightforward. First, initialize the display:

pb <- winProgressBar(title="Example progress bar", label="0% done", min=0, max=100, initial=0)

Use the title and label options to set the style of the display. The min and max options should use whatever values are most applicable to your task, but in most cases this will be displaying a percentage so a range of 0 to 100, starting at 0, makes sense.

After initializing the progress indicator, add your loop code and update the progress bar by calling the setWinProgressBar function at the end of each loop:

for(i in 1:100) {
  Sys.sleep(0.1) # slow down the code for illustration purposes
  info <- sprintf("%d%% done", round((i/100)*100))
  setWinProgressBar(pb, i/(100)*100, label=info)
}

# Once the loop is exited, close the progress bar window:

close(pb)

Here is a snapshot of the progress indicator in action:

Windows Progress Bar Example

To track the progress of an entire script or program, you just need to place the initializing function (winProgressBar) at the beginning of the code and do updates via setWinProgressBar at key points within the flow of your program. For example, if the first task your code performs usually takes about 10% of the total time required, set the value to 10% after that task completes.

If a popup progress bar doesn’t work for your task, there is a minimalist option, txtProgressBar, that by default draws a line in the console. It also allows for some fun customizations:

Text Progress Bar Example

The Art of R Programming – Matloff (2011)

It’s difficult to write a book on an entire programming language and keep it manageable and concise, but The Art of R Programming does it as well as any text I’ve seen. Matloff covers, in detail and among other things, R data structures, programming idioms, performance enhancements, interfaces with other languages, debugging and graphing.

Title:The Art of R Programming
Author(s): Norman Matloff
Publisher/Date: No Starch Press/2011
Statistics level: Very Low
Programming level: Intermediate
Overall recommendation: Highly Recommended

There is the requisite “Introduction to R” section that is present in almost all R texts, but any beginners who benefit from this chapter may benefit from re-reading ARP after some additional practical experience with R. The issues that Matloff addresses and the solutions he provides are more salient after you’ve spent hours trying to resolve them.

The section on graphing is a good overview, but the average programmer may find it less useful than the other sections. Anyone looking for graphic optimization tips will be better served by a book focused specifically on graphing.

With that minor critique in mind, put simply, The Art of R Programming is a must read for all intermediate level R programmers. It covers nearly every method of performance enhancement available and provides a review of key fundamentals that may have been forgotten or missed.

The Art of R ProgrammingOne point of note, this text focuses almost solely on programming – the statistical examples are a means to an end, not an end themselves. For that reason, this book is recommended for those seeking to improve the efficiency of their programming rather than their statistical acumen.

At around ~$25 USD from Amazon, The Art of R Programming is one of the best R text values available. I highly recommend it for almost all R users. (You can also purchase this book directly from the publisher and get both the print and e-book version for ~$40.)

Animations in R

Animated charts can be very helpful in illustrating concepts or discovering relationships, which makes them very helpful in teaching and exploratory research. Fortunately, creating animated graphs in R is fairly straightforward, once you have the right tools and understand a few basic principles about how the animations are created.
In this article I’ll provide an example of how to use the animation package to create an animated chart with a couple of bells and whistles.
The package installs out-of-the-box with several animations that are tailored for instruction. The examples are of varying complexity ranging from a simple coin flip simulation to illustrations of mathematical problems such as Buffon’s needle problem. In most scenarios, however, you’ll want to create your own animations, so let’s look at how to do that.
First, there are several different formats in which you can create your animations – GIF, HTML, LaTeX, SWF and mp4. The saveGIF() function call below illustrates the generic format for each of the calls:

saveGIF({
  for (i in 1:10) plot(runif(10), ylim = 0:1)
})

Understanding that the package creates animations by generating and then compiling many graphs is central to creating polished custom animations. As you can see, the syntax looks a little unfamiliar at first because the inside of the function call is a custom loop that creates the individual graphs. (Note: If you’re familiar with with the way the boot() function works, this is somewhat similar.) Once those individual graphs are created, the function compiles the images in the format specified by the function call. As you might have guessed, most of the animation types require that you install 3rd party libraries for R to be able to do the compilations. The installation of these libraries is covered in the package help.
Basic use of the animation functions is covered in the package help, but the application of the functions to novel tasks can still be a little difficult. As a result, I’ve created an example that illustrates how to use the functions to create animations with a couple of bells and whistles.
This animation plots the density functions of 150 draws of 100 values from a normally distributed random variable. To make things a little more interesting (i.e., make the distribution move), a constant that varies based on the iteration count is added to the 100 values. The chart also includes a slightly stylized frame tracker (or draw counter) along the top of the chart and a horizontal bar that notes the current position and previous two positions of the sample mean. Finally, the foreground color of the chart changes based on the mean of the distribution.

library(animation)

#Set delay between frames when replaying
ani.options(interval=.05)

# Set up a vector of colors for use below
col.range <- heat.colors(15)

# Begin animation loop
# Note the brackets within the parentheses
saveGIF({
	# For the most part, it's safest to start with graphical settings in 
	# the animation loop, as the loop adds a layer of complexity to 
	# manipulating the graphs. For example, the layout specification needs to 
	# be within animation loop to work properly.
	layout(matrix(c(1, rep(2, 5)), 6, 1))

	# Adjust the margins a little
	par(mar=c(4,4,2,1) + 0.1)

		# Begin the loop that creates the 150 individual graphs
		for (i in 1:150) {
			# Pull 100 observations from a normal distribution
			# and add a constant based on the iteration to move the distribution
			chunk <- rnorm(100)+sqrt(abs((i)-51))

			# Reset the color of the top chart every time (so that it doesn't change as the 
			# bottom chart changes)
			par(fg=1)

			# Set up the top chart that keeps track of the current frame/iteration
			# Dress it up a little just for fun
			plot(-5, xlim = c(1,150), ylim = c(0, .3), axes = F, xlab = "", ylab = "", main = "Iteration")
			abline(v=i, lwd=5, col = rgb(0, 0, 255, 255, maxColorValue=255))
			abline(v=i-1, lwd=5, col = rgb(0, 0, 255, 50, maxColorValue=255))
			abline(v=i-2, lwd=5, col = rgb(0, 0, 255, 25, maxColorValue=255))

			# Bring back the X axis
			axis(1)

			# Set the color of the bottom chart based on the distance of the distribution's mean from 0
			par(fg = col.range[mean(chunk)+3])

			# Set up the bottom chart
			plot(density(chunk), main = "", xlab = "X Value", xlim = c(-5, 15), ylim = c(0, .6))

			# Add a line that indicates the mean of the distribution. Add additional lines to track
			# previous means
			abline(v=mean(chunk), col = rgb(255, 0, 0, 255, maxColorValue=255))
			if (exists("lastmean")) {abline(v=lastmean, col = rgb(255, 0, 0, 50, maxColorValue=255)); prevlastmean <- lastmean;}
			if (exists("prevlastmean")) {abline(v=prevlastmean, col = rgb(255, 0, 0, 25, maxColorValue=255))}

			#Fix last mean calculation
			lastmean <- mean(chunk)
		}
})

And the final product:

A couple of closing notes:

  • Because there are external programs involved (e.g., SWF Tools, ImageMagick, FFmpeg), the setup for this package is slightly more difficult than the average package and things will likely seem less polished than normal. Things may also not work as well; you’ll need to be prepared to be flexible with your animation formats and graph layouts.
  • Animation works exceptionally well when smaller numbers of individual graphs are being compiled, but as the number of individual graphs grows, so does your likelihood of hitting a problem. E.g., although GIF is a very exportable and transportable format, and therefore ideal for many situations, I found that animations with more than ~500 source graphs just didn’t compile. The limit for HTML was similar. Your mileage may vary, but again, be prepared to be flexible.
  • If you do not need to transport your animation and it will have less than a few hundred individual images, you can avoid installing 3rd party software by using the saveHTML function. This output also includes an interface that allows you to pause and move within the animation easily.
  • As mentioned in the code above, if you’re having trouble getting a particular graphical parameter to work, make sure that it is in the internal loop. For efficiency, you want to keep the loop as clean as possible of course, but some things need to be specified each time a new chart is plotted, and therefore need to be inside the loop.
  • Animations aren’t very common in research presentations, but can provide extensive insight beyond static images. Given R’s advanced graphing capabilities, it’s possible to create very nice animations without needing to learn a completely different software package.

If you’ve created an animation you’d like to share or have additional tips, feel free add them to the comments.

RStudio Development Environment

Compared to many other languages of equal popularity, there are realtively few development environments for R. In fact, the total number of production ready R IDEs could probably be counted on one hand. That deficiency is a small price to pay to use R and if you’re not already accustomed to using IDEs for other languages, you probably haven’t missed it too much. But RStudio goes a long way toward providing a full-featured R development platform, that, once you’ve used it, quickly becomes hard to give up again.

RStudio has some nice graphical features and the layout is clean and logical for the most part. Functionally, some of the best features are:

  • Plot caching (allows you to flip back through previous graphs without rerunning them, making it much easier to review your graphical output)
  • Function, object and parameter completion that works even with user-defined functions (see below)
  • Shortcuts for quickly drilling down into functions

RStudio paramater completion

RStudio also provides version control integration (Git, SVN) which could prove to be very helpful, but I haven’t yet tested it. I can’t speak to how well it works, just that it is available.

In addition to these positives, RStudio has an active support system with developer participation via the RStudio support site.

Overall, I’ve been very impressed with RStudio over the past few weeks. If you haven’t yet tested it, I suggest you give it a try. Given the growth of R over recent years, I think it’s time we expected development tools to mature to the level that they have for other programming languages with similar levels of adoption. The only way that will produce sustainable, mature products is if there is a constant demand in the market.

Already using something else? Feel free to mention your favorite R IDE in the comments.