Creating your personal, portable R code library with GitHub

Note: I sold ProgrammingR.com and its online assets in 2015. I’m posting this for posterity, but no longer control access to the GitHub account.

As I discussed in a previous post, I have a few helper functions I’ve created that I commonly use in my work. Until recently, I manually included these functions at the start of my R scripts by either the tried-and-true copy-and-paste method, or by extracting them from a local file with the source() function. The former approach has the benefit of keeping the helper code inextricably attached to the main script, but it adds a good bit of code to wade through. The latter approach keeps the code cleaner, but requires that whoever is running the code always has access to the sourced file and that it is always in the same relative path – and that makes sharing or moving code more difficult. The start of a recent project requiring me to share my helper function library prompted me to find a better solution.

The resulting approach takes advantage of GitHub Gists and R’s ability to source via a web-based location to enable you to create a personal, portable library of R functions for private use or to share.

The process is very straightforward. If you don’t have a GitHub account, getting one is the first step. After doing so, click on “Gist” in the menu bar to create a new Gist. (There are a couple of reasons for using a Gist instead of a formal repository. For one, repositories cannot be made private unless you have a paid GitHub membership. Gists, however, can be created as “Secret” and therefore available only to people who have the exact URL.) Name your Gist, add your code, and choose whether you want it to be a secret or public Gist to save that revision. After saving your Gist you will be taken to a screen displaying your new Gist.

Now that you’ve created your Gist, you just need to source it in your R code. That step is easily done via the source() function. Rather than including a file path as you would with a local file, you simply include the URL to your Gist. Note that you can’t just copy the URL of the GitHub page you’re currently on. You need to source the URL of the raw code. To get that URL, click on the <> button on the top right of your Gist (circled in red below). This will take you to the raw code for the current revision. To get the path that always points to the most recent revision, remove everything in the URL after “/raw/”.

GitHubGist

Regardless of which version you link to, you’ll still be left with a long URL that includes a randomly generated alphanumeric string that you don’t want to have to try to remember. Here’s a protip, use a URL shortener that let’s you define a custom URL (e.g., tinyurl) to create a shortened version that is easy to remember. Something like the little library at http://tinyurl.com/ProgRLib (YES! You can use and add to this library!).

Now you’ve got a portable, easy to reference, personal library of functions that you can easily include in your code or share with others.

Global Indicator Analyses with R

I was recently asked by a client to create a large number of “proof of concept” visualizations that illustrated the power of R for compiling and analyzing disparate datasets. The client was specifically interested in automated analyses of global data. A little research led me to the WDI package.

The WDI package is a tool to “search, extract and format data from the World Bank’s World Development Indicators” (WDI help). In essence, it is an R-based wrapper for the World Bank Economic Indicators Data API. When used in combination with the information on the World Bank data portal it provides easy access to thousands of global datapoints.

Here is an example use case that illustrates how simple and easy it is to use, especially with a little help from the countrycode and ggplot2 packages:

library(WDI)
library(ggplot2)
library(countrycode)

# Use the WDIsearch function to get a list of fertility rate indicators
indicatorMetaData <- WDIsearch("Fertility rate", field="name", short=FALSE)

# Define a list of countries for which to pull data
countries <- c("United States", "Britain", "Sweden", "Germany")

# Convert the country names to iso2c format used in the World Bank data
iso2cNames <- countrycode(countries, "country.name", "iso2c")

# Pull data for each countries for the first two fertility rate indicators, for the years 2001 to 2011
wdiData <- WDI(iso2cNames, indicatorMetaData[1:2,1], start=2001, end=2011)

# Pull out indicator names
indicatorNames <- indicatorMetaData[1:2, 1]

# Create trend charts for the first two indicators
for (indicatorName in indicatorNames) { 
  pl <- ggplot(wdiData, aes(x=year, y=wdiData[,indicatorName], group=country, color=country))+
    geom_line(size=1)+
    scale_x_continuous(name="Year", breaks=c(unique(wdiData[,"year"])))+
    scale_y_continuous(name=indicatorName)+
    scale_linetype_discrete(name="Country")+
    theme(legend.title=element_blank())+
    ggtitle(paste(indicatorMetaData[indicatorMetaData[,1]==indicatorName, "name"], "\n"))
  ggsave(paste(indicatorName, ".jpg", sep="&"), pl)
}

WDI package visualization 1WDI package visualization 2

This code can be adapted to quickly pull and visualize many pieces of data. Even if you don’t have an analytic need for the WDI data, the ease of access and depth of information available via the WDI package make them perfect for creating toy examples for classes, presentations or blogs, or conveying the power and depth of available R packages.

SPARQL with R in less than 5 minutes

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

The primary method for accessing data made available on the Semantic Web is via SPARQL queries to a data provider’s endpoint. Endpoints are portals to data that a provider has made available for querying. This data is often published in RDF format. If you’re familiar with using SQL to query a database, then the idea of using a query language (SPARQL) to access a structured database or data repository (the endpoint) should be common ground. The practice is also similar to webscraping an XML document with a known structure.

The Semantic Web and R

A primary goal of the Semantic Web movement is to enable machines and applications to interface using standardized data definitions. As the Semantic Web grows, it will be ripe for mining with statistical programs that already lend themselves to automation – such as R. To that end, let’s dive into a simple query that will illustrate 75% of what you’ll be doing on the Semantic Web.

Accessing Data.gov datasets with SPARQL

We’ll use data at the Data.gov endpoint for this example. Data.gov has a wide array of public data available, making this example generalizable to many other datasets. One of the key challenges of querying a Semantic Web resource is knowing what data is accessible. Sometimes the best way to find this out is to run a simple query with no filters that returns only a few results or to directly view the RDF. Fortunately, information on the data available via Data.gov has been cataloged on a wiki hosted by Rensselaer. We’ll use Dataset 1187 for this example. It’s simple and has interesting data – the total number of wildfires and acres burned per year, 1960-2008.

R code

Data.gov provides a Virtuoso endpoint here, which we could use to submit manual queries. But we want to automate this process, so we’ll use Willem Robert van Hage and Tomi Kauppinen’s SPARQL package to access the endpoint.

library(SPARQL) # SPARQL querying package
library(ggplot2)

# Step 1 - Set up preliminaries and define query
# Define the data.gov endpoint
endpoint <- "http://services.data.gov/sparql"

# create query statement
query <-
"PREFIX  dgp1187: <http://data-gov.tw.rpi.edu/vocab/p/1187/>
SELECT ?ye ?fi ?ac
WHERE {
?s dgp1187:year ?ye .
?s dgp1187:fires ?fi .
?s dgp1187:acres ?ac .
}"

# Step 2 - Use SPARQL package to submit query and save results to a data frame
qd <- SPARQL(endpoint,query)
df <- qd$results

# Step 3 - Prep for graphing

# Numbers are usually returned as characters, so convert to numeric and create a
# variable for "average acres burned per fire"
str(df)
df <- as.data.frame(apply(df, 2, as.numeric))
str(df)

df$avgperfire <- df$ac/df$fi

# Step 4 - Plot some data
ggplot(df, aes(x=ye, y=avgperfire, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Average acres burned per fire")

ggplot(df, aes(x=ye, y=fi, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Number of fires")

ggplot(df, aes(x=ye, y=ac, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Acres burned")

# In less than 5 mins we have written code to download just
# the data we need and have an interesting result to explore!

If you’re familiar with R, the only foreign code is that in Steps 1 and 2. There we define the location of the endpoint and write the SPARQL query. If you have experience with SQL, you’ll notice that SPARQL syntax is very similar. The most notable exception is that Semantic Web data is structured around chunks of data that contain three pieces of information. In Semantic Web terminology these three pieces of information as referred to as subjects, predicates and objects and together they are called “triples”. A SPARQL query returns pieces of these chunks of data; exactly which pieces are returned depends on the query. We have defined a prefix which establishes that everywhere we say “dgp1187:” we want it to know that we mean “<http://data-gov.tw.rpi.edu/vocab/p/1187/>”. This saves us from having to retype long URIs every time we need to reference them. URIs are used extensively on the Semantic Web, so this is a very helpful feature.

In English, our query says, “Give me the values for the attributes “fires”, “acres” and “year” wherever they are defined, and assign them to variables named “fi”, “ac” and “ye” respectively. Also fetch the location of the value in the data as a variable called ?s and merge the other data values by that variable.”

The last bit of the query is the closest we get to SPARQL magic. Because the variable ?s is present in each clause of the query, our fire, acres burned, and year data will be merged by that variable. This is somewhat analogous to saying, “Once you’ve found a row of data that contains information on the number of fires, also return the acres burned and year data from that row and list them on the same row in the new dataset.” (If the logic behind the query isn’t yet clear, or you’re ready to see how you can make this query more advanced or parsimonious, the text and web pages in the “References” and “Additional Information” sections at the end of this article are good resources for learning SPARQL.)

Once defined, we pass this query and endpoint information to the SPARQL function which returns an object with two values – a data frame of the new data and a list of namespaces. At this point, we’re concerned with the data frame, so we pull it out in the last line of Step 2.Data.gov SPARQL example

From that point on we treat the data just like any other dataset. Running some quick graphs, we see that although the number of fires per year have decreased, the number of acres burned per fire and the number of acres burned per year increased considerably between 1975 and 2008. (Despite being born in 1975, I disavow any responsibility for this trend.)

Summary

In a few short minutes, using R and SPARQL, we wrote code to pull and do initial analyses on an interesting government dataset. Hopefully, the power of R for mining the Semantic Web is evident from this simple example. As more data becomes available in RDF format, automated solutions for mining and analyzing the Semantic Web will become more and more useful.

Reference:

DuCharme, Bob (2011-07-14). Learning SPARQL . OReilly Media.

Additional Information:

Basic tutorial on querying Data.gov

More detailed tutorial on querying Data.gov

R for Dummies – De Vries and Meys (2012)

The for Dummies series has been around since 1991. (A bit of trivia, DOS for Dummies was the first title.) I’ve owned a few books in the series and have been adequately impressed with most of them, but when I learned there was an R for Dummies I was immediately skeptical. Possibly I was skeptical because R has a steep learning curve and many idiosyncrasies, so the idea of an R for Dummies text seemed oxymoronic – it’s difficult to imagine a (successfully) dumbed-down version of an introductory R text. But if you’re familiar with the for Dummies series, you already know that the moniker is just for marketing. In reality, these books usually do a good job of distilling a topic down to the important components a new user needs to know. This edition is no exception.

Title: R for Dummies
Author(s): Andrie de Vries and Joris Meys
Publisher/Date: Wiley and Sons/2012
Statistics level: Not Applicable
Programming level: Beginner
Overall recommendation: Highly Recommended

The core topic areas that R for Dummies covers should come as no surprise: A basic overview of R and its capabilities, importing data into R, writing and debugging functions, summarizing data and graphing. In addition, there are sections covering potentially frustrating tasks for beginners such as working with dates and multidimensional arrays.

I am a big fan of periodically reviewing the fundamentals. I pick up introductory texts every once and a while to make sure I haven’t forgotten anything important and R for Dummies is a nice edition to my library for that reason.

One thing that sets this book apart from other currently available introductory R texts is that it covers a couple of recent and important developments in R coding – namely the RStudio development environment and ggplot2 graphics.

If you’re not familiar with the for Dummies series, it is important to note that they are written in a specific informal style, which is in stark contrast to most R texts. (For reference, the style is more similar to The R Inferno than the ggplot2 manual.) You can get a sense of this style by browsing a few pages on Amazon to see if you find it helpful or distracting. On the other hand, R for Dummies has a more polished feel than many R texts I’ve read. I didn’t encounter any of the frustrating and distracting editing errors that are common in some R texts.

R for Dummies is primarily focused on R as a programming language, so for the most part, statistical analyses are presented only as a means of illustrating programming techniques. Given its focus on programming and fundamentals, this book is highly recommended for someone with little to no experience in R who wants to learn R programming. Intermediate to advanced R programmers like myself who want a current review of the fundamentals, might also find it useful.

I might also recommend R for Dummies to experienced users of other programming languages who are new to R. The discussion of basic programming concepts, such as control flow, is minimal and focused primarily on details specific to R. It is not recommended for those looking to learn statistics in conjunction with R.

The current price of $20 USD puts it in the middle price range for texts of its kind. It is available as a paperback or Kindle text.

R Helper Functions

If you do a lot of R programming, you probably have a list of R helper functions set aside in a script that you include on R startup or at the top of your code. In some cases helper functions add capabilities that aren’t otherwise available. In other cases, they replicate functionality that is available elsewhere without loading unnecessary components. Below I present two of my most frequently used data manipulation helper functions as examples.

### Descriptives R Helpler Function
# Display some basic descriptives
descs <- function (x) {
  if(!hidetables) {
    if(length(unique(x))>30) {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Number of categories is greater than 30, table not produced')
    } else {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Table results:')
      print(table(x, useNA='always'))
    }
   
  } else {
    print('Tables are hidden')
  }
}

# Set hide tables to true to hide tables
hidetables <- FALSE


### Dummy Variable R Helper Function
# Create dummy variables for each level of a categorical variable
createDummies <- function(x, df, keepNAs = TRUE) {
  for (i in seq(1, length(unique(df[, x])))) {
    if(keepNAs) {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i, 0, 1)
    } else {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i | is.na(df[, x]) , 0, 1)     
    }
  }
  df
}

The Descriptives R Helper Function produces a summary or table of the passed variable/object; it uses the number of unique values to determine whether to call just the summary() or summary() and table() functions. It also includes NAs by default in the tables (one of table()‘s biggest annoyances). Once the exploratory and data manipulation work is done, all output from this function can be suppressed by setting the hidetables object to TRUE.

The Dummy Variable R Helper Function creates indicator variables from all values of a variable. Based on experience, I avoid the factor object as much as possible and this approach allows me to quickly create indicators that can be used in any way I want.

If you don’t have a programming background or are just beginning with R, you might not have had time to realize the benefit of helper functions or identify the tasks you do repetitively, but it’s worthwhile to give the issue some consideration. Helper functions can be exceptionally useful for saving time on repetitive tasks or facilitating your work. They’re so useful in fact, that there is a special ProgrammingR event planned around helper functions coming soon. For the more experienced R programmers out there, make a mental note of the most useful helper functions you’ve written in the past. That list will come in handy in the near future!

Progress bars in R using winProgressBar

Using progress bars in R scripts can provide valuable timing feedback during development and additional polish to final products. winProgressBar and setWinProgressBar are the primary functions for creating progress bars in R.

Progress bars, and progress indicators in general, are relatively uncommon in R programming. This makes sense, as they can add bloat and, being design elements, they generally fall into the classification of “nice but not necessary”. However, during development, especially when using loops, progress bars can a cleaner way of tracking loop progress than, for example, printing iteration numbers. And for programmers who prepare scripts or packages for non-programmers, they add feedback that users have come to expect from other software.

To add progress bars in R scripts use the winProgressBar and setWinProgressBar functions. For non-Windows users there’s also a very similar Tcl/Tk version (tkProgressBar and settkProgressbar); depending on your current set up, you may need to install the tcltk library to use it.

Setting up a progress indicator to track the progress of a loop is very straightforward. First, initialize the display:

pb <- winProgressBar(title="Example progress bar", label="0% done", min=0, max=100, initial=0)

Use the title and label options to set the style of the display. The min and max options should use whatever values are most applicable to your task, but in most cases this will be displaying a percentage so a range of 0 to 100, starting at 0, makes sense.

After initializing the progress indicator, add your loop code and update the progress bar by calling the setWinProgressBar function at the end of each loop:

for(i in 1:100) {
  Sys.sleep(0.1) # slow down the code for illustration purposes
  info <- sprintf("%d%% done", round((i/100)*100))
  setWinProgressBar(pb, i/(100)*100, label=info)
}

# Once the loop is exited, close the progress bar window:

close(pb)

Here is a snapshot of the progress indicator in action:

Windows Progress Bar Example

To track the progress of an entire script or program, you just need to place the initializing function (winProgressBar) at the beginning of the code and do updates via setWinProgressBar at key points within the flow of your program. For example, if the first task your code performs usually takes about 10% of the total time required, set the value to 10% after that task completes.

If a popup progress bar doesn’t work for your task, there is a minimalist option, txtProgressBar, that by default draws a line in the console. It also allows for some fun customizations:

Text Progress Bar Example

The Art of R Programming – Matloff (2011)

It’s difficult to write a book on an entire programming language and keep it manageable and concise, but The Art of R Programming does it as well as any text I’ve seen. Matloff covers, in detail and among other things, R data structures, programming idioms, performance enhancements, interfaces with other languages, debugging and graphing.

Title:The Art of R Programming
Author(s): Norman Matloff
Publisher/Date: No Starch Press/2011
Statistics level: Very Low
Programming level: Intermediate
Overall recommendation: Highly Recommended

There is the requisite “Introduction to R” section that is present in almost all R texts, but any beginners who benefit from this chapter may benefit from re-reading ARP after some additional practical experience with R. The issues that Matloff addresses and the solutions he provides are more salient after you’ve spent hours trying to resolve them.

The section on graphing is a good overview, but the average programmer may find it less useful than the other sections. Anyone looking for graphic optimization tips will be better served by a book focused specifically on graphing.

With that minor critique in mind, put simply, The Art of R Programming is a must read for all intermediate level R programmers. It covers nearly every method of performance enhancement available and provides a review of key fundamentals that may have been forgotten or missed.

The Art of R ProgrammingOne point of note, this text focuses almost solely on programming – the statistical examples are a means to an end, not an end themselves. For that reason, this book is recommended for those seeking to improve the efficiency of their programming rather than their statistical acumen.

At around ~$25 USD from Amazon, The Art of R Programming is one of the best R text values available. I highly recommend it for almost all R users. (You can also purchase this book directly from the publisher and get both the print and e-book version for ~$40.)

Animations in R

Animated charts can be very helpful in illustrating concepts or discovering relationships, which makes them very helpful in teaching and exploratory research. Fortunately, creating animated graphs in R is fairly straightforward, once you have the right tools and understand a few basic principles about how the animations are created.
In this article I’ll provide an example of how to use the animation package to create an animated chart with a couple of bells and whistles.
The package installs out-of-the-box with several animations that are tailored for instruction. The examples are of varying complexity ranging from a simple coin flip simulation to illustrations of mathematical problems such as Buffon’s needle problem. In most scenarios, however, you’ll want to create your own animations, so let’s look at how to do that.
First, there are several different formats in which you can create your animations – GIF, HTML, LaTeX, SWF and mp4. The saveGIF() function call below illustrates the generic format for each of the calls:

saveGIF({
  for (i in 1:10) plot(runif(10), ylim = 0:1)
})

Understanding that the package creates animations by generating and then compiling many graphs is central to creating polished custom animations. As you can see, the syntax looks a little unfamiliar at first because the inside of the function call is a custom loop that creates the individual graphs. (Note: If you’re familiar with with the way the boot() function works, this is somewhat similar.) Once those individual graphs are created, the function compiles the images in the format specified by the function call. As you might have guessed, most of the animation types require that you install 3rd party libraries for R to be able to do the compilations. The installation of these libraries is covered in the package help.
Basic use of the animation functions is covered in the package help, but the application of the functions to novel tasks can still be a little difficult. As a result, I’ve created an example that illustrates how to use the functions to create animations with a couple of bells and whistles.
This animation plots the density functions of 150 draws of 100 values from a normally distributed random variable. To make things a little more interesting (i.e., make the distribution move), a constant that varies based on the iteration count is added to the 100 values. The chart also includes a slightly stylized frame tracker (or draw counter) along the top of the chart and a horizontal bar that notes the current position and previous two positions of the sample mean. Finally, the foreground color of the chart changes based on the mean of the distribution.

library(animation)

#Set delay between frames when replaying
ani.options(interval=.05)

# Set up a vector of colors for use below
col.range <- heat.colors(15)

# Begin animation loop
# Note the brackets within the parentheses
saveGIF({
	# For the most part, it's safest to start with graphical settings in 
	# the animation loop, as the loop adds a layer of complexity to 
	# manipulating the graphs. For example, the layout specification needs to 
	# be within animation loop to work properly.
	layout(matrix(c(1, rep(2, 5)), 6, 1))

	# Adjust the margins a little
	par(mar=c(4,4,2,1) + 0.1)

		# Begin the loop that creates the 150 individual graphs
		for (i in 1:150) {
			# Pull 100 observations from a normal distribution
			# and add a constant based on the iteration to move the distribution
			chunk <- rnorm(100)+sqrt(abs((i)-51))

			# Reset the color of the top chart every time (so that it doesn't change as the 
			# bottom chart changes)
			par(fg=1)

			# Set up the top chart that keeps track of the current frame/iteration
			# Dress it up a little just for fun
			plot(-5, xlim = c(1,150), ylim = c(0, .3), axes = F, xlab = "", ylab = "", main = "Iteration")
			abline(v=i, lwd=5, col = rgb(0, 0, 255, 255, maxColorValue=255))
			abline(v=i-1, lwd=5, col = rgb(0, 0, 255, 50, maxColorValue=255))
			abline(v=i-2, lwd=5, col = rgb(0, 0, 255, 25, maxColorValue=255))

			# Bring back the X axis
			axis(1)

			# Set the color of the bottom chart based on the distance of the distribution's mean from 0
			par(fg = col.range[mean(chunk)+3])

			# Set up the bottom chart
			plot(density(chunk), main = "", xlab = "X Value", xlim = c(-5, 15), ylim = c(0, .6))

			# Add a line that indicates the mean of the distribution. Add additional lines to track
			# previous means
			abline(v=mean(chunk), col = rgb(255, 0, 0, 255, maxColorValue=255))
			if (exists("lastmean")) {abline(v=lastmean, col = rgb(255, 0, 0, 50, maxColorValue=255)); prevlastmean <- lastmean;}
			if (exists("prevlastmean")) {abline(v=prevlastmean, col = rgb(255, 0, 0, 25, maxColorValue=255))}

			#Fix last mean calculation
			lastmean <- mean(chunk)
		}
})

And the final product:

A couple of closing notes:

  • Because there are external programs involved (e.g., SWF Tools, ImageMagick, FFmpeg), the setup for this package is slightly more difficult than the average package and things will likely seem less polished than normal. Things may also not work as well; you’ll need to be prepared to be flexible with your animation formats and graph layouts.
  • Animation works exceptionally well when smaller numbers of individual graphs are being compiled, but as the number of individual graphs grows, so does your likelihood of hitting a problem. E.g., although GIF is a very exportable and transportable format, and therefore ideal for many situations, I found that animations with more than ~500 source graphs just didn’t compile. The limit for HTML was similar. Your mileage may vary, but again, be prepared to be flexible.
  • If you do not need to transport your animation and it will have less than a few hundred individual images, you can avoid installing 3rd party software by using the saveHTML function. This output also includes an interface that allows you to pause and move within the animation easily.
  • As mentioned in the code above, if you’re having trouble getting a particular graphical parameter to work, make sure that it is in the internal loop. For efficiency, you want to keep the loop as clean as possible of course, but some things need to be specified each time a new chart is plotted, and therefore need to be inside the loop.
  • Animations aren’t very common in research presentations, but can provide extensive insight beyond static images. Given R’s advanced graphing capabilities, it’s possible to create very nice animations without needing to learn a completely different software package.

If you’ve created an animation you’d like to share or have additional tips, feel free add them to the comments.

Installing quantstrat from R-forge and source

R is used extensively in the financial industry; many of my recent clients have been working in or developing products for the financial sector. Some common applications are to use R to analyze market data and evaluate quantitative trading strategies. Custom solutions are almost always the best way to do this, but the quantstrat package can make it easy to quickly get a high-level understanding of a strategy’s potential. However, quantstrat is still under development, and this, combined with a lack of documentation and the complex nature of the tasks involved, make it difficult to work with. This article addresses one of the most basic issues with quantstrat – getting it installed. quantstrat and it’s required packages currently aren’t available on CRAN – you have to get them from R-forge. As a result, the installation is slightly less straightforward than other packages and provides an opportunity to discuss how to install packages from R-forge and locally from source. Although this article focuses on installing quantstrat, these instructions will help with any R-package that you need to build from source.

If you’re installing from R-forge, the process is only moderately different than installing from CRAN; simply change the install.packages command to point to the R-forge repository:

install.packages("FinancialInstrument", repos="http://R-Forge.R-project.org")

install.packages("blotter", repos="http://R-Forge.R-project.org")

install.packages("quantstrat", repos="http://R-Forge.R-project.org")

Since the FinancialInstrument and blotter packages are dependencies for quantstrat, you can download and install all three at once with just the last line.

In some cases, you may need to build the packages yourself. You’ll need to set your system up to compile R source code if it isn’t already. To do so, follow steps 1-3 below. If your system is already set up to compile R source code, you can skip to step 4.

1) Install Rtools package (must be done manually from http://www.murdoch-sutherland.com/Rtools/

2) Install LaTex from www.miktex.org/

3) Install InnoSetup http://www.jrsoftware.org/

4) Download the three package source files available from R-forge http://r-forge.r-project.org/R/?group_id=316

5) Install the packages using the commands below (substituting the appropriate version numbers):

install.packages("C:/yourpath/FinancialInstrument_0.9.18.tar.gz", repos = NULL, type="source")

install.packages("C:/yourpath/blotter_0.8.4.tar.gz", repos = NULL, type="source")

install.packages("C:/yourpath/quantstrat_0.6.1.tar.gz", repos = NULL, type="source")

Note that these directions are relevant until the packages are available on CRAN, after which, you’ll be able to download and install them like any other package (I’ll make a note on this post once that happens). Also note that since these packages are under heavy development, you’ll want to update them often.

Bayesian Computation with R – Albert (2009)

Title: Bayesian Computation with R
Author(s): Jim Albert
Publisher/Date: Springer/2009
Statistics level: High
Programming level: Low
Overall recommendation: Recommended

Bayesian Computation with R focuses primarily on providing the reader with a basic understanding of Bayesian thinking and the relevant analytic tools included in R. It does not explore either of those areas in detail, though it does hit the key points for both.

As with many R books, the first chapter is devoted to an introduction of data manipulation and basic analyses in R. This introductory chapter focuses more heavily on analyses that many of the other similarly focused chapters in other texts. The new R user who hasn’t yet built up a library of these chapters will find it useful, but for experienced R users or those with multiple R texts, there is little new information.

Albert’s introduction to the foundational Bayesian concepts (e.g., Bayes’ theorem) is concise and will be clear to those with a statistical background, but others may need to refresh their statistical knowledge before they can fully grasp the content in the second chapter. Those from programming backgrounds without extensive statistical knowledge may be better off beginning with a text that deals specifically with Bayesian analysis.

Many of the topics discussed in this text have limited application, but possibly the most broadly applicable chapter deals with Bayesian regression. Those interested in learning how to run and diagnose Bayesian regression in R will find almost everything they need to know here.

As with many R texts, Bayesian Computation with R has an accompanying package of functions available on CRAN (“LearnBayes”). The functions in this package are focused mainly on teaching Bayesian analysis, but also include some useful basic implementations.

This book straddles the line between introductory theory and intermediate-level statistical programming. Because of the omissions of information on each side of that line, the reader will get the most mileage from the text if he or she has access to resources (i.e., other texts, colleagues, or previous knowledge) that can fill in those omissions. For that reason, it would work well as a text for an upper-level course on Bayesian statistics and their application, but it is not well suited as a reference text, or as a guide for real-world analysis.

Overall, I recommend this book, with the caveat that interested readers should review the sample pages available on the Springer website here and the functions in the “LearnBayes” package prior to purchasing. The text is currently available for approximately $50 in paperback and $40 for the Kindle version.