Playing with Gradient Descent in R

Gradient Descent is a workhorse in the machine learning world. As proof of its importance, it is one of the first algorithms that Andrew Ng discusses in his canonical Coursera Machine Learning course. There are many flavors and adaptations, but starting simple is usually a good thing. In this example, it is used to minimize the cost function (the sum of squared errors or SSE) for obtaining parameter estimates for a linear model. I.e.:

\text{minimize} J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

Which, when applied to a linear model becomes:

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})

\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right)

Where \theta_0 is our intercept and \theta_1 is the parameter estimate of our only predictor variable.

Ng’s course is Octave-based, but manually calculating the algorithm in an R script is a fun, simple exercise and if you’re primarily an R-user it might help you understand the algorithm better than the Octave examples. The code full code is in this repository, but here is the walkthrough:

  • Create some linearly related data with known relationships
  • Write a function that takes the data and starting (or current) estimates as inputs
  • Calculate the cost based on the current estimates
  • Adjust the estimates in the direction and magnitude indicated by the scaling factor \alpha.
  • Recursively run the function, providing the new parameter estimates each time
  • Stop when the estimate converges (i.e., meets the stopping criteria based on the change in the estimates)

This code is for a simple single variable model. Adding additional variables means calculating the partial derivatives with respect to each item. In other words, adding a version of the \theta_1 cost component for each feature in the model. I.e.,

\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\right)

I sometimes use Gradient Descent as a ‘Hello World’ program when I’m playing with statistical packages. It helps you get a feel for the language and its capabilities.

Global Indicator Analyses with R

I was recently asked by a client to create a large number of “proof of concept” visualizations that illustrated the power of R for compiling and analyzing disparate datasets. The client was specifically interested in automated analyses of global data. A little research led me to the WDI package.

The WDI package is a tool to “search, extract and format data from the World Bank’s World Development Indicators” (WDI help). In essence, it is an R-based wrapper for the World Bank Economic Indicators Data API. When used in combination with the information on the World Bank data portal it provides easy access to thousands of global datapoints.

Here is an example use case that illustrates how simple and easy it is to use, especially with a little help from the countrycode and ggplot2 packages:

library(WDI)
library(ggplot2)
library(countrycode)

# Use the WDIsearch function to get a list of fertility rate indicators
indicatorMetaData <- WDIsearch("Fertility rate", field="name", short=FALSE)

# Define a list of countries for which to pull data
countries <- c("United States", "Britain", "Sweden", "Germany")

# Convert the country names to iso2c format used in the World Bank data
iso2cNames <- countrycode(countries, "country.name", "iso2c")

# Pull data for each countries for the first two fertility rate indicators, for the years 2001 to 2011
wdiData <- WDI(iso2cNames, indicatorMetaData[1:2,1], start=2001, end=2011)

# Pull out indicator names
indicatorNames <- indicatorMetaData[1:2, 1]

# Create trend charts for the first two indicators
for (indicatorName in indicatorNames) { 
  pl <- ggplot(wdiData, aes(x=year, y=wdiData[,indicatorName], group=country, color=country))+
    geom_line(size=1)+
    scale_x_continuous(name="Year", breaks=c(unique(wdiData[,"year"])))+
    scale_y_continuous(name=indicatorName)+
    scale_linetype_discrete(name="Country")+
    theme(legend.title=element_blank())+
    ggtitle(paste(indicatorMetaData[indicatorMetaData[,1]==indicatorName, "name"], "\n"))
  ggsave(paste(indicatorName, ".jpg", sep="&"), pl)
}

WDI package visualization 1WDI package visualization 2

This code can be adapted to quickly pull and visualize many pieces of data. Even if you don’t have an analytic need for the WDI data, the ease of access and depth of information available via the WDI package make them perfect for creating toy examples for classes, presentations or blogs, or conveying the power and depth of available R packages.

R Helper Functions

If you do a lot of R programming, you probably have a list of R helper functions set aside in a script that you include on R startup or at the top of your code. In some cases helper functions add capabilities that aren’t otherwise available. In other cases, they replicate functionality that is available elsewhere without loading unnecessary components. Below I present two of my most frequently used data manipulation helper functions as examples.

### Descriptives R Helpler Function
# Display some basic descriptives
descs <- function (x) {
  if(!hidetables) {
    if(length(unique(x))>30) {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Number of categories is greater than 30, table not produced')
    } else {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Table results:')
      print(table(x, useNA='always'))
    }
   
  } else {
    print('Tables are hidden')
  }
}

# Set hide tables to true to hide tables
hidetables <- FALSE


### Dummy Variable R Helper Function
# Create dummy variables for each level of a categorical variable
createDummies <- function(x, df, keepNAs = TRUE) {
  for (i in seq(1, length(unique(df[, x])))) {
    if(keepNAs) {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i, 0, 1)
    } else {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i | is.na(df[, x]) , 0, 1)     
    }
  }
  df
}

The Descriptives R Helper Function produces a summary or table of the passed variable/object; it uses the number of unique values to determine whether to call just the summary() or summary() and table() functions. It also includes NAs by default in the tables (one of table()‘s biggest annoyances). Once the exploratory and data manipulation work is done, all output from this function can be suppressed by setting the hidetables object to TRUE.

The Dummy Variable R Helper Function creates indicator variables from all values of a variable. Based on experience, I avoid the factor object as much as possible and this approach allows me to quickly create indicators that can be used in any way I want.

If you don’t have a programming background or are just beginning with R, you might not have had time to realize the benefit of helper functions or identify the tasks you do repetitively, but it’s worthwhile to give the issue some consideration. Helper functions can be exceptionally useful for saving time on repetitive tasks or facilitating your work. They’re so useful in fact, that there is a special ProgrammingR event planned around helper functions coming soon. For the more experienced R programmers out there, make a mental note of the most useful helper functions you’ve written in the past. That list will come in handy in the near future!

The Art of R Programming – Matloff (2011)

It’s difficult to write a book on an entire programming language and keep it manageable and concise, but The Art of R Programming does it as well as any text I’ve seen. Matloff covers, in detail and among other things, R data structures, programming idioms, performance enhancements, interfaces with other languages, debugging and graphing.

Title:The Art of R Programming
Author(s): Norman Matloff
Publisher/Date: No Starch Press/2011
Statistics level: Very Low
Programming level: Intermediate
Overall recommendation: Highly Recommended

There is the requisite “Introduction to R” section that is present in almost all R texts, but any beginners who benefit from this chapter may benefit from re-reading ARP after some additional practical experience with R. The issues that Matloff addresses and the solutions he provides are more salient after you’ve spent hours trying to resolve them.

The section on graphing is a good overview, but the average programmer may find it less useful than the other sections. Anyone looking for graphic optimization tips will be better served by a book focused specifically on graphing.

With that minor critique in mind, put simply, The Art of R Programming is a must read for all intermediate level R programmers. It covers nearly every method of performance enhancement available and provides a review of key fundamentals that may have been forgotten or missed.

The Art of R ProgrammingOne point of note, this text focuses almost solely on programming – the statistical examples are a means to an end, not an end themselves. For that reason, this book is recommended for those seeking to improve the efficiency of their programming rather than their statistical acumen.

At around ~$25 USD from Amazon, The Art of R Programming is one of the best R text values available. I highly recommend it for almost all R users. (You can also purchase this book directly from the publisher and get both the print and e-book version for ~$40.)

Animations in R

Animated charts can be very helpful in illustrating concepts or discovering relationships, which makes them very helpful in teaching and exploratory research. Fortunately, creating animated graphs in R is fairly straightforward, once you have the right tools and understand a few basic principles about how the animations are created.
In this article I’ll provide an example of how to use the animation package to create an animated chart with a couple of bells and whistles.
The package installs out-of-the-box with several animations that are tailored for instruction. The examples are of varying complexity ranging from a simple coin flip simulation to illustrations of mathematical problems such as Buffon’s needle problem. In most scenarios, however, you’ll want to create your own animations, so let’s look at how to do that.
First, there are several different formats in which you can create your animations – GIF, HTML, LaTeX, SWF and mp4. The saveGIF() function call below illustrates the generic format for each of the calls:

saveGIF({
  for (i in 1:10) plot(runif(10), ylim = 0:1)
})

Understanding that the package creates animations by generating and then compiling many graphs is central to creating polished custom animations. As you can see, the syntax looks a little unfamiliar at first because the inside of the function call is a custom loop that creates the individual graphs. (Note: If you’re familiar with with the way the boot() function works, this is somewhat similar.) Once those individual graphs are created, the function compiles the images in the format specified by the function call. As you might have guessed, most of the animation types require that you install 3rd party libraries for R to be able to do the compilations. The installation of these libraries is covered in the package help.
Basic use of the animation functions is covered in the package help, but the application of the functions to novel tasks can still be a little difficult. As a result, I’ve created an example that illustrates how to use the functions to create animations with a couple of bells and whistles.
This animation plots the density functions of 150 draws of 100 values from a normally distributed random variable. To make things a little more interesting (i.e., make the distribution move), a constant that varies based on the iteration count is added to the 100 values. The chart also includes a slightly stylized frame tracker (or draw counter) along the top of the chart and a horizontal bar that notes the current position and previous two positions of the sample mean. Finally, the foreground color of the chart changes based on the mean of the distribution.

library(animation)

#Set delay between frames when replaying
ani.options(interval=.05)

# Set up a vector of colors for use below
col.range <- heat.colors(15)

# Begin animation loop
# Note the brackets within the parentheses
saveGIF({
	# For the most part, it's safest to start with graphical settings in 
	# the animation loop, as the loop adds a layer of complexity to 
	# manipulating the graphs. For example, the layout specification needs to 
	# be within animation loop to work properly.
	layout(matrix(c(1, rep(2, 5)), 6, 1))

	# Adjust the margins a little
	par(mar=c(4,4,2,1) + 0.1)

		# Begin the loop that creates the 150 individual graphs
		for (i in 1:150) {
			# Pull 100 observations from a normal distribution
			# and add a constant based on the iteration to move the distribution
			chunk <- rnorm(100)+sqrt(abs((i)-51))

			# Reset the color of the top chart every time (so that it doesn't change as the 
			# bottom chart changes)
			par(fg=1)

			# Set up the top chart that keeps track of the current frame/iteration
			# Dress it up a little just for fun
			plot(-5, xlim = c(1,150), ylim = c(0, .3), axes = F, xlab = "", ylab = "", main = "Iteration")
			abline(v=i, lwd=5, col = rgb(0, 0, 255, 255, maxColorValue=255))
			abline(v=i-1, lwd=5, col = rgb(0, 0, 255, 50, maxColorValue=255))
			abline(v=i-2, lwd=5, col = rgb(0, 0, 255, 25, maxColorValue=255))

			# Bring back the X axis
			axis(1)

			# Set the color of the bottom chart based on the distance of the distribution's mean from 0
			par(fg = col.range[mean(chunk)+3])

			# Set up the bottom chart
			plot(density(chunk), main = "", xlab = "X Value", xlim = c(-5, 15), ylim = c(0, .6))

			# Add a line that indicates the mean of the distribution. Add additional lines to track
			# previous means
			abline(v=mean(chunk), col = rgb(255, 0, 0, 255, maxColorValue=255))
			if (exists("lastmean")) {abline(v=lastmean, col = rgb(255, 0, 0, 50, maxColorValue=255)); prevlastmean <- lastmean;}
			if (exists("prevlastmean")) {abline(v=prevlastmean, col = rgb(255, 0, 0, 25, maxColorValue=255))}

			#Fix last mean calculation
			lastmean <- mean(chunk)
		}
})

And the final product:

A couple of closing notes:

  • Because there are external programs involved (e.g., SWF Tools, ImageMagick, FFmpeg), the setup for this package is slightly more difficult than the average package and things will likely seem less polished than normal. Things may also not work as well; you’ll need to be prepared to be flexible with your animation formats and graph layouts.
  • Animation works exceptionally well when smaller numbers of individual graphs are being compiled, but as the number of individual graphs grows, so does your likelihood of hitting a problem. E.g., although GIF is a very exportable and transportable format, and therefore ideal for many situations, I found that animations with more than ~500 source graphs just didn’t compile. The limit for HTML was similar. Your mileage may vary, but again, be prepared to be flexible.
  • If you do not need to transport your animation and it will have less than a few hundred individual images, you can avoid installing 3rd party software by using the saveHTML function. This output also includes an interface that allows you to pause and move within the animation easily.
  • As mentioned in the code above, if you’re having trouble getting a particular graphical parameter to work, make sure that it is in the internal loop. For efficiency, you want to keep the loop as clean as possible of course, but some things need to be specified each time a new chart is plotted, and therefore need to be inside the loop.
  • Animations aren’t very common in research presentations, but can provide extensive insight beyond static images. Given R’s advanced graphing capabilities, it’s possible to create very nice animations without needing to learn a completely different software package.

If you’ve created an animation you’d like to share or have additional tips, feel free add them to the comments.

Installing quantstrat from R-forge and source

R is used extensively in the financial industry; many of my recent clients have been working in or developing products for the financial sector. Some common applications are to use R to analyze market data and evaluate quantitative trading strategies. Custom solutions are almost always the best way to do this, but the quantstrat package can make it easy to quickly get a high-level understanding of a strategy’s potential. However, quantstrat is still under development, and this, combined with a lack of documentation and the complex nature of the tasks involved, make it difficult to work with. This article addresses one of the most basic issues with quantstrat – getting it installed. quantstrat and it’s required packages currently aren’t available on CRAN – you have to get them from R-forge. As a result, the installation is slightly less straightforward than other packages and provides an opportunity to discuss how to install packages from R-forge and locally from source. Although this article focuses on installing quantstrat, these instructions will help with any R-package that you need to build from source.

If you’re installing from R-forge, the process is only moderately different than installing from CRAN; simply change the install.packages command to point to the R-forge repository:

install.packages("FinancialInstrument", repos="http://R-Forge.R-project.org")

install.packages("blotter", repos="http://R-Forge.R-project.org")

install.packages("quantstrat", repos="http://R-Forge.R-project.org")

Since the FinancialInstrument and blotter packages are dependencies for quantstrat, you can download and install all three at once with just the last line.

In some cases, you may need to build the packages yourself. You’ll need to set your system up to compile R source code if it isn’t already. To do so, follow steps 1-3 below. If your system is already set up to compile R source code, you can skip to step 4.

1) Install Rtools package (must be done manually from http://www.murdoch-sutherland.com/Rtools/

2) Install LaTex from www.miktex.org/

3) Install InnoSetup http://www.jrsoftware.org/

4) Download the three package source files available from R-forge http://r-forge.r-project.org/R/?group_id=316

5) Install the packages using the commands below (substituting the appropriate version numbers):

install.packages("C:/yourpath/FinancialInstrument_0.9.18.tar.gz", repos = NULL, type="source")

install.packages("C:/yourpath/blotter_0.8.4.tar.gz", repos = NULL, type="source")

install.packages("C:/yourpath/quantstrat_0.6.1.tar.gz", repos = NULL, type="source")

Note that these directions are relevant until the packages are available on CRAN, after which, you’ll be able to download and install them like any other package (I’ll make a note on this post once that happens). Also note that since these packages are under heavy development, you’ll want to update them often.

Bayesian Computation with R – Albert (2009)

Title: Bayesian Computation with R
Author(s): Jim Albert
Publisher/Date: Springer/2009
Statistics level: High
Programming level: Low
Overall recommendation: Recommended

Bayesian Computation with R focuses primarily on providing the reader with a basic understanding of Bayesian thinking and the relevant analytic tools included in R. It does not explore either of those areas in detail, though it does hit the key points for both.

As with many R books, the first chapter is devoted to an introduction of data manipulation and basic analyses in R. This introductory chapter focuses more heavily on analyses that many of the other similarly focused chapters in other texts. The new R user who hasn’t yet built up a library of these chapters will find it useful, but for experienced R users or those with multiple R texts, there is little new information.

Albert’s introduction to the foundational Bayesian concepts (e.g., Bayes’ theorem) is concise and will be clear to those with a statistical background, but others may need to refresh their statistical knowledge before they can fully grasp the content in the second chapter. Those from programming backgrounds without extensive statistical knowledge may be better off beginning with a text that deals specifically with Bayesian analysis.

Many of the topics discussed in this text have limited application, but possibly the most broadly applicable chapter deals with Bayesian regression. Those interested in learning how to run and diagnose Bayesian regression in R will find almost everything they need to know here.

As with many R texts, Bayesian Computation with R has an accompanying package of functions available on CRAN (“LearnBayes”). The functions in this package are focused mainly on teaching Bayesian analysis, but also include some useful basic implementations.

This book straddles the line between introductory theory and intermediate-level statistical programming. Because of the omissions of information on each side of that line, the reader will get the most mileage from the text if he or she has access to resources (i.e., other texts, colleagues, or previous knowledge) that can fill in those omissions. For that reason, it would work well as a text for an upper-level course on Bayesian statistics and their application, but it is not well suited as a reference text, or as a guide for real-world analysis.

Overall, I recommend this book, with the caveat that interested readers should review the sample pages available on the Springer website here and the functions in the “LearnBayes” package prior to purchasing. The text is currently available for approximately $50 in paperback and $40 for the Kindle version.

Building Scoring and Ranking Systems in R

This guest article was written by author and consultant Tristan Yates (see his bio below). It emphasizes R’s data object manipulation and scoring capabilities via a detailed financial analysis example.

Scoring and ranking systems are extremely valuable management tools. They can be used to predict the future, make decisions, and improve behavior – sometimes all of the above. Think about how the simple grade point average is used to motivate students and make admissions decisions.

R is a great tool for building scoring and ranking systems. It’s a programming language designed for analytical applications with statistical capabilities. The capability to store and manipulate data in list and table form is built right into the core language.
 

But there’s also some validity to the criticism that R provides too many choices and not enough guidance. The best solution is to share your work with others, so in this article we show a basic design workflow for one such scoring and ranking system that can be applied to many different types of projects.

The Approach
For a recent article in Active Trader, we analyzed the risk of different market sectors over time with the objective of building less volatile investment portfolios. Every month, we scored each sector by its risk, using its individual ranking within the overall population, and used these rankings to predict future risk.

Here’s the workflow we used, and it can be applied to any scoring and ranking system that must perform over time (which most do):

  1. Load in the historical data for every month and ticker symbol.
  2. Load in the performance data for every month and ticker symbol.
  3. Generate scores and rankings for every month and ticker symbol based upon their relative position in the population on various indicators.
  4. Review the summary and look for trends.

In these steps, we used four data frames, as shown below:

Name Contents
my.history historical data
my.scores scoring components, total scores, rankings
my.perf performance data
my.summary   summary or aggregate data

 
One of my habits is to prefix my variables – it helps prevent collisions in the R namespace.

Some people put all of their data in the same data.frame, but keeping it separate reinforces good work habits. First, the historical data and performance data should never be manipulated, so it makes sense to keep it away from the more volatile scoring data.

Second, it helps draw a clear distinction between what we know at one point in time – which is historical data – and what we will know later – which is the performance data. That’s absolutely necessary for the integrity of the scoring system.

My.history, my.scores, and my.perf are organized like this:

 yrmo   ticker    var1     var2     etc…  
200401   XLF      
200401   XLB      
etc…        

 
yrmo is the year and month and ticker is the item to be scored. We maintain our own list of dates (in yrmo format) and items in my.dates and my.items. Both these lists are called drivers, as they can help iterate through the data.frame, and we also have a useful data.frame called my.driver which has only the yrmo and ticker.

One trick – we keep the order the same for all of these data.frames. That way we can use indexes on one to query another. For example, this works just fine:

  Vol.spy <- my.history$vol.1[my.score$rank==1]

Loading Data
First, we get our driver lists and my.driver data.frame set up. We select our date range and items from our population, and then build a data.frame using the rbind command.

  #this is based on previous analysis
  my.dates <- m2$yrmo[13:(length(m2$yrmo)-3)]
  my.items <- ticker.list[2:10]

  #now the driver
  my.driver <- data.frame()
  for (z.date in my.dates) {
    my.driver <- rbind(my.driver,data.frame(ticker=my.items,yrmo=z.date))
  }

Next, let’s get our historical and performance data. We can make a function that can be called once for each row in my.driver that then loads any data needed.

  my.seq <- 1:length(my.driver[,1])
  my.history <- data.frame(ticker=my.driver$ticker,yrmo=my.driver$yrmo,
    vol.1=sapply(my.seq,calc.sd.fn,-1,-1))

Each variable can be loaded by a function called with the sapply command. The calc.sd.fn function first looks up the ticker and yrmo from my.driver using the index provided, and then returns the data. You would have one function for each indicator that you want to load. My.perf, which holds the performance data, is built in the exact same way.

The rbind command is slow unfortunately, but loading the historical and performance data only needs to be done once.

Scoring The Data
This is where R really shines. Let’s look at the highest level first.

  my.scores <- data.frame()
  for (z.yrmo in my.dates) {
    my.scores <- rbind(my.scores,calc.scores.fn(z.yrmo))
    }
  my.scores$p.tot <- (my.scores$p.vol.1)

Every indicator gets its own score, and then that can be combined in any conceivable way to create total score. In this very simple case, we’re only scoring one indicator, so we just use that score as the total score.

For more complex applications, the ideal strategy is to use multiple indicators from multiple data sources to tell the same story. Ignore those who advocate reducing variables and cross-correlations. Instead, think like a doctor that wants to run just one more test and get that independent confirmation.

Now the calc functions:

  scaled.score.fn <- function(z.raw)
    {pnorm(z.raw,mean(z.raw),sd(z.raw))*100}
  scaled.rank.fn <- function(z.raw) {rank(z.raw)}

  calc.scores.fn <- function(z.yrmo) {
    z.df <- my.history[my.history$yrmo==z.yrmo,]
    z.scores <- data.frame(ticker=z.df$ticker,yrmo=z.df$yrmo,
      p.vol.1=scaled.score.fn(z.df$vol.1),r.vol.1=scaled.rank.fn(z.df$vol.1))
    z.scores
    }

The calc.scores.fn function queries the data.frame to pull the population data for just a single point in time. Then, each indicator is passed to the scaled.score.fn and scaled.rank.fn function, returning a list of scores and ranks.

Here, we use the pnorm function to calculate a statistical Z-score, which is a good practice for ensuring that a scoring system isn’t dominated by a single indicator.

Checking the Scores
At this point, we create a new data.frame for summary analysis. We use the always useful and always confusing aggregate function and combine by rank. Notice how we easily we can combine data from my.history, my.scores and my.perf.

  data.frame(rank=1:9,p.tot=aggregate(my.scores$p.tot,
    list(rank=my.scores$r.vol.1),mean)$x,ret.1=aggregate(my.perf$ret.1,
    list(rank=my.scores$r.vol.1),mean)$x,sd.1=aggregate(my.perf$ret.1,
    list(rank=my.scores$r.vol.1),sd)$x,vol.1=aggregate(my.history$vol.1,
    list(rank=my.scores$r.vol.1),mean)$x,vol.p1=aggregate(my.history$vol.1,
    list(rank=my.scores$r.vol.1),mean)$x)

Here’s the result. We could check plots or correlations, but the trend – higher relative volatility in the past (vol.p1, p.tot) is more likely to mean higher relative volatility in the future (vol.1, sd.1) - is crystal clear.

rank  p.tot  ret.1   sd.1    vol.1   vol.p1  
1 12.1 0.131 4.03 16.5 13.8
2 19.4 0.0872 4.82 16.6 16.1
3 27.1 0.2474 4.96 20.1 18
4 35.6 0.4247 5.31 20.9 19.9
5 44.9 0.6865 5.98 22.1 21.7
6 53 0.3235 5.84 21.5 23.2
7 65.1 1.019 5.86 24.6 25.4
8 78 0.7276 6.04 26.9 28.4
9 96.4 0.0837 9.34 35.2 38.3

 
In the case of our analysis, the scores aren’t really necessary – we’re only ranking nine items every month. If we did have a larger population, we could use code like this to create subgroups (six groups shown here), and then use the above aggregate function with the new my.scores$group variable.

  my.scores$group <- cut(my.scores$p.tot,
    breaks=quantile(my.scores$p.tot,(0:6)/6),include.lowest=TRUE,labels=1:6)

Wrap-up
We ultimately only ended up scoring one variable, but it’s pretty easy to see how this framework could be expanded to dozens or more. Even so, it’s an easy system to describe – we grade each item by its ranking within the population. People don’t trust scoring systems that can’t be easily explained, and with good reason.

There’s not a lot of code here, and that’s a testimony to R’s capabilities. A lot of housekeeping work is done for you, and the list operations eliminate confusing nested loops. It can be a real luxury to program in R after dealing with some other “higher level” language.

We hope you find this useful and encourage you to share your own solutions as well.

Tristan Yates is the Executive Director of Yates Management, a management and analytical consulting firm serving financial and military clients. He is also the author of Enhanced Indexing Strategies and his writing and research have appeared in publications including the Wall Street Journal and Forbes/Investopedia.

Helpful statistical references

In a previous article I provided a list of R programming resources. As a complement to that post, I’ve compiled a list of statistically oriented websites that colleagues and I have found useful below. For the most part, these sites focus on statistics and quantitative research methods rather than programming.

This first grouping lists sites that are mostly one-stop-shops for research design and analytical information. The first two, (and especially the UCLA website) are Tier I statistics/research methods sites. They are indispensable. The three remaining sites in this section cover less advanced topics and focus more on basics, but may be helpful for the R user who is more programmer than statistician.

The second group of sites is comprised of technical references such as statistical dictionaries and notation guides. The final section list two sites that have detailed information and examples focused on running statistical analyses in R. Note that the UCLA site also includes many examples using R.

Comprehensive coverage

Statistical computing at UCLA

Statnotes: Topics in Multivariate Analysis, by G. David Garson

Introductory Statistics: Concepts, models, and applications

Social Research Methods Knowledge Base

Wolfram MathWorld

Technical References

StatSoft statistical glossary

Glossary of technical notation

Dictionary of Algorithms and Data Structures

R specific sites

Journal of Statistical Software

QuickR

If you know of another site for either R programming or statistics that I’ve missed, mention it in the comments below and I’ll add it to the proper list.

Data Analysis and Graphics Using R – Maindonald and Braun (2003)

Title: Data Analysis and Graphics Using R: An Example-Based Approach
Author(s): John Maindonald; John Braun
Publisher/Date: Cambridge University Press/2003
Statistics level: Intermediate to advanced
Programming level: Beginner to intermediate
Overall recommendation: Highly recommended

Data Analysis and Graphics Using R (DAAG) covers an exceptionally large range of topics. Because of the book’s breadth, new and experienced R users alike will find the text helpful as a learning tool and resource, but it will be of most service to those who already have a basic understanding of statistics and the R system.

Although the text includes both an Introduction to R section (chapter one) and a discussion of the basics of quantitative data analysis (chapters two through four), these chapters will be most useful as overviews (or reviews for more experienced readers), as they lack the detail required to take a reader from no knowledge of these subjects to a functional understanding. For example, chapter one discusses importing data in .txt and .csv format, but the foreign package is not discussed until chapter fourteen – the final chapter of the book. In practice, .txt data structures are not common enough to justify relegating a discussion of the foreign package to the supplemental materials and a researcher stuck with a .sav or .dbf file would not leave chapter one with enough knowledge to import their data into R.

Chapters five through thirteen deal primarily with different flavors of regression techniques. These chapters are the truly valuable pieces of this work as each chapter covers one or two approaches in detail. The major analyses covered in this section include bivariate and multivariate regression, GLM and survival models, time-series analyses, repeated measures, classification trees, and factor analysis. As regression techniques are a core component of quantitative methods these chapters will be useful to many researchers across many industries and disciplines. Much of the discussion of graphing comes via diagnostic and exploratory techniques that are related to the analyses in this section.

As the subtitle suggests, examples of code accompany most significant discussions of analyses. Additionally, several full color plates of graphs are included in the appendices, allowing the authors to provide examples of color options.

DAAG is highly recommended for readers who have at least a basic understanding of quantitative analysis and at least some limited experience with R, however, more advanced readers will also find this book useful as a review and reference.