Bayesian Computation with R – Albert (2009)

Title: Bayesian Computation with R
Author(s): Jim Albert
Publisher/Date: Springer/2009
Statistics level: High
Programming level: Low
Overall recommendation: Recommended

Bayesian Computation with R focuses primarily on providing the reader with a basic understanding of Bayesian thinking and the relevant analytic tools included in R. It does not explore either of those areas in detail, though it does hit the key points for both.

As with many R books, the first chapter is devoted to an introduction of data manipulation and basic analyses in R. This introductory chapter focuses more heavily on analyses that many of the other similarly focused chapters in other texts. The new R user who hasn’t yet built up a library of these chapters will find it useful, but for experienced R users or those with multiple R texts, there is little new information.

Albert’s introduction to the foundational Bayesian concepts (e.g., Bayes’ theorem) is concise and will be clear to those with a statistical background, but others may need to refresh their statistical knowledge before they can fully grasp the content in the second chapter. Those from programming backgrounds without extensive statistical knowledge may be better off beginning with a text that deals specifically with Bayesian analysis.

Many of the topics discussed in this text have limited application, but possibly the most broadly applicable chapter deals with Bayesian regression. Those interested in learning how to run and diagnose Bayesian regression in R will find almost everything they need to know here.

As with many R texts, Bayesian Computation with R has an accompanying package of functions available on CRAN (“LearnBayes”). The functions in this package are focused mainly on teaching Bayesian analysis, but also include some useful basic implementations.

This book straddles the line between introductory theory and intermediate-level statistical programming. Because of the omissions of information on each side of that line, the reader will get the most mileage from the text if he or she has access to resources (i.e., other texts, colleagues, or previous knowledge) that can fill in those omissions. For that reason, it would work well as a text for an upper-level course on Bayesian statistics and their application, but it is not well suited as a reference text, or as a guide for real-world analysis.

Overall, I recommend this book, with the caveat that interested readers should review the sample pages available on the Springer website here and the functions in the “LearnBayes” package prior to purchasing. The text is currently available for approximately $50 in paperback and $40 for the Kindle version.

Building Scoring and Ranking Systems in R

This guest article was written by author and consultant Tristan Yates (see his bio below). It emphasizes R’s data object manipulation and scoring capabilities via a detailed financial analysis example.

Scoring and ranking systems are extremely valuable management tools. They can be used to predict the future, make decisions, and improve behavior – sometimes all of the above. Think about how the simple grade point average is used to motivate students and make admissions decisions.

R is a great tool for building scoring and ranking systems. It’s a programming language designed for analytical applications with statistical capabilities. The capability to store and manipulate data in list and table form is built right into the core language.
 

But there’s also some validity to the criticism that R provides too many choices and not enough guidance. The best solution is to share your work with others, so in this article we show a basic design workflow for one such scoring and ranking system that can be applied to many different types of projects.

The Approach
For a recent article in Active Trader, we analyzed the risk of different market sectors over time with the objective of building less volatile investment portfolios. Every month, we scored each sector by its risk, using its individual ranking within the overall population, and used these rankings to predict future risk.

Here’s the workflow we used, and it can be applied to any scoring and ranking system that must perform over time (which most do):

  1. Load in the historical data for every month and ticker symbol.
  2. Load in the performance data for every month and ticker symbol.
  3. Generate scores and rankings for every month and ticker symbol based upon their relative position in the population on various indicators.
  4. Review the summary and look for trends.

In these steps, we used four data frames, as shown below:

Name Contents
my.history historical data
my.scores scoring components, total scores, rankings
my.perf performance data
my.summary   summary or aggregate data

 
One of my habits is to prefix my variables – it helps prevent collisions in the R namespace.

Some people put all of their data in the same data.frame, but keeping it separate reinforces good work habits. First, the historical data and performance data should never be manipulated, so it makes sense to keep it away from the more volatile scoring data.

Second, it helps draw a clear distinction between what we know at one point in time – which is historical data – and what we will know later – which is the performance data. That’s absolutely necessary for the integrity of the scoring system.

My.history, my.scores, and my.perf are organized like this:

 yrmo   ticker    var1     var2     etc…  
200401   XLF      
200401   XLB      
etc…        

 
yrmo is the year and month and ticker is the item to be scored. We maintain our own list of dates (in yrmo format) and items in my.dates and my.items. Both these lists are called drivers, as they can help iterate through the data.frame, and we also have a useful data.frame called my.driver which has only the yrmo and ticker.

One trick – we keep the order the same for all of these data.frames. That way we can use indexes on one to query another. For example, this works just fine:

  Vol.spy <- my.history$vol.1[my.score$rank==1]

Loading Data
First, we get our driver lists and my.driver data.frame set up. We select our date range and items from our population, and then build a data.frame using the rbind command.

  #this is based on previous analysis
  my.dates <- m2$yrmo[13:(length(m2$yrmo)-3)]
  my.items <- ticker.list[2:10]

  #now the driver
  my.driver <- data.frame()
  for (z.date in my.dates) {
    my.driver <- rbind(my.driver,data.frame(ticker=my.items,yrmo=z.date))
  }

Next, let’s get our historical and performance data. We can make a function that can be called once for each row in my.driver that then loads any data needed.

  my.seq <- 1:length(my.driver[,1])
  my.history <- data.frame(ticker=my.driver$ticker,yrmo=my.driver$yrmo,
    vol.1=sapply(my.seq,calc.sd.fn,-1,-1))

Each variable can be loaded by a function called with the sapply command. The calc.sd.fn function first looks up the ticker and yrmo from my.driver using the index provided, and then returns the data. You would have one function for each indicator that you want to load. My.perf, which holds the performance data, is built in the exact same way.

The rbind command is slow unfortunately, but loading the historical and performance data only needs to be done once.

Scoring The Data
This is where R really shines. Let’s look at the highest level first.

  my.scores <- data.frame()
  for (z.yrmo in my.dates) {
    my.scores <- rbind(my.scores,calc.scores.fn(z.yrmo))
    }
  my.scores$p.tot <- (my.scores$p.vol.1)

Every indicator gets its own score, and then that can be combined in any conceivable way to create total score. In this very simple case, we’re only scoring one indicator, so we just use that score as the total score.

For more complex applications, the ideal strategy is to use multiple indicators from multiple data sources to tell the same story. Ignore those who advocate reducing variables and cross-correlations. Instead, think like a doctor that wants to run just one more test and get that independent confirmation.

Now the calc functions:

  scaled.score.fn <- function(z.raw)
    {pnorm(z.raw,mean(z.raw),sd(z.raw))*100}
  scaled.rank.fn <- function(z.raw) {rank(z.raw)}

  calc.scores.fn <- function(z.yrmo) {
    z.df <- my.history[my.history$yrmo==z.yrmo,]
    z.scores <- data.frame(ticker=z.df$ticker,yrmo=z.df$yrmo,
      p.vol.1=scaled.score.fn(z.df$vol.1),r.vol.1=scaled.rank.fn(z.df$vol.1))
    z.scores
    }

The calc.scores.fn function queries the data.frame to pull the population data for just a single point in time. Then, each indicator is passed to the scaled.score.fn and scaled.rank.fn function, returning a list of scores and ranks.

Here, we use the pnorm function to calculate a statistical Z-score, which is a good practice for ensuring that a scoring system isn’t dominated by a single indicator.

Checking the Scores
At this point, we create a new data.frame for summary analysis. We use the always useful and always confusing aggregate function and combine by rank. Notice how we easily we can combine data from my.history, my.scores and my.perf.

  data.frame(rank=1:9,p.tot=aggregate(my.scores$p.tot,
    list(rank=my.scores$r.vol.1),mean)$x,ret.1=aggregate(my.perf$ret.1,
    list(rank=my.scores$r.vol.1),mean)$x,sd.1=aggregate(my.perf$ret.1,
    list(rank=my.scores$r.vol.1),sd)$x,vol.1=aggregate(my.history$vol.1,
    list(rank=my.scores$r.vol.1),mean)$x,vol.p1=aggregate(my.history$vol.1,
    list(rank=my.scores$r.vol.1),mean)$x)

Here’s the result. We could check plots or correlations, but the trend – higher relative volatility in the past (vol.p1, p.tot) is more likely to mean higher relative volatility in the future (vol.1, sd.1) - is crystal clear.

rank  p.tot  ret.1   sd.1    vol.1   vol.p1  
1 12.1 0.131 4.03 16.5 13.8
2 19.4 0.0872 4.82 16.6 16.1
3 27.1 0.2474 4.96 20.1 18
4 35.6 0.4247 5.31 20.9 19.9
5 44.9 0.6865 5.98 22.1 21.7
6 53 0.3235 5.84 21.5 23.2
7 65.1 1.019 5.86 24.6 25.4
8 78 0.7276 6.04 26.9 28.4
9 96.4 0.0837 9.34 35.2 38.3

 
In the case of our analysis, the scores aren’t really necessary – we’re only ranking nine items every month. If we did have a larger population, we could use code like this to create subgroups (six groups shown here), and then use the above aggregate function with the new my.scores$group variable.

  my.scores$group <- cut(my.scores$p.tot,
    breaks=quantile(my.scores$p.tot,(0:6)/6),include.lowest=TRUE,labels=1:6)

Wrap-up
We ultimately only ended up scoring one variable, but it’s pretty easy to see how this framework could be expanded to dozens or more. Even so, it’s an easy system to describe – we grade each item by its ranking within the population. People don’t trust scoring systems that can’t be easily explained, and with good reason.

There’s not a lot of code here, and that’s a testimony to R’s capabilities. A lot of housekeeping work is done for you, and the list operations eliminate confusing nested loops. It can be a real luxury to program in R after dealing with some other “higher level” language.

We hope you find this useful and encourage you to share your own solutions as well.

Tristan Yates is the Executive Director of Yates Management, a management and analytical consulting firm serving financial and military clients. He is also the author of Enhanced Indexing Strategies and his writing and research have appeared in publications including the Wall Street Journal and Forbes/Investopedia.

Data Manipulation with R – Spector (2008)

Title: Data Manipulation with R
Author(s): Phil Spector
Publisher/Date: Springer/2008
Statistics level: N/A
Programming level: Intermediate
Overall recommendation: Highly recommended

If there is one book that every beginning R user coming from a programming background should have, it is Spector’s Data Manipulation with R. New R users with analytic backgrounds and experience with software packages such as SAS and SPSS will do well to start with Muenchen’s R for SPSS and SAS users, especially given that a free abbreviated version is available, but those users should also make Data Manipulation with R a quick second addition to their library.

The text of this book is as concise and to the point as its title. It covers almost every relevant data manipulation topic in R, from modes and classes, through accessing data via database connections, to complex reshaping and aggregating functions. It has copious examples and the text hits just the right level of sophistication for the individual who has some experience with programming, but little experience with R idioms and data manipulation techniques.

My only critique of this book is that it skips over the basics of creating user-defined functions for data manipulation tasks. Spector addresses mapping functions to various data structures, but it seems likely that, at this level, the average R analyst would be better served by a discussion of how to simply create a function in R. Keep in mind that if you are looking for that type of information, you will need to look elsewhere. The same is true if you are looking for any sort of statistical instruction, as Data Manipulation with R focuses almost exclusively on programming.

Overall, I highly recommend this book. At around $45 USD, it is well worth the price. You’ll breeze through it on your first pass, but if you’re new to R you will get your money’s worth out of it as a reference text.

Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For “wild” data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as “webscraping”. Two R facilities, readLines() from the base package and getURL() from the RCurl package make this task possible.

readLines

For basic webscraping tasks the readLines() function will usually suffice. readLines() allows simple access to webpage source data on non-secure servers. In its simplest form, readLines() takes a single argument – the URL of the web page to be read:

web_page <- readLines("http://www.interestingwebsite.com")

As an example of a (somewhat) practical use of webscraping, imagine a scenario in which we wanted to know the 10 most frequent posters to the R-help listserve for January 2009. Because the listserve is on a secure site (e.g. it has https:// rather than http:// in the URL) we can't easily access the live version with readLines(). So for this example, I've posted a local copy of the list archives on the this site.

One note, by itself readLines() can only acquire the data. You'll need to use grep(), gsub() or equivalents to parse the data and keep what you need.

# Get the page's source
web_page <- readLines("http://www.programmingr.com/jan09rlist.html")

# Pull out the appropriate line
author_lines <- web_page[grep("&lt;I&gt;", web_page)]

# Delete unwanted characters in the lines we pulled out
authors <- gsub("&lt;I&gt;", "", author_lines, fixed = TRUE)

# Present only the ten most frequent posters
author_counts <- sort(table(authors), decreasing = TRUE)
author_counts[1:10]

We can see that Gabor Grothendieck was the most frequent poster to R-help in January 2009.

The RCurl package

To get more advanced http features such as POST capabilities and https access, you'll need to use the RCurl package. To do webscraping tasks with the RCurl package use the getURL() function. After the data has been acquired via getURL(), it needs to be restructured and parsed. The htmlTreeParse() function from the XML package is tailored for just this task. Using getURL() we can access a secure site so we can use the live site as an example this time.

# Install the RCurl package if necessary
install.packages("RCurl", dependencies = TRUE)
library("RCurl")

# Install the XML package if necessary
install.packages("XML", dependencies = TRUE)
library("XML")

# Get first quarter archives
jan09 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)

jan09_parsed <- htmlTreeParse(jan09)

# Continue on similar to above
...

For basic webscraping tasks readLines() will be enough and avoids over complicating the task. For more difficult procedures or for tasks requiring other http features getURL() or other functions from the RCurl package may be required. For more information on cURL visit the project page here.

Helpful statistical references

In a previous article I provided a list of R programming resources. As a complement to that post, I’ve compiled a list of statistically oriented websites that colleagues and I have found useful below. For the most part, these sites focus on statistics and quantitative research methods rather than programming.

This first grouping lists sites that are mostly one-stop-shops for research design and analytical information. The first two, (and especially the UCLA website) are Tier I statistics/research methods sites. They are indispensable. The three remaining sites in this section cover less advanced topics and focus more on basics, but may be helpful for the R user who is more programmer than statistician.

The second group of sites is comprised of technical references such as statistical dictionaries and notation guides. The final section list two sites that have detailed information and examples focused on running statistical analyses in R. Note that the UCLA site also includes many examples using R.

Comprehensive coverage

Statistical computing at UCLA

Statnotes: Topics in Multivariate Analysis, by G. David Garson

Introductory Statistics: Concepts, models, and applications

Social Research Methods Knowledge Base

Wolfram MathWorld

Technical References

StatSoft statistical glossary

Glossary of technical notation

Dictionary of Algorithms and Data Structures

R specific sites

Journal of Statistical Software

QuickR

If you know of another site for either R programming or statistics that I’ve missed, mention it in the comments below and I’ll add it to the proper list.

Positioning charts with fig and fin

R offers several ways to spatially orient multiple graphs in a single graphing space. The layout() function and mfrow/mfcol parameter settings are adequate solutions for many tasks and allow the graphing space to be broken up into tabular or matrix-based arrangements. For more fine grained manipulation, the fig and fin parameter settings are available. This article illustrates the capabilities and use of fig and fin.

First we’ll create some simulation data to work with:

# create data
sim.data <- cbind(replicate(5,runif(8,min=0, max=100)))

The code above results in a matrix object with eight rows and three columns.

The fig and fin parameters affect the same graphing elements via different units. The fig parameter takes normalized device coordinates (NDC) and fin takes dimensions in inches of the device region. Because the fig units are generally more user friendly, I will use it in the examples below; however, selecting equivalent dimensions using the fin would have an identical effect. Similar to other functions that use NDC to define graphing space, fig takes a four item vector wherein positions one and three define, in percentages of the device region, the starting points of the x and y axes, respectively, while positions two and four define the end points. The default fig setting is (0, 1, 0, 1) and uses the entire device space. The default fig setting is (0, 1, 0, 1) and uses the entire device space. The graph below illustrates the default settings of fig.

# graph cases by first column using default fig
# settings of 0 1 0 1 (the full device width and height)
par(mar=c(2, 2, 1, 1), new = FALSE, cex.axis = .6, mgp = c(0, 0, 0))

#open plot
plot(c(0,100), c(-1,1), type = "n", ylab = "", yaxt = "n", xlab = "")
points(sim.data[,1], replicate(8, 0), pch = 19, col = 1:8, cex = 1.5)
# add center reference line
abline(0,0)
legend("bottomright", fill = c(1:8), legend = c(1:8), ncol = 4)
fig default

To make the horizontal dimensions of the graph smaller or to move the graph left or right, adjust the starting and ending x coordinates, given by the first and second positions of the fig value vector. To make the vertical dimensions of the graph smaller or to move the graph up or down, adjust the staring and ending y coordinates given in the third and fourth positions as below.

# decrease horizontal span
par(fig=c(0, 1, .2, .8))

#open plot
plot(c(0,100), c(-1,1), type = "n", ylab = "", yaxt = "n", xlab = "")
points(sim.data[,1], replicate(8, 0), pch = 19, col = 1:8, cex = 1.5)
# add center reference line
abline(0,0)
legend("bottomright", fill = c(1:8), legend = c(1:8), ncol = 4)
fig thin

It is possible to resize and move a single graph to any spatial orientation on the graphing device using the approach above. Additionally, you can also use this method to add multiple graphs of various sizes to a single device:

# place graph one in the bottom left
par(fig=c(0, .25, 0, .25), mar=c(2,.5,1,.5), mgp=c(0, 1, 0))

#open plot
plot(c(0,100), c(-1,1), type = "n", ylab = "", yaxt = "n", xlab = "")
points(sim.data[,1], replicate(8, 0), pch = 19, col = 1:8)
# add center reference line
abline(0,0)

# place graph two in the top right
# set graphing parameters for next plot and set new parameter to TRUE
par(fig=c(.75, 1, .75, 1), new = TRUE)

#open plot
plot(c(0,100), c(-1,1), type = "n", ylab = "", yaxt = "n", xlab = "")
points(sim.data[,2], replicate(8, 0), pch = 19, col = 1:8)
# add center reference line
abline(0,0)

# place main graph in the center
# set graphing parameters for next plot and set new parameter to TRUE
par(fig=c(.25, .75, .25, .75), new = TRUE)

#open plot
plot(c(0,100), c(-1,1), type = "n", ylab = "", yaxt = "n", xlab = "")
points(sim.data[,3], replicate(8, 0), pch = 19, col = 1:8, cex = 1.5)
# add center reference line
abline(0,0)
legend("bottomright", fill = c(1:8), legend = c(1:8), ncol = 4)
fig multiple

For simplicity I have mostly avoided labels and titles in these graphs; however they can be added and manipulated as they would be without the use of fig or fin.

Online R programming resources

R can legitimately be called both a programming language and a statistical package. Many books address both the programming and statistical components of R, but invariably the discussion of statistical topics is more detailed than the discussion of programming capabilities. As a supplement, I’ve started the list of links below. Each of these sources deals specifically and almost exclusively with the the programming aspects of R: objects, arrays, loops and conditional statements, custom functions, debugging, and so on. I’ll add to this list as I become aware of other sites. Please feel free to suggest additional sites in the comments.

For a list of complementary sites that focus on statistical principles and research methods (several deal specifically with R) read this article.

A Handbook of Statistical Analyses Using R – Everitt and Hothorn (2006)

Title: A Handbook of Statistical Analyses Using R
Author(s): Brian S. Torvitt; Torsten Hothorn
Publisher/Date: Chapman & Hall/2006
Statistics level: Intermediate to advanced
Programming level: Intermediate
Overall recommendation: Highly recommended

A Handbook of Statistical Analyses Using R addresses a list of several common statistical analyses in great detail. Over a course of 15 chapters, the handbook takes the reader from an introduction to R through a discussion of statistical inference, to linear and logistic regression, tree analysis, survival analysis, longitudinal analysis, meta-analysis, factoring, scaling, and clustering. The handbook has a peer-reviewed journal style that will be familiar to academic researchers and each chapter stands on its own. This approach makes the text exceptionally useful in the academic setting as a professor can distribute and assign the first chapter of the book to her Research Methods 101 course; the final chapters on scaling and dimensionality to her Psychometrics Methods course; the last chapter on clustering to her Marketing Research course; and require the entire book for her graduate methods course. For custom research shops making the transition to R or who frequently hire new entry level R users, this book will work well as a reference and training manual.

The handbook does show typical first edition flaws. There are sporadic mistakes in grammar such as misspellings and incorrect words. The overall organization of the book is strong, but the chapter level organization is less effective. Each chapter begins with a discussion of all of the datasets used in that chapter and is followed by examples and applications based on those datasets. In chapters where there are several examples, the discussion of the data is too detached from its corresponding example. When the reader reaches the example based on the first dataset they have likely forgotten the relevant details about that data’s structure. Grouping the data discussions with the examples they accompanied would have made the example based approach more effective.

The introductory section on R is one of the best introductory sections I have read. It strikes an almost perfect balance between the programming and statistical features of R. I frequently recommend this initial chapter to colleagues who have research experience but are new to R. There are numerous graphs included in the examples in the text and although there is virtually no general discussion of producing graphs in R, each graph presented in this text includes the code required to reproduce it. This omission is a welcome one, as it allows the authors to focus more on statistical details. Readers looking for a more general discussion of how to produce graphs in R should consider Data Analysis and Graphing Using R.

Controlling margins and axes with oma and mgp

When creating graphs, we’re usually most concerned with what happens near the center of our displays, as this is where most of the important information is generally held. But sometimes, either for aesthetics or clarity, we want to adjust what’s outside of the box – in the margins, labels or tick marks. The par() function offers several ways to do this and I’ll discuss two that deal primarily with spatial orientation – rather than content – below.

The oma, omd, and omi options

To control the width of the outer margins of your graph (the empty sections outside of the axes and labels) use either the oma, omd, or omi option of the par() function. All three of these options have the same effect and differ only in the units used to define the parameter. oma defines the space in lines, omd as a fraction of the device region, and omi specifies the size in inches. oma and omi take a four item vector where position one sets the bottom margin, position two the left margin, position three the top margin and position four the right margin. omd uses a four item vector where positions one and three define, in percentages of the device region, the starting points of the x and y axes, respectively, while positions two and four define the end points. Because these options all effect the same graph space, changing one also changes the remaining two. A few examples of code and the charts they produce are shown below. To help illustrate the different margin sizes, the blue area indicates the dimensions of the device display:

# generate some data
x<-log(seq(1:100))

# oma, omd, and omi defaults
par()

$oma
[1] 0 0 0 0

$omd
[1] 0 1 0 1

$omi
[1] 0 0 0 0

# plot using default margin settings
plot(x,pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("Default")
oma default
# add four lines to bottom and top margins
par(oma = c(4, 0, 4, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("oma = c(4, 0, 4, 0)")
oma 2
# change via omd
par(omd = c(.15, .85, .15, .85))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("omd = c(.15, .85, .15, .85)")
oma 3
# because oma, omd, and omi all affect the same graph space
# this doesn't make sense
par(omi = c(0, 0, 0, 0), omd = c(.10, .90, .10, .90))

# reset oma, omd, and omi to default by changing omi
par(omi = c(0, 0, 0, 0))

The mgp option

In addition to changing the margin size of your charts, you may also want to change the way axes and labels are spatially arranged. One method of doing so is the mgp parameter option. The mgp setting is defined by a three item vector wherein the first value represents the distance of the axis labels or titles from the axes, the second value is the distance of the tick mark labels from the axes, and the third is the distance of the tick mark symbols from the axes. As with the oma option discussed above, the distances are given in line widths. The defaults for the mgp setting are c(3, 1, 0). The examples below illustrate the effects of changing the various mgp values. Note: the mgp.axis() function in the Hmisc package can be used to change these settings for each axis individually.

# mgp default settings
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp default
# move labels close to axes
par(mgp = c(0, 1, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move labels
# move tick labels out
par(mgp = c(0, 3, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move tick labels
# move tick lines out
par(mgp = c(0, 3, 2))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move tick lines

Summary
The oma, omd, omi, and mgp parameter settings can be useful in defining and adjusting the outer regions of your charts. To arrage and size multiple graphing areas you may also find other par() settings such as fig, fin, or layout helpful.

Data Analysis and Graphics Using R – Maindonald and Braun (2003)

Title: Data Analysis and Graphics Using R: An Example-Based Approach
Author(s): John Maindonald; John Braun
Publisher/Date: Cambridge University Press/2003
Statistics level: Intermediate to advanced
Programming level: Beginner to intermediate
Overall recommendation: Highly recommended

Data Analysis and Graphics Using R (DAAG) covers an exceptionally large range of topics. Because of the book’s breadth, new and experienced R users alike will find the text helpful as a learning tool and resource, but it will be of most service to those who already have a basic understanding of statistics and the R system.

Although the text includes both an Introduction to R section (chapter one) and a discussion of the basics of quantitative data analysis (chapters two through four), these chapters will be most useful as overviews (or reviews for more experienced readers), as they lack the detail required to take a reader from no knowledge of these subjects to a functional understanding. For example, chapter one discusses importing data in .txt and .csv format, but the foreign package is not discussed until chapter fourteen – the final chapter of the book. In practice, .txt data structures are not common enough to justify relegating a discussion of the foreign package to the supplemental materials and a researcher stuck with a .sav or .dbf file would not leave chapter one with enough knowledge to import their data into R.

Chapters five through thirteen deal primarily with different flavors of regression techniques. These chapters are the truly valuable pieces of this work as each chapter covers one or two approaches in detail. The major analyses covered in this section include bivariate and multivariate regression, GLM and survival models, time-series analyses, repeated measures, classification trees, and factor analysis. As regression techniques are a core component of quantitative methods these chapters will be useful to many researchers across many industries and disciplines. Much of the discussion of graphing comes via diagnostic and exploratory techniques that are related to the analyses in this section.

As the subtitle suggests, examples of code accompany most significant discussions of analyses. Additionally, several full color plates of graphs are included in the appendices, allowing the authors to provide examples of color options.

DAAG is highly recommended for readers who have at least a basic understanding of quantitative analysis and at least some limited experience with R, however, more advanced readers will also find this book useful as a review and reference.