A Handbook of Statistical Analyses Using R – Everitt and Hothorn (2006)

Title: A Handbook of Statistical Analyses Using R
Author(s): Brian S. Torvitt; Torsten Hothorn
Publisher/Date: Chapman & Hall/2006
Statistics level: Intermediate to advanced
Programming level: Intermediate
Overall recommendation: Highly recommended

A Handbook of Statistical Analyses Using R addresses a list of several common statistical analyses in great detail. Over a course of 15 chapters, the handbook takes the reader from an introduction to R through a discussion of statistical inference, to linear and logistic regression, tree analysis, survival analysis, longitudinal analysis, meta-analysis, factoring, scaling, and clustering. The handbook has a peer-reviewed journal style that will be familiar to academic researchers and each chapter stands on its own. This approach makes the text exceptionally useful in the academic setting as a professor can distribute and assign the first chapter of the book to her Research Methods 101 course; the final chapters on scaling and dimensionality to her Psychometrics Methods course; the last chapter on clustering to her Marketing Research course; and require the entire book for her graduate methods course. For custom research shops making the transition to R or who frequently hire new entry level R users, this book will work well as a reference and training manual.

The handbook does show typical first edition flaws. There are sporadic mistakes in grammar such as misspellings and incorrect words. The overall organization of the book is strong, but the chapter level organization is less effective. Each chapter begins with a discussion of all of the datasets used in that chapter and is followed by examples and applications based on those datasets. In chapters where there are several examples, the discussion of the data is too detached from its corresponding example. When the reader reaches the example based on the first dataset they have likely forgotten the relevant details about that data’s structure. Grouping the data discussions with the examples they accompanied would have made the example based approach more effective.

The introductory section on R is one of the best introductory sections I have read. It strikes an almost perfect balance between the programming and statistical features of R. I frequently recommend this initial chapter to colleagues who have research experience but are new to R. There are numerous graphs included in the examples in the text and although there is virtually no general discussion of producing graphs in R, each graph presented in this text includes the code required to reproduce it. This omission is a welcome one, as it allows the authors to focus more on statistical details. Readers looking for a more general discussion of how to produce graphs in R should consider Data Analysis and Graphing Using R.

Controlling margins and axes with oma and mgp

When creating graphs, we’re usually most concerned with what happens near the center of our displays, as this is where most of the important information is generally held. But sometimes, either for aesthetics or clarity, we want to adjust what’s outside of the box – in the margins, labels or tick marks. The par() function offers several ways to do this and I’ll discuss two that deal primarily with spatial orientation – rather than content – below.

The oma, omd, and omi options

To control the width of the outer margins of your graph (the empty sections outside of the axes and labels) use either the oma, omd, or omi option of the par() function. All three of these options have the same effect and differ only in the units used to define the parameter. oma defines the space in lines, omd as a fraction of the device region, and omi specifies the size in inches. oma and omi take a four item vector where position one sets the bottom margin, position two the left margin, position three the top margin and position four the right margin. omd uses a four item vector where positions one and three define, in percentages of the device region, the starting points of the x and y axes, respectively, while positions two and four define the end points. Because these options all effect the same graph space, changing one also changes the remaining two. A few examples of code and the charts they produce are shown below. To help illustrate the different margin sizes, the blue area indicates the dimensions of the device display:

# generate some data
x<-log(seq(1:100))

# oma, omd, and omi defaults
par()

$oma
[1] 0 0 0 0

$omd
[1] 0 1 0 1

$omi
[1] 0 0 0 0

# plot using default margin settings
plot(x,pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("Default")
oma default
# add four lines to bottom and top margins
par(oma = c(4, 0, 4, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("oma = c(4, 0, 4, 0)")
oma 2
# change via omd
par(omd = c(.15, .85, .15, .85))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
title("omd = c(.15, .85, .15, .85)")
oma 3
# because oma, omd, and omi all affect the same graph space
# this doesn't make sense
par(omi = c(0, 0, 0, 0), omd = c(.10, .90, .10, .90))

# reset oma, omd, and omi to default by changing omi
par(omi = c(0, 0, 0, 0))

The mgp option

In addition to changing the margin size of your charts, you may also want to change the way axes and labels are spatially arranged. One method of doing so is the mgp parameter option. The mgp setting is defined by a three item vector wherein the first value represents the distance of the axis labels or titles from the axes, the second value is the distance of the tick mark labels from the axes, and the third is the distance of the tick mark symbols from the axes. As with the oma option discussed above, the distances are given in line widths. The defaults for the mgp setting are c(3, 1, 0). The examples below illustrate the effects of changing the various mgp values. Note: the mgp.axis() function in the Hmisc package can be used to change these settings for each axis individually.

# mgp default settings
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp default
# move labels close to axes
par(mgp = c(0, 1, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move labels
# move tick labels out
par(mgp = c(0, 3, 0))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move tick labels
# move tick lines out
par(mgp = c(0, 3, 2))
plot(x, pch=1, col = "red", ylab = "Y Label", xlab = "X Label")
mgp move tick lines

Summary
The oma, omd, omi, and mgp parameter settings can be useful in defining and adjusting the outer regions of your charts. To arrage and size multiple graphing areas you may also find other par() settings such as fig, fin, or layout helpful.

Data Analysis and Graphics Using R – Maindonald and Braun (2003)

Title: Data Analysis and Graphics Using R: An Example-Based Approach
Author(s): John Maindonald; John Braun
Publisher/Date: Cambridge University Press/2003
Statistics level: Intermediate to advanced
Programming level: Beginner to intermediate
Overall recommendation: Highly recommended

Data Analysis and Graphics Using R (DAAG) covers an exceptionally large range of topics. Because of the book’s breadth, new and experienced R users alike will find the text helpful as a learning tool and resource, but it will be of most service to those who already have a basic understanding of statistics and the R system.

Although the text includes both an Introduction to R section (chapter one) and a discussion of the basics of quantitative data analysis (chapters two through four), these chapters will be most useful as overviews (or reviews for more experienced readers), as they lack the detail required to take a reader from no knowledge of these subjects to a functional understanding. For example, chapter one discusses importing data in .txt and .csv format, but the foreign package is not discussed until chapter fourteen – the final chapter of the book. In practice, .txt data structures are not common enough to justify relegating a discussion of the foreign package to the supplemental materials and a researcher stuck with a .sav or .dbf file would not leave chapter one with enough knowledge to import their data into R.

Chapters five through thirteen deal primarily with different flavors of regression techniques. These chapters are the truly valuable pieces of this work as each chapter covers one or two approaches in detail. The major analyses covered in this section include bivariate and multivariate regression, GLM and survival models, time-series analyses, repeated measures, classification trees, and factor analysis. As regression techniques are a core component of quantitative methods these chapters will be useful to many researchers across many industries and disciplines. Much of the discussion of graphing comes via diagnostic and exploratory techniques that are related to the analyses in this section.

As the subtitle suggests, examples of code accompany most significant discussions of analyses. Additionally, several full color plates of graphs are included in the appendices, allowing the authors to provide examples of color options.

DAAG is highly recommended for readers who have at least a basic understanding of quantitative analysis and at least some limited experience with R, however, more advanced readers will also find this book useful as a review and reference.