R Helper Functions for Indeed.com Searches

If you need job listing data, Indeed.com is a natural choice. It is one of the most popular job sites on the Internet and has listings from a wide range of industries. Indeed has APIs for things like affiliate widgets, but nothing that allows one to directly download a list of job results. Fortunately, the URL structure and site layout are fairly straightforward and lend themselves to easy webscraping.

The following functions wrap rvest capabilities for use on Indeed.com. A write up of the project that required these, including more detailed examples, will follow at some point. For now, I think the use of these functions is straightforward enough without much documentation. If not, email me or ask questions in the comments.

To source just the helper functions you can use:

source("https://raw.githubusercontent.com/bryancshepherd/IndeedJobSearchFunctions/master/jobSearchFunctions.R")

The full GitHub repo is here.

Get the job results – this may take a couple of minutes

jobResultsList = getJobs("JavaScript", nPages=10, includeSponsored = FALSE, showProgress = FALSE)
head(jobResultsList[["JavaScript"]][["Titles"]], 5)
## [1] "Frontend Software Engineer"                        
## [2] "Web Developer Front End HTML CSS"                  
## [3] "Sr. HTML5/JS Engineer"                             
## [4] "Web & Mobile Software Engineer"                    
## [5] "Software Engineer, JavaScript - Mobile - SNEI - SF"
head(jobResultsList[["JavaScript"]][["Summaries"]], 5)
## [1] "Expert in JavaScript, D3, AngularJS. The name ThousandEyes was born from two big ideas:...."                                                                       
## [2] "CSS, HTML, JavaScript:. Knowledge of JavaScript is handy. Web Developer Front End Programming...."                                                                 
## [3] "Hand coded JavaScript. Javascript, HTML 5, CSS 3, Angular. Xavient Information System is seeking a HTML/Javascript Developer with at least 3 year of expert..."    
## [4] "At least 1 year experience in applying knowledge of Javascript framework. Java, JSP, Servlets, Javascript Frameworks, HTML, Cascading Style Sheets (CSS),..."      
## [5] "Skilled JavaScript, HTML/CSS developer. Software Engineer, JavaScript - Mobile - SNEI - SF. 2+ years of single page web application development experience with..."

Collapse all of the terms into a large list and remove stopwords

cleanedJobData = cleanJobData(jobResultsList)
head(cleanedJobData[["JavaScript"]][["Titles"]], 5)
## [1] "frontend"  "software"  "engineer"  "web"       "developer"
head(cleanedJobData[["JavaScript"]][["Summaries"]], 5)
## [1] "expert"     "javascript" "d3"         "angularjs"  "name"

Create ordered wordlists for titles and descriptions

orderedTables = createWordTables(cleanedJobData)
head(orderedTables[["JavaScript"]][["Titles"]], 5)
##         Var1 Freq
## 27 developer   56
## 34  engineer   35
## 81       web   34
## 33       end   30
## 37     front   29
head(orderedTables[["JavaScript"]][["Summaries"]], 5)
##           Var1 Freq
## 264 javascript  129
## 107        css   45
## 230       html   43
## 538        web   41
## 173 experience   39

Create a flat file from the aggregated data for easier manipulation and plotting

flatFile = createFlatFile(orderedTables)
head(flatFile, 5)
##   searchTerm resultType resultTerms Freq   Percent
## 1 JavaScript     Titles   developer   56 15.642458
## 2 JavaScript     Titles    engineer   35  9.776536
## 3 JavaScript     Titles         web   34  9.497207
## 4 JavaScript     Titles         end   30  8.379888
## 5 JavaScript     Titles       front   29  8.100559

Playing with Gradient Descent in R

Gradient Descent is a workhorse in the machine learning world. As proof of its importance, it is one of the first algorithms that Andrew Ng discusses in his canonical Coursera Machine Learning course. There are many flavors and adaptations, but starting simple is usually a good thing. In this example, it is used to minimize the cost function (the sum of squared errors or SSE) for obtaining parameter estimates for a linear model. I.e.:

\text{minimize} J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

Which, when applied to a linear model becomes:

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})

\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right)

Where \theta_0 is our intercept and \theta_1 is the parameter estimate of our only predictor variable.

Ng’s course is Octave-based, but manually calculating the algorithm in an R script is a fun, simple exercise and if you’re primarily an R-user it might help you understand the algorithm better than the Octave examples. The code full code is in this repository, but here is the walkthrough:

  • Create some linearly related data with known relationships
  • Write a function that takes the data and starting (or current) estimates as inputs
  • Calculate the cost based on the current estimates
  • Adjust the estimates in the direction and magnitude indicated by the scaling factor \alpha.
  • Recursively run the function, providing the new parameter estimates each time
  • Stop when the estimate converges (i.e., meets the stopping criteria based on the change in the estimates)

This code is for a simple single variable model. Adding additional variables means calculating the partial derivatives with respect to each item. In other words, adding a version of the \theta_1 cost component for each feature in the model. I.e.,

\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\right)

I sometimes use Gradient Descent as a ‘Hello World’ program when I’m playing with statistical packages. It helps you get a feel for the language and its capabilities.

Calling Python from R with rPython

Python has generated a good bit of buzz over the past year as an alternative to R. Personal biases aside, an expert makes the best use of the available tools, and sometimes Python is better suited to a task. As a case in point, I recently wanted to pull data via the Reddit API. There isn’t an R package that provides easy access to the Reddit API, but there is a very well designed and documented Python module called PRAW (or, the Python Reddit API Wrapper). Using this module I was able to develop a Python-based solution to get and analyze the data I needed without too much trouble.

However, I prefer working in R, so I was glad to discover the rPython package, which enables calling Python scripts from R. After finding rPython, I was able to rewrite my purely Python script as a primarily R-based program.

If you want to use rPython there are a couple of prerequisites you’ll need to address if you haven’t already. No surprise, you’ll need to have Python installed. After that, you’ll need to install the PRAW module via pip install praw. Finally, install the rPython package from CRAN. (But see the note below first if you’re on Windows.)

After you’ve completed those steps, it’s as easy as writing your Python script and adding a line or two to your R code.

First create a Python script that imports the praw module and does the first data call:

import praw

# Set the user agent information
# IMPORTANT: Change this if you borrow this code. Reddit has very strong 
# guidelines about how to report user agent information 
r = praw.Reddit('Check New Articles script based on code by ProgrammingR.com')
   
# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and put it into a usable format
new_subs=[str(x) for x in new_subs]

Since the Python session is persistent, we can also create a shorter Python script that we can use to fetch updated data without reimporting the praw module

# Create a (lazy) generator that will get the data when we call it below
new_subs = r.get_new(limit=100)

# Get the data and create a list of strings
new_subs=[str(x) for x in new_subs]

Finally, some R code that calls the Python script and gets the data from the Python variables we create:

library(rPython)

# Load/run the main Python script
python.load('GetNewRedditSubmissions.py')

# Get the variable
new_subs_data <- python.get('new_subs')

# Load/run re-fecth script
python.load('RefreshNewSubs.py')

# Get the updated variable
new_subs_data <- python.get('new_subs')

head(new_subs_data)

A few final notes:

  • The main drawback to the rPython package is that it currently doesn’t run on Windows. The developer (Carlos J. Gil Bellosta) is working to fix this, though. If that wrinkle gets resolved, I can see this being a very popular package.
  • You can use RStudio to write your Python programs, which is easier than switching to another IDE for simple scripts. However, it causes an issue with EOL characters. Namely, you need to add a blank line at the end of each .py file to get it to load properly.
  • The Python session rPython initiates is associated with the R session. Any Python modules you load or variables you create will be available until you remove them or close the R session.

Creating your personal, portable R code library with GitHub

Note: I sold ProgrammingR.com and its online assets in 2015. I’m posting this for posterity, but no longer control access to the GitHub account.

As I discussed in a previous post, I have a few helper functions I’ve created that I commonly use in my work. Until recently, I manually included these functions at the start of my R scripts by either the tried-and-true copy-and-paste method, or by extracting them from a local file with the source() function. The former approach has the benefit of keeping the helper code inextricably attached to the main script, but it adds a good bit of code to wade through. The latter approach keeps the code cleaner, but requires that whoever is running the code always has access to the sourced file and that it is always in the same relative path – and that makes sharing or moving code more difficult. The start of a recent project requiring me to share my helper function library prompted me to find a better solution.

The resulting approach takes advantage of GitHub Gists and R’s ability to source via a web-based location to enable you to create a personal, portable library of R functions for private use or to share.

The process is very straightforward. If you don’t have a GitHub account, getting one is the first step. After doing so, click on “Gist” in the menu bar to create a new Gist. (There are a couple of reasons for using a Gist instead of a formal repository. For one, repositories cannot be made private unless you have a paid GitHub membership. Gists, however, can be created as “Secret” and therefore available only to people who have the exact URL.) Name your Gist, add your code, and choose whether you want it to be a secret or public Gist to save that revision. After saving your Gist you will be taken to a screen displaying your new Gist.

Now that you’ve created your Gist, you just need to source it in your R code. That step is easily done via the source() function. Rather than including a file path as you would with a local file, you simply include the URL to your Gist. Note that you can’t just copy the URL of the GitHub page you’re currently on. You need to source the URL of the raw code. To get that URL, click on the <> button on the top right of your Gist (circled in red below). This will take you to the raw code for the current revision. To get the path that always points to the most recent revision, remove everything in the URL after “/raw/”.

GitHubGist

Regardless of which version you link to, you’ll still be left with a long URL that includes a randomly generated alphanumeric string that you don’t want to have to try to remember. Here’s a protip, use a URL shortener that let’s you define a custom URL (e.g., tinyurl) to create a shortened version that is easy to remember. Something like the little library at http://tinyurl.com/ProgRLib (YES! You can use and add to this library!).

Now you’ve got a portable, easy to reference, personal library of functions that you can easily include in your code or share with others.

Global Indicator Analyses with R

I was recently asked by a client to create a large number of “proof of concept” visualizations that illustrated the power of R for compiling and analyzing disparate datasets. The client was specifically interested in automated analyses of global data. A little research led me to the WDI package.

The WDI package is a tool to “search, extract and format data from the World Bank’s World Development Indicators” (WDI help). In essence, it is an R-based wrapper for the World Bank Economic Indicators Data API. When used in combination with the information on the World Bank data portal it provides easy access to thousands of global datapoints.

Here is an example use case that illustrates how simple and easy it is to use, especially with a little help from the countrycode and ggplot2 packages:

library(WDI)
library(ggplot2)
library(countrycode)

# Use the WDIsearch function to get a list of fertility rate indicators
indicatorMetaData <- WDIsearch("Fertility rate", field="name", short=FALSE)

# Define a list of countries for which to pull data
countries <- c("United States", "Britain", "Sweden", "Germany")

# Convert the country names to iso2c format used in the World Bank data
iso2cNames <- countrycode(countries, "country.name", "iso2c")

# Pull data for each countries for the first two fertility rate indicators, for the years 2001 to 2011
wdiData <- WDI(iso2cNames, indicatorMetaData[1:2,1], start=2001, end=2011)

# Pull out indicator names
indicatorNames <- indicatorMetaData[1:2, 1]

# Create trend charts for the first two indicators
for (indicatorName in indicatorNames) { 
  pl <- ggplot(wdiData, aes(x=year, y=wdiData[,indicatorName], group=country, color=country))+
    geom_line(size=1)+
    scale_x_continuous(name="Year", breaks=c(unique(wdiData[,"year"])))+
    scale_y_continuous(name=indicatorName)+
    scale_linetype_discrete(name="Country")+
    theme(legend.title=element_blank())+
    ggtitle(paste(indicatorMetaData[indicatorMetaData[,1]==indicatorName, "name"], "\n"))
  ggsave(paste(indicatorName, ".jpg", sep="&"), pl)
}

WDI package visualization 1WDI package visualization 2

This code can be adapted to quickly pull and visualize many pieces of data. Even if you don’t have an analytic need for the WDI data, the ease of access and depth of information available via the WDI package make them perfect for creating toy examples for classes, presentations or blogs, or conveying the power and depth of available R packages.

SPARQL with R in less than 5 minutes

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

The primary method for accessing data made available on the Semantic Web is via SPARQL queries to a data provider’s endpoint. Endpoints are portals to data that a provider has made available for querying. This data is often published in RDF format. If you’re familiar with using SQL to query a database, then the idea of using a query language (SPARQL) to access a structured database or data repository (the endpoint) should be common ground. The practice is also similar to webscraping an XML document with a known structure.

The Semantic Web and R

A primary goal of the Semantic Web movement is to enable machines and applications to interface using standardized data definitions. As the Semantic Web grows, it will be ripe for mining with statistical programs that already lend themselves to automation – such as R. To that end, let’s dive into a simple query that will illustrate 75% of what you’ll be doing on the Semantic Web.

Accessing Data.gov datasets with SPARQL

We’ll use data at the Data.gov endpoint for this example. Data.gov has a wide array of public data available, making this example generalizable to many other datasets. One of the key challenges of querying a Semantic Web resource is knowing what data is accessible. Sometimes the best way to find this out is to run a simple query with no filters that returns only a few results or to directly view the RDF. Fortunately, information on the data available via Data.gov has been cataloged on a wiki hosted by Rensselaer. We’ll use Dataset 1187 for this example. It’s simple and has interesting data – the total number of wildfires and acres burned per year, 1960-2008.

R code

Data.gov provides a Virtuoso endpoint here, which we could use to submit manual queries. But we want to automate this process, so we’ll use Willem Robert van Hage and Tomi Kauppinen’s SPARQL package to access the endpoint.

library(SPARQL) # SPARQL querying package
library(ggplot2)

# Step 1 - Set up preliminaries and define query
# Define the data.gov endpoint
endpoint <- "http://services.data.gov/sparql"

# create query statement
query <-
"PREFIX  dgp1187: <http://data-gov.tw.rpi.edu/vocab/p/1187/>
SELECT ?ye ?fi ?ac
WHERE {
?s dgp1187:year ?ye .
?s dgp1187:fires ?fi .
?s dgp1187:acres ?ac .
}"

# Step 2 - Use SPARQL package to submit query and save results to a data frame
qd <- SPARQL(endpoint,query)
df <- qd$results

# Step 3 - Prep for graphing

# Numbers are usually returned as characters, so convert to numeric and create a
# variable for "average acres burned per fire"
str(df)
df <- as.data.frame(apply(df, 2, as.numeric))
str(df)

df$avgperfire <- df$ac/df$fi

# Step 4 - Plot some data
ggplot(df, aes(x=ye, y=avgperfire, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Average acres burned per fire")

ggplot(df, aes(x=ye, y=fi, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Number of fires")

ggplot(df, aes(x=ye, y=ac, group=1)) +
geom_point() +
stat_smooth() +
scale_x_continuous(breaks=seq(1960, 2008, 5)) +
xlab("Year") +
ylab("Acres burned")

# In less than 5 mins we have written code to download just
# the data we need and have an interesting result to explore!

If you’re familiar with R, the only foreign code is that in Steps 1 and 2. There we define the location of the endpoint and write the SPARQL query. If you have experience with SQL, you’ll notice that SPARQL syntax is very similar. The most notable exception is that Semantic Web data is structured around chunks of data that contain three pieces of information. In Semantic Web terminology these three pieces of information as referred to as subjects, predicates and objects and together they are called “triples”. A SPARQL query returns pieces of these chunks of data; exactly which pieces are returned depends on the query. We have defined a prefix which establishes that everywhere we say “dgp1187:” we want it to know that we mean “<http://data-gov.tw.rpi.edu/vocab/p/1187/>”. This saves us from having to retype long URIs every time we need to reference them. URIs are used extensively on the Semantic Web, so this is a very helpful feature.

In English, our query says, “Give me the values for the attributes “fires”, “acres” and “year” wherever they are defined, and assign them to variables named “fi”, “ac” and “ye” respectively. Also fetch the location of the value in the data as a variable called ?s and merge the other data values by that variable.”

The last bit of the query is the closest we get to SPARQL magic. Because the variable ?s is present in each clause of the query, our fire, acres burned, and year data will be merged by that variable. This is somewhat analogous to saying, “Once you’ve found a row of data that contains information on the number of fires, also return the acres burned and year data from that row and list them on the same row in the new dataset.” (If the logic behind the query isn’t yet clear, or you’re ready to see how you can make this query more advanced or parsimonious, the text and web pages in the “References” and “Additional Information” sections at the end of this article are good resources for learning SPARQL.)

Once defined, we pass this query and endpoint information to the SPARQL function which returns an object with two values – a data frame of the new data and a list of namespaces. At this point, we’re concerned with the data frame, so we pull it out in the last line of Step 2.Data.gov SPARQL example

From that point on we treat the data just like any other dataset. Running some quick graphs, we see that although the number of fires per year have decreased, the number of acres burned per fire and the number of acres burned per year increased considerably between 1975 and 2008. (Despite being born in 1975, I disavow any responsibility for this trend.)

Summary

In a few short minutes, using R and SPARQL, we wrote code to pull and do initial analyses on an interesting government dataset. Hopefully, the power of R for mining the Semantic Web is evident from this simple example. As more data becomes available in RDF format, automated solutions for mining and analyzing the Semantic Web will become more and more useful.

Reference:

DuCharme, Bob (2011-07-14). Learning SPARQL . OReilly Media.

Additional Information:

Basic tutorial on querying Data.gov

More detailed tutorial on querying Data.gov

R for Dummies – De Vries and Meys (2012)

The for Dummies series has been around since 1991. (A bit of trivia, DOS for Dummies was the first title.) I’ve owned a few books in the series and have been adequately impressed with most of them, but when I learned there was an R for Dummies I was immediately skeptical. Possibly I was skeptical because R has a steep learning curve and many idiosyncrasies, so the idea of an R for Dummies text seemed oxymoronic – it’s difficult to imagine a (successfully) dumbed-down version of an introductory R text. But if you’re familiar with the for Dummies series, you already know that the moniker is just for marketing. In reality, these books usually do a good job of distilling a topic down to the important components a new user needs to know. This edition is no exception.

Title: R for Dummies
Author(s): Andrie de Vries and Joris Meys
Publisher/Date: Wiley and Sons/2012
Statistics level: Not Applicable
Programming level: Beginner
Overall recommendation: Highly Recommended

The core topic areas that R for Dummies covers should come as no surprise: A basic overview of R and its capabilities, importing data into R, writing and debugging functions, summarizing data and graphing. In addition, there are sections covering potentially frustrating tasks for beginners such as working with dates and multidimensional arrays.

I am a big fan of periodically reviewing the fundamentals. I pick up introductory texts every once and a while to make sure I haven’t forgotten anything important and R for Dummies is a nice edition to my library for that reason.

One thing that sets this book apart from other currently available introductory R texts is that it covers a couple of recent and important developments in R coding – namely the RStudio development environment and ggplot2 graphics.

If you’re not familiar with the for Dummies series, it is important to note that they are written in a specific informal style, which is in stark contrast to most R texts. (For reference, the style is more similar to The R Inferno than the ggplot2 manual.) You can get a sense of this style by browsing a few pages on Amazon to see if you find it helpful or distracting. On the other hand, R for Dummies has a more polished feel than many R texts I’ve read. I didn’t encounter any of the frustrating and distracting editing errors that are common in some R texts.

R for Dummies is primarily focused on R as a programming language, so for the most part, statistical analyses are presented only as a means of illustrating programming techniques. Given its focus on programming and fundamentals, this book is highly recommended for someone with little to no experience in R who wants to learn R programming. Intermediate to advanced R programmers like myself who want a current review of the fundamentals, might also find it useful.

I might also recommend R for Dummies to experienced users of other programming languages who are new to R. The discussion of basic programming concepts, such as control flow, is minimal and focused primarily on details specific to R. It is not recommended for those looking to learn statistics in conjunction with R.

The current price of $20 USD puts it in the middle price range for texts of its kind. It is available as a paperback or Kindle text.

R Helper Functions

If you do a lot of R programming, you probably have a list of R helper functions set aside in a script that you include on R startup or at the top of your code. In some cases helper functions add capabilities that aren’t otherwise available. In other cases, they replicate functionality that is available elsewhere without loading unnecessary components. Below I present two of my most frequently used data manipulation helper functions as examples.

### Descriptives R Helpler Function
# Display some basic descriptives
descs <- function (x) {
  if(!hidetables) {
    if(length(unique(x))>30) {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Number of categories is greater than 30, table not produced')
    } else {
      print('Summary results:')
      print(summary(x))
      print('')
      print('Table results:')
      print(table(x, useNA='always'))
    }
   
  } else {
    print('Tables are hidden')
  }
}

# Set hide tables to true to hide tables
hidetables <- FALSE


### Dummy Variable R Helper Function
# Create dummy variables for each level of a categorical variable
createDummies <- function(x, df, keepNAs = TRUE) {
  for (i in seq(1, length(unique(df[, x])))) {
    if(keepNAs) {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i, 0, 1)
    } else {
      df[, paste(x,'.', i, sep = '')] <- ifelse(df[, x] != i | is.na(df[, x]) , 0, 1)     
    }
  }
  df
}

The Descriptives R Helper Function produces a summary or table of the passed variable/object; it uses the number of unique values to determine whether to call just the summary() or summary() and table() functions. It also includes NAs by default in the tables (one of table()‘s biggest annoyances). Once the exploratory and data manipulation work is done, all output from this function can be suppressed by setting the hidetables object to TRUE.

The Dummy Variable R Helper Function creates indicator variables from all values of a variable. Based on experience, I avoid the factor object as much as possible and this approach allows me to quickly create indicators that can be used in any way I want.

If you don’t have a programming background or are just beginning with R, you might not have had time to realize the benefit of helper functions or identify the tasks you do repetitively, but it’s worthwhile to give the issue some consideration. Helper functions can be exceptionally useful for saving time on repetitive tasks or facilitating your work. They’re so useful in fact, that there is a special ProgrammingR event planned around helper functions coming soon. For the more experienced R programmers out there, make a mental note of the most useful helper functions you’ve written in the past. That list will come in handy in the near future!

Progress bars in R using winProgressBar

Using progress bars in R scripts can provide valuable timing feedback during development and additional polish to final products. winProgressBar and setWinProgressBar are the primary functions for creating progress bars in R.

Progress bars, and progress indicators in general, are relatively uncommon in R programming. This makes sense, as they can add bloat and, being design elements, they generally fall into the classification of “nice but not necessary”. However, during development, especially when using loops, progress bars can a cleaner way of tracking loop progress than, for example, printing iteration numbers. And for programmers who prepare scripts or packages for non-programmers, they add feedback that users have come to expect from other software.

To add progress bars in R scripts use the winProgressBar and setWinProgressBar functions. For non-Windows users there’s also a very similar Tcl/Tk version (tkProgressBar and settkProgressbar); depending on your current set up, you may need to install the tcltk library to use it.

Setting up a progress indicator to track the progress of a loop is very straightforward. First, initialize the display:

pb <- winProgressBar(title="Example progress bar", label="0% done", min=0, max=100, initial=0)

Use the title and label options to set the style of the display. The min and max options should use whatever values are most applicable to your task, but in most cases this will be displaying a percentage so a range of 0 to 100, starting at 0, makes sense.

After initializing the progress indicator, add your loop code and update the progress bar by calling the setWinProgressBar function at the end of each loop:

for(i in 1:100) {
  Sys.sleep(0.1) # slow down the code for illustration purposes
  info <- sprintf("%d%% done", round((i/100)*100))
  setWinProgressBar(pb, i/(100)*100, label=info)
}

# Once the loop is exited, close the progress bar window:

close(pb)

Here is a snapshot of the progress indicator in action:

Windows Progress Bar Example

To track the progress of an entire script or program, you just need to place the initializing function (winProgressBar) at the beginning of the code and do updates via setWinProgressBar at key points within the flow of your program. For example, if the first task your code performs usually takes about 10% of the total time required, set the value to 10% after that task completes.

If a popup progress bar doesn’t work for your task, there is a minimalist option, txtProgressBar, that by default draws a line in the console. It also allows for some fun customizations:

Text Progress Bar Example

The Art of R Programming – Matloff (2011)

It’s difficult to write a book on an entire programming language and keep it manageable and concise, but The Art of R Programming does it as well as any text I’ve seen. Matloff covers, in detail and among other things, R data structures, programming idioms, performance enhancements, interfaces with other languages, debugging and graphing.

Title:The Art of R Programming
Author(s): Norman Matloff
Publisher/Date: No Starch Press/2011
Statistics level: Very Low
Programming level: Intermediate
Overall recommendation: Highly Recommended

There is the requisite “Introduction to R” section that is present in almost all R texts, but any beginners who benefit from this chapter may benefit from re-reading ARP after some additional practical experience with R. The issues that Matloff addresses and the solutions he provides are more salient after you’ve spent hours trying to resolve them.

The section on graphing is a good overview, but the average programmer may find it less useful than the other sections. Anyone looking for graphic optimization tips will be better served by a book focused specifically on graphing.

With that minor critique in mind, put simply, The Art of R Programming is a must read for all intermediate level R programmers. It covers nearly every method of performance enhancement available and provides a review of key fundamentals that may have been forgotten or missed.

The Art of R ProgrammingOne point of note, this text focuses almost solely on programming – the statistical examples are a means to an end, not an end themselves. For that reason, this book is recommended for those seeking to improve the efficiency of their programming rather than their statistical acumen.

At around ~$25 USD from Amazon, The Art of R Programming is one of the best R text values available. I highly recommend it for almost all R users. (You can also purchase this book directly from the publisher and get both the print and e-book version for ~$40.)