R Helper Functions for Indeed.com Searches

If you need job listing data, Indeed.com is a natural choice. It is one of the most popular job sites on the Internet and has listings from a wide range of industries. Indeed has APIs for things like affiliate widgets, but nothing that allows one to directly download a list of job results. Fortunately, the URL structure and site layout are fairly straightforward and lend themselves to easy webscraping.

The following functions wrap rvest capabilities for use on Indeed.com. A write up of the project that required these, including more detailed examples, will follow at some point. For now, I think the use of these functions is straightforward enough without much documentation. If not, email me or ask questions in the comments.

To source just the helper functions you can use:

source("https://raw.githubusercontent.com/bryancshepherd/IndeedJobSearchFunctions/master/jobSearchFunctions.R")

The full GitHub repo is here.

Get the job results – this may take a couple of minutes

jobResultsList = getJobs("JavaScript", nPages=10, includeSponsored = FALSE, showProgress = FALSE)
head(jobResultsList[["JavaScript"]][["Titles"]], 5)
## [1] "Frontend Software Engineer"                        
## [2] "Web Developer Front End HTML CSS"                  
## [3] "Sr. HTML5/JS Engineer"                             
## [4] "Web & Mobile Software Engineer"                    
## [5] "Software Engineer, JavaScript - Mobile - SNEI - SF"
head(jobResultsList[["JavaScript"]][["Summaries"]], 5)
## [1] "Expert in JavaScript, D3, AngularJS. The name ThousandEyes was born from two big ideas:...."                                                                       
## [2] "CSS, HTML, JavaScript:. Knowledge of JavaScript is handy. Web Developer Front End Programming...."                                                                 
## [3] "Hand coded JavaScript. Javascript, HTML 5, CSS 3, Angular. Xavient Information System is seeking a HTML/Javascript Developer with at least 3 year of expert..."    
## [4] "At least 1 year experience in applying knowledge of Javascript framework. Java, JSP, Servlets, Javascript Frameworks, HTML, Cascading Style Sheets (CSS),..."      
## [5] "Skilled JavaScript, HTML/CSS developer. Software Engineer, JavaScript - Mobile - SNEI - SF. 2+ years of single page web application development experience with..."

Collapse all of the terms into a large list and remove stopwords

cleanedJobData = cleanJobData(jobResultsList)
head(cleanedJobData[["JavaScript"]][["Titles"]], 5)
## [1] "frontend"  "software"  "engineer"  "web"       "developer"
head(cleanedJobData[["JavaScript"]][["Summaries"]], 5)
## [1] "expert"     "javascript" "d3"         "angularjs"  "name"

Create ordered wordlists for titles and descriptions

orderedTables = createWordTables(cleanedJobData)
head(orderedTables[["JavaScript"]][["Titles"]], 5)
##         Var1 Freq
## 27 developer   56
## 34  engineer   35
## 81       web   34
## 33       end   30
## 37     front   29
head(orderedTables[["JavaScript"]][["Summaries"]], 5)
##           Var1 Freq
## 264 javascript  129
## 107        css   45
## 230       html   43
## 538        web   41
## 173 experience   39

Create a flat file from the aggregated data for easier manipulation and plotting

flatFile = createFlatFile(orderedTables)
head(flatFile, 5)
##   searchTerm resultType resultTerms Freq   Percent
## 1 JavaScript     Titles   developer   56 15.642458
## 2 JavaScript     Titles    engineer   35  9.776536
## 3 JavaScript     Titles         web   34  9.497207
## 4 JavaScript     Titles         end   30  8.379888
## 5 JavaScript     Titles       front   29  8.100559

Tracking Reddit Freshness with Python, D3.js, and an RPi

Background

A couple of years ago I purchased some Raspberry Pis to build a compute cluster at home. A cluster of RPis isn’t going to win any computing awards, but they’re fun little devices and since they run a modified version of Debian Linux, much of what you learn working with them is generalizable to the real world. Also, it’s fun to say you have a compute cluster at your house, right?

Unfortunately, the compute cluster code and project notes are lost to the ages, so let us never speak of it again. However, after finishing that project I moved on to another one. That work was almost lost to the ages too, but as I was cleaning up the RPis for use as print and backup servers, I came across the aging and dusty code. Although old, this code might be useful to others, so I figured I would write it up quickly and post the code to GitHub. This post describes using the Raspberry Pi, Python, Reddit, D3, and some basic statistics to setup a simple dashboard that displays the freshness of the Reddit front page and r/new.

Dashboarding

D3.js was the exciting new visualization solution at the time so I decided to use the Pi’s Apache-based webstack to serve a small D3-based dashboard.

Getting the data

Raspberry Pis come with several OS choices. Raspbian is a version of Debian tailored to the Raspberry Pi and is usually the best option for general use. It comes preinstalled with Python 2.7 and 3, meaning it’s easy to get up and running for general computing. Given that Python is available out of the box, I often end up using the PRAW module to get toy data. PRAW is a wrapper for the Reddit API. It’s a well written package that I’ve used for several projects because of its ease of use and the depth of the available data. PRAW can be added in the usual way with:

sudo python3 -m pip install praw

(If you’re using Python != 3 just specify your version in the code above)

PRAW is straightforward to use, but I’m trying to keep this short so it’s beyond the scope of this write up. You can check out the well-written docs and examples at the link above and also check my code in the GitHub repo.

Analyzing the data

The analysis of the data was just to get to something that could be displayed on the dashboard, not to do fancy stats. The Pearson correlation numbers are essentially just placeholders, so don’t put much weight in them. However, the r/new analysis is based on the percent of new articles on each API pull and ended up showing some interesting trends – just a reminder that value comes from the appropriateness of your stats, not the complexity of them.

Where statistics or data manipulation were required they were done with Scipy and/or pandas. The Pearson correlation metric is defined as:

r_{xy} =\frac{\sum ^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n_{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n_{i=1}(y_i - \bar{y})^2}}

but you knew that. I just wanted to add some LaTeX to make this writeup look better.

Displaying the data

D3 is the main workhorse in the data visualization, primarily using SVG. The code is relatively simple, as far as these things go. There are essentially two key code blocks, one for displaying the percent of new articles in r/new, the other for displaying a correlation of article ranks on the front page. Below is a snippet that covers the r/new chart:


// Create some required variables
var parseDate = d3.time.format("%Y%m%d%H%M%S").parse;
var x = d3.time.scale()
	.range([0, width]);
var y = d3.scale.linear()
	.range([height, 0]);
var xAxis = d3.svg.axis()
	.scale(x)
	.orient("bottom");
var yAxis = d3.svg.axis()
	.scale(y)
	.orient("left");

// Define the line
var line = d3.svg.line()
	.x(function(d) { return x(d["dt"]); })
	.y(function(d) { return y(+d["pn"]); });

// Skip a few lines ... (check the code in the repo for details)

// Set some general layout parameters
var s = d3.selectAll('svg');
s = s.remove();
var svg2 = d3.select("body").append("svg")
	.attr("width", width + margin.left + margin.right)
	.attr("height", height + margin.top + margin.bottom)
		.append("g")
	.attr("transform", "translate(" + margin.left + "," + margin.top + ")");

// Bring in the data
d3.csv("./data/corr_hist.csv", function(data) {

	dataset2 = data.map(function(d) {
		return {
			corr: +d["correlation"],
			dt: parseDate(d["datetime"]),
			pn: +d["percentnew"],
			rmsd : +d["rmsd"]
		};
	});

	// Define the axes and chart text
	x.domain(d3.extent(dataset2, function(d) { return d.dt; }));
	y.domain([0,100]);
	svg2.append("g")
		.attr("class", "x axis")
		.attr("transform", "translate(0," + height + ")")
		.call(xAxis);
	svg2.append("g")
		.attr("class", "y axis")
		.call(yAxis)
	.append("text")
		.attr("transform", "rotate(-90)")
		.attr("y", 6)
		.attr("dy", ".71em")
		.style("text-anchor", "end")
		.text("% new");
	svg2.append("path")
		.datum(dataset2)
		.attr("class", "line")
		.attr("d", line);

	svg2.append("text")
	.attr("x", (width / 2))             
	.attr("y", -8)
	.attr("text-anchor", "middle")  
	.style("font-size", "16px")
	.style("font-weight", "bold")  
	.text("Freshness of articles in r/new");

});

And here’s a screenshot of what you get with that bit of code:

RedditFreshness

As you can see, the number of new submissions follow a very strong trend, peaking in the early afternoon (EST) and hitting a low point in the early morning hours. Depending on your perspective, Americans need to spend less time on Reddit at work or Australia needs to pick up the pace.

If you want to know more of the details head on over to the GitHub repo.

Displaying LaTeX in Atom’s Live Markdown Preview

The Atom editor

I’ve been using the Atom editor for a few months and I am very pleased with it overall. It doesn’t beat well-developed, language-specific IDEs such as R-Studio or iPython Notebooks, but it works very well for general purpose editing and in cases where there isn’t a comprehensive IDE available (e.g., Julia). It has an active development community and that translates into a lot of extensibility. It has broad support for syntax highlighting and markdown editing, including a very helpful live markdown preview. Both of those features make writing markdown with embedded code very easy.

However, when I recently tried to include some LaTeX in one of my markdown files I hit a roadblock. Atom doesn’t support LaTeX in markdown previews out of the box and the solution took a little longer to find than I think it should have. Ergo, I’m posting this information in hopes that it might help others find the solution faster.

Defining the parameters

To be clear, this isn’t about rendering .tex files in Atom. if you’re looking for a package to build and display pure .tex files, that’s fairly straightforward – start with the latex package.

The goal here is to display LaTeX blocks and inline LaTeX in markdown, i.e., .md file previews. Being clear on that distinction can save you a good bit of time, as getting the latex package set up can take a while and it’s not required at all for markdown files. If you start there, you’ll waste an hour or two of your valuable time.

Markdown Preview Plus

The key to getting LaTeX to display in Atom markdown previews is the Markdown Preview Plus package. It is a fork of the core markdown preview package that adds several features, one of which is LaTeX support (via MathJax*).

What this package does can be a little unclear from the description, with some people assuming it provides support for .tex files (it does not). But I assure you, if you’re looking to preview LaTeX in your .md files this is what you want.

The install is simple:

apm install markdown-preview-plus

then disable the markdown preview package that ships with Atom (markdown-preview). Details on the installation process are here, if you need them.

After that, you need to know how to use the package. The docs are pretty clear that ctrl-shift-m opens a preview and ctrl-shift-x turns on math support, but that’s only half the battle. If you try to use straight LaTeX from there you’ll get no love. The key is in this doc – namely, you need to add $$ to delineate your math blocks and $ for inline math.

And you’re done

With that information in hand, you should be good to go, hopefully in less time than it would have taken otherwise.


* The distinction between LaTeX and MathJax may or may not be important for your purposes.