Tracking Reddit Freshness with Python, D3.js, and an RPi

Background

A couple of years ago I purchased some Raspberry Pis to build a compute cluster at home. A cluster of RPis isn’t going to win any computing awards, but they’re fun little devices and since they run a modified version of Debian Linux, much of what you learn working with them is generalizable to the real world. Also, it’s fun to say you have a compute cluster at your house, right?

Unfortunately, the compute cluster code and project notes are lost to the ages, so let us never speak of it again. However, after finishing that project I moved on to another one. That work was almost lost to the ages too, but as I was cleaning up the RPis for use as print and backup servers, I came across the aging and dusty code. Although old, this code might be useful to others, so I figured I would write it up quickly and post the code to GitHub. This post describes using the Raspberry Pi, Python, Reddit, D3, and some basic statistics to setup a simple dashboard that displays the freshness of the Reddit front page and r/new.

Dashboarding

D3.js was the exciting new visualization solution at the time so I decided to use the Pi’s Apache-based webstack to serve a small D3-based dashboard.

Getting the data

Raspberry Pis come with several OS choices. Raspbian is a version of Debian tailored to the Raspberry Pi and is usually the best option for general use. It comes preinstalled with Python 2.7 and 3, meaning it’s easy to get up and running for general computing. Given that Python is available out of the box, I often end up using the PRAW module to get toy data. PRAW is a wrapper for the Reddit API. It’s a well written package that I’ve used for several projects because of its ease of use and the depth of the available data. PRAW can be added in the usual way with:

sudo python3 -m pip install praw

(If you’re using Python != 3 just specify your version in the code above)

PRAW is straightforward to use, but I’m trying to keep this short so it’s beyond the scope of this write up. You can check out the well-written docs and examples at the link above and also check my code in the GitHub repo.

Analyzing the data

The analysis of the data was just to get to something that could be displayed on the dashboard, not to do fancy stats. The Pearson correlation numbers are essentially just placeholders, so don’t put much weight in them. However, the r/new analysis is based on the percent of new articles on each API pull and ended up showing some interesting trends – just a reminder that value comes from the appropriateness of your stats, not the complexity of them.

Where statistics or data manipulation were required they were done with Scipy and/or pandas. The Pearson correlation metric is defined as:

r_{xy} =\frac{\sum ^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n_{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n_{i=1}(y_i - \bar{y})^2}}

but you knew that. I just wanted to add some LaTeX to make this writeup look better.

Displaying the data

D3 is the main workhorse in the data visualization, primarily using SVG. The code is relatively simple, as far as these things go. There are essentially two key code blocks, one for displaying the percent of new articles in r/new, the other for displaying a correlation of article ranks on the front page. Below is a snippet that covers the r/new chart:


// Create some required variables
var parseDate = d3.time.format("%Y%m%d%H%M%S").parse;
var x = d3.time.scale()
	.range([0, width]);
var y = d3.scale.linear()
	.range([height, 0]);
var xAxis = d3.svg.axis()
	.scale(x)
	.orient("bottom");
var yAxis = d3.svg.axis()
	.scale(y)
	.orient("left");

// Define the line
var line = d3.svg.line()
	.x(function(d) { return x(d["dt"]); })
	.y(function(d) { return y(+d["pn"]); });

// Skip a few lines ... (check the code in the repo for details)

// Set some general layout parameters
var s = d3.selectAll('svg');
s = s.remove();
var svg2 = d3.select("body").append("svg")
	.attr("width", width + margin.left + margin.right)
	.attr("height", height + margin.top + margin.bottom)
		.append("g")
	.attr("transform", "translate(" + margin.left + "," + margin.top + ")");

// Bring in the data
d3.csv("./data/corr_hist.csv", function(data) {

	dataset2 = data.map(function(d) {
		return {
			corr: +d["correlation"],
			dt: parseDate(d["datetime"]),
			pn: +d["percentnew"],
			rmsd : +d["rmsd"]
		};
	});

	// Define the axes and chart text
	x.domain(d3.extent(dataset2, function(d) { return d.dt; }));
	y.domain([0,100]);
	svg2.append("g")
		.attr("class", "x axis")
		.attr("transform", "translate(0," + height + ")")
		.call(xAxis);
	svg2.append("g")
		.attr("class", "y axis")
		.call(yAxis)
	.append("text")
		.attr("transform", "rotate(-90)")
		.attr("y", 6)
		.attr("dy", ".71em")
		.style("text-anchor", "end")
		.text("% new");
	svg2.append("path")
		.datum(dataset2)
		.attr("class", "line")
		.attr("d", line);

	svg2.append("text")
	.attr("x", (width / 2))             
	.attr("y", -8)
	.attr("text-anchor", "middle")  
	.style("font-size", "16px")
	.style("font-weight", "bold")  
	.text("Freshness of articles in r/new");

});

And here’s a screenshot of what you get with that bit of code:

RedditFreshness

As you can see, the number of new submissions follow a very strong trend, peaking in the early afternoon (EST) and hitting a low point in the early morning hours. Depending on your perspective, Americans need to spend less time on Reddit at work or Australia needs to pick up the pace.

If you want to know more of the details head on over to the GitHub repo.

Physical Computing – Puppies VS Kittens on Reddit

“The best way to have a good idea is to have a lot of ideas.” – Linus Pauling

Physical Computing

Although the term is used in a lot of ways, physical computing usually refers to using software to monitor and process analog inputs and then use that data to control mechanical processes in the real world. It’s already commonplace in some areas (e.g., autopilots), but it will be all the rage as the Internet of Things grows and automates. It’s also often used in interactive art and museum exhibits like the Soundspace exhibit at the Museum of Life and Sciences in Durham, NC. In this case we’re manipulating the brightness of two LEDs based on the popularity of animals on Reddit, which I’d say is closer to the art end of the spectrum than the autopilot end.

RPis

Rasberry Pis are great for generating ideas. Because they consume very little power and have a very small form factor, they almost beg you to think of tasks that you want them to get started on and then shove them away in a corner for a while to work on. Because they run a modified version of Debian, it’s easy to take advantage of things like the Apache webstack and Python to get things up and running quickly. In fact, in a previous post, I showed an example of using a Pi to fetch data and serve it to a simple D3-based dashboard.

This is another project that takes advantage of the Reddit API, Python on the RPi, and the GPIO interface on the RPi to visually answer the age old question “What’s more popular on the internet – puppies or kittens?”

Get data via the Reddit API

Reddit has a very easy to use API, especially in combination with the Python PRAW module, so I use it for a lot of little projects. On top of the easy interface, Reddit is a very popular website, ergo lots of data. For this project I used Python to access the Reddit API and grab frequency counts for mentions of puppies and kittens. As you can see in the code (GitHub repo), I actually used a few canine and feline related terms, but ‘puppies’ and ‘kittens’ are where the interest and ‘aww’ factor are, so I’m sticking with that for the title.

The PRAW module does all the work getting the comments. After installing the module all that’s required is three lines of code:

import PRAW
r = praw.Reddit('Term tracker by u/mechanicalreddit') # Change 'yourname' to your Reddit username
allComments = r.get_comments('all', limit=750) # The maximum number of comments that can be fetched at a time is 1000

You now have a lazy generator (allComments) that you can work with to pull comment details. After fetching the comments, tokenizing, and a few other details that you can look at in the Git repo, we have a list of tokens (resultsSet) that we can send to a function that keeps a running sum for each set of terms:

def countAndSumThings(resultsSet, currentCounts):
    resultsSet = resultsSet.lower()
    for thing in currentCounts:
        thingLower = thing.lower()
        searchThing = ' '+thingLower+' '
        thingCount = resultsSet.count(searchThing)
        currentCounts[thing]+=thingCount
    return currentCounts

Visualizing the data – i.e., Little Blinky Things

Setting up the RPi GPIO circuitry to control the LEDs is beyond the scope of this post, and it’s also covered better elsewhere than I could do. Here are a couple of resources you may find helpful:

  • http://www.instructables.com/id/Easiest-Raspberry-Pi-GPIO-LED-Project-Ever/
  • http://www.thirdeyevis.com/pi-page-2.php
  • http://raspi.tv/2013/how-to-use-soft-pwm-in-rpi-gpio-pt-2-led-dimming-and-motor-speed-control

On the software side, the RPi.GPIO module provides the basic functionality. The first part of this code initializes the hardware and prepares it to handle the last two lines which set the brightness of the LEDs.

# The RPi/Python/LED interface
import RPi.GPIO as GPIO ## Import GPIO library
GPIO.setmode(GPIO.BCM) ## Use internal pin numbering

# Initialize LEDs
# Green light
GPIO.setup(22, GPIO.OUT) ## Setup GPIO pin 25 to OUT

# Initialize Pulse width modulation on GPIO 25. Frequency=100Hz and OFF
pG = GPIO.PWM(22, 100)
pG.start(0)

# Red light
GPIO.setup(25, GPIO.OUT) ## Setup GPIO pin 22 to OUT

# Initialize Pulse width modulation on GPIO 22. Frequency=100Hz and OFF
pR = GPIO.PWM(25, 100)
pR.start(0)

# Skip some intermediary code...(see repo for the details)

# Update lighting
pG.ChangeDutyCycle(greenIntensity)
pR.ChangeDutyCycle(redIntensity)

Put this in a loop and you’re good to go with continual updating.

And here’s a picture of the final setup. I added a cover to disperse the light a little bit and help with some of the issues with visual perception discussed below:

Blinky Blinky

Puppies or Kittens?

Drum roll please….

If the Internet is 90% cats, Reddit is an anomaly. Dogs were usually the most popular topic of conversation.

Some frustrations with…

…hardware

I didn’t have any breadboard jumper wires to connect the LEDs and I had some difficulty finding them. I had some male-to-male jumpers from an Arduino kit, but connecting an RPi to a breadboard requires female-to-male connectors. I expected Radio Shack to have them, but no luck. Incidentally, an old 40-pin IDE connector will also work with the newer 40-pin RPi GPIOs, but not 80-wire connector. Note that while the headers for these both have 40 sockets and will physically fit onto the RPi board, attempting to use an 80-pin cable will almost certainly break your Pi. If you’re not sure which type of cable you have, just count the ridges in the cable from the wires. If you still have wires to count after you get to 40, you can’t use that cable. Unless, that is, you want to do what I ultimately did and use individual wires from the cable:

Disassembling an IDE cable

The individual wires of IDE cables are easy to separate and can be stripped with a tight pinch. Because they are single wires, they are easier to manage than if you tried to work with something like speaker wire.

…the visualization method

Brightness, it turns out, is not to be a great way to indicate magnitude. Because of the way our eyes perceive color and the way that LED brightness is affected by increases in current it’s not a very clear comparison. The relationship between LED brightness and current is not linear, which the raw percentages can’t be used as intensity levels.

…the data

Reddit gets very busy sometimes, and during those times posting of comments can slow down considerably due to commenters receiving page errors and a comment backlog developing. Presumably every comment eventually gets posted and is available to the API, but I’m not 100% sure. If comprehensiveness was a concern this would need to be looked into in more detail.

What’s next

In hindsight it would have been better to make the lights blinks faster or slower rather than change the brightness. If there’s a version 0.2 of this project, that will be one of the changes.

Currently you have to change the code directly to change the search terms. Although setting up an interactive session or GUI is overkill for this application, it would be pretty easy to have the Reddit bot that does the term searches check its Reddit messages every so often for a list of new search terms. In that case, changing the search terms would be as easy as sending a Reddit message in some predefined format.

And finally, so maybe you don’t care about the relative popularity of pets on Reddit. With the upcoming election plugging in the candidate names might be interesting. Now that I have some real breadboard wires, I’ll probably set that up the next time I have a free weekend. But go ahead, feel free to clone the repo and do something useful.