When is the best time to tweet?

Twitter has made a few recent changes that make it hard to follow the chronology of tweets in my timeline. I’ve mostly accepted that, but as a result I don’t feel like I have a sense of when my followers are most active anymore. (Of course, there are a few people whose active times would be classified as ‘always’.) Anyway, these changes make me feel a little disconnected from my roughly 175 current tweeps and that makes me sad.

But beyond the emotional pain of no longer being able to really, I mean really, connect with one’s twitter followers on an existential level, some people care about all those Twitter status markers like retweet, reply, and like counts. For them, knowing when your followers are most active can help you improve those numbers and determine the best time to tweet.

Regardless of where you fall on that completely contrived spectrum, this is something that’s analyzable and would be cool to know. And for those that care, it can give some insight on the best time to engage with your Twitter followers.

So here’s a snippet of what I found using data from my Twitter followers’ activity over the last 28 days. Day of the week is along the y-axis, hour of the day is along the x-axis.

Number of unique followers posting or retweeting during a given hour
Chart of twitter behavior by day of week and hour

Number of retweets from followers during a given hour
Chart of retweet behavior by day of week and hour

You can see mid-afternoon is the most active time for my followers. Interestingly, as you get closer to Friday, the mid-afternoon activity increases in intensity and and happens earlier.

I’m not going to bother running through code for this one in this write up. I’m thinking about throwing this up as a simple webservice for others to use; if so, I’ll do a detailed write up then.

Resources for Learning Basic Python Programming

As part of the Python for Data Science video series I wanted to provide some basic Python programming resources for those who may be new to Python. The list of links below is designed to get new Python programmers off to a quick start and it focuses on things that are most relevant to data analysis. (E.g., there’s nothing in here about writing custom classes.)

If you have suggestions for other links, feel free to mention them in the comments below.

General overviews:

Installing packages:

Importing packages:

Data types:

Control flow:

Defining functions:

 

Left-Handed Chord Charts for Guitar and Mandolin

Left-handed chord charts are hard to come by, especially charts that are good for displaying on an iPad or printing and keeping in a songbook. Given that frustration, you can imagine how glad I was to come across some pretty comprehensive left-handed chord charts for guitar, mandolin, and ukulele on a site called Matt’s Music Monday.

Matt has done a really great thing for all us leftys out there. This format is perfect for quick reference as you’re trying to learn a song. I’ve posted snippet of one of his charts below so you can see why you should go check them out. As you can see, this is good stuff. So if you’re looking for left-handed chord charts for guitar, mandolin, or ukulele, head over to his site to get the original files; they’re more complete and that way he gets the site traffic love.

Left-handed chord chart from Matt's Music Monday
Snippet of a left-handed chord chart form Matt’s Music Monday (http://mattsmusicmonday.tumblr.com)

Making GitHub Art

The contribution heatmaps on GitHub profiles are interesting. Although they are intended to be passive data visualizations, they don’t have to be. Specifically, they can act as a 7xN pixel –very slowly– scrolling display. After realizing this, I decided I had to do something to shape the blank canvas that is my GitHub commit log.

“An artist is somebody who produces things that people don’t need to have.”
― Andy Warhol

The plan

Ostensibly, it should be pretty straightforward. The color of each cell of the heatmap is based on the number of commits made that day, so one just needs to automate the appropriate number of commits per day to get the desired shading. For simplicity, I decided to start by using the darkest shade possible to build some text.

The execution

And to be honest, it pretty much was that simple. The most difficult part was finding a Python library to automate the git commits. Many StackOverflow discussions essentially suggested rolling your own functions because it is relatively simple and flexible. Had I been building something I cared about more, that might have been the way to go, but I was determined not to spend more than a few minutes on this project and I didn’t need a lot of flexibility. I really wanted to find something off-the-shelf with good documentation.

Connecting to GitHub

I tried a few valiant entries into the Python/GitHub API space, but what some lacked in functionality the others lacked in documentation. Finally, I tried github3.py and found the right mix. Without too much trouble, I was able to automate connecting to GitHub and making commits. After a little research it looked like ~40 commits per day would be enough to keep the color scaling the way I wanted it.

There is a link to the GitHub repo at the end of this post. These are the main functions for connecting and committing to GitHub:

from github3.py import login
import time

# Login helper
# Comma separated credentials are stored
# in the first row of auth.csv.
def github_login():
    with open('auth/auth.csv', newline='') as f:
        text = csv.reader(f)

        for row in text:
            user_name, password = row

    session = login(user_name, password)

    return(session)

# The function that submits the commits.
# The number of commits should be set to
# something quite a bit higher than your
# normal number of daily commits. Changing number_of_commits
# may also require changing sleep_time
# so that things still complete in a reasonable
# amount of time.
def do_typing(num_of_commits=30, sleep_time=20):

    me = github_login()

    repo = me.repository('your_github_username', 'GitHubTyper')

    for i in range(num_of_commits):
        # Create a file
        data = 'typing file'
        repo.create_file(path = 'files/dotfile.txt',
                         message = 'Add dot file',
                         content = data.encode('utf-8'))

        # Get the file reference for later use
        file_sha = repo.contents(path = 'files/dotfile.txt').sha

        # Delete the file
        repo.delete_file(path = 'files/dotfile.txt',
                         message = 'Delete dot file',
                         sha = file_sha)

        time.sleep(sleep_time)

Translating letters to useable format

With a way to connect in hand, the code needed to know when to connect. Basically, I needed an on/off switch for every day represented on the heatmap. If the switch is on, the committing function should run, making the cell dark. If it is off, the committing function shouldn’t run, leaving the cell gray (or close to it, depending on what other commits are made that day).

Since we’re using the heatmap to display text, a matrix-based font seemed to make sense. If you’ve seen dot-matrix font styles, these will look familiar. Each position in the matrix corresponds to a day on the heatmap. I used values of ‘1’ and ‘0’ to indicate on and off days, respectively. (And technically these are lists, not matrices, but they are laid out like matrices to make them easier to create.)

As an example, here is the setup for the letter ‘A’:

letters_dict = {
'A' : [0,1,1,1,0,0,
       1,0,0,0,1,0,
       1,0,0,0,1,0,
       1,0,0,0,1,0,
       1,1,1,1,1,0,
       1,0,0,0,1,0,
       1,0,0,0,1,0],
...
}

These matrices are time consuming to create, so I’ve only created the few that I needed. If you make more feel free to send them along via pull a pull request.

Automating and scheduling the runs

Now that I had a way to do commits programmatically and something to commit, I needed a way to schedule the Python script to run at the appropriate time. The ultimate goal was to be able to tell the script what I wanted to do at the beginning and have it run unsupervised for a few weeks until it completed.

This was achieved with PythonAnywhere.com and a little bash script. Python Anywhere is a Python-oriented hosting environment. Among many other things, it can be used to schedule Python scripts to run at certain times of the day. A free account allows one daily task and http calls to a whitelist of domains. Fortunately, one task is all we need to run and GitHub.com is on the whitelist.

After uploading the Python code, I created a really simple bash script that calls the main Python script and is scheduled to run daily:

#!/bin/sh
python3.5 GitHubArt/main.py '$echo "Hi"' '2016-07-31'

And that’s all. The first parameter is the message to display on the GitHub heatmap, the second is the date on which to start typing. Since the GitHub heatmap starts with Sunday at the top, this date should also be a Sunday.

Philosophical Implications

I took a very obvious approach for someone with no artistic talent – I am using this functionality to print out a *nix command. Christo would not be impressed. Honestly, though, the prospect of using this to make art seems really cool. Given that the intensity can differ for each cell in the heatmap, it is essentially as versatile as a grayscale palette. If I had an artistic bone in my body I might give it a shot. For now, I’ll just use text and appreciate my simple creations as a Buddhist would, for their intrinsic and ephemeral beauty.

You can find all the project files here:
https://github.com/bryancshepherd/GitHubArt

Article XII Alexa Skill

I’ve redone this post several times already and haven’t been able to get the tone in line with the rest of the site so I’ll just stick to the facts:

  • Donald Trump said as president he would support Article XII of the Constitution.
  • There is no Article XII of the Constitution.
  • I love my Amazon Echo.
  • I have been wanting to write a skill for my Amazon Echo (a.k.a. Alexa).
  • I wrote an Alexa skill that describes some things Article XII might cover, if it existed.
  • It is mostly based on this work by Tim Carr.
  • If you want to submit additional things that Article XII might cover (if it existed) tweet them to @bryancshepherd hashtag #whatisarticleXII or post them in the comments. I will add them to the next version of the skill if the initial version passes Amazon’s review.
  • There is no chance this skill will pass Amazon’s review.

One of the responses below is selected at random each time Alexa is asked ‘Alexa, what is Article Twelve?’

Article XII of the U.S. Constitution…

  • requires that all dogs be trained to shoot free throws, in the event that such a skill is required to settle an international dispute.
  • governs the creation, distribution, and taxation of loofahs.
  • states that harboring pink, furry, intergalactic lifeforms is prohibited.
  • makes it illegal to carry different denominations of change in the same pocket.
  • describes the process for making dingy whites all nice and sparkly again.
  • requires that cats look apathetic and nonchalant after doing something dumb.
  • certifies these are not the droids you’re looking for.
  • prohibits making the ‘It must be free’ joke to cashiers when products do not ring up correctly.

R Helper Functions for Indeed.com Searches

If you need job listing data, Indeed.com is a natural choice. It is one of the most popular job sites on the Internet and has listings from a wide range of industries. Indeed has APIs for things like affiliate widgets, but nothing that allows one to directly download a list of job results. Fortunately, the URL structure and site layout are fairly straightforward and lend themselves to easy webscraping.

The following functions wrap rvest capabilities for use on Indeed.com. A write up of the project that required these, including more detailed examples, will follow at some point. For now, I think the use of these functions is straightforward enough without much documentation. If not, email me or ask questions in the comments.

To source just the helper functions you can use:

source("https://raw.githubusercontent.com/bryancshepherd/IndeedJobSearchFunctions/master/jobSearchFunctions.R")

The full GitHub repo is here.

Get the job results – this may take a couple of minutes

jobResultsList = getJobs("JavaScript", nPages=10, includeSponsored = FALSE, showProgress = FALSE)
head(jobResultsList[["JavaScript"]][["Titles"]], 5)
## [1] "Frontend Software Engineer"                        
## [2] "Web Developer Front End HTML CSS"                  
## [3] "Sr. HTML5/JS Engineer"                             
## [4] "Web & Mobile Software Engineer"                    
## [5] "Software Engineer, JavaScript - Mobile - SNEI - SF"
head(jobResultsList[["JavaScript"]][["Summaries"]], 5)
## [1] "Expert in JavaScript, D3, AngularJS. The name ThousandEyes was born from two big ideas:...."                                                                       
## [2] "CSS, HTML, JavaScript:. Knowledge of JavaScript is handy. Web Developer Front End Programming...."                                                                 
## [3] "Hand coded JavaScript. Javascript, HTML 5, CSS 3, Angular. Xavient Information System is seeking a HTML/Javascript Developer with at least 3 year of expert..."    
## [4] "At least 1 year experience in applying knowledge of Javascript framework. Java, JSP, Servlets, Javascript Frameworks, HTML, Cascading Style Sheets (CSS),..."      
## [5] "Skilled JavaScript, HTML/CSS developer. Software Engineer, JavaScript - Mobile - SNEI - SF. 2+ years of single page web application development experience with..."

Collapse all of the terms into a large list and remove stopwords

cleanedJobData = cleanJobData(jobResultsList)
head(cleanedJobData[["JavaScript"]][["Titles"]], 5)
## [1] "frontend"  "software"  "engineer"  "web"       "developer"
head(cleanedJobData[["JavaScript"]][["Summaries"]], 5)
## [1] "expert"     "javascript" "d3"         "angularjs"  "name"

Create ordered wordlists for titles and descriptions

orderedTables = createWordTables(cleanedJobData)
head(orderedTables[["JavaScript"]][["Titles"]], 5)
##         Var1 Freq
## 27 developer   56
## 34  engineer   35
## 81       web   34
## 33       end   30
## 37     front   29
head(orderedTables[["JavaScript"]][["Summaries"]], 5)
##           Var1 Freq
## 264 javascript  129
## 107        css   45
## 230       html   43
## 538        web   41
## 173 experience   39

Create a flat file from the aggregated data for easier manipulation and plotting

flatFile = createFlatFile(orderedTables)
head(flatFile, 5)
##   searchTerm resultType resultTerms Freq   Percent
## 1 JavaScript     Titles   developer   56 15.642458
## 2 JavaScript     Titles    engineer   35  9.776536
## 3 JavaScript     Titles         web   34  9.497207
## 4 JavaScript     Titles         end   30  8.379888
## 5 JavaScript     Titles       front   29  8.100559

Tracking Reddit Freshness with Python, D3.js, and an RPi

Background

A couple of years ago I purchased some Raspberry Pis to build a compute cluster at home. A cluster of RPis isn’t going to win any computing awards, but they’re fun little devices and since they run a modified version of Debian Linux, much of what you learn working with them is generalizable to the real world. Also, it’s fun to say you have a compute cluster at your house, right?

Unfortunately, the compute cluster code and project notes are lost to the ages, so let us never speak of it again. However, after finishing that project I moved on to another one. That work was almost lost to the ages too, but as I was cleaning up the RPis for use as print and backup servers, I came across the aging and dusty code. Although old, this code might be useful to others, so I figured I would write it up quickly and post the code to GitHub. This post describes using the Raspberry Pi, Python, Reddit, D3, and some basic statistics to setup a simple dashboard that displays the freshness of the Reddit front page and r/new.

Dashboarding

D3.js was the exciting new visualization solution at the time so I decided to use the Pi’s Apache-based webstack to serve a small D3-based dashboard.

Getting the data

Raspberry Pis come with several OS choices. Raspbian is a version of Debian tailored to the Raspberry Pi and is usually the best option for general use. It comes preinstalled with Python 2.7 and 3, meaning it’s easy to get up and running for general computing. Given that Python is available out of the box, I often end up using the PRAW module to get toy data. PRAW is a wrapper for the Reddit API. It’s a well written package that I’ve used for several projects because of its ease of use and the depth of the available data. PRAW can be added in the usual way with:

sudo python3 -m pip install praw

(If you’re using Python != 3 just specify your version in the code above)

PRAW is straightforward to use, but I’m trying to keep this short so it’s beyond the scope of this write up. You can check out the well-written docs and examples at the link above and also check my code in the GitHub repo.

Analyzing the data

The analysis of the data was just to get to something that could be displayed on the dashboard, not to do fancy stats. The Pearson correlation numbers are essentially just placeholders, so don’t put much weight in them. However, the r/new analysis is based on the percent of new articles on each API pull and ended up showing some interesting trends – just a reminder that value comes from the appropriateness of your stats, not the complexity of them.

Where statistics or data manipulation were required they were done with Scipy and/or pandas. The Pearson correlation metric is defined as:

r_{xy} =\frac{\sum ^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n_{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n_{i=1}(y_i - \bar{y})^2}}

but you knew that. I just wanted to add some LaTeX to make this writeup look better.

Displaying the data

D3 is the main workhorse in the data visualization, primarily using SVG. The code is relatively simple, as far as these things go. There are essentially two key code blocks, one for displaying the percent of new articles in r/new, the other for displaying a correlation of article ranks on the front page. Below is a snippet that covers the r/new chart:


// Create some required variables
var parseDate = d3.time.format("%Y%m%d%H%M%S").parse;
var x = d3.time.scale()
	.range([0, width]);
var y = d3.scale.linear()
	.range([height, 0]);
var xAxis = d3.svg.axis()
	.scale(x)
	.orient("bottom");
var yAxis = d3.svg.axis()
	.scale(y)
	.orient("left");

// Define the line
var line = d3.svg.line()
	.x(function(d) { return x(d["dt"]); })
	.y(function(d) { return y(+d["pn"]); });

// Skip a few lines ... (check the code in the repo for details)

// Set some general layout parameters
var s = d3.selectAll('svg');
s = s.remove();
var svg2 = d3.select("body").append("svg")
	.attr("width", width + margin.left + margin.right)
	.attr("height", height + margin.top + margin.bottom)
		.append("g")
	.attr("transform", "translate(" + margin.left + "," + margin.top + ")");

// Bring in the data
d3.csv("./data/corr_hist.csv", function(data) {

	dataset2 = data.map(function(d) {
		return {
			corr: +d["correlation"],
			dt: parseDate(d["datetime"]),
			pn: +d["percentnew"],
			rmsd : +d["rmsd"]
		};
	});

	// Define the axes and chart text
	x.domain(d3.extent(dataset2, function(d) { return d.dt; }));
	y.domain([0,100]);
	svg2.append("g")
		.attr("class", "x axis")
		.attr("transform", "translate(0," + height + ")")
		.call(xAxis);
	svg2.append("g")
		.attr("class", "y axis")
		.call(yAxis)
	.append("text")
		.attr("transform", "rotate(-90)")
		.attr("y", 6)
		.attr("dy", ".71em")
		.style("text-anchor", "end")
		.text("% new");
	svg2.append("path")
		.datum(dataset2)
		.attr("class", "line")
		.attr("d", line);

	svg2.append("text")
	.attr("x", (width / 2))             
	.attr("y", -8)
	.attr("text-anchor", "middle")  
	.style("font-size", "16px")
	.style("font-weight", "bold")  
	.text("Freshness of articles in r/new");

});

And here’s a screenshot of what you get with that bit of code:

RedditFreshness

As you can see, the number of new submissions follow a very strong trend, peaking in the early afternoon (EST) and hitting a low point in the early morning hours. Depending on your perspective, Americans need to spend less time on Reddit at work or Australia needs to pick up the pace.

If you want to know more of the details head on over to the GitHub repo.

Displaying LaTeX in Atom’s Live Markdown Preview

The Atom editor

I’ve been using the Atom editor for a few months and I am very pleased with it overall. It doesn’t beat well-developed, language-specific IDEs such as R-Studio or iPython Notebooks, but it works very well for general purpose editing and in cases where there isn’t a comprehensive IDE available (e.g., Julia). It has an active development community and that translates into a lot of extensibility. It has broad support for syntax highlighting and markdown editing, including a very helpful live markdown preview. Both of those features make writing markdown with embedded code very easy.

However, when I recently tried to include some LaTeX in one of my markdown files I hit a roadblock. Atom doesn’t support LaTeX in markdown previews out of the box and the solution took a little longer to find than I think it should have. Ergo, I’m posting this information in hopes that it might help others find the solution faster.

Defining the parameters

To be clear, this isn’t about rendering .tex files in Atom. if you’re looking for a package to build and display pure .tex files, that’s fairly straightforward – start with the latex package.

The goal here is to display LaTeX blocks and inline LaTeX in markdown, i.e., .md file previews. Being clear on that distinction can save you a good bit of time, as getting the latex package set up can take a while and it’s not required at all for markdown files. If you start there, you’ll waste an hour or two of your valuable time.

Markdown Preview Plus

The key to getting LaTeX to display in Atom markdown previews is the Markdown Preview Plus package. It is a fork of the core markdown preview package that adds several features, one of which is LaTeX support (via MathJax*).

What this package does can be a little unclear from the description, with some people assuming it provides support for .tex files (it does not). But I assure you, if you’re looking to preview LaTeX in your .md files this is what you want.

The install is simple:

apm install markdown-preview-plus

then disable the markdown preview package that ships with Atom (markdown-preview). Details on the installation process are here, if you need them.

After that, you need to know how to use the package. The docs are pretty clear that ctrl-shift-m opens a preview and ctrl-shift-x turns on math support, but that’s only half the battle. If you try to use straight LaTeX from there you’ll get no love. The key is in this doc – namely, you need to add $$ to delineate your math blocks and $ for inline math.

And you’re done

With that information in hand, you should be good to go, hopefully in less time than it would have taken otherwise.


* The distinction between LaTeX and MathJax may or may not be important for your purposes.

Physical Computing – Puppies VS Kittens on Reddit

“The best way to have a good idea is to have a lot of ideas.” – Linus Pauling

Physical Computing

Although the term is used in a lot of ways, physical computing usually refers to using software to monitor and process analog inputs and then use that data to control mechanical processes in the real world. It’s already commonplace in some areas (e.g., autopilots), but it will be all the rage as the Internet of Things grows and automates. It’s also often used in interactive art and museum exhibits like the Soundspace exhibit at the Museum of Life and Sciences in Durham, NC. In this case we’re manipulating the brightness of two LEDs based on the popularity of animals on Reddit, which I’d say is closer to the art end of the spectrum than the autopilot end.

RPis

Rasberry Pis are great for generating ideas. Because they consume very little power and have a very small form factor, they almost beg you to think of tasks that you want them to get started on and then shove them away in a corner for a while to work on. Because they run a modified version of Debian, it’s easy to take advantage of things like the Apache webstack and Python to get things up and running quickly. In fact, in a previous post, I showed an example of using a Pi to fetch data and serve it to a simple D3-based dashboard.

This is another project that takes advantage of the Reddit API, Python on the RPi, and the GPIO interface on the RPi to visually answer the age old question “What’s more popular on the internet – puppies or kittens?”

Get data via the Reddit API

Reddit has a very easy to use API, especially in combination with the Python PRAW module, so I use it for a lot of little projects. On top of the easy interface, Reddit is a very popular website, ergo lots of data. For this project I used Python to access the Reddit API and grab frequency counts for mentions of puppies and kittens. As you can see in the code (GitHub repo), I actually used a few canine and feline related terms, but ‘puppies’ and ‘kittens’ are where the interest and ‘aww’ factor are, so I’m sticking with that for the title.

The PRAW module does all the work getting the comments. After installing the module all that’s required is three lines of code:

import PRAW
r = praw.Reddit('Term tracker by u/mechanicalreddit') # Change 'yourname' to your Reddit username
allComments = r.get_comments('all', limit=750) # The maximum number of comments that can be fetched at a time is 1000

You now have a lazy generator (allComments) that you can work with to pull comment details. After fetching the comments, tokenizing, and a few other details that you can look at in the Git repo, we have a list of tokens (resultsSet) that we can send to a function that keeps a running sum for each set of terms:

def countAndSumThings(resultsSet, currentCounts):
    resultsSet = resultsSet.lower()
    for thing in currentCounts:
        thingLower = thing.lower()
        searchThing = ' '+thingLower+' '
        thingCount = resultsSet.count(searchThing)
        currentCounts[thing]+=thingCount
    return currentCounts

Visualizing the data – i.e., Little Blinky Things

Setting up the RPi GPIO circuitry to control the LEDs is beyond the scope of this post, and it’s also covered better elsewhere than I could do. Here are a couple of resources you may find helpful:

  • http://www.instructables.com/id/Easiest-Raspberry-Pi-GPIO-LED-Project-Ever/
  • http://www.thirdeyevis.com/pi-page-2.php
  • http://raspi.tv/2013/how-to-use-soft-pwm-in-rpi-gpio-pt-2-led-dimming-and-motor-speed-control

On the software side, the RPi.GPIO module provides the basic functionality. The first part of this code initializes the hardware and prepares it to handle the last two lines which set the brightness of the LEDs.

# The RPi/Python/LED interface
import RPi.GPIO as GPIO ## Import GPIO library
GPIO.setmode(GPIO.BCM) ## Use internal pin numbering

# Initialize LEDs
# Green light
GPIO.setup(22, GPIO.OUT) ## Setup GPIO pin 25 to OUT

# Initialize Pulse width modulation on GPIO 25. Frequency=100Hz and OFF
pG = GPIO.PWM(22, 100)
pG.start(0)

# Red light
GPIO.setup(25, GPIO.OUT) ## Setup GPIO pin 22 to OUT

# Initialize Pulse width modulation on GPIO 22. Frequency=100Hz and OFF
pR = GPIO.PWM(25, 100)
pR.start(0)

# Skip some intermediary code...(see repo for the details)

# Update lighting
pG.ChangeDutyCycle(greenIntensity)
pR.ChangeDutyCycle(redIntensity)

Put this in a loop and you’re good to go with continual updating.

And here’s a picture of the final setup. I added a cover to disperse the light a little bit and help with some of the issues with visual perception discussed below:

Blinky Blinky

Puppies or Kittens?

Drum roll please….

If the Internet is 90% cats, Reddit is an anomaly. Dogs were usually the most popular topic of conversation.

Some frustrations with…

…hardware

I didn’t have any breadboard jumper wires to connect the LEDs and I had some difficulty finding them. I had some male-to-male jumpers from an Arduino kit, but connecting an RPi to a breadboard requires female-to-male connectors. I expected Radio Shack to have them, but no luck. Incidentally, an old 40-pin IDE connector will also work with the newer 40-pin RPi GPIOs, but not 80-wire connector. Note that while the headers for these both have 40 sockets and will physically fit onto the RPi board, attempting to use an 80-pin cable will almost certainly break your Pi. If you’re not sure which type of cable you have, just count the ridges in the cable from the wires. If you still have wires to count after you get to 40, you can’t use that cable. Unless, that is, you want to do what I ultimately did and use individual wires from the cable:

Disassembling an IDE cable

The individual wires of IDE cables are easy to separate and can be stripped with a tight pinch. Because they are single wires, they are easier to manage than if you tried to work with something like speaker wire.

…the visualization method

Brightness, it turns out, is not to be a great way to indicate magnitude. Because of the way our eyes perceive color and the way that LED brightness is affected by increases in current it’s not a very clear comparison. The relationship between LED brightness and current is not linear, which the raw percentages can’t be used as intensity levels.

…the data

Reddit gets very busy sometimes, and during those times posting of comments can slow down considerably due to commenters receiving page errors and a comment backlog developing. Presumably every comment eventually gets posted and is available to the API, but I’m not 100% sure. If comprehensiveness was a concern this would need to be looked into in more detail.

What’s next

In hindsight it would have been better to make the lights blinks faster or slower rather than change the brightness. If there’s a version 0.2 of this project, that will be one of the changes.

Currently you have to change the code directly to change the search terms. Although setting up an interactive session or GUI is overkill for this application, it would be pretty easy to have the Reddit bot that does the term searches check its Reddit messages every so often for a list of new search terms. In that case, changing the search terms would be as easy as sending a Reddit message in some predefined format.

And finally, so maybe you don’t care about the relative popularity of pets on Reddit. With the upcoming election plugging in the candidate names might be interesting. Now that I have some real breadboard wires, I’ll probably set that up the next time I have a free weekend. But go ahead, feel free to clone the repo and do something useful.

Playing with Gradient Descent in R

Gradient Descent is a workhorse in the machine learning world. As proof of its importance, it is one of the first algorithms that Andrew Ng discusses in his canonical Coursera Machine Learning course. There are many flavors and adaptations, but starting simple is usually a good thing. In this example, it is used to minimize the cost function (the sum of squared errors or SSE) for obtaining parameter estimates for a linear model. I.e.:

\text{minimize} J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2

Which, when applied to a linear model becomes:

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})

\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right)

Where \theta_0 is our intercept and \theta_1 is the parameter estimate of our only predictor variable.

Ng’s course is Octave-based, but manually calculating the algorithm in an R script is a fun, simple exercise and if you’re primarily an R-user it might help you understand the algorithm better than the Octave examples. The code full code is in this repository, but here is the walkthrough:

  • Create some linearly related data with known relationships
  • Write a function that takes the data and starting (or current) estimates as inputs
  • Calculate the cost based on the current estimates
  • Adjust the estimates in the direction and magnitude indicated by the scaling factor \alpha.
  • Recursively run the function, providing the new parameter estimates each time
  • Stop when the estimate converges (i.e., meets the stopping criteria based on the change in the estimates)

This code is for a simple single variable model. Adding additional variables means calculating the partial derivatives with respect to each item. In other words, adding a version of the \theta_1 cost component for each feature in the model. I.e.,

\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\right)

I sometimes use Gradient Descent as a ‘Hello World’ program when I’m playing with statistical packages. It helps you get a feel for the language and its capabilities.