Analyze HN “Who is hiring?” with a little bit of Python and Beautiful Soup 🐍🍜

Santiago Basulto
rmotr.com
Published in
3 min readJun 11, 2016

--

The first weekday of every month, a “Who is hiring?” post is submitted to Hacker News, by an automatic bot. We’ve always liked to read those posts to see if there are any interesting trends arising. We also share it among our students and we all like to start discussing it: “wow, see how many React offers since last month”, “look, the SEC is hiring!”, etc. It’s fun and insightful.

This month, a student (Phillip Wright) decided to go a little bit further and build a Python script that could tell us how many offers for a certain keyword were posted. For example, how many posts included the keyword “Python”, or “javascript”, or “Django”, etc. Needless to say that we thought it was an amazing idea. We even created a group project based on it, and asked our students to publish their projects to PyPi.

The resulting script was extremely simple, thanks to BeautifulSoup4 (and a tiny bit of requests). We believe it’s a good example of the power-simplicity balance of Python; especially for beginners, who might erroneously think that these types of tasks are too challenging for them to complete.

Understanding Beautiful Soup

Beautiful Soup (bs4, to friends) is an HTML parser which allows you to analyze the contents of an HTML document. You can use simple selectors (like the ones in jQuery) to move through the DOM tree. The work we need to do to analyze the job offers is super simple: we just need to extract all the job offers from the post, and then we can make use of simple text processing to analyze the given keywords. To understand how the job offers are structured in the page, we first need to analyze the HTML markup used for HN posts.

Analyzing Hacker News HTML

Hacker news still uses tables to structure their page HTML ¯\_(ツ)_/¯. On top of that, every comment in a post (regardless if it’s a direct answer to the main post, or a reply to other comment), has the same HTML and CSS structure: it’s contained in a `<tr>` HTML tag with the CSS class `.athing`

That means that there’s no simple way to differentiate a job offer (a “root” comment) from a reply to an offer. If we’d decide just to get every `tr.athing`element and inspect their contents looking for keywords, the data would be inaccurate:

To solve this issue, we can just rely in the visual difference between those two types of comments. Every root comment is aligned to the left of the screen, while replies to “root” comments are indented to the right. The way HN’s page performs that visual style is through a hardcoded `width` attribute in the first `<td>` tag of the internal `<table>` of the comment, identified by the CSS class `.ind` (😱):

(Thanks HN for your brilliant markup 😠)

Now, using a little bit of bs4 magic, we can get all the `<tr>` elements containing job offers:

Searching for keywords

If this is not the first time you see Python, you’ll know how simple it’s going to be to search for keywords. Just using the magic `in` operator from string objects is going to do the job:

We now can make use of the `post_matches` dict to get the information we need, for example, printing how many posts per keyword we found:

That’s it! Extremely simple, as we promised. Of course there’s a lot of room for improvements, and we encourage you to keep experimenting with it.

If you’re interested in seeing how to build a full-fledged version of this script, make it a command-line utility, and publishing it on PyPi (to make it pip-installable), check the following post: https://medium.com/rmotr-com/publishing-group-projects-on-pypi-17e60031f522

--

--