Screen Scraping with Node.js

by MondowWindow CTO Tyler Freeman

At MondoWindow, we do a lot of screen scraping of sites like Wikipedia, Wikitravel, etc. for geo-located content, so we can include it as pins on the map. One of the best new tools for this is Node.js, since you can write simple scripts that use jQuery to parse the HTML of various sites and lift out those delicious bits of content amongst all the stuff we don't need.

"Whoa, what are you saying there, Tyler? That’s a lot of code jargon."

Let's have a little background for the un-nerd-formed: Node.js basically lets you run Javascript on a server instead of your web browser. jQuery is an extension to Javascript that makes things like finding certain bits of content in a web page really easy. Combining these two things in a techno ménage à trois with node.io, a data scraping framework, lets you do things which before you'd have to pay hundreds of lowly interns to burn out their retinas and cramp their mouse fingers copying-and-pasting certain parts of pages into a database/spreadsheet/colored paper.

For instance, let's say we want to find all the pages on Wikitravel that relate to a destination near a certain airport. Wikitravel has an amazing community of contributors that have already gone through all the pages and tagged them with the IATA code of the nearest airport, using a special tag that looks something like "{{IATA|SFO}}" (in the case of SFO airport). Using node.io, we can take all the IATA codes in our airport database, and feed them one at a time into Wikitravel's search engine, in the format of the aforementioned tag, like this.

Screen_shot_2012-02-08_at_10

Now you can see that on that page is a bunch of links with descriptions, a search box, ads, etc. This is where jQuery comes in. By telling jQuery a simple command like "$('#bodyContent ul li a')", we can isolate only the links on the page which are part of the search results (instead of links to ads, or descriptive text, etc.) Then we can follow each link, download that page, and save it to our database to show on the map.

"But wait, Tyler, this search is all wrong! If I'm flying into SFO, I definitely want to know about San Francisco, but it's all the way at the bottom of the page!"

Your are absolutely right, dear reader. So how do we find the most relevant page, instead of sending all those poor, unsuspecting tourists, fresh off the plane, to downtown El Cerrito? (Which, by the way, they would never forgive us for - I've been there and it's not pretty.)

The answer here, is a little bit of artificial intelligence.

Well, in this case it's as close to "intelligence" as zombies are to Steven Hawking, but it will get us close enough. We add a little bit of logic to check the title of the search result link against the airport's title and city it serves. By ranking each word in the search title by decreasing importance of airport name, the city the airport serves, and whatever else might be in there ("John Wayne", anybody?), we can come up with a pretty good match of what is the most important article to show to someone heading to a given airport.

For instance, SFO's long name is called "San Francisco International Airport.” The actual airport is located in San Bruno, CA. By using our little pseudo-AI algorithm, we can match the words "San Francisco" to the link at the bottom of our search page, and therefore determine that it's probably the page that people will find most useful. Of course, this is not foolproof, especially for those weirder airports, so don't fire those interns just yet!

By running this script thousands of times a second, we can quickly gather all the relevant articles for our entire airport database with our new robotic intern overlord. This is the magic that Node.js provides us, and we owe it to those awesome open-source developers for providing such a neat and easy way to do the dirty work. In kind, we've decided to give back to the Node community by posting the source of our little Wikitravel scraper script. You can find it on Github:

https://github.com/odbol/Data-Scraping-with-node.io

Go forth and scrape with zeal!

Tyler is MondoWindow’s CTO. He holds a Masters Degree in Digital Art and New Media from the University of California (Santa Cruz). He is also a cyborg, equipped with performance interfaces of his own design.