"Open Data" has been the flavor of the year. While ten years ago most people were content to access their favorite government-issued pamphlets and documents as PDF documents on the web, a few people knew that there were untold applications locked within the encrypted data in those documents. In the past year or so, many jurisdictions, most notably at the city level, have been releasing sets of raw data. Now that we're starting to get access to new sources of data, in the form of RSS fields, geographical data, and bit dumps from other kinds of databases, the next question is, what do we do with it, and how? This article is taking a look at one particular instance, vantrash.ca, and how it's using public data to save people's marriages. You see, I don't know how it is in other places, but dealing with residential garbage in Vancouver, B.C., is a bit of a pain. Maybe this was covered by one of the visiting broadcasters during the recent Olympics, but I must have missed it, and I'll assume you're the type who would rather read blogs than watch network TV, so here's a quick explanation on Vancouver's trash pickup system. You have a weekly garbage pickup date, until a statutory holiday comes along, and then your pickup date skips ahead by one workday. For example, during the first three stat-free months of winter, my pickup is currently Monday. That's easy to deal with -- I put out the recycling and garbage sometime Sunday, and we have a clean recycling box to start refilling Monday evening. But then along comes Easter, and afterwards everyone's garbage pickup day advances one day, unless your pickup day was Friday, and then it will be Monday. But wait, Easter Monday is also a stat, so that means you advance two days. The last week of the year can be even crazier, with a three-day jump due to Christmas, Boxing Day, and New Year's. Combine that with a 6:30 AM pickup time, and you're going to find some perfectly good mornings ruined by the sound of the garbage trucks zooming past your house as there wasn't anything put out for pickup. You can always throw on some clothes, zoom out of the house, toss the recycling and the garbage can in the minivan, and go racing after the trucks, but that will just raise the ire of your neighbors, when they think you've used up your garbage quota and are trying to horn in on theirs. Plus the kids will complain about the smell in the van.
Show me the Data
Shift the scene from a dark wet winter morning looking at the tail lights of a garbage truck roaring down a back alley, to the warm, dry hackers' hangout at the Vancouver City Archives. A bunch of local hackers have arranged to meet with a group of City workers who have released some new data sets, and are looking for feedback. The mission now is to find an interesting dataset, apply a bit of code, and whip up an application in three hours. I went there one evening, and if you have a look at Vancouver's Open Data Catalogue, you'll see how many different data sets there are, ranging from a list of city alleyways and bikeways down to city-run webcams and zoning boundaries. Most of the datasets have a geographical aspect to them, not surprising considering the source. Most of the geographical datasets ship in three formats: KML, SHP, and DWG. You can use KML files right off the bat with an online map viewing utility like Google Maps or Bing. For example, from the catalog I see that the URL for the list of webcam data is http://data.vancouver.ca/download/kml/webcams.kml, but that file isn't so interesting to look at in raw form. If you open up Google Maps like so: http://maps.google.com/maps?q=http://gisweb2.vancouver.ca/google/kml/webcams.kml, you'll see a much more interesting view.
So right off the bat we can build trivial web applications with data like this. There's a lot of open data in this application, but not much open source. Let's bring in some code. The classic example is vantrash.ca by Luke Closs and Kevin Jones, two local Vancouver programmers who are always looking for an interesting project using, typically, open-source technologies. They had read David Eaves post How Open Data even makes Garbage collection sexier, easier and cheaper, got the data, scraped a couple of other city sites to get the pickup times for each zone throughout the year, and whipped together an alpha in Perl. Users go to the site, click on the zone they live in and enter their email address. The site figures out which zone they live in, and they then get a weekly email reminder the day before the pickup date for that week.
Perl to the Rescue
The interesting part of this article, is why they selected Perl. True, they both had been writing large amounts of code in the last few years in Perl, but there are other reasons. And if you've read articles like this before, you can guess that CPAN is part of the answer. Notice those two other columns next to "KML" in the data column: "DWG" and "SHP". A couple of minutes at Wikipedia told me that DWG files were a type of AutoCAD format, and SHP files, also called shapefiles, were an open specification for GIS systems. From that, knowing nothing else, I knew which format I'd rather work with, then went to CPAN, and downloaded Geo::Shapefile. I then filled in the documentation by firing up an interactive Perl shell, typing in commands, and getting back large numbers of polynomials. This is basically the process Luke and Kevin went through, taking vantrash.ca from concept through alpha to release after a few evenings and weekends of coding. Calculating a user's zone based on his location is also a typical problem from computation geometry, easily solved by downloading the Math::Polygon module, feeding it the data, and using the contains(POINT) method to determine which zone a point is in. There are no doubt other ways to solve the problem, and solving it isn't a strength peculiar to Perl. But this is the kind of problem Perl and CPAN make easy to solve.
Show me the Code