Much ado about open data

The tech and civic journosphere reported on Wednesday evening with a revelation that the Maryland Transit Adminstration’s bus tracking project had been ‘hacked’ to ‘save Baltimore $600,000 in one day.’ Hyperbole ensued, and if you aren’t familiar with the story, it can be summed up in the following quotes from The Transit App and the MTA’s rebuttal:

Disclaimer: The content of this post is purely the opinion of the author, and not of any current or previous employer. I have not spoken with the Maryland Transportation Authority or The Transit App before writing this.

The tech and civic journosphere reported on Wednesday evening with a revelation that the Maryland Transit Adminstration’s bus tracking project had been ‘hacked’ to ‘save Baltimore $600,000 in one day.’ Hyperbole ensued, and if you aren’t familiar with the story, it can be summed up in the following quotes from The Transit App and the MTA’s rebuttal:

Baltimore’s data wasn’t made available in a developer-friendly format…This means that Baltimore’s real-time tracking data isn’t compatible with the most popular commuter apps… When reporters asked the MTA why they opted to only show the info on their mobile webpage … the MTA responded that it would have been too expensive.

Why are we still working on it? Well, our data is not in GTFS-RT format. (… General Transit Feed Specification-Real Time – is [a] data format used by developers to make transit apps.) …Our CAD/AVL system pre-dates Google. That’s why GTFS-RT was never a requirement… The cost to convert our CAD/AVL data prior to the development of the interface was going to cost MD taxpayers an additional $600,000.

I’d like to applaud both sides for being civil. Many similar discussions have devolved into the realm of toxicity. What I’ve found lacking in this discussion are nuance and actual lessons to be learned for the industry as a whole. That’s why I’m throwing my hat in the ring.
Continue reading “Much ado about open data”

Required Transit Nerd Reading: TCRP 135

TCRP report 135 provides a primer on the transit scheduling process. Intended for training those who have to schedule service or manage the process, It is well written and easy to follow. Whether you are a critic, nerd, foamer, or enthusiast, it is a worthwhile read that might illuminate some of the oddities you see in schedules.

A rant on platforms

This post on the current trend towards ubiquitous technology in transit.

This is all well and good, but the article only alludes to how long-term successful systems are implemented. There is a history of bad technology design decisions in the industry that either result in monolithic systems that prove unmaintainable and non-expandable. There is also a recent trend by certain vendors to sell an intentionally-crippled system by intentionally unbundling data ownership and the license to re-use it from their standard product, requiring extra payment to own their data. This makes additional applications or analysis more difficult than necessary.

The tech-industry buzzword for a system that avoids this by exposing all of its informationwave of is “platform.” The term was vaulted to legendary status in this essay (http://steverant.pen.io/). A platform is, put simply, accessible by other platforms, modular and extensible, well-documented and well-tested, stable, maintainable, robust, and not redundant. Platforms ensure that the owner of the project not only owns the data, but the interfaces as well.

If you’re in a role to evaluate proposals, ask your vendors: show me the data!

Journal of Public Transportation article on OneBusAway

The latest issue of the Journal of Public Transportation has an article on the multi-region features of OneBusAway. By leveraging the existing OBA work, other locales can gain an off-the-shelf product that is far more extensible than any of the existing commercial offerings.

disclaimer: I sit on the OBA board as part of my day job.

BASHing through GTFS

Quite a bit has been written about putting GTFS in a database. That is great, but the power of the Bash shell and structure of GTFS enables faster queries if you know what to do.

For example, you can find the stops on a specific trip using a trip_id. With BART’s GTFS, loaded, let’s give it a try:

Requires: bash shell, GTFS files unzipped to a directory

First, let’s take a look at the top of the trips.txt file.

$ head trips.txt
route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
01,WKDY,01SFO10,San Francisco Int'l Airport,0,,
01,SAT,01SFO10SAT,Millbrae,0,,
01,SUN,01SFO10SUN,Millbrae,0,,
01,WKDY,02SFO10,San Francisco Int'l Airport,0,,
01,SAT,02SFO10SAT,Millbrae,0,,
01,SUN,02SFO10SUN,Millbrae,0,,
01,WKDY,03SFO10,San Francisco Int'l Airport,0,,
01,SAT,03SFO10SAT,Millbrae,0,,
01,SUN,03SFO10SUN,Millbrae,0,,

The column we’re looking for is the third column (trip_id). “Grep” is the universal text search tool, and “|” tells the Bash shell to pass the output from one command to another, forming a chain. As GTFS is comma-separated, we’re able to use the “cut” tool on a file to get just what we need. The “xargs” tool then can pass that on to another command. Taking a trip ID, you can see the particular stops that are on a trip by looking at a combination of stop_times.txt and stops.txt.

grep 183OAK11 stop_times.txt | cut -d "," -f 4 | xargs -I {} grep {} stops.txt | cut -d "," -f 2

Coliseum/Oakland Airport
Oakland Int'l Airport

You can even go crazy with this. The following beast of a command will give *all* of the stops that are made on the Pittsburg line on any trip during the schedule.

$ grep -E "Pittsburg/Bay Point - SFIA/Millbrae" routes.txt | cut -d "," -f 1| xargs -I {} grep -e ^{} trips.txt | cut -d "," -f 3| xargs -I {} grep {} stop_times.txt | cut -d "," -f 4 | sort -u | xargs -I {} grep {} stops.txt

12TH,12th St. Oakland City Center,,37.803664,-122.271604,12TH,http://www.bart.gov/stations/12TH/
16TH,16th St. Mission,,37.765062,-122.419694,16TH,http://www.bart.gov/stations/16TH/
19TH,19th St. Oakland,,37.80787,-122.269029,19TH,http://www.bart.gov/stations/19TH/
19TH_N,19th St. Oakland,,37.80787,-122.269029,19TH,http://www.bart.gov/stations/19TH/
19TH_N,19th St. Oakland,,37.80787,-122.269029,19TH,http://www.bart.gov/stations/19TH/
24TH,24th St. Mission,,37.752254,-122.418466,24TH,http://www.bart.gov/stations/24TH/
BALB,Balboa Park,,37.72198087,-122.4474142,BALB,http://www.bart.gov/stations/BALB/
CIVC,Civic Center/UN Plaza,,37.779528,-122.413756,CIVC,http://www.bart.gov/stations/CIVC/
COLM,Colma,,37.684638,-122.466233,COLM,http://www.bart.gov/stations/COLM/
CONC,Concord,,37.973737,-122.029095,CONC,http://www.bart.gov/stations/CONC/
DALY,Daly City,,37.70612055,-122.4690807,DALY,http://www.bart.gov/stations/DALY/
EMBR,Embarcadero,,37.792976,-122.396742,EMBR,http://www.bart.gov/stations/EMBR/
GLEN,Glen Park,,37.732921,-122.434092,GLEN,http://www.bart.gov/stations/GLEN/
LAFY,Lafayette,,37.893394,-122.123801,LAFY,http://www.bart.gov/stations/LAFY/
MCAR,MacArthur,,37.828415,-122.267227,MCAR,http://www.bart.gov/stations/MCAR/
MCAR_S,MacArthur,,37.828415,-122.267227,MCAR,http://www.bart.gov/stations/MCAR/
MCAR_S,MacArthur,,37.828415,-122.267227,MCAR,http://www.bart.gov/stations/MCAR/
MLBR,Millbrae,,37.599787,-122.38666,MLBR,http://www.bart.gov/stations/MLBR/
MONT,Montgomery St.,,37.789256,-122.401407,MONT,http://www.bart.gov/stations/MONT/
NCON,North Concord/Martinez,,38.003275,-122.024597,NCON,http://www.bart.gov/stations/NCON/
ORIN,Orinda,,37.87836087,-122.1837911,ORIN,http://www.bart.gov/stations/ORIN/
PHIL,Pleasant Hill/Contra Costa Centre,,37.928403,-122.056013,PHIL,http://www.bart.gov/stations/PHIL/
PITT,Pittsburg/Bay Point,,38.018914,-121.945154,PITT,http://www.bart.gov/stations/PITT/
POWL,Powell St.,,37.784991,-122.406857,POWL,http://www.bart.gov/stations/POWL/
ROCK,Rockridge,,37.844601,-122.251793,ROCK,http://www.bart.gov/stations/ROCK/
SBRN,San Bruno,,37.637753,-122.416038,SBRN,http://www.bart.gov/stations/SBRN/
SFIA,San Francisco Int'l Airport,,37.616035,-122.392612,SFIA,http://www.bart.gov/stations/SFIA/
SSAN,South San Francisco,,37.664174,-122.444116,SSAN,http://www.bart.gov/stations/SSAN/
WCRK,Walnut Creek,,37.905628,-122.067423,WCRK,http://www.bart.gov/stations/WCRK/
WOAK,West Oakland,,37.80467476,-122.2945822,WOAK,http://www.bart.gov/stations/WOAK/

Specifying and Validating Open Standard Interfaces using Open Source Tools: Part I

Interchange of data between systems can be a tricky affair. There are a handful of standards that exist, but figuring out how to use them is not always straightforward. This is the first in a brief series of posts devoted to showing how open standard interfaces for public transit data interchange can be specified with the skills learned in a first-semester Java course coupled with a hint of XML.

Standards with XML encoding typically have a readily-available XML Schema Definition (XSD), which formally describes the standard in machine-readable format. The two that I’ll cover in this series areTCIP andSIRI. The nifty trick with XSDs is that they can be “bound” (compiled) into objects in an object oriented programming language like Java or C#. XML that is “marshalled” (generated) from those objects can then be validated against the original XSD for completeness. The next post will introduce how to specify and validate interfaces using pre-compiled Java classes.

Slimming down the LION

New York City’s Department of City Planning provides a wonderful resource in the LION dataset, which contains lines for streets, paths, railroads, and administrative boundaries. There is one problem with the most recent version– it is only available in ESRI’s proprietary File GeoDatabase (or GDB) format. While the format has some advantages (more on these below), it is not easily accessible in many free/open source tools. I recently had occasion to work on converting LION to a more accessible format for display and light analysis, and this post outlines that methodology and provides the results. Continue reading “Slimming down the LION”

AVL: Monitoring

Monitoring systems are invisible, yet crucial. They help find out if there is a problem, and also help pin down the source of a failure so it can be fixed. Ideally, all of the boxes from the introductory post should have some point at which they can be monitored, both on input and output interfaces. In this post, I will briefly introduce the concepts in monitoring and explain why AVL systems, whether turn-key or custom-built, should have a monitoring component, and why that component should be evolve after delivery.

The four core concepts of monitoring systems are metrics, checks, alerts, and alarms.
Continue reading “AVL: Monitoring”

AVL: APIs. Own your data, own your interfaces.

If you’ve done your homework on systems and how they hook up to the Internet, you have encountered the term API. APIs (Application Programming Interfaces) are defined methods for applications to talk to each other. All web users use them daily without knowing; if you load Google Maps or any Webmail page, your computer is first downloading a program that runs in your browser and that program is making ‘calls’ to an API to fill your screen with information. Many, though not all, AVL vendors include APIs with their system for data to be used in novel applications.

That is all well and good, but it might surprise some purchasers of AVL systems that they do not own the rights to data from their own vehicles. Like airline fares, some vendors un-bundle data ownership and the license to re-use it from their standard product, requiring extra payment to get access. This makes any additional applications or analysis very difficult. You shouldn’t let this happen, and many of us in the field are adamant. Further legal opinion on the topic can be found in TCRP Legal Digest 37, which notes that:

Although real-time data as such are not copyrightable, the majority view seems to be that a license or other agreement with provisions restricting access to or use or dissemination of data are not preempted by the Copyright Act. The rationale is that contracts affect the rights of the parties to the contract and do not involve exclusive rights against the world as exist under the copyright laws.

Legal digest 37 also contains sample contractual language to make data ownership clear. Ideally, this would be included in procurement documents.

Now that this is out of the way, I’d like to introduce Sean Barbeau’s presentation from APTA TransitTech 2013, which provides a very good introduction to APIs in the real-time transit context.

Relatively new and novel data sets are subject to one set of problems — the data itself will have been less well scrutinized and is more likely to contain errors, small and large. – Nate Silver