Querying GTFS with CSVkit – Transit Data Blog

A while ago I wrote a post about using built-in tools from the Bash command line shell to perform quick and dirty analysis of GTFS files. Those examples depended on the order of the columns in the GTFS, something which is not standard across all feeds. There is another set of utilities, csvkit, which can work around this problem and enable the same analysis to be performed regardless of column order.

This post assumes:

Some familiarity with the commandline. I use Bash on Linux or OS X, but PowerShell on Windows may work. Hint, a “|” means to pass data from one program to another, and “-” is shorthand for using that data in the next program along the line.
An installation of csvkit and its dependencies (python, etc)
Unzipped GTFS files.

The first tool that is worth taking a look at is csvlook, which simply pretty prints the file you feed into it. (forgive the appearance of the following on mobile.

~/bayarea_gtfs/muni$ csvlook routes.txt 
|-----------+-----------+------------------------+-------------------------+------------+------------+-----------+-------------+-------------------|
|  route_id | agency_id | route_short_name       | route_long_name         | route_desc | route_type | route_url | route_color | route_text_color  |
|-----------+-----------+------------------------+-------------------------+------------+------------+-----------+-------------+-------------------|
|  10867    | SFMTA     | 1                      | CALIFORNIA              |            | 3          |           |             |                   |

That could be more useful if we include only the filled in fields. This is where another tool, csvcut, becomes useful. It can cut output down only to the columns you ask for, like so:

~/bayarea_gtfs/muni$ csvcut -c agency_id,route_id,route_short_name,route_long_name routes.txt | csvlook
|------------+----------+------------------------+--------------------------|
|  agency_id | route_id | route_short_name       | route_long_name          |
|------------+----------+------------------------+--------------------------|
|  SFMTA     | 10867    | 1                      | CALIFORNIA               |
|  SFMTA     | 1033     | 1AX                    | CALIFORNIA A EXPRESS     |
|  SFMTA     | 1034     | 1BX                    | CALIFORNIA B EXPRESS     |
|  SFMTA     | 1031     | 31AX                   | BALBOA A EXPRESS         |
|  SFMTA     | 1032     | 31BX                   | BALBOA B EXPRESS         |

What if I want to know more some summary statistics of the dataset itself? The aptly named csvstat comes into play.

~/bayarea_gtfs/muni$ csvcut -c agency_id,route_short_name,route_long_name routes.txt | csvstat
  1. agency_id
	
	Nulls: False
	Values: SFMTA
  2. route_short_name
	
	Nulls: False
	Unique values: 85
	Max length: 22
  3. route_long_name
	
	Nulls: True
	Unique values: 76
	Max length: 23

Row count: 85

What’s tricky about using GTFS not in a database or a specialized library is that the data are normalized. In my previous post, I mentioned a hack using xargs, but csvkit has its own join functionality.

SF Muni’s GTFS has a non-human readable route_id, so analysis can be a little annoying, how about joining together and putting out something that ties together routes, trips, and stop_times?

~/bayarea_gtfs/muni$ csvjoin -c route_id trips.txt routes.txt | head -3 | csvjoin -c trip_id stop_times.txt - |  head | csvcut -c route_short_name,route_long_name,trip_headsign,arrival_time |csvlook
|-------------------+-----------------+-----------------------+---------------|
|  route_short_name | route_long_name | trip_headsign         | arrival_time  |
|-------------------+-----------------+-----------------------+---------------|
|  1                | CALIFORNIA      | the Richmond District | 05:33:00      |
|  1                | CALIFORNIA      | the Richmond District | 05:33:46      |
|  1                | CALIFORNIA      | the Richmond District | 05:34:35      |
|  1                | CALIFORNIA      | the Richmond District | 05:35:03      |
|  1                | CALIFORNIA      | the Richmond District | 05:35:27      |
|  1                | CALIFORNIA      | the Richmond District | 05:36:03      |
|  1                | CALIFORNIA      | the Richmond District | 05:36:47      |
|  1                | CALIFORNIA      | the Richmond District | 05:38:00      |
|  1                | CALIFORNIA      | the Richmond District | 05:38:30      |
|-------------------+-----------------+-----------------------+---------------|

What did this do? Join trips and routes, take only the first 3 rows of that, join stop_times to that, take the first 10 rows of that, only show route_short_name,route_long_name,trip_headsign,arrival_time, and make it pretty!

What about pulling together multiple GTFS files? There’s an awesome feature in Bash, globstar! It allows to search for the same filename in multiple directories, but it’s not enabled by default.

The possibilities are interesting, but I’ve got a boring example. What are the most common trip start times among the three biggest transit operators in the Bay Area– Ac Transit, BART, and SF MUNI?

~/bayarea_gtfs$shopt -s globstar

~/bayarea_gtfs$ csvstack **/stop_times.txt | csvgrep -c stop_sequence -r "^1$"| csvcut -c stop_sequence,arrival_time  > first_stops

~/bayarea_gtfs$ csvcut -c arrival_time first_stops | csvstat -y 0
  1. arrival_time
	
	Nulls: False
	Unique values: 1446
	5 most frequent values:
		11:00:00:	132
		17:00:00:	127
		19:00:00:	123
		12:00:00:	120
		14:00:00:	120
	Max length: 8

Row count: 36357

This one uses csvstack to concatenate all of the stop_times in the directory tree, cuts that down to departure time and sequence, takes only the first of the sequence (csvgrep), then cuts that again to departure time only. At this point, I saved the results to an intermediary file, as juggling this whole pipeline in memory was a bit much. Finally, I took that output and printed summary statistics on the result.

What does this say? It is more common for trips start on the hour (a real clockface headway), and among these three systems combined, more trips start precisely at 11:00 than at any other minute during the day.

Leave a Reply Cancel reply