Mommy, what’s a Protobuf?

A question that continually pops up when someone learns that GTFS-Realtime is intended to be formatted as Protocol Buffers (or Protobuf) is, understandably, “why?”

The documentation does little to answer this question for laypeople.

“Protocol buffers are a language- and platform-neutral mechanism for serializing structured data (think XML, but smaller, faster, and simpler). The data structure is defined in a gtfs-realtime.proto file, which then is used to generate source code to easily read and write your structured data from and to a variety of data streams, using a variety of languages – e.g. Java, C++ or Python.”

What does that mean? Why not JSON (which is smaller, faster and simpler than XML)? I’ve heard a number of complaints that Protobuf is something along the lines of “entirely unnecessary.” But it’s not— it serves a purpose.

In short, Protobuf is designed to be:
* efficiently generated,
* efficiently transferred,
* efficiently processed (machine readable), and
* unambiguously understood by programs written in many languages.

Breaking each of these points down:

Protobuf is efficient to generate and process because programs do not have to worry about making data human-readable. In order for programs to transfer data in memory between one another, that data needs to be ‘serialized’ or ‘marshaled,’ which generally takes longer for XML than for JSON than for Protobuf.

Protobuf is efficient to transfer because .proto files specify exactly what the options and fields for messages are. By doing this the amount of data that is required is much less. For example, a 227KB protobuf-encoded GTFSrt feed has a JSON representation of around 1.25MB.

Finally, Protobuf enforces that data will have an equivalent representation regardless of the designer of the producing or consuming system. JSON and XML *can* do this, but both offer the opportunity to “roll-your-own” and avoid strict enforcement of the format. This often causes unexpected problems. Instead, Protobuf bindings (the code in whatever language you are working in that is compiled from the .proto file) make it easier to generate and consume data reliably and repeatably. This means no need to worry about how a library handles the differences between CDATA and NMTOKEN, or how special characters are escaped, or…

In short, yes, there is a learning curve for Protobuf, but its existence makes projects easier in the long-term.

Leave a Reply

Your email address will not be published.