On Good Data

Below the fold are edited notes from a talk I gave at an industry conference last year. They’re germaine to a topic currently on the transit-developers list, about best practices for GTFS. It is an edit of a previous ranting paper and a presentation given at an industry conference.

The audience of this conference is diverse. As we all deploy ITS projects, the level of resources and to some degree, the vendors in play, will depend on the size of the organization. I’m going to speak in the abstract, because many of us face the same issues with open data and transit.

I want to pose the question: what is good, or highly effective data? I describe good data according to four “habits.”

There is a well-meaning misconception that Commodity (AVL, Customer Information, Automated Announcements, etc.) ITS projects are easy. This misconception is translated into this formula:

[latex display=”true”] Funding + Product_{off-the-shelf} = Project[/latex]

But failures still happen:

Describing the signs as ‘fiction,’ … councilor Davies asked, ‘In this technological age why on earth can we not get the real time information service to work?’

We dread having to speak these words.

How then, in an era where many technologies are off-the-shelf and systems engineering is prevalent, do these problems keep happening?

The answer is that data is the greatest unmitigated risk to projects. Reducing this risk requires moving past check boxes, pre-planning and clear thinking about data before implementation.

If you take away one thing from this post let it be this:

data_poster

Even if data exists, it may not serve the latest technologies. As technology evolves, requirements for data evolve as well.

For example, data from first-generation scheduling software did not work well with trip planners, and even an organization that has modern scheduling software might find there is not enough quality for automated announcements or AVL.

We lack a common language to describe data and possible problems. The 4 “habits” are a framework that have evolved in my mind to describe both requirements placed on data by a new system and problems in systems implementation.

1. Precision

Precision is the level of refinement available in the dataset to store a piece of information reliably and repeatedly.

how many digits after the decimal?

2. Accuracy

Accuracy is the degree of closeness that a record or observation matches the real-world value.

Two questions need to be addressed in discussions of accuracy.

How quick can a change on the ground make it into a dataset and systems that use it?

How reliable is the equipment? For example, knowing about the MTBF and CEP of a positioning device.

A dataset can be accurate but not precise, precise but not accurate, neither, or both. Some organizations conduct expensive, highly detailed stop surveys, where individuals are sent to every stop to survey its location and other data. These are both precise and accurate at the time of their creation, but these lose accuracy over time. It is not uncommon for the very first survey to be considered only as part of a capital project, and the organization in question believes that the work is done. This, however, is often a folly. Due to changes on the ground, accuracy of a stop dataset degrades over time if it is not updated. Streets are paved and closed, businesses open and close,
and in some areas entire neighborhoods appear overnight.

3. Consistency

While precision and accuracy are important when considering a project that only uses, or heavily relies upon one dataset. The elephant in the room in projects that draw upon data from multiple systems is the concurrency of that data. Concurrency, in this case, refers to the ability to join together, reliably, the data from one or many systems. There is also another crucial element: the degree to which all systems see the same data simultaneously.

Why consistency matters is better explained by this picture (a decomposition of an organically grown real-world system), which I find worth more than a thousand words.

KCM's data flowSource

All you need to read from that diagram is if one system is inconsistent, you’ve got a problem.

4. Transparency

Hopefully, when working with data from an existing system, there is documentation on what that data means. Often, however, when the data has only been used for only one purpose or within one organization or department, documentation is not complete. Transparency refers to the degree to which the data conforms to the documentation provided.

Conclusion: Why does this all matter?

As new uses for existing data are required, the need for additional precision, accuracy, consistency, and transparency may increase with them. While data may exist, it is important to take into consideration the data’s PACT given it’s previous uses. Most large transit organizations have had a rough progression in systems along the lines of the following.

• Scheduling: Early scheduling systems focused primarily on driver schedules over geographic schedules, using a simple schematic network between them. Geography or intermediate stops were not taken into account.

• Detailed Scheduling: Modern scheduling systems include geographic information for paths and stops and has the ability to generate passing times for each stop, which allows for detailed GTFS creation.

• Computer Aided Dispatching: In order to dispatch effectively, locations of terminals and timepoints must be accurate. If a timepoint is on the wrong side of the intersection or block, the CAD system will produce incorrect schedule adherence in that vicinity.

• Static Trip Planning: Passengers need to know where to board and alight the bus.

• Real-Time Passenger Information: The location of every stop must be within reason, and every revenue trip must be accounted for.

• Reporting and Analysis: Reporting on service delivery requires the behind-the-scenes data (e.g. runs and blocks) to be correct, above and beyond the passenger information use case.

• On Board Announcements: Announcements made too late because of incorrect stop locations or for stops that do not exist render the system useless.

• Real-Time Trip Planning: Planning trips in real time, sometimes hours in advance, requires the greatest level of PACT from all of the above– systems must concur for real-time information in the future to be valid.

• The “Connected Vehicle:” While this is still being defined, it will surely require more PACT from existing datasets. The progression of technology leads to greater needs from data, and often technology can grow before data is updated. Understanding this is a key to delivering both good data and good projects.

Leave a Reply

Your email address will not be published.