AVL: Monitoring

Monitoring systems are invisible, yet crucial. They help find out if there is a problem, and also help pin down the source of a failure so it can be fixed. Ideally, all of the boxes from the introductory post should have some point at which they can be monitored, both on input and output interfaces. In this post, I will briefly introduce the concepts in monitoring and explain why AVL systems, whether turn-key or custom-built, should have a monitoring component, and why that component should be evolve after delivery.

The four core concepts of monitoring systems are metrics, checks, alerts, and alarms.

A metric is a structured set of data on a component’s performance that can be evaluated objectively. A fully developed monitoring system may collect hundreds or thousands of metrics. Metrics can be aggregated into summary statistics, used for high-level system evaluations. For example, an AVL system may have one metric for each route with the number of buses being tracked. A summary statistic, the total number of buses currently tracked, can measure the entire system’s health.

A check (also called a “monitor”) is the logic that compares metrics against thresholds of normal performance. They also categorize metric performance into several states, such as “Clear” or “OK,” “Warning” or “Marginal,” and “Failure” or “Critical.” Monitors can be simple (e.g. report OK if the system is receiving some data every N minutes), or contain a fair bit of complexity to reduce false positives (e.g. report OK if the system is receiving at least Y percentage of expected data every N minutes between the times A and B). As monitors become more complex, however, the chance of not detecting a failure increases. Not all metrics require a check.

Alarms are events created by checks when a metric moves from the normal state to an alert state. Alarms should be only on metrics that are highly correlated with normal system performance. Too much noise will be ignored.

An alert is the notification of an alarm. Stand-alone alerting systems (e.g. PagerDuty or Pingdom) exist to propagate messages to responsible parties by various means: email, phone, SMS, etc.

In the introduction to this post, I mentioned that ideal monitoring implementations have many metrics on all parts of the system. This is not always possible. Effective real-world monitoring systems are constructed with feedback from system failures. For every failure that is not detected by way of an alarm, the root cause should be analyzed and relevant alarms and metrics should be put in place or updated. If you’re working with a vendor, they should have a plan for this.

For more information on monitoring, I highly recommend Effective Monitoring and Alerting by Slawek Ligus (AMZN). A big-picture systems view on the pitfalls of monitoring can be found in Charles Perrow’s classic Normal Accidents: Living with High-Risk Technologies.