Tuesday, February 28, 2012

On OneBusAway Inaccuracies

(This post is by S. Morris Rose. I'm the engineer that's been hired on a temporary basis to keep the services that power OneBusAway chugging away now that Brian Ferris, the engineer that created it, has moved on to work on transit projects at Google Zurich, though he still pitches in from time to time. The position is funded by contracts with King County Metro, Pierce Transit, and Sound Transit. I've been a technical staff member for Computer Science & Engineering at the University of Washington, where OneBusAway was created, for more than a decade.)

Many users have noticed that sometimes OneBusAway isn't real accurate- it might report a bus is early when it's on time, late when it's early, or display the status labeled, "scheduled departure" (which means that there is no "real time" arrival data available for that trip), or a scheduled trip might simply be missing. In the case of Community Transit (which is not a project funder), the schedule data has simply gone missing. In this post, I'll explain a few of the factors that lead to the errors.

OneBusAway depends upon two types of data to tell you where your bus is: schedule data, which all about where and when the agency plans for each bus to be, and real-time arrival data, which is all about where the bus is right now. Schedule data is updated but several times a year. Real-time arrival data is updated constantly. Pull those two data types together and apply algorithms, and you've got a guess about when your bus will arrive. There can- and are- problems with both data types and with algorithms that lead to false predictions.

In the case of schedule data, various things can go wrong. It can be incomplete, as is the case with the current King County Metro data, it can contain errors, as is the case with all complex datasets, or it can just be missing- as is the case, for now, with Community Transit data. Also, since the data only lands a few times a year, but minor changes are made by agencies along the way- perhaps due to construction- it can be partially stale. And if a trip is canceled or rerouted, such as during a snow emergency, the schedule data can become desperately wrong.

Real-time (AVL, or automatic vehicle location) data is much more complex and fraught. Because the data is changing constantly, latency- a difference between when a data point is generated and when OneBusAway gets it- is a problem. Sometimes a trip goes missing due to technical issues, in which case only "scheduled departure" is shown. Some agencies don't even have real-time data (e.g. Community Transit). Complicating matters for King County Metro is the fact that they are transitioning from an older system based on a combination of radio beacons and wheel rotation counts to one based on GPS. (That process is about 60% complete, but there are yet more than 500 buses to be converted. Some areas are behind others, including the northern area of Seattle, where there is a high concentration of OneBusAway users.) The task of combining the two types of real-time data has proven to be challenging.

And then there are the algorithms. To predict an arrival, there is a lot to compute even after the position of a bus is known. For example, a mile of Montlake Boulevard at rush hour on Friday translates to a lot more time than that same mile two hours later. OneBusAway doesn't do its own arrival prediction- instead, we rely upon data from others, who in turn run their own or commercial software. This arrival prediction data comes from the agencies themselves for buses that use GPS; and from MyBus for buses using the older AVL system. (MyBus is a system running here at UW, from Dan Dailey and the Intelligent Transportation Systems project. A big thank-you to Dan and Joel Bradbury for continuing to keep this data up and available! OneBusAway has relied on it from the beginning, and will continue to do so while the AVL system is still in use.)

Finally, when buses are on reroute due to snow (as happened last month), the arrival predictions currently become somewhere between wildly inaccurate or totally missing.

Add up all these issues, toss in a snowstorm in January and simultaneous major schedule changes in mid-February, and you get a service that sometimes tells you lies.