Hunting for Heisenbugs


Yale researchers have created an experimental operating system (OS) that can eliminate a class of dangerously elusive computer bugs. Determinator, as the OS is called, works by changing the way computers implement parallel processes. The work has helped to energize a nascent field, one that has grown remarkably in the last year. “There’s been a sudden explosion of interest,” comments Bryan Ford, Assistant Professor in the Computer Science Department and leader of the team that designed the OS, the first of its kind. Ultimately, their research can be used to make everything from media software to industrial networks safer and more robust.

The Bug Problem
Midnight. Friday. The computer clusters of the Arthur K. Watson building are packed with students cranking out their latest assignments. The atmosphere is tense, pregnant with frustration and fatigue. Dozens of eyes quietly focus on the lines of code in front of them, immersed in the dreaded act of debugging. Over the next few hours, the students will run their programs through a series of carefully designed tests, picking out the subtle missteps in their logic and the occasional typo, before submitting the final product for evaluation. For some, debugging can take eight hours or more. The programming itself probably took no more than two.

Almost every programmer can recall a night ruined by the fell combination of a particularly aggravating bug and a tight deadline. “It’s actually not that uncommon for debugging a really subtle bug to take days – there’s no upper limit on the time it may take,” says Ford. Bugs are a pervasive problem in programming, spawning entire research areas and application suites dedicated to their eradication. At best, bugs are nuisances, wasting valuable hours of a programmer’s time. At worst, undetected bugs can cause catastrophic system failures. Their existence is a liability in an increasingly wired world, one where stock markets, shuttle launches, and power grids all depend on robust computers and networks.

It may come as a surprise, then, that one of the biggest trends in computer science, parallel computing, can expose computer systems to a class of bugs that are alarmingly difficult to find, even for experts in the field.

Problems of Parallel Computing
Parallel computing, also known as parallelism, seems simple at first glance: improve performance by splitting up a task into multiple processes, each running simultaneously. The potential of parallelism to multiply computing power is especially attractive to hardware developers, who are worried about how much smaller integrated circuits can be manufactured. Parallel computing is essentially the only way that computers will get faster, as core speeds have only barely doubled over the past decade. With the rapid adoption of multi-core processors like the Intel Core series, parallelism pervades modern computing.

A quad-core computer processor. Photo courtesy of Wikimedia Commons.

But despite its intuitive appeal, parallelism carries one inherent flaw: nondeterminism. Applications are typically made parallel through the use of threads, lines of code that run concurrently. Though these threads usually run together in harmony, occasionally they can clash, for example, by writing to a variable at the same time. This can lead to unexpected results – the dreaded bug. Such “concurrency bugs” are critically dependent upon the timing of the threads involved, a factor that varies with each execution. By their very nature, parallel computations, at least in current implementations, are not deterministic – even if you know the entire configuration of inputs for a program, you cannot always predict the output. Sometimes, one thread is a little faster than the other; sometimes, the reverse is true.

The “nondeterminism” of parallel systems makes concurrency bugs quite challenging to debug: how are you supposed to find these bugs in the first place when they don’t even occur consistently? Concurrency bugs are aptly known as “Heisenbugs,” named after the physicist who discovered that a particle’s exact kinematics must always remain inscrutable. “They are the most subtle, the most complex bugs we know of,” says Amittai Aviram, a graduate student and co-author of the study.

The Determinator
The issue of nondeterminism in parallel systems had piqued the interest of Ford for a while, but he felt that most approaches in the field were either too idealistic, creating entirely new programming paradigms, or too narrow, only focusing on debugging techniques. With a background in OS design, Ford felt that an OS-based approach provided a pragmatic middle path, one that could eliminate the non¬determinism that was at the crux of the matter while still creating a product that was accessible to non-specialists. With a few slight adjustments, a modified OS would have the unique ability to provide a deterministic environment compatible with any pre-existing language. Not only that, by working at the operating system level, one could also directly block applications from accessing time-sensitive parts of the com¬puter’s machinery, enforcing determinism in a way that front-end applications cannot.

After a long period of conceptual development, Ford eventually wrote what would come to be Determinator, the first-ever OS created from the ground up for the purpose of enforcing determinism in parallel processing. He began to use Determinator in the Operating System Design class that he taught at Yale and polished the system further with the help of Aviram and Sen Hu, two graduate students in the Computer Science Department, and Shu-Chun Weng, a student from his OS class whose final project eventually became a crucial component of Determinator’s final build.

Determinator works by blocking applications from accessing hardware resources that introduce nondeterministic behavior. For example, instead of reading and writing to a shared memory space, each parallel process accesses a separate virtual memory space. At predetermined synchronization points, each component waits for the others to finish, and then they all compare their data, updating the conventionally shared memory space if no contradictions are present. When nondeterministic resources like real time clocks must be accessed, the operating system records and passes the pertinent data to the application instead of allowing direct access. Thus, if time-sensitive bugs do occur, they can be traced to their point of origin. It can also control the inputs so that a particular output can be guaranteed. The system has been made deterministic – the walls of the Heisenbugs’ black box have collapsed.

(a) Conventionally, threads read and write to a shared memory space. (b) This exposes computer systems to timing-sensitive bugs. (c) Determinator avoids this problem by copying the original data to separate virtual memory space for each thread. Image courtesy of Bryan Ford.

The Future of Parallel Computing
Besides eliminating Heisenbugs, a pervasively deterministic OS can also use security applications that are hard or impossible to adapt for conventional parallel machines. Although these applications are effective at identifying malicious code, they rely extensively on logging and replaying events. By feeding a system the same initial start state at multiple points and checking to see if the output remains constant, attacks on a system’s code can be caught systematically. Such tools, however, rely on a deterministic environment; they’re unreliable if the output of a program changes even with the same input, as it can in a nondeterministic OS where timing is critical. Determinator-like OSes, which block applications from accessing nondeterministic resources, can provide an environment more suitable for deployment of these security mechanisms.

Like any proof-of-concept project, Determinator is far from being complete. Concurrency bugs are eliminated, but at the cost of performance overhead. As problems become more “fine-grained,” that is, as threads in parallel interact more among themselves, more processing will be required to keep everything deterministic. Ultimately, any future solution to deterministic fine-grained parallelism will probably incorporate advances in both hardware and software.

In the future, OSs like Determinator may be commercialized, or existing OSs could draw Determinator-like elements into their design, providing multicore-processing systems that implement efficient, secure, and deterministic parallelism. The impact would reach areas as diverse as video and music management, online searching, scientific modeling, and industrial networks. The possibilities are exciting. “I think we’ve made an important contribution to solving an important problem,” says Aviram. “A problem that’s central to the future of computing.”

About the Author
SOONWOOK “WOOKIE” HONG is a sophomore in Pierson College.

The author would like to thank Bryan Ford, Amittai Aviram and Shu-Chun Waeng for taking the time to share their exciting research with him.

Further Reading
B. Barney, Introduction to Parallel Computing. Lawrence Livermore National Laboratory.
E. Lee. The problem with threads. Computer, 39(5): 33– 42, May 2006.
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes — a comprehensive study on real world concurrency bug characteristics. In 13th ASPLOS, pages 329– 339, Mar. 2008.