Reliable Computer Systems Laboratory

Effective Model Checking of Real Systems

Real systems are difficult to get right because they must correctly handle a practically infinite number of rare events. For example, file systems must correctly recover from all possible crashes; distributed systems must ensure consistency and liveness despite of a large variety of rare events, such as machine crashes, network partition, message delays, and message loss. The complexity to handle these rare events often leads to corner-case errors that are difficult to test, and once detected in the field, impossible to reproduce.

Our goal is to effectively detect these bugs. The main technique we use is model checking. This formal verification technique systematically enumerates the possible execution paths of a distributed system by starting from an initial state and repeatedly performing all possible actions to this state and its successors. This state-space exploration makes rare actions such as machine crashes and network failures appear as often as common ones, thereby quickly driving a system into corner cases where subtle bugs surface.

However,naive application of model checking to large systems is prohibitive because of the huge cost required to write an abstract specification of the checked system. Recent work has developed implementation-level model checkers that check code directly, but these checkers still require an invasive, heavyweight port of the checked system to run inside these model checkers.

Instead of the heavyweight way of creating a simulated environment to run a system, we created an in-situ checking architecture that interlaces the mechanism for comprehensive checking within the checked system. This architecture enables users to check live systems, drastically reducing the overhead and the invasive infrastructure needed. For example, in eXplode (our storage system checker), the checking infrastructure is reduced down to a single device driver, dynamically loadable to a running OS kernel. These drivers are fairly small and easy to write; both the Linux and FreeBSD drivers of eXplode are less than two thousand lines of code. Once such a driver is loaded, eXplode can readily check any storage system that runs inside or above the OS kernel. Compared to the cost of weeks or months to check a system using the old approaches, eXplode only requires minutes, several orders of magnitude reduction.

Using this approach, we have found numerous serious errors in 20 widely used, well-tested systems. For example, we found data-loss errors that can vaporize all user data in 17 storage systems, including three version control software, a database, Linux NFS, ten local file systems, a software RAID, and a popular commercial virtual machine.

Publications
Download eXplode

[ source code | virtual machine | version history]

The above are three ways to get eXplode. You can download the eXplode source code locally, or download a virtual machine image with eXplode and eXplode-patched kernels compiled, or get the source code from sourceforge. You can browse the version history of eXplode at sourceforge.

For instructions to compile, build, and use eXplode, Please refer to the README file in the top-level directory after you uncompress the explode tar ball.

The eXplode distribution contains a generic model checker that can be applied to other systems. For details, please refer to the README file under directory mcl.