Real systems are difficult to get right because they must correctly handle a practically infinite number of rare events. For example, file systems must correctly recover from all possible crashes; distributed systems must ensure consistency and liveness despite of a large variety of rare events, such as machine crashes, network partition, message delays, and message loss. The complexity to handle these rare events often leads to corner-case errors that are difficult to test, and once detected in the field, impossible to reproduce.
Our goal is to effectively detect these bugs. The main technique we use is model checking. This formal verification technique systematically enumerates the possible execution paths of a distributed system by starting from an initial state and repeatedly performing all possible actions to this state and its successors. This state-space exploration makes rare actions such as machine crashes and network failures appear as often as common ones, thereby quickly driving a system into corner cases where subtle bugs surface.
However,naive application of model checking to large systems is prohibitive because of the huge cost required to write an abstract specification of the checked system. Recent work has developed implementation-level model checkers that check code directly, but these checkers still require an invasive, heavyweight port of the checked system to run inside these model checkers.
Instead of the heavyweight way of creating a simulated environment to run a system, we created an in-situ checking architecture that interlaces the mechanism for comprehensive checking within the checked system. This architecture enables users to check live systems, drastically reducing the overhead and the invasive infrastructure needed. For example, in eXplode (our storage system checker), the checking infrastructure is reduced down to a single device driver, dynamically loadable to a running OS kernel. These drivers are fairly small and easy to write; both the Linux and FreeBSD drivers of eXplode are less than two thousand lines of code. Once such a driver is loaded, eXplode can readily check any storage system that runs inside or above the OS kernel. Compared to the cost of weeks or months to check a system using the old approaches, eXplode only requires minutes, several orders of magnitude reduction.
Using this approach, we have found numerous serious errors in 20 widely used, well-tested systems. For example, we found data-loss errors that can vaporize all user data in 17 storage systems, including three version control software, a database, Linux NFS, ten local file systems, a software RAID, and a popular commercial virtual machine.
-
eXplode:
a Lightweight, General Approach to Finding Serious Storage System
Errors [PDF |
PS ]
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI '06)
Seattle, Washington, November, 2006.
Describes our main model checking approach. It made it so easy to thoroughly check systems that we checked total 17 storage systems and found data-loss errors in every system checked. This paper is my favorite on describing our approach. If you just want to read one paper on this approach, this is the paper you should read. -
eXplode:
a Lightweight, General System for Finding Serious Storage System
Errors [PDF |
PS ]
Workshop on the Evaluation of Software Detection Tools (BUGS '05), co-located with PLDI '05
Chicago, Illinois, June, 2005.
This paper describes an initial version of the eXplode system. -
MODIST: Transparent Model Checking of Unmodified
Distributed Systems [PDF |
PS]
Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI '09),
Boston, Massachusetts, 2009.
Describes how we checked three mature distributed systems, one of which is a Paxos implementation that has managed production data centers with more than 100K machines for over two years. For every system checked, we found flaws in its core distributed protocol, for example, the bugs that cause all nodes in an entire distributed system to make no progress. -
Using Model Checking to Find Serious File
System Errors [PDF |
PS]
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI '04)
San Francisco, California, December, 2004.
Best Paper
Describes how we model-checked the three most popular Linux file systems and found many serious data-loss errors. Compared to the eXplode approach, the approach described in this paper was more complex and we came up with many tricks to optimize it --- this is perhaps why they gave us best paper :) -
Using Model Checking to Find Serious File
System Errors [PDF |
PS]
ACM Transactions on Computer Systems (TOCS), 24(4):393-423, 2006.
Our OSDI 04 paper on FiSC got forwarded to TOCS.
[ source code | virtual machine | version history]
The above are three ways to get eXplode. You can download the eXplode source code locally, or download a virtual machine image with eXplode and eXplode-patched kernels compiled, or get the source code from sourceforge. You can browse the version history of eXplode at sourceforge.
For instructions to compile, build, and use eXplode, Please refer to the README file in the top-level directory after you uncompress the explode tar ball.
The eXplode distribution contains a generic model checker that can be applied to other systems. For details, please refer to the README file under directory mcl.