Jonathan S. Shapiro
University of Pennsylvania
ABSTRACT
|
Capabilities define a uniform semantics for system service invocation, enforce separation of concerns and encapsulation, and allow each program to be restricted to exactly that set of authorities it requires (the principle of least privilege). As applications and systems composed of pieces from many sources become widespread, the ability to provide such protections becomes an essential requirement for basic system integrity. If carefully architected, a capability system should be both faster and simpler than a comparable access-control-based system. In practice, implementations have fallen short of this performance potential; software systems from poor choices of abstraction, and software/hardware systems from a software-driven design methodology which led to excess complexity. Using the KeyKOS capability architecture as a starting point, we have constructed an experimentally-guided architecture, design, implementation, and validation of EROS (the Extremely Reliable Operating System), a persistent capability system which provides complete accountability for persistent, consumable and multiplexed resources. Where KeyKOS showed that a pure capability system was practical, EROS demonstrates that by careful selection of data structures and primitive atomic objects one can build a capability system that performs as well as or better than conventional systems on many primitive operating system functions, and gets close to the maximum performance realizable from the machine on some tasks. This refutes the commonly held belief that capability systems are inherently slow. |
Conventional access control mechanisms have used (object, user, authority) triples to describe authorized operations. Such an access control list (ACL) [Sal75] has two significant disadvantages:
An ACL is a statically expressed constraint describing a dynamic problem.
Access control on the basis of user introduces many opportunities for unintended sharing of authorities, especially in composable systems.
Both of these disadvantages are resolved by capabilities.
A capability is an unforgeable (object, type, authority) triple [Den66]. An operating system in which capabilities provide the system's mechanism for security and access control is a capability system. To access any object in such a system, a program must hold and invoke a capability for that object. Possession of a capability is a necessary and sufficient proof of authority to perform the operations authorized by the capability on the object that it designates.
The capability model is attractive for several reasons. It is possible to construct a formal semantics for such a system, and to prove useful properties from that semantics.
Footnote:
A paper describing this construction is currently in progress.
Simultaneously, capabilities provide a natural framework in which to expose the underlying machine representation in a secure and consistent fashion, allowing programmers to better understand application performance and providing fine-grain accountability for resources with low overhead. Finally, programs in a capability-based system can be confined [Lam73]. A confined program cannot leak information to an unauthorized party; it lives in a ``black box.'' Such an environment is a safe container within which to run an untrusted program.
We believe that confinement has become an essential security mechanism for modern operating systems. Today's users are connecting frequently using ever-faster connections and potentially visiting hundreds of servers in a single session. This access is routinely performed through applications that provide sophisticated programmability, run applications authored by untrusted (and sometimes adversarial) parties, embody exploitable runtime environments, and sit atop operating environments that are insecure. The operating environments in turn involve components assembled from hundreds of anonymous suppliers whose robustness and goodwill goes largely unexamined by the user. Many of these components operate under unnecessarily privileged conditions (either supervisor mode or cross-process state sharing), and frequently cause systemwide failures.
To date, systems that solve these problems -- capability systems -- have provided inadequate performance. This has rendered the tradeoff between security and performance an ``either-or'' proposition, and mainstream architecture has been driven primarily by performance. Previous work on KeyKOS [Bom92, Har85] has shown that constructing a pure capability system is feasible. Our research contention is that recent insights in microkernel performance improvement have made it possible to architect, design, and implement an exceedingly high-performance capability architecture. EROS, the Extremely Reliable Operating System, shows that by careful selection of data structures and primitive atomic objects one can build a capability operating system that performs as well as or better than conventional systems on many primitive operating system functions, and gets close to the maximum performance realizable from the machine on some tasks.
Our essential research question is whether the performance failings of capability systems are artifacts of particular design choices and implementation or are somehow intrinsic to capability architectures. Our hypothesis is that the poor showing of these systems is not intrinsic. In part, this belief is based on recent microkernel results, and in part on the observation that the basic idea behind RISC design -- structure a machine to do common things fast with minimal control complexity and uncommon things by layering -- is equally applicable to operating systems, and has not previously been applied to capability designs either at the conceptual abstraction level or at the level of implementation.
We propose to focus on several issues in capability architecture and implementation:
Is there a choice of operating system abstractions that particularly lends itself to efficient design and implementation? What issues are particularly important in reducing this architecture to practice?
What performance results can be achieved under this new and improved design and implementation, and how are they to be measured in a new kind of system?
Can the architectural and design results from microkernels be adapted successfully to a system where, for reasons of security architecture, it is desirable to implement persistence and therefore several drivers within the supervisor? In particular, can such a ``structured kernel'' quantitatively match the performance of a microkernel on interprocess communication and control transfer, which are the principle metrics of interest for decomposed systems, and critical enablers for confined execution?
To what degree is the resulting performance limited by the underlying hardware in modern architectures? Alternatively phrased: what are the performance-limiting factors in switching from one privilege domain to another?
What, if any, weaknesses exist in our proposed architecture, and how can they be minimized?
Particular bottlenecks in existing microkernel systems have been the interprocess communication, context switch, and page fault handling mechanisms. Because capability systems have similarly decomposed structure, these bottlenecks apply to capability architectures as well. If satisfactory performance on these operations can be achieved, we believe that binary emulators running on a capability substrate can meet or exceed the performance of current operating systems, and that new system architectures are enabled by this new substrate.
Footnote:
Demonstrating this is a subject of future work.
One consequence is that tradeoffs of security and performance need no longer be made on an ``either-or'' basis. Equally important, we will have demonstrated that capability systems are a promising area for future research in operating system architecture.
A previous system, KeyKOS, has shown that constructing a pure capability system is feasible [Bom92, Har85], and provides a relatively clean architecture from which to proceed. Using this system as a starting point, we propose to architect, design, implement, and validate a new capability architecture in an experimentally guided fashion. EROS will blend previous ideas in capability systems with more recent efforts in microkernel design and high-performance design. To be successful, this architecture must yield exceptional performance in the bottleneck areas identified above.
In an ideal evaluation, both the new operating system and a suitable environment emulator would be constructed and evaluated. Building and evaluating a binary compatible emulation environment is the subject of future work. To obtain an initial evaluation of the proposed system, we will use both qualitative examination of complexity and quantifiable measurement of key operations using microbenchmarks:
Comparison of the architected abstractions with those of UNIX, identifying key differences and simplifications in the architecture permitting a lighter footprint and/or higher performance [Qualitative].
Examining the interprocess communication mechanism, both in terms of latency and cache overhead and in terms of its structural impact on the system design and implementation. The quantitative measurements examine both the domain crossing and context switching overheads, while the cache traffic measurements support our contention that reducing abstractions has a concrete benefit for performance [Quantitative].
Examination of page fault latencies, and the contribution of each system component to these latencies. Page allocation, particularly in the heap, is critical to the overall performance of many applications, and a well-established bottleneck in decomposed systems [Quantitative].
Analysis of the domain-crossing path, identifying the major sources of delay contributed by each hardware functional unit [Quantitative].
Collectively, this evaluation should demonstrate that the EROS architecture compares favorably with conventional systems, and provides high performance support for protected and confined subsystems. Our results specifically rebut the contentions of Chen et. al. [Che93], and provide the first quantitative support for the views of Karger [Kar88].
To confirm these results, we will implement a subset of the EROS system for a second architecture, and show that both the basic structure of the system and the flow of control of the critical paths is not architecture specific.
The EROS architecture is unique in its reduction in number and complexity of abstractions, in its reduction to practice, and in a new and simplified object storage system. The architecture and implementation result in performance on critical operations (IPC, context switching) not delivered by any operating system on comparable hardware.
EROS demonstrates that by careful selection of protection granularity, data structures and primitive atomic objects one can build a capability system that performs as well or better than conventional systems on many primitive operating system functions, and gets close to the maximum performance realizable from the machine on some tasks.
Finally, EROS resolves a long-standing debate in the operating system community over whether full-featured kernels can deliver microkernel performance if suitably structured.
Early capability systems bear a historical relevance to EROS, but have little direct relevance on the issues raised here. An excellent overview of early capability efforts may be found in [Lev84]. Lampson's postmortem on CAL/TSS motivated some of the KeyKOS design [Lam76]. HYDRA/C.mmp is documented in [Wul81]. Colwell's postmortem [Col88] on the Intel i432 [Int83] is also instructive, particularly with respect to the hazards of complexity.
Little recent effort has been undertaken in the area of capability systems, but ongoing work in several other areas is immediately relevant to EROS.
Liedtke (L4) [Lie93a, Lie93b, Lie95] has architected and implemented or co-implemented the L4 kernel on several architectures. L4 is a tiny system, relying on the presence of user-mode applications to provide conventional operating system services and drivers. L4's major contributions are its small size and Liedtke's obsession with high-performance IPC. Mungi [Hei97] is a capability system emulator built on top of the L4 microkernel. Mungi is based on password capabilities, and implements persistence on a per-process basis. One consequence is that Mungi can neither rescind capabilities nor provide demonstrations of confinement.
In 1993, Liedtke published some remarkable IPC performance results for Pentium-family architectures in connection with the redesign of L3 [Lie93a]. In this paper, Liedtke put forth several architectural principles that he claimed were necessary precursors to fast IPC implementation, one of which was that the microkernel must be small. Liedtke's work continued until 1996, and culminated with a position paper entitled ``Microkernels Must and Can be Small'' [Lie96].
Where Liedtke has concentrated on hand-optimizing a single supervisor facility (IPC) and excluding everything possible from the supervisor, EROS implements a full-featured supervisor. Our work has focused on structuring the supervisor in such a way as to achieve better performance than Liedtke reports while retaining the security kernel and those functions needed to support low-latency device response within the supervisor. EROS is a compromise that demonstrates the feasibility of applying microkernel architecture and performance tuning techniques to structured kernels. In addition, the EROS architecture makes a policy decision to implement all pieces of the primitive security architecture within a single system, while L4 distributes this over several components.
EROS realizes a more powerful interface model than L4 in two regards. First, EROS implements co-process invocation rather than remote procedure call as its primitive IPC mechanism, which in some cases permits novel (and useful) forms of co-tasking and delegation. Second, EROS uses capability invocations for all system services, where L4 implements operation-specific system calls. The EROS approach allows system services to be semi-transparently proxied by user-level applications, which was not an objective in the L4 design.
Bryan Ford (Mach 4) has adapted earlier microkernel work done at Carnegie Mellon University. Ford completely rebuilt the Mach IPC mechanism, obtaining dramatic improvements in spite of constraints on the permissible changes to the interface [For93]. His subsequent work on Fluke was influenced by early discussions of EROS and KeyKOS [For96], and his implementation results were partly responsible for our investigation of capability invocation performance. Much of Ford's work has since been adapted by OSF into their Real-Time Mach kernel.
As with L4, the EROS architecture realizes a more powerful IPC abstraction than that of Mach 4, and unifies all service interfaces behind a single invocation mechanism.
Paul Karger has proposed SCAP, a hybrid system implementing both capabilities and access control lists [Kar88]. Karger argued that capability systems would be improved by certain hardware features, and suggested in-memory structures for efficient capability management, particularly efficient handling of rescinded authority. Only the IPC path of SCAP was implemented, and the system as a whole was never completed or evaluated with application-level code.
Karger's work focuses on providing a clear definition of several security policies of interest, and on evaluating the effectiveness of capabilities in implementing those policies. Ironically, Karger cites the paper that refutes the essential claim of the thesis: that capabilities must be combined with access control lists to provide multilevel security.
Karger's work does not address the question of persistence, and therefore does not take into account the performance impact of data structure constraints to support object storage (if included) or the complexity of the secure startup arguments if the capability model is not extended across layers of the storage hierarchy.
Because only fragments of SCAP were implemented, it is unclear how the end result should be judged. EROS provides the first quantitative support for some of Karger's hardware arguments, and demonstrates that a few were incompletely considered.
Hardy, Landau, and Frantz architected KeyKOS, a commercial capability system from which the the EROS architecture is modeled [Bom92, Har85]. KeyKOS was first deployed on the System/370 architecture in the early 1980's, and was briefly successful in transaction processing applications. Like EROS, KeyKOS provides universal persistence and a small set of architected abstractions. No attempt was made to evaluate the system relative to the realizable bounds of the underlying hardware.
KeyKOS predates recent concerns with resource reservation and end-to-end scheduling, and therefore omits essential controls over representation and scheduling. These issues have been addressed in EROS, and significantly impact the interprocess communication mechanism. In addition, the KeyKOS design adopted a representation pun for its abstractions that significantly complicates the supervisor's consistency rules. This issue is discussed further below under ``Representation Puns'' (Section 3.2.2).
Where KeyKOS focused on the question of whether a capability system was feasible, EROS is concerned with refining this system to the point where the need for hardware implementation can be clearly dismissed. The EROS architecture has built on the experiences of the KeyKOS team to produce a newer, faster, and simpler supervisor, and to provide better support for performance-critical utility programs.
BiiN is a joint Intel/Siemens research effort started in 1983, which resulted in an entirely new computer architecture and operating system [Pol90]. The hardware portion of this architecture has been publicized as the Intel 80960 XA microprocessor. The BiiN project is now defunct due to the withdrawal of Siemens from the collaboration. In spite of its political failure, BiiN is an architectural tour de force; certainly it is the most successful hardware-based capability system implemented to date.
At the hardware level, BiiN is noteworthy for having incorporated the entire low-level process queueing, dequeuing, and context switch operation into microcode. The hardware directly supports a cross-domain call operation, which executes in 15 microseconds, as compared to 1 microsecond for a normal procedure call. Processes are dispatched according to a 32-level priority scheme, which can be built on by the operating system to generate more flexible scheduling policies.
BiiN was conceived from the beginning as an object-oriented architecture, with access descriptors (the BiiN name for capability) serving as object designators. BiiN implements a tagged memory architecture, allowing the processor to securely distinguish between access descriptors and ordinary data. The access descriptor architecture is similar to the segmentation architecture of the Pentium family; access descriptors reference a global descriptor table entry, which in turn describes the location of an object or subsystem in main memory and its associated access rights. Every object in the BiiN architecture has its own address space, and object references are made using a (capability, offset) pair. The operating system implements conventional access control list protections on top of the hardware-implemented access descriptor mechanism.
BiiN/OS provides transacted persistence on an object-by-object basis. The operating system also provides an Update_tree operation that will recursively stabilize all objects reachable from a designated root within a single transaction.
Where BiiN is a complete implementation of an architecture proceeding from custom VLSI and microcode, EROS is built entirely on conventional hardware and provides essentially equivalent performance. On the Pentium family [Int94], the first version of the EROS architecture implements a round-trip cross-domain call and return in 4.5 microseconds, where the ordinary call and return operations take 0.16 microseconds.
Footnote:
As it happens, we were involved in the creation of the first high-performance compilers for the superscalar generation of the 80960 processor family. It should be noted that the relative performance of memory to CPU has degraded substantially since 1988, and that the 15:1 performance ratio obtained by the 80960 XA would be closer to 45:1 today, principally due to memory reference costs inherent in the BiiN descriptor architecture. It is not clear if the per-object address space mechanism defined by the BiiN architecture would be feasible in current generation TLB's at current clock rates.
The bulk of the cross-domain time is due to excessive hardware privilege crossing delays, readily reducible on other architectures. Unlike BiiN/OS, EROS uses capabilities as the exclusive means of object designation: even supervisor calls are made via a capability.
EROS does not use an indirection table to locate objects, nor does it provide a per-object address space. This avoids one of the complexities of BiiN/OS, which was deciding when and how to reclaim global object table entries. Karger has noted a similar problem with indirection tables, and proposed a singly linked list of capabilities as a solution [Kar88]. EROS at one time adopted both of these soltuions before settling on a doubly linked list implementation that outperforms both alternatives.
A first generation of the EROS system has been constructed and evaluated, resulting in several recognized shortcomings. A revised architecture addressing these issues has been generated, and an implementation of this architecture is nearly complete.
The first generation of EROS is closely modeled after the KeyKOS system with architectural additions for reservation-based resource control. This generation of the supervisor was substantially completed for Pentium-family uniprocessors, less the persistence implementation. The purpose of this first implementation was to obtain detailed insight into a previous, successful system, and to gain an understanding of the shortcomings of that system with respect to resource allocation.
Our interest in resource allocation was prompted by the work of Klara Nahrstedt at the University of Pennsylvania, and related work by Cliff Mercer at Carnegie Mellon. Nahrstedt's work shows fairly clearly that multimedia resource management must be handled end-to-end, and cannot be achieved without a supervisor that exports control over the mapping of abstractions onto physical resources [Nah96]. While KeyKOS had extensive mechanisms for abstraction control, it lacked any exported facility for resource allocation or real-time scheduling. Mercer's effort suggested one way to achieve this, but it was not immediately clear how well Mercer's processor capacity reserves [Mer93] would function in a transparently persistent architecture. We also considered work by Richard Black in connection with the Pegasus effort at the University of Cambridge, and rejected this path. The Pegasus approach relies on reducing the number of multiplexing points between the application and the device by giving up protection layers; our initial constraint was that security is an uncompromisable requirement
This first generation system has been used to pursue several lines of investigation, including high-performance IPC, confinement in the context of an active router, and the first stages of a binary-compatible Linux emulation environment.
The earliest result from EROS concerned the question of representation caching. The EROS architecture is predicated on the design assumption that all application state must be periodically written to disk. This assumption manifests in two ways:
The EROS supervisor must manage two object representations: one intended for high-performance execution and the other for efficient disk storage.
Where most systems presume that the supervisor owns the application state and will occasionally deign to report it to the user, EROS assumes that the application state is conceptually owned by the user at all times, and is occasionally ``borrowed'' by the supervisor for execution. One consequence is that certain consistency rules must be verified when state is cached by the supervisor, particularly in the face of representation puns.
Our first paper, published in POS '95 [Sha96a], described the consistency management and representation caching mechanisms of the first EROS kernel version. In this paper, we showed how the challenges above can both be handled efficiently. The resulting system has a significantly smaller effective supervisor footprint than conventional kernels.
Partly in response to Liedtke's work [Lie96], and partly because we knew capability invocation to be a critical operation, we set out to see how close we could get to Liedtke's results. The presentation order at IWOOOS allowed us the rare opportunity to quantitatively refute certain of Liedtke's claims within hours of their assertion [Sha96b]. After a somewhat brief but heated discussion, Liedtke conceded that our results measured directly comparable operations and was both fair and accurate. Somewhat to our surprise, our results slightly outperformed Liedtke's.
Footnote:
The published numbers matched Liedtke's results cycle for cycle. In light of the context of presentation, we did not emphasize that our IPC transmits a larger payload than Liedtke's for the same cost.
The essential impact of this result is that confinement can be supported cost-effectively, that there is no intrinsic prohibitive overhead to a capability-based interface, and that a carefully structured kernel can exceed the performance of the smallest, fastest microkernel currently in existence.
A number of system utilities were constructed for the first version of EROS, including the system storage allocator, two user-level pagers, the constructor (the utility that builds instances of confined subsystems), a mutex application, and utility programs that emulate files and directories. A subset POSIX emulation was constructed that provides the file portion of the POSIX interface for use by an extractor of tar images.
While cursory measurements indicate that file and address space performance under EROS is acceptable, these utilities drew attention to a number of shortcomings in the version 1 architecture beyond those that had emerged in the kernel implementation: demultiplexing, page fault handling, and a shortage of capability registers and storage mechanisms required in certain applications. Each of these issues is discussed below under ``Analysis of Shortcomings'' (Section 3.2).
In collaboration with Steve Muir, we have designed and implemented the framework of a confined active routing element. In addition to validating the system itself, this application poses a challenging test for the architecture. At high network speeds, the routing application is heavily dependent on demultiplexing and context switching, which EROS does not do well in the absence of a fast IPC implementation; our implementation at the time of the experiment was temporarily inoperable due to changes in the architecture.
The design and initial implementation convinced us that the application could be built atop the EROS substrate, but would be greatly assisted by an increase in capability registers and the addition of capability load/store instructions.
A paper on this work was submitted. The preliminary results have been published as a DSL technical report [Sha97].
The first steps toward a binary-compatible Linux emulator for EROS were taken by Geoff Milord, Mike Laskin, and Mike Berry. Their effort focused primarily on the emulation of the UNIX system call interface, implementation of the UNIX memory primitives in terms of the EROS architecture, and some preliminary file system support [Tho78].
Two observations emerged from this work: first, writing good emulators is quite difficult, but EROS does provide the necessary underpinnings. Second, other emulation efforts (Linux/L4, Lites) have succeeded in part because they have preserved the essential monolithic structure of the emulated system, giving over device management to the emulator.
The Dionysix effort was concerned with providing multiple simultaneous rehosted environments. This requires device virtualization, and consequently is more complex to implement than similar efforts on top of Mach or L4. In addition, some thought was required to determine how most efficiently to implement the system call interface. A replacement shared library was constructed and partially tested.
As with the routing experiments, an operating system emulator is essentially a multiplexed service, and the Dionysix effort ran into the same concerns that the active router did.
In collaboration with Sam Webber, we have developed agape, a formal model of capability system architectures.
Footnote:
Eros and Agape are Greek words which distinguish two very different means of experiencing love.
Eros describes passionate love; it is experienced as an all-consuming and desperate yearning for a lover. A victim of Eros becomes infatuated by the love-object, and is even willing to endure pain and hardship in order to preserve the attachment. Eros is characterised by feelings of excitement, anxiety, longing and drama.
Agape describes a love which is free of passion. The individual experiences feelings of contentment and stability. Often this form of love may not be directed to a single love-object but may be a general sensation of security and well-being.
Originally based closely on the EROS architecture, the model was subsequently simplified and generalized to cover any pure capability system with minimal modifications. The model covers the abstract architecture; it does not take into account resource management features or temporal modeling.
The simple reduction of the architecture to a tractable theoretical model is a significant step; especially so given that the mapping between the abstract architecture and the concrete architecture (the one implemented) is easily discerned. Further work would be necessary to relate the model to the concrete architecture realized by the EROS design, to relate that concrete architecture to the implementation, and the implementation to the generated program. In practice, the last step is of limited utility, as all currently shipping commodity processors have bugs permitting the supervisor mode of the processor to be compromised. In spite of its limitations, this model has generated considerable interest within IBM Yorktown, and we are organizing to write this work up for publication over the next few months.
Using this model, we have a nearly completed proof of correctness for the EROS confinement mechanism. This proof shows that a program authored by an untrusted party can be rendered confined without any knowledge of the program code.
The modeling process pointed out two shortcomings in EROS Version 1, one an implementation bug and the other an unnecessary architectural complication. The implementation bug emerged as we were formalizing the description of capability invocation. During the course of the discussion it became clear that what we were formalizing was consistent with the intended concrete architecture, but not consistent with the then-current implementation.
The architectural complication concerned the representation pun between capability pages and processes, and is described below.
In general, the performance of the first EROS version was surprisingly good, but the application efforts associated with this version of EROS pointed out a number of shortcomings in the original architecture, all of which are corrected in the second version.
The first version of EROS provided 16 supervisor-implemented capability registers for each process, of which register 0 was hardwired to a capability conveying no authority (more specifically: to the zero number capability). This version provided no direct mechanism by which a program could load and store capabilities; instead, a user-level program known as a supernode provided an indexed capability storage facility.
While the hardwiring of register zero was an improvement over KeyKOS, we found that 16 capability registers is overly constraining. Most of the programs we wished to build could be squeezed into 16 registers, but only barely. In addition, certain library routines want to manipulate capability registers, but in the absence of a canonicalized way to save and restore them the registers used by the library become a global constraint. KeyKOS implemented a source code preprocessor to manage this.
A particularly unfortunate aspect of the supernode program is that it interacts badly with the Pentium's slow privilege crossing and context switch mechanism. There are a few programs -- generally programs that are highly multiplexed -- that must efficiently manage large numbers of capabilities. After some thought, we concluded that more capability registers were desirable, and a pseudo-instruction for quickly loading and storing capabilities was required.
In addition, the dearth of capability registers and the absence of capability load/store instructions makes high-performance demultiplexing applications difficult. The typical mechanism for a demultiplexing application is to hold a capability for each client. Clearly, 15 clients is too small for many demultiplexing applications, and inserting an extra context switch into a high-performance demultiplexing tool seems like a poor choice.
The first version of EROS used two fundamental representations for all system abstractions. Pages held user data, and CaPages (capability pages) held capabilities. Other system abstractions were constructed from these representations and distinguished via type tags within the capabilities. A process address space, which is a mapping from addresses to pages, was specified using a tree of CaPages (Figure 1).
A process itself was specified using an architecture-dependent arrangement of CaPages (typically three) with an accompanying address space tree (Figure 2).
Representing address spaces as trees of CaPages is convenient; while it is possible for the address space tree to contain a capability that does not designate a CaPage, Page, or Number, provision for invalid mappings must already be made, and inappropriate capabilities in address space contexts can simply be treated as invalid subtrees.
For processes, matters are a bit more complicated. EROS version 1, following KeyKOS, had a notion of a ``well-formed process.'' A well-formed process is one in which all constituent CaPages are in their proper arrangements and all slots in those CaPages hold capabilities that satisfy a complex set of type constraints. For example, all CaPage slots that hold the values of process registers must hold number keys. When a process is scheduled to run, or when any operation manipulating the state of a process occurs, a check is made to determine if that process is well-formed. If it is not, the occupying thread of control evaporates or the operation fails in a well-defined fashion.
A regrettable number of places in the EROS version 1 supervisor had to be concerned with situations in which a currently running process could come to be malformed by a CaPage operation on one of the process's constituents. This obscured the supervisor code, was a rich source of errors, and added considerable complexity to the IPC and context switch paths. Further, the formal specification of a well-formed process proved sufficiently difficult and sufficiently EROS-specific that we concluded this representation pun was unsound. EROS version 2 includes processes as first-class objects.
The introduction of a new object type in turn dictated changes in the underlying object store. The version 1 EROS store partitioned the disk into areas of three types, which we refer to as ranges. Checkpoint ranges are reserved for use by the persistence mechanism as a write-ahead log. Page ranges hold the on-disk representation of a page. CaPage ranges hold the on-disk representation of a CaPage.
The need to partition the disk into typed ranges had already struck us as unfortunate; any such partitioning is sensitive to load and usage, and rebalancing a mispartitioned disk seemed likely to prove difficult. The introduction of a third persistent object type presented an additional compelling reason to revise the storage design.
When faced with a page fault at the end of the heap or stack, conventional operating systems allocate a page from the swap area. This page is ephemeral, and is known at allocation time to be zero. In consequence, allocating such a page is an extremely fast operation (7.2 microseconds under Linux on a 133Mhz Pentium). There are two problems with this allocation strategy:
Since no official storage has been allocated (or billed!) for this page, there is no way to preserve the storage across system failures. UNIX programs are not expected to survive such failures. As a corollary effect, swap space allocation becomes a source of denial of service attacks.
Footnote:
This strategy is exploited by a UNIX operating system testing utility, killer, written by Roger Faulkner of Sun Microsystems, Inc. Killer abuses swap space allocation and similar UNIX flaws so as to maximize any existing kernel windows of vulnerability, making them more easily reproduced using bug-specific test cases.
Swap space can become oversubscribed, at which point the operating system is forced to run complex triage policies to decide which undeserving program to shoot in the head. Storage allocated storage therefore violates referential integrity.
If these flaws are to be corrected, the performance consequences must be considered. Because the EROS system forwards the page fault to a user level handler, which in turn must purchase a page from a space bank (an EROS storage agent), the corresponding path in EROS is quite a bit slower. For most UNIX programs, heap-space allocation accounts for less than 0.1% of total execution time. Adding a small number of microseconds to each heap allocation operation is unlikely to cause measurable changes in performance for these programs.
Unfortunately a few programs -- notably the C compiler -- are both highly dependent on heap allocation and excessively stupid about allocation strategies. Heap growth accounts for nearly 10% of total execution time for cc1. Worse, this storage is allocated one page at a time, which makes it harder to amortize this cost.
If the calls to the space bank can be reduced, comparable performance can be obtained by the EROS architecture. Unfortunately, the page fault handling application requires more capability registers to be able to do this. This was a significant incentive for increasing the number of per-process capability registers.
Following our experiences with the first version of EROS, the architecture has been revised and simplified. The revised architecture is well on the way to being implemented.
Processes. The most significant change to the EROS architecture is the introduction of processes as a first class object type. An EROS process now has 32 capability registers, which relieves pressure on many existing applications.
The elimination of the Process/CaPage representation pun has eliminated quite a number of complications in the kernel; it is no longer possible for a process to be malformed, and no longer possible for an operation performed via a CaPage capability to impact a currently running process. One unfortunate consequence of this change is that it is no longer possible for two processes to share functional units. While this was not permitted by EROS version 1, there is no intrinsic reason why two collaborating processes should not be permitted to share floating point register state or capability register state, and relaxations of the original architecture to permit this had been contemplated.
Large Capability Pages. While it will probably not be implemented in the time frame of this thesis, we have a design for ``large capability pages'' and supervisor-implemented load and store instructions. A large capability page occupies the same amount of space as a data page, and can be mapped into the user address space. Any attempt to perform an ordinary (data) memory operation to such a page behaves as an access violation. Correspondingly, supervisor load/store capability instructions fail if they source or target a data page.
Surprisingly, the introduction of data pages adds essentially no code to the EROS supervisor; the capability load and store instructions are able to reuse the address space traversal facilities already used by the page fault handler.
Bigger CaPages. EROS version 1 CaPages held 16 capabilities, partly because this was a convenient size given the original EROS process structure, and partly because it conveniently captured the page table size of the IBM 370 architecture. With the elimination of the Process/CaPage representation pun and the introduction of large capability pages, CaPages are essentially reduced in usage to the description of address spaces. A CaPage holding 16 slots maps 4 bits of a virtual page address. Modern architectures with tree-structured mapping tables commonly translate 10 bits with each layer of tree, and 4 does not conveniently divide 10. This results in an unfortunate mapping generation complexity at the overlap between the first and second level page tables.
EROS version 2 increases the size of a CaPage to 32 capabilities. This reduces the number of layers of CaPages that must be traversed to perform a translation for most programs, and significantly simplifies an awkward case in the fast address fault satisfaction path. It proves fairly common for some previous fault to have already translated the upper bits of the virtual address and constructed an entry in the lower level page table corresponding to the current fault address. The current fault can bypass the traversal of the CaPage tree corresponding to the upper portions of the address, since the resulting permissions are already cached within the upper-level hardware page table. This significantly speeds translation of adjacent addresses, and slightly improves the referential locality visible to any object prefetch algorithm that may later be implemented.
The EROS version 2 store has only two disk range types: checkpoint ranges and object ranges. Checkpoint ranges have been carried forward from the version 1 architecture unchanged. Object ranges now consist of on-disk page frames, each of which has an associated entry in a side-table indicating the types of the elements contained. One way to think of this is that EROS allocates disk storage in page-sized clusters of objects. The number of objects held by such a cluster varies according to the size of the object, and currently ranges from 1 (Pages) to 9 (CaPages).
Somewhat to our surprise, this change has reduced the size of the EROS kernel, and improved localization of the complexities of the object load algorithm (reducing effective complexity).
A secondary benefit of this change concerns locality of reference. Because space banks allocate disk page frames in clusters, a side effect of the mixed object ranges is improved locality of reference. The previous architecture, while satisfying a page fault, must first traverse a tree of CaPages (loaded from one physical disk range) and then fetch the actual Page (loaded from another range, not geographically local to the first). In the new object store design, the CaPages and Pages have a decent chance of being co-located on the disk, reducing the need for arm motion and increasing the likely benefit of read ahead.
The final change to the version 1 architecture is the introduction of semi real-time checkpointing. This was accomplished by causing the process that dirties an object to pay the freight for stabilizing and migrating at least one object. Because stabilization and migration writes are asynchronous, this typically does not cause the process modifying the object to block. Further, if the object modified was snapshotted by the previous checkpoint, the migration of that object in the previous checkpoint can be suppressed, reducing the migration workload.
Footnote:
Either a subsequent checkpoint will successfully stabilize, rendering the old object in the checkpoint area stale, or the system will crash, at which time that object can be recovered from the checkpoint area. In either case, there is no need to migrate the object back to its home object range.
The combined effect of these changes is to eliminate a substantial percentage of the migration load, and to bill the cost of checkpoint and migration as part of the cost of modifying an object.
The resulting system is not truly real-time, because the application is not necessarily aware of when checkpoints will be taken. Other resource management mechanisms are needed to more strongly insulate real-time applications from background checkpointing.
The EROS effort as a whole goes well beyond the scope of any single thesis. We propose to focus the thesis on a part of the work to date: the capability-related portion of the EROS system architecture and implementation.
To demonstrate the effectiveness of this architecture and implementation, we propose to examine three areas:
Capability Invocation. The basic mechanism for performing actions on objects must be fast enough to support decomposition.
Emulation Performance. New systems must somehow run old applications. Ideally, the emulator should come within a small percentage of (or ideally exceed) the performance of existing applications on existing systems. There is considerable evidence within the microkernel world that this is feasible even on relatively slow microkernels. It should therefore be feasible on capability systems.
Confinement. We must show that isolation can be enforced on a subsystem with sufficiently high performance that use of confined subsystems becomes a viable engineering choice for those applications that need them.
Each of these issues is expanded briefly below.
The need for a fast capability invocation mechanism is self-evident; the issue is to decide how fast a mechanism must be to support effective decomposition. Liedtke's L4 makes heavy use of decomposition, and we propose to use his performance results as a metric of success [Lie93a, Lie93b, Lie95]. If our interprocess capability invocation mechanism can achieve comparable performance to that of L4, decomposition is certainly feasible.
Footnote:
We do not plan to duplicate Liedtke's results for ``small spaces.'' While Liedtke's results in this area are very impressive, and can be duplicated in EROS, the small spaces mechanism relies on a peculiarity of the Pentium family segmentation architecture that does not generalize to other architectures.
A second issue in capability invocation concerns the invocation of kernel-implemented services. For our metric here, we propose to compare the performance of a representative sample of such services to the most nearly analogous operation under Linux. If the EROS system can meet or exceed Linux performance on these operations, we will consider them adequate.
The performance of existing applications under an emulator is determined by three factors:
The speed of the processor and memory subsystems while executing user-mode instructions (which is OS independent).
The performance of the emulated service call interface, which directly depends on the architecture and performance of the underlying capability invocation mechanism.
The latency of page fault satisfaction for validatable addresses, which is a bottleneck for many applications.
Because the construction of a second operating system (or an emulator for one) is beyond the scope of a single thesis, we plan to evaluate each of these issues with microbenchmarks, comparing the performance under EROS to the equivalent performance under Linux.
Confinement is a problem of subsystem isolation. A well-architected subsystem has a small number of entry points that provide computationally substantive services. Ideally, a protected subsystem should be no more expensive to access than an unprotected subsystem. In practice, protected subsystems require more work: the address space must be changed and the data necessary to the operation must be copied across the protection boundary. The protected invocation should impose as little overhead as possible on a carefully tuned context switching mechanism.
As a practical goal, we propose that the protected call should be within a factor of 30 of basic procedure calls. On modern architectures this is feasible. On the Pentium and family it is feasible subject to (transparent) restrictions of the addressing model.
A factor of 30 is clearly not sufficiently fast to replace procedure calls; this is not the objective. If a confined subsystem executes several thousand instructions per invocation, we expect that the marginal increase in invocation time will be acceptable (i.e. 5% or less over the invocation as a whole). Further, we expect that isolation carries with it a reduction in the number of ways the subsystem can be compromised by the user, which in the end makes up for the cost of the invocation. Finally, we observe that separate subsystems are often capable of parallelism that is difficult to extract from monolithic applications.
The plan of action from this point is intended to generate a relatively complete thesis draft by the second week of September, 1997. The dates by each item are provided as a rough estimate only. With one exception, each of these steps is essential to answer the questions posed. The exception is the MIPS port, which is not strictly essential but provides significant supporting data.
Complete the introduction of first-class process objects per the second-generation EROS architecture. (7/11/97)
Reconstruct the fast capability invocation path per the new design, and measure the performance of this path for cross-process invocations. (7/18/97)
Update CaPages to hold 32 capabilities, implement page purchase caching, and measure the performance of the revised page fault handling mechanism relative to Linux. (7/25/97)
Measure the performance of invoking trivial supervisor-implemented services, and compare the results to those obtained under Linux. (7/27/97)
Construct minimal port to MIPS R4400 running in 32-bit mode. This will require console driver, page fault handling, and IPC path, but no device drivers. (8/15/97)
Analyze resulting data and write up architecture and results. Present dissertation to committee. (9/12/97)
An expansion on each milestone follows.
This modification is roughly half completed, and should be completable earlier than scheduled. It is necessary both to improve performance and to address the first generation failings identiifed above. In addition, the first-class process object makes the port to other architectures considerably simpler. Based on data collected for the earlier generation of the EROS architecture, we expect to be able to show a quantitative improvement due to this change.
Following the introduction of first class processes, we will be able to reconstruct the fast capability invocation path for later measurement. This is largely a reproduction of a path taken by earlier work in connection with the IWOOOS paper [Sha96b], and should not generate any surprises.
A key weakness in the current architecture, as described above, concerns page fault handling time. The proposed solution essentially involves constructing a slightly smarter user-level page fault handler and measuring its performance. The combination of this change plus the reconstructed fast IPC path should bring us comfortably close to the Linux performance figures for heap page allocation. The results must then be analyzed and compared.
The cost of encapsulating supervisor service calls by capabilities will be most apparent for those service calls that do minimal work. The cost of invoking an EROS capability to perform such a call needs to be measured and compared to the cost of equivalent service calls under Linux. We do not expect that EROS performance will match that of Linux, but we do expect their respective performance to be close enough to give little cause for concern.
This is the riskiest portion of the effort remaining, and arguably unnecessary. A port has been committed to Tandem, and therefore needs to be accomplished in any case. The advantage to doing this port is that the MIPS architecture implements a completely different page mapping and TLB handling mechanism, which is a good test of the ability of the EROS architecture to map to non tree-structured translation mechanisms. In addition, the MIPS architecture implements address space identifiers and a number of other features that will permit us to quantitatively evaluate the benefit of certain changes to the processor architecture that have been seen as desirable for capability and/or microkernel systems.
This port is highly restricted, in that we do not propose to generate a full EROS port. For purposes of these measurements, only the console, context switch, interrupt, and page fault mechanisms will need to be implemented. Some of the preliminary, non-essential preparation for this port will be accomplished by a student, which should help make the proposed porting time feasible.
This step will, of course, be a continuous process. Further, we anticipate generating several thesis chapters as the work proceeds, and making those available for early comment and feedback.
If successful, this research will demonstrate that by careful selection of data structures and primitive atomic objects one can build a capability system that performs as well or better than conventional systems on many primitive operating system functions, and gets close to the maximum performance realizable from the machine on some tasks. By doing so, we will have demonstrated the feasibility of capability systems for performance-sensitive computing. In addition, we will have demonstrated that the performance of protected subsystems need not be prohibitive, which will expand the realm of feasible engineering options in operating system and application design.
As incidental results, we expect to show:
That a high-performance context switch and capability invocation mechanism can be constructed on commodity hardware, which provides the necessary underpinnings for fault isolation. Secondarily, that such a system is fast enough to make decomposition feasible in many cases.
That a carefully structured full-featured supervisor can match the performance of the fastest microkernels if properly architected.
A revised method for implementing confinement within a relatively conventional architecture, which may be adaptable to conventional operating systems.
An analysis of the potential shortcomings of this architecture, accompanied by a clear identification of the root architectural requirement from which each shortcoming derives and the problems of relaxing that requirement. By implication, a clear identification of why the failure to meet these requirements in current-generation systems renders those systems necessarily unreliable and insecure.
Finally, this research will result in an artifact: the EROS system itself, which is a clean and relatively well-documented system from which further work can proceed.
| [Bom92] | Allen C. Bomberger, A. Peri Frantz, William S. Frantz, Ann C. Hardy, Norman Hardy, Charles R. Landau, Jonathan S. Shapiro. ``The KeyKOS NanoKernel Architecture,'' Proceedings of the USENIX Workshop on Micro-Kernels and Other Kernel Architectures. USENIX Association. April 1992. pp 95-112. |
| [Che93] | J. Bradley Chen and Brian N. Bershad, ``The Impact of Operating System Structure on Memory System Performance,'' Proceedings of the 14th ACM Symposium on Operating System Principles, ACM, 1993 |
| [Col88] | Robert P. Colwell, Edward F. Gehringer, E. Douglas Jensen, ``Performance Effects of Architectural Complexity in the Intel 432.'' ACM Transactions on Computer Systems, 6(3), August 1988, pp. 296--339. |
| [Den66] | J. B. Dennis and E. C. Van Horn, ``Programming Semantics for Multiprogrammed Computations'' in Communications of the ACM, vol. 9, pp. 143--154, March 1966. |
| [For93] | Bryan Ford and Jay Lepreau, ``Evolving Mach 3.0 to a Migrating Threads Model,'' Proceedings of the Winter USENIX Conference, January 1994. |
| [For96] | Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullmann, Godmar Back, Stephen Clawson. ``Microkernels Meet Recursive Virtual Machines.'' Proc. OSDI'96, October 1996. |
| [Har85] | Norman Hardy. ``The KeyKOS Architecture'' Operating Systems Review. Oct. 1985, pp 8-25. |
| [Hei97] | Gernot Heiser, Kevin Elphinstone, Jerry Vochteloo, Stephen Russell, Jochen Liedtke. Implementation and Performance of the Mungi Single-Addess-Space Oeprating System. UNSW-CSE-TR-9704, June 1997 |
| [Int83] | Elliott I. Organick, A Programmer's View of the Intel 432 System McGraw-Hill Book Company, 1983. |
| [Int94] | Intel Corporation, Pentium Processor Family User's Manual. Volume 3. |
| [Kar88] | Paul Karger, Improving Security and Performance for Capability Systems, Technical Report No. 149, University of Cambridge Computer Laboratory, October 1988 (Ph. D. thesis). |
| [Lam73] | Butler W. Lampson. ``A Note on the Confinement Problem.'' Communications of the ACM Vol 16, No 10, 1973 |
| [Lam76] | Butler W. Lampson and Howerd E. Sutrgis. ``Reflections on an Operating System Design.'' Communications of the ACM Vol 19, No 5, May 1976. |
| [Lev84] | Henry M. Levy. Capability-Based Computer Systems. Digital Press. 1984. |
| [Lie93a] | Jochen Liedtke. ``Improving IPC by Kernel Design,'' Proceedings of the 14th ACM Symposium on Operating System Principles, ACM, 1993 |
| [Lie93b] | Jochen Liedtke. ``A Persistent System in Real Use -- Experiences of the First 13 Years'' Proceedings of the 3rd International Workshop on Object-Orientation in Operating Systems, Asheville, N.C. 1993 |
| [Lie95] | Jochen Liedtke. Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces, GMD Technical Report No. 933, November 1995. |
| [Lie96] | Jochen Liedtke. ``Microkernels Must and Can be Small,'' Proceedings of the 5th International Workshop on Object Orientation in Operating Systems, Seattle, Washington, November 1996, IEEE |
| [Mer93] | ``Processor Capacity Reserves: An Abstraction for Managing Processor Usage,'' C. W. Mercer, S. Savage and H. Tokuda, in Proc. 4th Workshop on Workstation Operating Systems, October 1993. |
| [Nah96] | ``Design, Implementation and Experiences of the OMEGA End-Point Architecture,'' K. Nahrstedt and J. M. Smith, in IEEE Journal on Selected Areas in Communications (Special Issue on Multimedia Systems) 14(7), September, 1996, pp. 1263--1279. |
| [OSF97] | See ``NONAIM3 Comparisons with HP-UX'' link from http://www.osf.org/mall/os/pa-mklinux/ |
| [Pol90] | Fred Pollack, Kevin Kahn, T. Don Dennis, Gerald Holzhammer, Herman D'Hooge, and Steve Tolopka, ``An Object-Oriented Distributed Operating System,'' Proceedings, Spring COMPCON '90, IEEE. |
| [Sal75] | ``The Protection of Information in Computer Systems,'' Jerome H. Saltzer and Michael D. Schroeder, in Proceedings of the IEEE, 63(9), Sept. 1975, pp. 1278--1308. |
| [Sha96a] | Jonathan S. Shapiro, David J. Farber, and Jonathan M. Smith. ``State Caching in the EROS Kernel -- Implementing Efficient Orthogonal Persistence in a Pure Capability System'', Proceedings of the 7th International Workshop on Persistent Object Systems, Cape May, N.J. 1996 |
| [Sha96b] | Jonathan S. Shapiro, David J. Farber, Jonathan M. Smith. ``The Measured Performance of a Fast Local IPC''. Proceedings of the 5th International Workshop on Object Orientation in Operating Systems, Seattle, Washington, November 1996, IEEE |
| [Sha97] | J. S. Shapiro, S. J. Muir, J. M. Smith, D. J. Farber. Operating System Support for Active Networks University of Pennsylvania CIS technical report number MS-CIS-97-03. http://www.cis.upenn.edu/~shap/EROS/osactive.ps.gz |
| [Tho78] | K. Thompson, ``UNIX Implementation,'' Bell System Technical Journal, Vol 57, No 6, Part 2, pp. 1931-1946, July/August 1978 |
| [Wul81] | William A. Wulf, Roy Levin, and Samuel P. Harbison. HYDRA/C.mmp: An Experimental Computer System McGraw Hill, 1981. |