This post is a cliff notes version I scrapped together after reading the paper Operating Systems Transactions. Although I strongly recommend you read the paper if you are interested in how the authors pulled inspiration from database systems to create a transactional operating system, this post should give you a good high overview if you are short on time and need a quick and shallow understanding.

Table of Contents

Abstract

System transactions enable application developers to update OS resources in an ACID (atomic, consistent, isolated, and durable) fashion.
TxOS is a variant of Linux that implements system transactions using new techniques, allowing fairness between system transactions and non-transaction activities

Introduction

The difficulty lies in making updates to multiple files (or shared data structures) at the same time. One example of this is updating user accounts, which requires making changes to the following files: /etc/passwd, /etc/shadow, /etc/group
One way for ensuring that a file is atomically updates is by using a “rename” operation, this system call replacing the contents of a file.
But for more complex updates, we’ll need to use something like flock for handling mutual exclusion. These advisory locks are just that: advisory. Meaning, someone can bypass these control, like an administrator, and just update the file directly.
Although one approach to fix these concurrency problems is by adding more and more system calls. But instead of taking this approach of constantly identifying and eliminating race conditions, why not percolate the responsibility up to the end user, by allowing system transactions?
These system transactions is what the paper proposes and this technique allows developers to group their transaction using system calls: sys_xbegin() and sysx_xend().
This paper focuses on a new approach to OS implementation and demonstrates the utility of system transactions by creating multiple prototypes.

Motivating Examples

Section covers two common application consistency problems: software upgrade and security
Both above examples and their race conditions can be solved by using ”’system transactions”’

Software installation or upgrade

Upgrading software is common but difficult
There are other approaches, each with their own drawbacks
One example is using a checkpoint based system. With checpoints, system can rollback. However, files not under the control of the checkpoint cannot be restored.
To work around the shortcomings of checkpoint, system transactions can be used to atomically roll forward or rollback the entire installation.

Eliminating races for security

Another type of attack is interleaving a symbolic link in between a user’s access and open system calls
By using transactions, the symbolic link is serialized (or ordered) either before or after and cannot see partial updates
The approach of adding transactions is more effective long term, instead of fixing race conditions as they pop up

Overview

System transactions make it easy on the developer to implement
Remainder of section describes the API and semantics

System Transactions

System transactions provide ACID (atomic, consistent, isolation, durability) semantics – but instead of at the database level, at the operating system level
Essentially, application programmer wraps their code in sys_xbegin() and sys_xend()

System transaction semantics

Similar to database semantics, system transactions are serializable and recoverable
Transactions are atomic and can be rolled back to a previous state
Transactions are durable (i.e. once transaction results are committed, they survive system crashes)
Kernel enforces the following invariant: only a single writer at a time (per object)
If there are multiple writers, system will detect this condition and abort one of the writers
Kernel enforces serialization
Durability is an option

Interaction of transactional and non-transactional threads

Serialization of transaction and non-transational updates is caclled strong isolation
Other implementations do not take a strong stance on the subject and are semantically murkey
By taking a strong stance, we can avoid unexpected behavior in the presence of non-transactional updates

System transaction progress

OS guarantees system transactions do not livelock with other system transactions
If two transactions are in progress, OS will select one of the transactions to commit, while restarting the other transaction
OS can enforce policies to limit abuse of transactions, similar to how OS can control access to memory, disk space, kernel threads etc

System transactions for system state

Key point: system transactions provide ACID semantics for system state but not for application state
When a system transaction aborts, OS will restore kernel data structures, but not touch or revert application state

Communication Model

Application programmer is responsible for not adding code that will communicate outside of a transaction. For example, by adding a request to a non-transactional thread, the application may deadlock

TxOS overview

TXOS Design

System transactions guarantee strong isolation

Interoperability and fairness

Whether or not a thread is a transactional or non transactional thread, it must check for conflicting annotation when accessing a kernel object
Often this check is done at the same time when a thread acquires a lock on the object
When there’s a conflict between a transaction and non-transactional thread, this is called asymmetric conflict. Instead of aborting the transaction, TxOS will suspend the non-transactional thread, promoting fairness between transactions and non-transactional threads.

Managing transactional state

Historically, databases and transactional OS will update data in place and maintain an undo log: this is known as eager version management
”Isn’t the undo log approach the approach the light recoverable virtual machine takes?”
In eager version management, systems hold lock until the commit is completed and is also known as two-phase locking
Deadlocking can happen and one typical strategy is to expose a timeout parameter to users
Too short of a timeout starves long transactions. Too long of a deadlock and can starve performance (this is a trade off, of course)
Unfortunately, eager version management can kill performance since the transaction must process its redo log and jeopardizes system’s overall performance
Therefore, TxOS uses lazy version management, operating on private copies of data structures
Main disadvantage of lazy versioning is the additional commit latency due to copying updates of the underlying data structures

Integration with transactional memory

Again, system transactions protect system state: not application state
Users can integrate iwth user level transaction memory systems if they want to protect application state
System calls are forbidden during user transactions since allowing so would violate transactional semantics

TxOS Kernel Implementation

Versioning data

TxOS applies a technique that’s borrowed from software transactional memory systems
During a transaction, a private copy of the object is made: this is known as a the shadow object
The other object is known as “stable”
During the commit, shadow object replaces the stable
A naive approach would be to simply replace the stable pointer, since the object may be the target of pointers from several other objects
For efficient commit of lazy versioned data, need to break up data into header and data.
”Really fascinating technique…”
Maintain a header and the header pointers to the object’s data. That means, other objects always access data via the header, the header never replaced by a transaction
Transactional code always has speculative object
The header splits data into different payloads, allowing the data to be accessed disjointly
OS garbage collects via read-copy update
Although read only data avoids cost of duplicating data, doing so complicates the programming model slightly
Ultimately, RCU is a technique that supports efficient, concurrent access to read-mostly data.

Conflict detection and resolution

TxOS provides transactions for 150 of 303 system calls in Linux
Providing transactions for these subset system calls requires an additional 3,300 lines of code – just for transaction management alone
A conflict occurs when transaction is about to write to an object but that object has been written by another transaction
Header information is used to determine the reader count (necessary for garbage collection)
A non-null writer pointer indicates an active transactional writer. Similarly, an empty reader lists means there are no readers
All conflicts are arbitrated by the contention manager
During a conflict, the contention manager arbitrates by using an osprio policy: the process with the higher scheduling process wins. But if both processes have the same priority, then the older one wins: this policy is known as timestamp.

Asymmetric conflicts

non-transactional threads cannot be rolled back, although transactional threads can always be rolled back. That being said, there must be mechanism to resolve the conflict in favor of the transactional thread otherwise that policy always favor the non-transactional thread
non-transactional threads cannot be rolled back but they can be preemted, a recent feature of Linux

Minimizing conflicts on lists

Kernel relies heavily on linked lists data structures

Managing transaction state

TxOS adds transaction objects to the kernel
Inside of transaction struct, the status (probably an alias to uint8_t) is updated atomically with a compare and swap operation
If transaction system call cannot complete because of conflict, it must abort
Roll back is possible by saving register state on the stack at the beginning of the system call, in the “checkpointed_registers” field
During abort, restore register state and call longjmp
Certain operations must not be done until commit; these operations are stored in deferred_ops. Similarly, some operations must be done during abort, and these operations are stored in undo_ops field.
Workset_list is a skip list that contains references to all objects in the transaction and the transaction’s private copies

Commit protocol

When sys_xend (i.e. transaction ends), transaction acquires lock for all items in (above mentioned) workset.
Once all locks are acquired, transaction performs one final check in its status word and verifies that the status has been set to abort.

Abort protocol

Abort must happen when transaction detects that it lost a conflict
Transaction must decrement the reference count and free the shadow objects

User level transactions

Can only support user-level transactions by coordinating commit of application state with system transaction’s commit

Lock-based STM requirements

Used a simplified variant of two-phase commit protocol
Essentially, user uses sys_xend() system call and must inspect the return code so that the user application can then decide what to do based off of the system call’s transaction

TxOS Kernel Subsystems

Remainder will discuss ACID semantics
Example will include ext3 file system

Transactional file system

Managed versioned data in the virtual filesystem layer
File system only needs to provide atomic updates to stable storage (i.e. via a journal)
By guaranteeing writes are done in a single journal transaction, ext3 is now transactional

Multi-process transactions

Forked children execute until sys_xend() or the process exits

Signal delivery

Application can decide whether to defer a signal until a later point
If deferred, signals are placed into queue

Future work

TxOS does not provide transactional semantics for all OS resources
If attempting to use transaction on unsupported resource, transaction will be aborted