This chapter is not intended as an introduction to synchronization. It is assumed that you have some understanding of the basic concepts of locks and semaphores already. If you need additional background reading, synchronization is covered in most introductory operating systems texts. However, since synchronization in the kernel is somewhat different from locking in an application this chapter does provide a brief overview to help ease the transition, or for experienced kernel developers, to refresh your memory.
As an OS X kernel programmer, you have many choices of synchronization mechanisms at your disposal. The kernel itself provides two such mechanisms: locks and semaphores.
A lock is used for basic protection of shared resources. Multiple threads can attempt to acquire a lock, but only one thread can actually hold it at any given time (at least for traditional locks—more on this later). While that thread holds the lock, the other threads must wait. There are several different types of locks, differing mainly in what threads do while waiting to acquire them.
If you're using OS X El Capitan v10.11.5 or later and your App Store preferences or Software Update preferences are set to download new updates when available, macOS Big Sur will be downloaded conveniently in the background, making it even easier to upgrade. Mac OS version has more updates than Windows, but still not functioning properly, I tested it in Cubase 8.5 and Logic 9 and had the same experience. Hudeenee 4 November 2016 at 2:48pm. I have two major problems with the MAC AU version on 10.6.8 (32-bit).
A semaphore is much like a lock, except that a finite number of threads can hold it simultaneously. Semaphores can be thought of as being much like piles of tokens. Multiple threads can take these tokens, but when there are none left, a thread must wait until another thread returns one. It is important to note that semaphores can be implemented in many different ways, so Mach semaphores may not behave in the same way as semaphores on other platforms.
In addition to locks and semaphores, certain low-level synchronization primitives like test and set are also available, along with a number of other atomic operations. These additional operations are described in libkern/gen/OSAtomicOperations.c
in the kernel sources. Such atomic operations may be helpful if you do not need something as robust as a full-fledged lock or semaphore. Since they are not general synchronization mechanisms, however, they are beyond the scope of this chapter.
Semaphores
Semaphores and locks are similar, except that with semaphores, more than one thread can be doing a given operation at once. Semaphores are commonly used when protecting multiple indistinct resources. For example, you might use a semaphore to prevent a queue from overflowing its bounds.
OS X uses traditional counting semaphores rather than binary semaphores (which are essentially locks). Mach semaphores obey Mesa semantics—that is, when a thread is awakened by a semaphore becoming available, it is not executed immediately. This presents the potential for starvation in multiprocessor situations when the system is under low overall load because other threads could keep downing the semaphore before the just-woken thread gets a chance to run. This is something that you should consider carefully when writing applications with semaphores.
Semaphores can be used any place where mutexes can occur. This precludes their use in interrupt handlers or within the context of the scheduler, and makes it strongly discouraged in the VM system. The public API for semaphores is divided between the MIG–generated task.h
file (located in your build output directory, included with #include
) and osfmk/mach/semaphore.h
(included with #include
).
The public semaphore API includes the following functions:
which are described in or
xnu/osfmk/mach/semaphore.h
(except for create and destroy, which are described in .
The use of these functions is relatively straightforward with the exception of the semaphore_create
, semaphore_destroy
, and semaphore_signal_thread
calls.
The value
and semaphore
parameters for semaphore_create
are exactly what you would expect—a pointer to the semaphore structure to be filled out and the initial value for the semaphore, respectively.
The task
parameter refers to the primary Mach task that will 'own' the lock. This task should be the one that is ultimately responsible for the subsequent destruction of the semaphore. The task
parameter used when calling semaphore_destroy
must match the one used when it was created.
For communication within the kernel, the task
parameter should be the result of a call to current_task
. For synchronization with a user process, you need to determine the underlying Mach task for that process by calling current_task
on the kernel side and mach_task_self
on the application side.
Note: In the kernel, be sure to always use current_task
. In the kernel, mach_task_self
returns a pointer to the kernel's VM map, which is probably not what you want.
The details of user-kernel synchronization are beyond the scope of this document.
The policy
parameter is passed as the policy for the wait queue contained within the semaphore. The possible values are defined in osfmk/mach/sync_policy.h
. Current possible values are:
SYNC_POLICY_FIFO
SYNC_POLICY_FIXED_PRIORITY
SYNC_POLICY_PREPOST
The FIFO policy is, as the name suggests, first-in-first-out. The fixed priority policy causes wait queue reordering based on fixed thread priority policies. The prepost policy causes the semaphore_signal
function to not increment the counter if no threads are waiting on the queue. This policy is needed for creating condition variables (where a thread is expected to always wait until signalled). See the section Wait Queues and Wait Primitives for more information.
The semaphore_signal_thread
call takes a particular thread from the wait queue and places it back into one of the scheduler's wait-queues, thus making that thread available to be scheduled for execution. If thread_act
is NULL
, the first thread in the queue is similarly made runnable.
With the exception of semaphore_create
and semaphore_destroy
, these functions can also be called from user space via RPC. See Calling RPC From User Applications for more information.
Condition Variables
The BSD portion of OS X provides msleep
, wakeup
, and wakeup_one
, which are equivalent to condition variables with the addition of an optional time-out. You can find these functions in sys/proc.h
in the Kernel framework headers.
The msleep
call is similar to a condition variable. It puts a thread to sleep until wakeup
or wakeup_one
is called on that channel. Unlike a condition variable, however, you can set a timeout measured in clock ticks. This means that it is both a synchronization call and a delay. The prototypes follow:
The three sleep calls are similar except in the mechanism used for timeouts. The function msleep0
is not recommended for general use.
In these functions, channel
is a unique identifier representing a single condition upon which you are waiting. Normally, when msleep
is used, you are waiting for a change to occur in a data structure. In such cases, it is common to use the address of that data structure as the value for channel
, as this ensures that no code elsewhere in the system will be using the same value.
The priority
argument has three effects. First, when wakeup
is called, threads are inserted in the scheduling queue at this priority. Second, if the bit (priority & PCATCH)
is set, msleep0
does not allow signals to interrupt the sleep. Third, if the bit (priority & PDROP)
is zero, msleep0
drops the mutex on sleep and reacquires it upon waking. If (priority & PDROP)
is one, msleep0
drops the mutex if it has to sleep, but does not reacquire it.
The subsystem
argument is a short text string that represents the subsystem that is waiting on this channel. This is used solely for debugging purposes.
The timeout
argument is used to set a maximum wait time. The thread may wake sooner, however, if wakeup
or wakeup_one
is called on the appropriate channel. It may also wake sooner if a signal is received, depending on the value of priority
. In the case of msleep0
, this is given as a mach abstime deadline. In the case of msleep
, this is given in relative time (seconds and nanoseconds).
Outside the BSD portion of the kernel, condition variables may be implemented using semaphores.
Locks
OS X (and Mach in general) has three basic types of locks: spinlocks, mutexes, and read-write locks. Each of these has different uses and different problems. There are also many other types of locks that are not implemented in OS X, such as spin-sleep locks, some of which may be useful to implement for performance comparison purposes.
Spinlocks
A spinlock is the simplest type of lock. In a system with a test-and-set instruction or the equivalent, the code looks something like this:
In other words, until the lock is available, it simply 'spins' in a tight loop that keeps checking the lock until the thread's time quantum expires and the next thread begins to execute. Since the entire time quantum for the first thread must complete before the next thread can execute and (possibly) release the lock, a spinlock is very wasteful of CPU time, and should be used only in places where a mutex cannot be used, such as in a hardware exception handler or low-level interrupt handler.
Note that a thread may not block while holding a spinlock, because that could cause deadlock. Further, preemption is disabled on a given processor while a spinlock is held.
There are three basic types of spinlocks available in OS X: lck_spin_t
(which supersedes simple_lock_t
), usimple_lock_t
, and hw_lock_t
. You are strongly encouraged to not use hw_lock_t
; it is only mentioned for the sake of completeness. Of these, only lck_spin_t
is accessible from kernel extensions. Sunset surf: a lifeguard adventure mac os.
The u
in usimple
stands for uniprocessor, because they are the only spinlocks that provide actual locking on uniprocessor systems. Traditional simple locks, by contrast, disable preemption but do not spin on uniprocessor systems. Note that in most contexts, it is not useful to spin on a uniprocessor system, and thus you usually only need simple locks. Use of usimple locks is permissible for synchronization between thread context and interrupt context or between a uniprocessor and an intelligent device. However, in most cases, a mutex is a better choice.
Important: Simple and usimple locks that could potentially be shared between interrupt context and thread context must have their use coordinated with spl (see glossary). The IPL (interrupt priority level) must always be the same when acquiring the lock, otherwise deadlock may result. (This is not an issue for kernel extensions, however, as the spl functions cannot be used there.)
The spinlock functions accessible to kernel extensions consist of the following:
Prototypes for these locks can be found in .
The arguments to these functions are described in detail in Using Lock Functions.
Mutexes
A mutex, mutex lock, or sleep lock, is similar to a spinlock, except that instead of constantly polling, it places itself on a queue of threads waiting for the lock, then yields the remainder of its time quantum. It does not execute again until the thread holding the lock wakes it (or in some user space variations, until an asynchronous signal arrives).
Mutexes are more efficient than spinlocks for most purposes. However, they are less efficient in multiprocessing environments where the expected lock-holding time is relatively short. If the average time is relatively short but occasionally long, spin/sleep locks may be a better choice. Although OS X does not support spin/sleep locks in the kernel, they can be easily implemented on top of existing locking primitives. If your code performance improves as a result of using such locks, however, you should probably look for ways to restructure your code, such as using more than one lock or moving to read-write locks, depending on the nature of the code in question. See Spin/Sleep Locks for more information.
Because mutexes are based on blocking, they can only be used in places where blocking is allowed. For this reason, mutexes cannot be used in the context of interrupt handlers. Interrupt handlers are not allowed to block because interrupts are disabled for the duration of an interrupt handler, and thus, if an interrupt handler blocked, it would prevent the scheduler from receiving timer interrupts, which would prevent any other thread from executing, resulting in deadlock.
For a similar reason, it is not reasonable to block within the scheduler. Also, blocking within the VM system can easily lead to deadlock if the lock you are waiting for is held by a task that is paged out.
However, unlike simple locks, it is permissible to block while holding a mutex. This would occur, for example, if you took one lock, then tried to take another, but the second lock was being held by another thread. However, this is generally not recommended unless you carefully scrutinize all uses of that mutex for possible circular waits, as it can result in deadlock. You can avoid this by always taking locks in a certain order.
In general, blocking while holding a mutex specific to your code is fine as long as you wrote your code correctly, but blocking while holding a more global mutex is probably not, since you may not be able to guarantee that other developers' code obeys the same ordering rules.
A Mach mutex is of type mutex_t
. The functions that operate on mutexes include:
as described in .
The arguments to these functions are described in detail in Using Lock Functions.
Read-Write Locks
Read-write locks (also called shared-exclusive locks) are somewhat different from traditional locks in that they are not always exclusive locks. A read-write lock is useful when shared data can be reasonably read concurrently by multiple threads except while a thread is modifying the data. Read-write locks can dramatically improve performance if the majority of operations on the shared data are in the form of reads (since it allows concurrency), while having negligible impact in the case of multiple writes.
A read-write lock allows this sharing by enforcing the following constraints:
Multiple readers can hold the lock at any time.
Only one writer can hold the lock at any given time.
A writer must block until all readers have released the lock before obtaining the lock for writing.
The eddman game (beta) mac os. Readers arriving while a writer is waiting to acquire the lock will block until after the writer has obtained and released the lock.
The first constraint allows read sharing. The second constraint prevents write sharing. The third prevents read-write sharing, and the fourth prevents starvation of the writer by a steady stream of incoming readers.
Mach read-write locks also provide the ability for a reader to become a writer and vice-versa. In locking terminology, an upgrade is when a reader becomes a writer, and a downgrade is when a writer becomes a reader. To prevent deadlock, some additional constraints must be added for upgrades and downgrades:
Upgrades are favored over writers.
The second and subsequent concurrent upgrades will fail, causing that thread's read lock to be released.
The first constraint is necessary because the reader requesting an upgrade is holding a read lock, and the writer would not be able to obtain a write lock until the reader releases its read lock. In this case, the reader and writer would wait for each other forever. The second constraint is necessary to prevent the deadlock that would occur if two readers wait for the other to release its read lock so that an upgrade can occur.
The functions that operate on read-write locks are:
This is a more complex interface than that of the other locking mechanisms, and actually is the interface upon which the other locks are built.
The functions lck_rw_lock
and lck_rw_unlock
lock and unlock a lock as either shared (read) or exclusive (write), depending on the value of lck_rw_type
., which can contain either LCK_RW_TYPE_SHARED
or LCK_RW_TYPE_EXCLUSIVE
. You should always be careful when using these functions, as unlocking a lock held in shared mode using an exclusive call or vice-versa will lead to undefined results.
The arguments to these functions are described in detail in Using Lock Functions.
Spin/Sleep Locks
Spin/sleep locks are not implemented in the OS X kernel. However, they can be easily implemented on top of existing locks if desired.
For short waits on multiprocessor systems, the amount of time spent in the context switch can be greater than the amount of time spent spinning. When the time spent spinning while waiting for the lock becomes greater than the context switch overhead, however, mutexes become more efficient. For this reason, if there is a large degree of variation in wait time on a highly contended lock, spin/sleep locks may be more efficient than traditional spinlocks or mutexes.
Ideally, a program should be written in such a way that the time spent holding a lock is always about the same, and the choice of locking is clear. However, in some cases, this is not practical for a highly contended lock. In those cases, you may consider using spin/sleep locks.
The basic principle of spin/sleep locks is simple. A thread takes the lock if it is available. If the lock is not available, the thread may enter a spin cycle. After a certain period of time (usually a fraction of a time quantum or a small number of time quanta), the spin routine's time-out is reached, and it returns failure. At that point, the lock places the waiting thread on a queue and puts it to sleep.
In other variations on this design, spin/sleep locks determine whether to spin or sleep according to whether the lock-holding thread is currently on another processor (or is about to be).
For short wait periods on multiprocessor computers, the spin/sleep lock is more efficient than a mutex, and roughly as efficient as a standard spinlock. For longer wait periods, the spin/sleep lock is significantly more efficient than the spinlock and only slightly less efficient than a mutex. There is a period near the transition between spinning and sleeping in which the spin/sleep lock may behave significantly worse than either of the basic lock types, however. Thus, spin/sleep locks should not be used unless a lock is heavily contended and has widely varying hold times. When possible, you should rewrite the code to avoid such designs.
Using Lock Functions
While most of the locking functions are straightforward, there are a few details related to allocating, deallocating, and sleeping on locks that require additional explanation. As the syntax of these functions is identical across all of the lock types, this section explains only the usage for spinlocks. Extending this to other lock types is left as a (trivial) exercise for the reader.
The first thing you must do when allocating locks is to allocate a lock group and a lock attribute set. Lock groups are used to name locks for debugging purposes and to group locks by function for general understandability. Lock attribute sets allow you to set flags that alter the behavior of a lock.
The following code illustrates how to allocate an attribute structure and a lock group structure for a lock. In this case, a spinlock is used, but with the exception of the lock allocation itself, the process is the same for other lock types.
Mutexes are more efficient than spinlocks for most purposes. However, they are less efficient in multiprocessing environments where the expected lock-holding time is relatively short. If the average time is relatively short but occasionally long, spin/sleep locks may be a better choice. Although OS X does not support spin/sleep locks in the kernel, they can be easily implemented on top of existing locking primitives. If your code performance improves as a result of using such locks, however, you should probably look for ways to restructure your code, such as using more than one lock or moving to read-write locks, depending on the nature of the code in question. See Spin/Sleep Locks for more information.
Because mutexes are based on blocking, they can only be used in places where blocking is allowed. For this reason, mutexes cannot be used in the context of interrupt handlers. Interrupt handlers are not allowed to block because interrupts are disabled for the duration of an interrupt handler, and thus, if an interrupt handler blocked, it would prevent the scheduler from receiving timer interrupts, which would prevent any other thread from executing, resulting in deadlock.
For a similar reason, it is not reasonable to block within the scheduler. Also, blocking within the VM system can easily lead to deadlock if the lock you are waiting for is held by a task that is paged out.
However, unlike simple locks, it is permissible to block while holding a mutex. This would occur, for example, if you took one lock, then tried to take another, but the second lock was being held by another thread. However, this is generally not recommended unless you carefully scrutinize all uses of that mutex for possible circular waits, as it can result in deadlock. You can avoid this by always taking locks in a certain order.
In general, blocking while holding a mutex specific to your code is fine as long as you wrote your code correctly, but blocking while holding a more global mutex is probably not, since you may not be able to guarantee that other developers' code obeys the same ordering rules.
A Mach mutex is of type mutex_t
. The functions that operate on mutexes include:
as described in .
The arguments to these functions are described in detail in Using Lock Functions.
Read-Write Locks
Read-write locks (also called shared-exclusive locks) are somewhat different from traditional locks in that they are not always exclusive locks. A read-write lock is useful when shared data can be reasonably read concurrently by multiple threads except while a thread is modifying the data. Read-write locks can dramatically improve performance if the majority of operations on the shared data are in the form of reads (since it allows concurrency), while having negligible impact in the case of multiple writes.
A read-write lock allows this sharing by enforcing the following constraints:
Multiple readers can hold the lock at any time.
Only one writer can hold the lock at any given time.
A writer must block until all readers have released the lock before obtaining the lock for writing.
The eddman game (beta) mac os. Readers arriving while a writer is waiting to acquire the lock will block until after the writer has obtained and released the lock.
The first constraint allows read sharing. The second constraint prevents write sharing. The third prevents read-write sharing, and the fourth prevents starvation of the writer by a steady stream of incoming readers.
Mach read-write locks also provide the ability for a reader to become a writer and vice-versa. In locking terminology, an upgrade is when a reader becomes a writer, and a downgrade is when a writer becomes a reader. To prevent deadlock, some additional constraints must be added for upgrades and downgrades:
Upgrades are favored over writers.
The second and subsequent concurrent upgrades will fail, causing that thread's read lock to be released.
The first constraint is necessary because the reader requesting an upgrade is holding a read lock, and the writer would not be able to obtain a write lock until the reader releases its read lock. In this case, the reader and writer would wait for each other forever. The second constraint is necessary to prevent the deadlock that would occur if two readers wait for the other to release its read lock so that an upgrade can occur.
The functions that operate on read-write locks are:
This is a more complex interface than that of the other locking mechanisms, and actually is the interface upon which the other locks are built.
The functions lck_rw_lock
and lck_rw_unlock
lock and unlock a lock as either shared (read) or exclusive (write), depending on the value of lck_rw_type
., which can contain either LCK_RW_TYPE_SHARED
or LCK_RW_TYPE_EXCLUSIVE
. You should always be careful when using these functions, as unlocking a lock held in shared mode using an exclusive call or vice-versa will lead to undefined results.
The arguments to these functions are described in detail in Using Lock Functions.
Spin/Sleep Locks
Spin/sleep locks are not implemented in the OS X kernel. However, they can be easily implemented on top of existing locks if desired.
For short waits on multiprocessor systems, the amount of time spent in the context switch can be greater than the amount of time spent spinning. When the time spent spinning while waiting for the lock becomes greater than the context switch overhead, however, mutexes become more efficient. For this reason, if there is a large degree of variation in wait time on a highly contended lock, spin/sleep locks may be more efficient than traditional spinlocks or mutexes.
Ideally, a program should be written in such a way that the time spent holding a lock is always about the same, and the choice of locking is clear. However, in some cases, this is not practical for a highly contended lock. In those cases, you may consider using spin/sleep locks.
The basic principle of spin/sleep locks is simple. A thread takes the lock if it is available. If the lock is not available, the thread may enter a spin cycle. After a certain period of time (usually a fraction of a time quantum or a small number of time quanta), the spin routine's time-out is reached, and it returns failure. At that point, the lock places the waiting thread on a queue and puts it to sleep.
In other variations on this design, spin/sleep locks determine whether to spin or sleep according to whether the lock-holding thread is currently on another processor (or is about to be).
For short wait periods on multiprocessor computers, the spin/sleep lock is more efficient than a mutex, and roughly as efficient as a standard spinlock. For longer wait periods, the spin/sleep lock is significantly more efficient than the spinlock and only slightly less efficient than a mutex. There is a period near the transition between spinning and sleeping in which the spin/sleep lock may behave significantly worse than either of the basic lock types, however. Thus, spin/sleep locks should not be used unless a lock is heavily contended and has widely varying hold times. When possible, you should rewrite the code to avoid such designs.
Using Lock Functions
While most of the locking functions are straightforward, there are a few details related to allocating, deallocating, and sleeping on locks that require additional explanation. As the syntax of these functions is identical across all of the lock types, this section explains only the usage for spinlocks. Extending this to other lock types is left as a (trivial) exercise for the reader.
The first thing you must do when allocating locks is to allocate a lock group and a lock attribute set. Lock groups are used to name locks for debugging purposes and to group locks by function for general understandability. Lock attribute sets allow you to set flags that alter the behavior of a lock.
The following code illustrates how to allocate an attribute structure and a lock group structure for a lock. In this case, a spinlock is used, but with the exception of the lock allocation itself, the process is the same for other lock types.
Listing 17-1 Allocating lock attributes and groups (lifted liberally from kern_time.c)
The first argument to the lock initializer, of type lck_grp_t
, is a lock group. This is used for debugging purposes, including lock contention profiling. The details of lock tracing are beyond the scope of this document, however, every lock must belong to a group (even if that group contains only one lock).
The second argument to the lock initializer, of type lck_attr_t
, contains attributes for the lock. Currently, the only attribute available is lock debugging. This attribute can be set using lck_attr_setdebug
and cleared with lck_attr_setdefault
.
To dispose of a lock, you simply call the matching free functions. For example:
Note: While you can safely dispose of the lock attribute and lock group attribute structures, it is important to keep track of the lock group associated with a lock as long as the lock exists, since you will need to pass the group to the lock's matching free function when you deallocate the lock (generally at unload time).
The other two interesting functions are lck_spin_sleep
and lck_spin_sleep_deadline
. These functions release a spinlock and sleep until an event occurs, then wake. The latter includes a timeout, at which point it will wake even if the event has not occurred.
The parameter lck_sleep_action
controls whether the lock will be reclaimed after sleeping prior to this function returning. The valid options are:
LCK_SLEEP_DEFAULT
Devil may network mac os. Release the lock while waiting for the event, then reclaim it. Read-write locks are held in the same mode as they were originally held.
LCK_SLEEP_UNLOCK
Release the lock and return with the lock unheld.
LCK_SLEEP_SHARED
Reclaim the lock in shared mode (read-write locks only).
LCK_SLEEP_EXCLUSIVE
Reclaim the lock in exclusive mode (read-write locks only).
The event
parameter can be any arbitrary integer, but it must be unique across the system. To ensure uniqueness, a common programming practice is to use the address of a global variable (often the one containing a lock) as the event value. For more information on these events, see Event and Timer Waits.
The parameter interruptible
indicates whether the scheduler should allow the wait to be interrupted by asynchronous signals. If this is false, any false wakes will result in the process going immediately back to sleep (with the exception of a timer expiration signal, which will still wake lck_spin_sleep_deadline
).
Copyright © 2002, 2013 Apple Inc. All Rights Reserved. Terms of Use | Privacy Policy | Updated: 2013-08-08
BSDCon 2002 Paper[BSDCon '02 Tech Program Index]Pp. 37–46 of the Proceedings |
Louis G. Gerbarg
Apple Computer, Inc.
louis@apple.com
Abstract
Throughout the years, as Unix has grown and evolved so has computer hardware. The 4.4BSD-Lite2 distribution had no support for two features that are becoming more and more important: SMP and real-time processing.With the release Mac OS X Apple has made extensive alterations to our kernel in order to support both SMP and real-time processing. These alterations affected both the BSD and Mach portions of our kernel, as well as shaping our driver system, IOKit.
These changes range from scheduling policies, enabling support for kernel preemption, altering locking hierarchies, and defining new serialization primitives, as well as designing a driver architecture that allows developers to easily make their drivers SMP and preemption safe.
Traditional BSD kernels do some things very well. SMP is not one of them. The 4.4BSD-Lite2 source, on which NetBSD, FreeBSD, OpenBSD, and Mac OS X are based did not have support for SMP. Its locking mechanisms were not set up for multiple processors, the kernel was not reentrant, and bottom half (interrupt time) drivers always work directly within an interrupt context. As FreeBSD, Mac OS X, and NetBSD have moved to support SMP they have had to overcome these shortcomings. Some aspects of their solutions are similar, some are wildly divergent. Both xnu and FreeBSD have decided to adopt interrupt thread contexts, as well as a number of similar new locking primitives.
Before delving into the intricacies of Mac OS X's advanced features a brief overview of the kernel architecture and its history is necessary. Mac OS X is based around a BSD distribution known as Darwin. At the heart of Darwin is its kernel, xnu. xnu is a monolithic kernel based on sources from the OSF/mk Mach Kernel, the BSD-Lite2 kernel source, as well as source that was developed at NeXT. All of this has been significantly modified by Apple.
xnu is not a traditional microkernel as its Mach heritage might imply. Over the years various people have tried methods of speeding up microkernels, including collocation (MkLinux), and optimized messaging mechanisms (L4)[microperf]. Since Mac OS X was not intended to work as a multi-server, and a crash of a BSD server was equivalent to a system crash from a user perspective the advantages of protecting Mach from BSD were negligible. Rather than simple collocation, message passing was short circuited by having BSD directly call Mach functions. While the abstractions are maintained within the kernel at source level, the kernel is in fact monolithic. xnu exports both Mach 3.0 and BSD interfaces for userland applications to use. Use of the Mach interface is discouraged except for IPC, and if it is necessary to use a Mach API it should most likely be used indirectly through a system provided wrapper API.
Operating systems use a number of structures and algorithms to ensure proper synchronization between various parts of the kernel. xnu uses several different locking structures, including the BSD lockmanager, Mach mutexes, simple locks, read-write locks, and funnels. Additionally thread control is complicated by the use of Mach continuations, and kernel preemption.
3.1 Simple Locks
Simple locks in Mach are standard spin locks. When a thread attempts to access a simple lock that is in use it loops until the lock becomes free. This is useful when allowing the thread to sleep could cause a deadlock, or when one of the threads could be running in an interrupt context.
Simple locks are the safest general synchronization primitive to use when in doubt, but their CPU cost is very high. In general is is better to use a mutex if at all possible. If a piece of code attempts to acquire a simple lock it already holds it will result in a kernel panic.
3.2 Mutexes
Mach mutexes are very primitive. Since they are sleep locks, and do not have the rich semantics that FreeBSDs mutexes have. They are sleep locks, when a thread attempts to access an inuse mutex it will sleep until that mutex is available. Mutexes can be used from a thread context (though it is not always the best performance decision for things like drivers). If a piece of code attempts to acquire a mutex it already holds it will result in a kernel panic.
3.3 Read-Write Locks
Many variables within the kernel are safe to be read, so long as they are not being written. If a lock is highly contended, generally it is primarily being protected for readers. Read-write locks solve this problem by allowing either multiple reads, or a single writer to possess the lock. While there are API's for promoting and demoting locks between the read and write states, their usage is discouraged and subject to change.
3.4 Continuations
One of the costs typically associated with context switches is saving and restoring thread stacks. This uses both CPU time and wired memory. In order to avoid this cost, Mac OS X uses Mach continuations whenever possible. A continuation allows the kernel to avoid saving or restoring a kernel stack across schedulings of the thread.
Continuations work within a non-preemptible context. Since the thread is not going to be preempted, its entry and exit points are well-defined. The thread begins executing through a call to a function pointer. It ends execution by making a call that tells the scheduler to schedule a new thread, and leaves a pointer to a function that should be executed the next time the thread is scheduled. It is the thread's responsibility to save and restore its own variables.
While it is useful to be aware of continuations, it is not generally necessary to directly interact with them. They may be useful for doing extremely low overhead threading, but in general it is best to use them indirectly through other kernel mechanisms such as IOWorkLoops.
Funnels are quite possibly one of the most confusing elements of xnu for people familiar with other BSD kernels. They are not a lock in the traditional sense of the word (though they are sometimes referred to as ``flock' within the kernel). Funnels are used to serialize access to the BSD segment of the kernel. This is necessary because that portion on the codebase does not have fine-grained locking, and is not fully reentrant. There are currently two funnels within the kernel, the kernel funnel (it might be more appropriate to call it the filesystem funnel, though it does protect a few calls besids the file systems), and the network funnel.
4.1 Funnels
Funnels first appeared into Digital UNIX[dgux], though their implementation in Mac OS X is entirely different, and significantly improved. Funnels are actually built on top of Mach mutexes. Each funnel backs into a mutex, and once a thread gains a funnel it is holding that funnel while it is executing. The difference between a funnel and a mutex is that a mutex is held across rescheduling. The scheduler drops a thread's funnel when it is rescheduled, and reacquires the funnel when it is rescheduled. That means that holding a funnel does not guarantee that another thread will not enter a critical section before a thread drops the funnel. What it does mean is that on a multiprocessor system it is guaranteed that no other thread will access the section concurrently from another CPU.
https://slotsplaymanhattaninstantvnrubonus-sem.peatix.com.
Originally there was a single funnel protecting the entirety of the BSD kernel. It was in many ways analoguous to FreeBSD-current's Giant mutex (more on that later). Since networking and other kernel functions are generally seperate, splitting the funnel into two is a major win for dual processor machines. Unfortunately, since holding both funnels can result in nasty deadlocks and other problems, holding both at the same time causes a panic. This can cause significant problems for entities that need to access items that are protected by each funnel. The primary entities this effects are network file systems. The funnel API has a call for swapping funnels, but in some cases this has proven to be too complicated to orchestrate (such as NFS serving). The API also provides a merge call which will combine the two funnels into a single funnel, backed by a single mutex. Unfortunately, the funnels cannot be unmerged, which causes a net performance loss.
The primary difference between Digital UNIX funnels and Mac OS X funnels are that on Digital UNIX there can only be one funnel, and it always will be on the primary CPU. On Mac OS X there can be multiple funnels, and funnels can run on any CPU (although a particular funnel may only be on one CPU at any given time).
There are primitives for creating funnels, but in general nobody should be creating new funnels. All control of the funnels is done through the thread_funnel_set call().
4.2 So long spl.
In BSD the various spl priority levels formed a locking hierarchy that could be used to guarantee synchronization between the interrupt and non-interrupt segments of a driver. Unfortunately the spl's definitions got less and less fine-grained over the years, and they were never particularly well suited for SMP. For these reasons Mac OS X no longer uses them. Instead it manages its synchronization through mutexes, and the BSD funnel serializations.
If this sounds familiar to FreeBSD users, that is probably because FreeBSD-current actually has a funnel (or rather a magic mutex), Giant. FreeBSD plays scheduler games with Giant that are almost identical to what xnu does with funnels, although Mac OS X deals with them explicitly, through a different API than its mutexes. Like FreeBSD, xnu has replaced the functionality of the spl's with these more flexible syncronization primitives. Unlike FreeBSD, the spl calls are still sprinkled through the kernel. Through the development of xnu they have been no-ops, wrappers to getting the funnels, and most recently they act as asserts to make sure the funnels are in the correct state when they are called.
There are two important aspects to real-time scheduling. One is the scheduling algorithm, the other is guaranteeing latencies within the kernel are not excessive. While both will be discussed, this section focuses mostly on the latencies related issues.
5.1 Interrupt Handling
True interrupt handlers cannot be preempted, and cannot sleep. Therefore, if there is a long path in an interrupt handler it will lead to high latency. In order to handle this, xnu generally uses a simple interrupt handler that processes the interrupt by triggering a handler in a regular kernel thread context that a driver has registered for the interrupt handler. This ``pseudo interrupt' handler is run in a normal kernel thread context, where it can access the full kernel API. If true interrupt handling is necessary the correct mechanism is generally an IOFilterInterruptEventSource (see below).
5.2 Scheduling Bands
xnu internally has 128 priority levels, ranging from 0 (lowest priority) to 127 (highest priority). They are divided into several major bands. 0 through 51 correspond to what is available through the traditional BSD interface. The default priority is 31. 52 through 63 correspond to elevated priorities. 64-79 are the highest priority regular threads, and are used by things like WindowServer. 80 through 95 are for kernel mode threads. Finally 96 through 127 correspond to real-time threads, which are treated differently than other threads by the scheduler.
5.3 Fixed and Degrading priorities
By default the scheduler creates threads with degradable priorities. These threads will have lower and lower effective priorities as they use (and abuse) their time allocations. This is particularly significant for real-time threads, since if they are truly abusive they will eventually degrade into non-real-time threads. This mechanism means that it is possible to allow non-superusers to create real-time threads.
There are also mechanisms to create fixed priority threads which will not degrade. Their creation is much more restrictive than degradable threads, since they can be used very effectively to perform a denial of service against a system.
5.4 Kernel Preemption
Kernel preemption is the main tool xnu uses to achieve low latencies. The kernel is preemptible, though in standard usage kernel preemption is turned off. Kernel preemption begins when a real-time thread is scheduled. Since the real-time thread has a higher priority than a kernel thread it should be scheduled in favor of the kernel thread, and that is the point at which kernel preemption is activated.
Preemption changes the runtime characteristics of the kernel dramatically. Continuations are no longer nearly as useful, since the thread may be rescheduled at any point, which will require a stack. Additionally, all sorts of new deadlocks can arise. In order to cope with this the locking primitives have been modified to work with preemption. Simple locks disable preemption while they are spinning. Mutexes only disable preemption while the thread is trying to gain access to its interlock (a spin lock protecting the mutexes private data structures). Additionally the true interrupt handler is not preemptible
What this means is that well written code should not need to be at all aware of the fact that kernel preemption is enabled, and should just work if they properly use the locking primitives. It should be transparent to most kernel extensions and drivers. It may not be transparent if the driver uses an IOFilterInterruptEventSource, or does not make proper use of an IOWorkLoop, as described in the next section.
IOKit is the driver subsystem of the Mac OS X kernel. IOKit provides a number of synchronization primitives, ranging from simple wrappers to the Mach primitives, all the way through complex new synchronization constructs that massively simplify writing drivers for devices that are SMP clean and preemptible. IOKit is implemented in eC++ [eC++], a subset of C++, and uses a custom runtime type system.
6.1 IOLocks
IOKit provides wrappers to the Mach locking primitives. These wrappers provide some convenience as well as a consistent interface to the locking primitives.
6.1.1 IORWLock
IORWLock provides a wrapper to the standard Mach read-write locks.
6.1.2 IORecursiveLock
IORecursiveLock provides a wrapper to the standard Mach mutexes. Additionally, it has an internal reference counting mechanism that allows it to be locked recursively.
6.1.3 IOLock
IOLock provides a wrapper to the standard Mach mutexes. The semantics are the same. Recursive locking is not allowed.
6.1.4 IOSimpleLock
IOSimpleLock provides a wrapper to the standard Mach simple_locks. Additionally, it has an interface for enabling and disabling interrupts (drivers should probably be using a IOWorkLoop for synchronization, which will take care of interrupt related issues).
6.2 IOWorkLoop
IOWorkLoops are constructs designed to simplify synchronization issues that arise when working with hardware in the multi-threaded, reentrant, preemptible environment present within xnu. Unlike the other locking primitives discussed earlier in this paper, the IOWorkLoop is a very complex entity that takes care of most of the more mundane synchronization issues for driver writers. Its interface is rather extensive, and somewhat complex.
The basic idea behind a work loop is that it forces anything attached to the work loop to run effectively single threaded. So while anything is holding the work loop none of the other event handlers or runActions associated can run. This effectively synchronizes the various items that are attached to the work loop. It also provides a convenient mechanism for servicing interrupts and timers while keeping them synchronized.
The work loop also takes care of a bunch of mundane issues such as turning on and off interrupts during certain locking procedures, meaning that driver writers can concentrate on getting their drivers working, not keeping their locking straight. Inherently there is some overhead in using work loops, and they do not serve every purpose, but they are quite flexible, and allow programmers to write correct drivers without intimate knowledge of xnu's internal synchronization mechanisms.
6.2.1 EventSources
IOEventSources are very flexible constructs for dealing with asynchronous events. While it is possible to implement new event sources, in general the provided IOInterruptEventSource and IOTimerEventSource are sufficient.
Event sources allow functions to be associated with asynchronous events, such as interrupts and timers. The full details and subtleties of how they work falls outside the scope of this paper, but the basic interfaces for creating new event sources are provided below.
Once an event source has been created it can then be added to a work loop. After that any time the event happens it will automatically be processed by the function that was specified when it was created, and in the work loop context.
6.2.2 runActions
Event sources solve a significant amount of the synchronization issues drivers face dealing with the bottom half (interrupt time) of the driver, but they do not deal with the synchronizing the top half (non-interrupt) and bottom half of the driver. This synchronization is achieved through the use of runActions.
runActions simply link a particular invocation of a function to the work loop. While the runAction is operating it is holding the work loop, thus forcing synchronization with everything else on the work loop, including the interrupt and timer event handlers.
6.3 IOFilterInterruptEventSource
IOFilterInterruptEventSource is a subclass of IOInterruptEventSource It is special because in addition to running within the work loop thread's context it runs directly on the primary interrupt context. This allows for much faster interrupt response time, but also means that an IOFilterInterruptEventSource cannot block, and must not use any kernel API that may block. In general IOFilterInterruptEventSoures should be used for cases where there are a lot of potential spurious interrupts, such as when a device shares an interrupt, or when processing only needs to be performed after several interrupts. The IOFilterInterruptSource can choose to ignore the interrupts that do not need processing, and pass the ones that do need processing onto an IOInterruptEventSource. A full description of limitations imposed on code running within the primary interrupt context is beyond the scope of this paper.
Darwin provides a number of synchronization primitives, both traditional and unique. They provide mechanisms for writing high performance drivers, without requiring driver writers to become intimately familiar with the OS. This both simplifies driver bring up, and encourages more people to write Mac OS X drivers for their devices.
Mac OS X is an evolving system, and many of these features are still in their infancy. Over time it will likely evolve into a more fine grained locking model, with certain compromises that are currently present will be phased out. The basic architecture needed to support SMP and real-time exists, and for most things the interfaces should remain stable for the foreseeable future.
Bibliography
- Advanced Synchronization in Mac OS X: Extending Unix to SMP and Real-Time
This document was generated using theLaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997,Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html-no_auto_link -show_section_numbers -no_navigation -split 0 BSDCon.tex.
Syndactyly (hjh) Mac Os Download
The translation was initiated by Louis Gerbarg on 2001-12-05