As stated earlier, FreeBSD supports running binaries from several other UNIX®es. This works because FreeBSD has an abstraction called the execution class loader. This wedges into the execve(2) syscall, so when execve(2) is about to execute a binary it examines its type.
There are basically two types of binaries in FreeBSD.
Shell-like text scripts which are identified by
#!
as their first two characters and normal
(typically ELF) binaries, which are a
representation of a compiled executable object. The vast
majority (one could say all of them) of binaries in FreeBSD are
from type ELF. ELF files contain a header, which specifies
the OS ABI for this ELF file. By reading this information,
the operating system can accurately determine what type of
binary the given file is.
Every OS ABI must be registered in the FreeBSD kernel. This
applies to the FreeBSD native OS ABI, as well. So when
execve(2) executes a binary it iterates through the list
of registered APIs and when it finds the right one it starts
to use the information contained in the OS ABI description
(its syscall table, errno
translation
table, etc.). So every time the process calls a syscall, it
uses its own set of syscalls instead of some global one. This
effectively provides a very elegant and easy way of supporting
execution of various binary formats.
The nature of emulation of different OSes (and also some other subsystems) led developers to invite a handler event mechanism. There are various places in the kernel, where a list of event handlers are called. Every subsystem can register an event handler and they are called accordingly. For example, when a process exits there is a handler called that possibly cleans up whatever the subsystem needs to be cleaned.
Those simple facilities provide basically everything that is needed for the emulation infrastructure and in fact these are basically the only things necessary to implement the Linux® emulation layer.
Emulation layers need some support from the operating system. I am going to describe some of the supported primitives in the FreeBSD operating system.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
The FreeBSD synchronization primitive set is based on the idea to supply a rather huge number of different primitives in a way that the better one can be used for every particular, appropriate situation.
To a high level point of view you can consider three kinds of synchronization primitives in the FreeBSD kernel:
atomic operations and memory barriers
locks
scheduling barriers
Below there are descriptions for the 3 families. For every lock, you should really check the linked manpage (where possible) for more detailed explanations.
Atomic operations are implemented through a set of
functions performing simple arithmetics on memory operands
in an atomic way with respect to external events
(interrupts, preemption, etc.). Atomic operations can
guarantee atomicity just on small data types (in the
magnitude order of the .long.
architecture C data type), so should be rarely used
directly in the end-level code, if not only for very
simple operations (like flag setting in a bitmap, for
example). In fact, it is rather simple and common to
write down a wrong semantic based on just atomic
operations (usually referred as lock-less). The FreeBSD
kernel offers a way to perform atomic operations in
conjunction with a memory barrier. The memory barriers
will guarantee that an atomic operation will happen
following some specified ordering with respect to other
memory accesses. For example, if we need that an atomic
operation happen just after all other pending writes (in
terms of instructions reordering buffers activities) are
completed, we need to explicitly use a memory barrier in
conjunction to this atomic operation. So it is simple to
understand why memory barriers play a key role for
higher-level locks building (just as refcounts, mutexes,
etc.). For a detailed explanatory on atomic operations,
please refer to atomic(9). It is far, however,
noting that atomic operations (and memory barriers as
well) should ideally only be used for building
front-ending locks (as mutexes).
Refcounts are interfaces for handling reference
counters. They are implemented through atomic operations
and are intended to be used just for cases, where the
reference counter is the only one thing to be protected,
so even something like a spin-mutex is deprecated. Using
the refcount interface for structures, where a mutex is
already used is often wrong since we should probably close
the reference counter in some already protected paths. A
manpage discussing refcount does not exist currently, just
check sys/refcount.h
for an overview
of the existing API.
FreeBSD kernel has huge classes of locks. Every lock is defined by some peculiar properties, but probably the most important is the event linked to contesting holders (or in other terms, the behavior of threads unable to acquire the lock). FreeBSD's locking scheme presents three different behaviors for contenders:
spinning
blocking
sleeping
numbers are not casual
Spin locks let waiters to spin until they cannot acquire the lock. An important matter do deal with is when a thread contests on a spin lock if it is not descheduled. Since the FreeBSD kernel is preemptive, this exposes spin lock at the risk of deadlocks that can be solved just disabling interrupts while they are acquired. For this and other reasons (like lack of priority propagation support, poorness in load balancing schemes between CPUs, etc.), spin locks are intended to protect very small paths of code, or ideally not to be used at all if not explicitly requested (explained later).
Block locks let waiters to be descheduled and blocked until the lock owner does not drop it and wakes up one or more contenders. In order to avoid starvation issues, blocking locks do priority propagation from the waiters to the owner. Block locks must be implemented through the turnstile interface and are intended to be the most used kind of locks in the kernel, if no particular conditions are met.
Sleep locks let waiters to be descheduled and fall asleep until the lock holder does not drop it and wakes up one or more waiters. Since sleep locks are intended to protect large paths of code and to cater asynchronous events, they do not do any form of priority propagation. They must be implemented through the sleepqueue(9) interface.
The order used to acquire locks is very important, not only for the possibility to deadlock due at lock order reversals, but even because lock acquisition should follow specific rules linked to locks natures. If you give a look at the table above, the practical rule is that if a thread holds a lock of level n (where the level is the number listed close to the kind of lock) it is not allowed to acquire a lock of superior levels, since this would break the specified semantic for a path. For example, if a thread holds a block lock (level 2), it is allowed to acquire a spin lock (level 1) but not a sleep lock (level 3), since block locks are intended to protect smaller paths than sleep lock (these rules are not about atomic operations or scheduling barriers, however).
This is a list of lock with their respective behaviors:
spin mutex - spinning - mutex(9)
sleep mutex - blocking - mutex(9)
pool mutex - blocking - mtx_pool(9)
sleep family - sleeping - sleep(9) pause tsleep msleep msleep spin msleep rw msleep sx
condvar - sleeping - condvar(9)
rwlock - blocking - rwlock(9)
sxlock - sleeping - sx(9)
lockmgr - sleeping - lockmgr(9)
semaphores - sleeping - sema(9)
Among these locks only mutexes, sxlocks, rwlocks and lockmgrs are intended to handle recursion, but currently recursion is only supported by mutexes and lockmgrs.
Scheduling barriers are intended to be used in order to drive scheduling of threading. They consist mainly of three different stubs:
critical sections (and preemption)
sched_bind
sched_pin
Generally, these should be used only in a particular context and even if they can often replace locks, they should be avoided because they do not let the diagnose of simple eventual problems with locking debugging tools (as witness(4)).
The FreeBSD kernel has been made preemptive basically to deal with interrupt threads. In fact, in order to avoid high interrupt latency, time-sharing priority threads can be preempted by interrupt threads (in this way, they do not need to wait to be scheduled as the normal path previews). Preemption, however, introduces new racing points that need to be handled, as well. Often, in order to deal with preemption, the simplest thing to do is to completely disable it. A critical section defines a piece of code (borderlined by the pair of functions critical_enter(9) and critical_exit(9), where preemption is guaranteed to not happen (until the protected code is fully executed). This can often replace a lock effectively but should be used carefully in order to not lose the whole advantage that preemption brings.
Another way to deal with preemption is the
sched_pin()
interface. If a piece of
code is closed in the sched_pin()
and sched_unpin()
pair of functions
it is guaranteed that the respective thread, even if it
can be preempted, it will always be executed on the same
CPU. Pinning is very effective in the particular case
when we have to access at per-cpu datas and we assume
other threads will not change those data. The latter
condition will determine a critical section as a too
strong condition for our code.
sched_bind
is an API used in
order to bind a thread to a particular CPU for all the
time it executes the code, until a
sched_unbind
function call does not
unbind it. This feature has a key role in situations
where you cannot trust the current state of CPUs (for
example, at very early stages of boot), as you want to
avoid your thread to migrate on inactive CPUs. Since
sched_bind
and
sched_unbind
manipulate internal
scheduler structures, they need to be enclosed in
sched_lock
acquisition/releasing when
used.
Various emulation layers sometimes require some
additional per-process data. It can manage separate
structures (a list, a tree etc.) containing these data for
every process but this tends to be slow and memory
consuming. To solve this problem the FreeBSD
proc
structure contains
p_emuldata
, which is a void pointer to
some emulation layer specific data. This
proc
entry is protected by the proc
mutex.
The FreeBSD proc
structure contains a
p_sysent
entry that identifies, which ABI
this process is running. In fact, it is a pointer to the
sysentvec
described above. So by
comparing this pointer to the address where the
sysentvec
structure for the given ABI is
stored we can effectively determine whether the process
belongs to our emulation layer. The code typically looks
like:
if (__predict_true(p->p_sysent != &elf_Linux®_sysvec))
return;
As you can see, we effectively use the
__predict_true
modifier to collapse the
most common case (FreeBSD process) to a simple return operation
thus preserving high performance. This code should be
turned into a macro because currently it is not very
flexible, i.e. we do not support Linux®64 emulation nor
A.OUT Linux® processes on i386.
The FreeBSD VFS subsystem is very complex but the Linux® emulation layer uses just a small subset via a well defined API. It can either operate on vnodes or file handlers. Vnode represents a virtual vnode, i.e. representation of a node in VFS. Another representation is a file handler, which represents an opened file from the perspective of a process. A file handler can represent a socket or an ordinary file. A file handler contains a pointer to its vnode. More then one file handler can point to the same vnode.
The namei(9) routine is a central entry point to pathname lookup and translation. It traverses the path point by point from the starting point to the end point using lookup function, which is internal to VFS. The namei(9) syscall can cope with symlinks, absolute and relative paths. When a path is looked up using namei(9) it is inputed to the name cache. This behavior can be suppressed. This routine is used all over the kernel and its performance is very critical.
The vn_fullpath(9) function takes the best effort to traverse VFS name cache and returns a path for a given (locked) vnode. This process is unreliable but works just fine for the most common cases. The unreliability is because it relies on VFS cache (it does not traverse the on medium structures), it does not work with hardlinks, etc. This routine is used in several places in the Linuxulator.
fgetvp
- given a thread and a
file descriptor number it returns the associated
vnode
vn_lock(9) - locks a vnode
vn_unlock
- unlocks a
vnode
VOP_READDIR(9) - reads a directory referenced by a vnode
VOP_GETATTR(9) - gets attributes of a file or a directory referenced by a vnode
VOP_LOOKUP(9) - looks up a path to a given directory
VOP_OPEN(9) - opens a file referenced by a vnode
VOP_CLOSE(9) - closes a file referenced by a vnode
vput(9) - decrements the use count for a vnode and unlocks it
vrele(9) - decrements the use count for a vnode
vref(9) - increments the use count for a vnode
All FreeBSD documents are available for download at https://download.freebsd.org/ftp/doc/
Questions that are not answered by the
documentation may be
sent to <freebsd-questions@FreeBSD.org>.
Send questions about this document to <freebsd-doc@FreeBSD.org>.