System Calls
Definition and purpose: In computing, a system call is the programmatic way in which a user-space program requests a service from the operating system kernel (System call – Wikipedia). System calls provide an essential interface between processes and the OS, allowing applications to perform operations like accessing hardware (e.g. disk I/O, network) or creating new processes via a controlled, privileged mechanism (System call – Wikipedia). Instead of directly manipulating hardware, programs use system calls as abstracted operations (e.g. reading a file, creating a socket), which the OS implements. This abstraction improves portability and safety, since the OS checks and executes these requests in a safe manner.
User mode vs. kernel mode: Modern CPUs operate in at least two modes: user mode (restricted) and kernel mode (privileged). Regular application code runs in user mode and cannot directly access hardware or critical resources, whereas the OS kernel runs in kernel mode with full privileges (System call – Wikipedia). When a process invokes a system call, the CPU transitions from user mode to kernel mode (often via a software interrupt or special trap instruction) and jumps to a predefined OS handler address (System Calls — The Linux Kernel documentation). The kernel then performs the requested service (with full hardware access) and returns to user mode. This mode switch is essential for protection: user code cannot perform dangerous operations except via system calls, which the kernel validates. Notably, a system call does not require a full context switch to a different process – it executes in the context of the calling process but with elevated privilege (System call – Wikipedia). The overhead of entering kernel mode is present, but it’s cheaper than a full process switch. (In some systems like microkernels, many services run in user space, meaning more frequent user-kernel transitions. This can increase overhead but isolates components for reliability.)
Lifecycle of a system call: The typical sequence is: (1) The application sets up arguments (e.g. which file to open) and invokes the system call (often via a library wrapper like C’s open()
which triggers a CPU trap instruction). (2) The CPU switches to kernel mode and jumps to the system call handler; the kernel saves the user process’s state (registers, program counter) and uses a system call number to index a table of service routines (System Calls — The Linux Kernel documentation). (3) The kernel routine for that call (e.g. the file-open implementation) executes, accessing hardware or kernel data as needed. (4) On completion, the kernel restores the saved state, switches the CPU back to user mode, and resumes the process, returning any result (e.g. a file descriptor or error code). In summary, “the execution mode switches to kernel, the OS performs the service, then restores registers and returns to user space” (System Calls — The Linux Kernel documentation). This transition and return happen quickly, but still incur more overhead than a normal function call due to mode switching and safety checks.
Categories of system calls: System calls can be grouped into several broad categories based on their functionality (System call – Wikipedia) (System call – Wikipedia):
- Process control: create or terminate processes (e.g.
fork()
on Unix,CreateProcess
on Windows), load/execute programs, signal and wait for events, allocate memory, etc (System call – Wikipedia). For example, a call likefork()
creates a new process by duplicating the caller, whileexec()
replaces the process’s memory with a new program. - File management: operations on files and directories – create, delete, open, close, read, write, reposition (seek), and get/set file attributes (System call – Wikipedia). For instance,
open()
returns a handle for file I/O, andread()
/write()
transfer data. - Device management: requests to hardware or virtual devices – open/close device files, read/write to devices, or control device settings (System call – Wikipedia). This includes I/O control system calls (ioctl) that send commands to devices, and calls to request or release device access.
- Information maintenance: calls that get or set system and process information – e.g. getting the time or system configuration, setting timers, getting a process’s ID or priority (System call – Wikipedia).
- Communication: inter-process communication calls – create and use pipes, sockets, shared memory, or send/receive messages (System call – Wikipedia). For example,
pipe()
creates a unidirectional pipe, and networking sockets (viasocket()
, etc.) let processes communicate over the network. - Protection: calls related to access control – e.g. setting file permissions or user access rights (System call – Wikipedia).
These categories are not rigid and specific OSes may have dozens to hundreds of system calls. For example, Linux and UNIX systems follow the POSIX standard for many calls (file and process operations), whereas Windows provides most OS services through the Win32 API which internally uses native system calls (e.g. NtCreateFile
). Despite differences, all major OS kernels (Windows, Linux, macOS, etc.) implement similar functionality via system calls, though naming and details differ.
Real-world OS design differences: Traditional monolithic kernels (Linux, Windows, UNIX) execute system calls within the kernel address space, whereas microkernels push many services (file systems, drivers, etc.) into user space processes. In a microkernel like Minix or seL4, a system call may involve additional IPC messages between user-space services (e.g. a file system server) – this can increase overhead but improves modularity and fault isolation. Another trend is applying system calls in virtualization: hypervisors offer hypercalls (similar in concept to system calls) for guest OSes to request hypervisor services. Overall, system calls remain the critical gate between user programs and the protected kernel, and efficient implementation (e.g. using fast CPU instructions like Intel’s SYSENTER/SYSCALL) is important for performance.
Processes & Threads
Process creation and states: A process is a program in execution – an OS abstraction containing the program’s code, data, and execution context (registers, program counter, memory mappings, etc.) (This article describes the role of process states in OS) (This article describes the role of process states in OS). When an OS creates a new process (via system calls like fork()
or CreateProcess
), it allocates a process control block (PCB) data structure to track the process’s state, resources, and metadata. The new process (child) initially shares or duplicates the parent’s context (e.g. in Unix, fork()
creates a copy of the parent process). Processes typically go through a lifecycle with well-defined states (Microsoft PowerPoint – chap03 [Compatibility Mode]). A common five-state model includes: New (being created), Ready (fully prepared in memory and waiting to be assigned CPU time), Running (actively executing on a CPU core), Blocked/Waiting (unable to proceed until some event occurs, such as I/O completion), and Terminated (finished execution) (Microsoft PowerPoint – chap03 [Compatibility Mode]). For example, when a process is created (New) and admitted by the OS scheduler, it enters the Ready queue. When the scheduler dispatches it, it transitions to Running on a CPU. If it needs I/O or waits for a resource, it moves to a Waiting state until the event occurs, then goes back to Ready (This article describes the role of process states in OS). Eventually, it exits and enters Terminated state, at which point the OS cleans up its resources. Only one process per CPU can be in Running state at a time (on a single core) (Process state – Wikipedia), but modern OSes rapidly context-switch the CPU among Ready processes to achieve multitasking. The PCB and scheduling data structures keep track of these state transitions. Contemporary OSes also support suspended states (e.g. Suspended or Swap Out Ready/Blocked), where a process is not in memory (swapped to disk) to free RAM, introducing additional state nuances.
While processes provide isolation (each has its own memory address space), creating a process is relatively heavy. For lighter concurrency, OSes implement threads. Threads are the unit of CPU scheduling and execution that live within processes. All threads in a process share the process’s memory and resources, but each thread has its own stack, CPU registers, and program counter. This makes context switches between threads cheaper than between processes (since no memory map switch is needed) (multithreading – Thread context switch Vs. process context switch – Stack Overflow) (multithreading – Thread context switch Vs. process context switch – Stack Overflow). Indeed, switching threads within the same process does not flush the processor’s memory management unit (TLB), whereas switching processes does – avoiding that TLB flush makes thread switches faster (multithreading – Thread context switch Vs. process context switch – Stack Overflow). However, threads lack the strong isolation of processes, so a bug in one thread can affect the entire process.
Thread models (user-level vs kernel-level): There are two primary threading models. User-level threads are managed by a user-space threading library (the OS kernel is unaware of them), whereas kernel-level threads are managed and scheduled by the OS kernel.
- User-level threads: These are lightweight threads implemented in a process’s own runtime (for example, via a threading library in the language or application). Operations like thread creation, synchronization, and context switching are done in user space, which makes them very fast (no kernel trap needed). Switching between user-level threads does not involve a kernel mode switch, so it can be as fast as a function call. However, since the OS only sees one task (the process), all user threads of a process map to one kernel scheduling unit (System call – Wikipedia). This is the many-to-one model: many user threads, but the OS schedules only a single kernel thread for the process (System call – Wikipedia). A drawback is if one user thread performs a blocking system call (e.g. read from disk), the entire process (and all its threads) are blocked, because the kernel isn’t aware of the other user threads that could run (System call – Wikipedia). Also, user threads cannot run concurrently on multiple CPUs (since OS sees only one thread). Despite these limitations, user-level threads (sometimes called green threads) offer extremely low-overhead context switches and were used in older systems or language runtimes.
- Kernel-level threads: Here the OS knows about each thread and schedules them individually. This is a one-to-one model: each user thread corresponds to a kernel thread (System call – Wikipedia). Most modern OSes (Linux, Windows, macOS) use kernel threads for user applications (System call – Wikipedia). The OS can schedule different threads of the same process on different CPUs in parallel. If one thread blocks on I/O, the kernel can switch to another thread in the same process – thus no process-wide freeze. The trade-off is that kernel threads have higher overhead to manage: creating or switching threads involves kernel calls. Context switching between two threads (even in the same process) goes through the kernel’s scheduler, requiring saving/restoring registers and possibly flushing some CPU state. Still, switching between two threads in the same process is generally cheaper than switching between threads of different processes, because the memory mapping remains the same (no TLB flush) (multithreading – Thread context switch Vs. process context switch – Stack Overflow) (multithreading – Thread context switch Vs. process context switch – Stack Overflow). Kernel threads also consume kernel resources (each has a kernel stack, etc.), so there’s a limit to how many can be efficiently supported.
Some systems implemented a hybrid many-to-many model or two-level model, where multiple user threads are multiplexed over a smaller or equal number of kernel threads (allowing concurrency but with some of the flexibility of user threads) (System call – Wikipedia). Older Solaris had such an M:N threading model. In practice, one-to-one threading (with possible internal scheduling optimizations) is prevalent today.
Context switching and performance: A context switch occurs when the CPU switches from running one thread/process to another. This requires saving the state (CPU register values, program counter, etc.) of the currently running thread and loading the saved state of the next thread to run. If the two threads belong to different processes, the OS must also switch the memory context (load a new page table or memory map), which invalidates cached memory translations (TLB) and may incur additional cache misses (multithreading – Thread context switch Vs. process context switch – Stack Overflow). Thus, process context switches carry more overhead than thread context switches. When switching between threads of the same process, the virtual memory space remains the same, avoiding TLB flush and reducing cache disruption (multithreading – Thread context switch Vs. process context switch – Stack Overflow) (multithreading – Thread context switch Vs. process context switch – Stack Overflow). In either case, there is a cost: entering the kernel scheduler, doing the save/restore, and warming up caches for the new thread. Excessive context switching (often caused by too many threads or too short time-slices) can degrade performance – the CPU spends time switching rather than executing useful work (The Performance Impact of Excessive Context Switching – Serkan Erip) (The Performance Impact of Excessive Context Switching – Serkan Erip). Therefore, OS schedulers try to balance responsiveness with minimizing unnecessary context switches. Modern CPUs also have features like cache affinity, so OS schedulers attempt to keep threads on the same CPU core to improve cache reuse.
In summary, threads provide a mechanism for concurrency within a process with lower overhead than full processes. User-level threads minimize overhead further but at the cost of kernel transparency, while kernel-managed threads gain full OS support and parallelism at some performance cost. For example, in Linux, the clone()
system call creates threads (or processes) with various sharing options, and Linux schedules all threads (which are essentially tasks with potentially shared memory) with the same scheduler. Windows provides a thread API and schedules threads preemptively with dynamic priorities. Both approaches aim to optimize throughput and responsiveness: “Switching between user-level threads is often faster… because it doesn’t require resetting memory protections… Kernel-level threads allow a thread to run while another thread in the same process is blocked… and can run simultaneously on multiprocessors.” (operating systems – What is the difference between user-level threads and kernel-level threads? – Computer Science Stack Exchange) (operating systems – What is the difference between user-level threads and kernel-level threads? – Computer Science Stack Exchange).
(Advanced note: Recent trends include unikernels, which blur the line between processes and the OS. A unikernel builds a single-address-space machine image containing one application linked with the OS kernel libraries. There is essentially one process/thread (with no user-kernel mode switch in the traditional sense). This can yield performance gains for specialized cloud applications, but sacrifices the multiprocess model. Also, OS kernels are exploring new concurrency paradigms; for example, the Linux kernel uses Read-Copy-Update (RCU), a lock-free synchronization mechanism, to efficiently handle read-mostly data structures in a concurrent environment, demonstrating that not all synchronization in kernels relies solely on classical locking.)*
Inter-Process Communication (IPC)
Overview: Inter-process communication refers to mechanisms that allow processes to exchange data and signals. Since processes in modern OSes run in isolated memory spaces, IPC methods are needed for cooperation. Broadly, IPC can be message-based or memory-based (Inter-Process Communication – OMSCS Notes). In message-based IPC, processes send discrete messages via the kernel (which might copy data between address spaces) – examples include pipes, message queues, and sockets. In memory-based IPC, processes share a memory region, and communicate by reading/writing to that shared memory (synchronization needed to coordinate access) (Inter-Process Communication – OMSCS Notes). The choice of IPC mechanism depends on use case: for instance, sharing large data is efficient with shared memory, while message passing is useful for distributed systems or when synchronization complexity must be minimized.
Common IPC mechanisms and their characteristics:
- Pipes: A pipe is a unidirectional communication channel that connects the output of one process to the input of another. One process writes bytes into the pipe, and the other reads those bytes in FIFO order. Pipes are typically used between related processes (e.g. a parent and child) – for example, in UNIX,
pipe()
creates a pair of file descriptors for a parent and child to communicate. Anonymous pipes are unidirectional and exist only as long as the processes are alive, whereas named pipes (FIFOs) can persist and be used between any processes. Pipes are simple and have low overhead for small data streams, but are limited to one-way flow (for full duplex, two pipes can be used) and are local to one machine. They’re excellent for producer-consumer workflows (e.g. connecting shell commands). “A pipe is a unidirectional communication channel used for IPC between two related processes. One process writes to the pipe, and the other process reads from it.” ([PDF] Question & Answers – Sercan Külcü). - Message Queues: Many operating systems provide message queue IPC (e.g. System V message queues or POSIX message queues on UNIX, or mail slots on Windows). A message queue allows processes to post discrete messages (structured chunks of data) to a kernel-managed queue, where another process can read them (often in FIFO order, though some systems allow prioritized messages). Queues facilitate asynchronous communication – the sender and receiver don’t need to interact at the same instant. The OS ensures synchronization (e.g. blocking a reader if queue empty, or optionally blocking a writer if queue full). Use case: When processes need to exchange distinct units of data (e.g. jobs, commands) and possibly have multiple senders/receivers. The advantage is decoupling (the sender can continue after posting the message) and that the kernel can enforce access control and ordering. The downside is each message send/receive involves copying data into/out of the kernel, which can be less efficient for large volumes than shared memory. Still, message queues are widely used for moderate-sized communications.
- Shared Memory: Shared memory maps a region of physical memory into the address space of multiple processes, so they can all read/write the same bytes. Once set up (via a system call like
shmget
/shmat
on UNIX orCreateFileMapping
/MapViewOfFile
on Windows), no further kernel intervention is needed for data transfer – the processes communicate by directly accessing memory, which is very fast (as fast as normal memory access). This makes shared memory the highest-throughput IPC method, especially for large data, because it avoids per-data syscalls. For example, two processes could share a large buffer to avoid copying data through pipes. However, shared memory comes with challenges: processes must coordinate access to avoid race conditions (usually using synchronization primitives like semaphores or mutexes, possibly provided by OS or shared as well). Another drawback: shared memory is only applicable to processes on the same machine (unlike sockets which can be networked). Use cases include multimedia applications (sharing video frames between processes), or client-server systems on one host where copying overhead must be minimized. The OS also typically needs to clean up shared memory segments if processes crash or fail to remove them. - Sockets: Sockets are a more general IPC mechanism originally designed for network communication. They allow two-way byte-stream or message-based communication between processes, either on the same host or across a network. For local IPC, UNIX domain sockets (on POSIX systems) or Windows named pipes (which are actually implemented via a form of socket) provide socket-like IPC with high efficiency. Sockets can be of type STREAM (TCP semantics, reliable byte stream) or DGRAM (UDP semantics, message datagrams). Using a socket for local IPC incurs more overhead than a pipe or shared memory (since the data goes through the networking stack and might be copied more times), but it has the great advantage of uniformity – the same interface works for inter-host communication. For example, a client and server on the same machine might still use a loopback network socket; or a set of microservices might communicate via localhost sockets. Sockets are essential if IPC needs to be distributed across different machines. On a single machine, Unix domain sockets avoid network protocol overhead and use the filesystem namespace for addressing, making them quite efficient (the kernel still copies data between processes, but can bypass a lot of networking code).
- Other IPC mechanisms: There are signals – simple asynchronous notifications to processes (used in UNIX for events or to request termination, etc.), though signals carry minimal data (mostly just an identifier). There are OS-specific mechanisms like Remote Procedure Calls (RPC) which abstract IPC into a function-call model (often using sockets underneath for remote calls). On Windows, COM (Component Object Model) and OLE allow high-level object and data sharing between applications (built atop lower-level IPC and shared memory). There are also memory-mapped files (which map file contents into memory, effectively shared memory with the filesystem – useful for sharing large data via files), and RPC-based frameworks like gRPC or Distributed Computing Environment (DCE RPC) for network IPC. Each has pros and cons – for instance, RPC abstracts the details but can hide costs, whereas shared memory gives maximum performance but requires careful synchronization by the programmer.
Use cases, advantages, limitations: In summary, IPC method selection depends on needs:
- If you have a producer-consumer pipeline (especially parent-child processes) and need simplicity, pipes are ideal (very simple API, OS-managed synchronization). Limitation: one-direction and local only.
- If you need to send structured messages/commands and have potentially multiple senders/receivers or want queued communication, message queues are convenient. They decouple processes and handle flow control, but are subject to kernel copy overhead and finite queue sizes.
- For large data sharing or high-frequency communication, shared memory is preferred because of minimal overhead (no kernel on each access). It’s the fastest IPC for bulk data (e.g. multimedia processing). However, it requires careful coordination (the OS doesn’t implicitly serialize access) and lacks inherent notification (often paired with semaphores or condition variables).
- Sockets are the most flexible, especially if you might extend communication to multiple machines. They allow client-server models and can use standard network protocols. The cost is higher latency (due to protocol overhead) and complexity of setup. On one system, Unix domain sockets offer a good compromise for bidirectional IPC without manual shared memory management.
- Some systems also use files for communication – one process writes to a file and another reads (not real-time but useful for logging or very loosely coupled communication). This is not efficient or real-time, but demonstrates the spectrum from tight coupling (shared mem) to loose coupling (file exchange).
In modern OS practice, multiple IPC mechanisms might be used together. For example, a database might use shared memory for transferring query results to a client process, but use a socket or pipe to send control commands and wake up the client. Operating systems like Linux provide all these IPC tools (pipes via pipe()
, SysV or POSIX message queues, SysV/POSIX shared memory, signals, Unix sockets, etc.). Windows provides named pipes, mailslots, socket API (Winsock), shared memory via file mapping, and high-level frameworks like Local Procedure Call (LPC) and COM.
Importantly, IPC often requires synchronization. For instance, if using shared memory, one must use mutexes or semaphores (which themselves are IPC objects) to protect concurrent access – this ties into the next topic of concurrency control. Also, performance can vary: shared memory and pipes reside in memory and are very fast; sockets might involve network stack overhead. Recent trends include zero-copy IPC techniques (to avoid data copying between kernel and user space) and use of kernel bypass for high-performance messaging in specialized systems.
Concurrency & Synchronization
When multiple threads or processes execute concurrently (especially on multiple CPUs or interleaved on one CPU), we must deal with coordination and synchronization to avoid incorrect behavior. The fundamental problem addressed here is that concurrent entities might access shared resources (memory, files, devices) in an interleaved fashion, causing unpredictability known as race conditions if not controlled.
Critical sections and race conditions: A race condition occurs when the correctness of a computation depends on the relative timing (order of execution) of threads or processes, and threads “race” to access or modify shared data (Race Conditions and Critical Sections) (multithreading – What is a race condition? – Stack Overflow). For example, if two threads both try to increment a global counter, the final result could be wrong if their operations intermix (one thread’s update might be lost). A critical section is a piece of code that accesses a shared resource (such as a shared variable or file) that must not be concurrently executed by more than one thread (Race Conditions and Critical Sections). If two threads enter the same critical section at the same time, a race condition is likely – the state of the shared resource can become inconsistent. “A race condition occurs when two or more threads can access shared data and they try to change it at the same time. Because thread scheduling can swap between threads at any time, the exact ordering is unpredictable, so the result depends on who wins the race to update the data.” (multithreading – What is a race condition? – Stack Overflow). If, for instance, Thread A and Thread B both check a variable X and then update it, the lack of synchronization could lead to lost updates or other incorrect outcomes (the classic check-then-act race). Therefore, we identify critical sections in code and ensure that only one thread at a time executes a critical section on a given shared resource.
The need for synchronization: To prevent race conditions, threads must synchronize their execution. This typically means establishing mutual exclusion for critical sections – ensuring that if one thread is executing the critical section, others are excluded from it until it’s done. Additionally, synchronization may involve ordering constraints (e.g. thread B must wait until thread A has produced some data). Without proper synchronization, concurrency can lead to corrupted data structures, incorrect computations, or crashes. For example, imagine two threads withdrawing money from a bank account simultaneously – without synchronization, they might each see the same initial balance and both succeed in overdrawing the account, violating consistency.
Synchronization primitives: Operating systems and programming libraries offer various synchronization mechanisms:
- Mutex (Mutual Exclusion Lock): A mutex is the simplest lock that ensures only one thread can own it at a time. It has two operations: lock (or acquire) and unlock (or release). When a thread locks a mutex, if it’s already locked by another thread, the caller will block (or spin) until the mutex becomes available. Mutexes thus allow threads to serialize access to critical sections – e.g. a thread locks before entering a critical section and unlocks after leaving, guaranteeing exclusive access in between. “A mutex is a locking mechanism used to synchronize access to a resource. Only one thread or process can acquire a mutex at a given time.” (Execution Guardrails: Mutual Exclusion, Sub-technique T1480.002 – Enterprise | MITRE ATT&CK®). Mutexes are typically provided by the OS (e.g. pthreads library
pthread_mutex
on UNIX,CRITICAL_SECTION
orMutex
objects on Windows). Internally, if a mutex is uncontended, locking/unlocking is very fast (a few atomic CPU instructions). If contended, the OS will block the waiting thread and schedule another, to avoid busy-waiting. Mutexes can cause blocking and context switches, so they should be held only for short durations to minimize performance impact. - Semaphores: A semaphore is a more general synchronization primitive, essentially a protected integer counter with operations P(wait) and V(signal) (or in OS APIs, often called
wait()
andsignal()
/release()
operations). A counting semaphore can allow multiple units of a resource, not just one. For example, a semaphore initialized to N allows up to N threads to enter; if N threads are already inside (the semaphore count goes to 0), further waiters block until someone signals (increments the count). A binary semaphore (count 0/1) is effectively a mutex (without ownership concept). Semaphores can be used for mutual exclusion (binary semaphore) or for scheduling constraints (e.g. have a thread wait for an event by waiting on a semaphore that another thread will signal). Dijkstra’s classical definition: “A semaphore S is an integer variable that is accessed only through two atomic operations: wait (P) and signal (V)” (Semaphore (programming) – Wikipedia) (Semaphore (programming) – Wikipedia). Semaphores are versatile: for example, a semaphore initialized to 0 can be used as a signal for an event – one thread waits (blocks) until another thread signals the semaphore, at which point the first wakes up. Unlike a mutex, a semaphore doesn’t strictly have an “owner” (any thread can signal it), which can be useful for some producer-consumer problems. However, with binary semaphores it’s easy to accidentally misuse them (e.g. signal from the wrong thread), so many higher-level frameworks prefer mutexes or condition variables for locking semantics. - Monitors and condition variables: A monitor is a high-level synchronization construct that combines mutual exclusion and the ability for threads to wait for conditions. In classic terms (Hoare or Mesa monitors), a monitor is essentially an object or module that has a mutex implicitly associated with it and condition variables for waiting. A condition variable is a queue of threads waiting for some condition to be true. Condition variables are used in conjunction with a mutex: a thread locks the mutex, then checks a condition (like “buffer not empty?”). If the condition is false, the thread waits on the condition variable, which atomically releases the mutex and puts the thread to sleep until another thread signals the condition (Monitor (synchronization) – Wikipedia). Another thread, when it changes the state (e.g. produces an item into the buffer), will lock the same mutex and signal the condition variable to wake one waiting thread, then unlock. This allows threads to sleep inside critical sections waiting for a specific event without holding the lock. The monitor abstraction, as used in Java for example, means every object has an implicit mutex and condition (in Java, the
synchronized
keyword on a method/block is the mutex, andwait()/notify()
are condition variable operations on that object) ([Notes] Mutex, Semaphore, Monitor | by Tarun Jain | Medium) ([Notes] Mutex, Semaphore, Monitor | by Tarun Jain | Medium). Monitors make synchronization safer by tying the lock and condition waits together. “A monitor consists of a mutex (lock) and at least one condition variable. Threads temporarily give up exclusive access (lock) in order to wait for a condition to be met, then regain the lock to resume.” (Monitor (synchronization) – Wikipedia). This pattern avoids busy-waiting and makes it easier to coordinate complex conditions (like producer-consumer where producers wait if buffer full, consumers wait if buffer empty, both protected by one mutex). - Spinlocks and others: A spinlock is a simple lock where a thread waits in a loop (“spins”) checking until the lock becomes available. Spinlocks avoid the overhead of a context switch when the wait is expected to be very short (they’re often used inside OS kernels or on multiprocessors for low-level synchronization). However, they tie up CPU while waiting, so they are inefficient if held for longer. OS kernels also use specialized mechanisms like reader-writer locks (allow multiple concurrent readers or one writer), barriers (to make threads wait until all reach a certain point), and more.
Challenges in concurrency: Even with these primitives, concurrency is tricky. Two classic problems:
- Deadlock: This occurs when threads cyclically wait on resources held by each other, and none can proceed. (Deadlock is significant enough to have its own section below.) For example, Thread A holds mutex X and waits for mutex Y, while Thread B holds Y and waits for X – both are stuck forever. Proper lock ordering or deadlock avoidance techniques are needed to prevent such situations in concurrent code.
- Starvation: A thread might never get CPU time or lock access because others are constantly favoured. For instance, a low-priority thread might starve if higher-priority threads keep running (unless the OS scheduler prevents this), or a thread might starve for a lock if other threads repeatedly beat it to acquiring the lock. This is often mitigated by fair scheduling or lock mechanisms (some mutexes offer fairness or FIFO queuing to ensure each waiter eventually gets the lock).
- Priority inversion: A scheduling anomaly where a high-priority thread is waiting on a resource (lock) held by a lower-priority thread, but a medium-priority thread preempts the low-priority one, causing the high-priority thread to wait longer – effectively inverting priorities. For example, a low-priority task L holds a mutex, a high-priority task H blocks on that mutex, and then a medium task M runs and preempts L (since M > L in priority), preventing L from releasing the lock, thereby blocking H indefinitely. This famously happened in the Mars Pathfinder mission (a high-priority thread was blocked by a low-priority thread holding a resource while medium threads ran). OSes address this by priority inheritance – temporarily boosting the priority of the low-priority thread holding the lock to that of the highest waiter, so it can run and release the lock ( FreeRTOS Priority Inversion ) ( FreeRTOS Priority Inversion ). Without such measures, priority inversion can lead to serious responsiveness issues in real-time systems. “Priority inversion is a bug where a high-priority task is indirectly preempted by a low-priority task (e.g. low-priority task holds a mutex needed by high-priority, and a medium-priority task runs, blocking the low-priority from running to release the mutex) ( FreeRTOS Priority Inversion ).” The solution is priority inheritance or protocols like priority ceiling, which the OS or synchronization library can implement to avoid unbounded priority inversion.
- Memory consistency and atomicity: On multiprocessors, even with locks, one must consider memory ordering. Modern CPUs may reorder memory operations for performance. Therefore, synchronization primitives typically also act as memory barriers, enforcing that operations inside critical sections appear in the intended order to other threads. Another issue is making sure certain operations are atomic (indivisible). Low-level synchronization often relies on atomic CPU instructions (like compare-and-swap (CAS) or atomic test-and-set) to implement locks and semaphores. These hardware primitives ensure that even if two CPUs attempt, say, to test-and-set a flag at the same time, one will succeed and the other will see the updated value.
In practice, writing correct concurrent programs is one of the hardest tasks. The primitives above help structure the solutions: mutexes for mutual exclusion, semaphores or condition variables for ordering and waiting. Higher-level concurrency libraries (like OpenMP, Java java.util.concurrent
, etc.) build further abstractions (futures, thread pools, etc.) on top of these basic primitives. The OS supports concurrency by providing these mechanisms in the kernel (for inter-process sync, e.g. POSIX named semaphores, futexes in Linux) or by using them internally in scheduling and resource management. If synchronization is misused, issues like deadlock or livelock (threads not making progress but not technically deadlocked) can occur.
In summary, to safely share data in a concurrent environment, threads must synchronize access using proper primitives. A critical section should be protected by a mutex or equivalent, preventing race conditions. Condition variables allow threads to wait efficiently for specific conditions, avoiding busy-wait loops. Semaphores can handle more complex resource count scenarios or act as simple locks. The OS and hardware provide the underpinning (like atomic instructions) to make these primitives work correctly. The next section will delve deeper into deadlocks, one of the worst synchronization problems.
Deadlock
When a set of processes or threads are each waiting for an event that only another process in the set can cause, none of them can proceed – this situation is called a deadlock. In simpler terms, deadlock is a circular waiting predicament: Process A waits for a resource held by B, while B waits for a resource held by A (or a chain of such waiting exists). As a result, all are stuck permanently.
Conditions for deadlock: Deadlock doesn’t happen arbitrarily; it arises only if certain conditions hold simultaneously. These are known as the Coffman conditions (from a 1971 paper) (Deadlock), and all four are necessary for deadlock (Deadlock):
- Mutual Exclusion: At least one resource is non-shareable – only one process can use it at a time (Deadlock). For example, a printer device or a file lock that only one process can hold. If all resources were shareable (e.g. read-only files), deadlock wouldn’t occur because processes wouldn’t block each other.
- Hold and Wait: Some process holds one resource and is waiting for another resource that is currently held by some other process (Deadlock). In other words, processes hold resources they’ve already acquired while requesting new ones. If processes didn’t hold what they already have while waiting (i.e. if they released everything before requesting new resources), deadlocks would be avoidable.
- No Preemption: Resources cannot be forcibly removed (preempted) from a process; a resource is released only voluntarily by the process holding it when it’s done (Deadlock). Thus, the OS cannot simply snatch a resource (like memory or a lock) from one process to give to another. If preemption were possible (e.g. the OS could roll back or suspend a process and take its resource), we could break deadlocks by force.
- Circular Wait: There exists a circular chain of two or more processes where each process is waiting for a resource held by the next process in the circle (Deadlock). Formally, we have processes P1, P2, …, Pn such that P1 waits for a resource held by P2, P2 waits for a resource held by P3, …, and Pn waits for a resource held by P1 (Deadlock). This condition implies a cycle in the resource allocation graph (a directed graph of which process holds which resource and requests which resource).
If all four conditions hold, a deadlock situation is possible (Deadlock) (and if circular wait actually occurs, deadlock exists). If any one of these conditions is broken or prevented, deadlock cannot occur (Deadlock, Part 2: Deadlock Conditions – UIUC CS241 SystemProgramming Wiki) (Deadlock, Part 2: Deadlock Conditions – UIUC CS241 SystemProgramming Wiki). These conditions often indeed occur naturally: many resources are mutually exclusive (printers, disk blocks, mutex locks), processes do hold resources while requesting others, and OSes typically don’t preempt locks or other resources (preemption is easy for CPU but not for general resources). Thus the key to deadlock avoidance or prevention is to ensure a cycle cannot form or to conditionally violate one of the conditions.
Deadlock prevention strategies: Deadlock prevention means designing the system or resource allocation protocols so that deadlock is structurally impossible by negating one of the Coffman conditions (What is deadlock? – Lenovo). Some approaches:
- Eliminate Mutual Exclusion: If you can make resources sharable, no deadlock. In practice, certain resources can be made concurrent (e.g. read-only access to files by multiple processes), but many resources are inherently exclusive (you can’t have two threads simultaneously hold the same mutex or two processes exclusively write to the same file region without conflict). This is usually not a general solution except in special cases.
- Eliminate Hold and Wait: Require processes to request all the resources they will need up front, and if they can’t get them all, release what they have (or don’t hold anything until all can be obtained). This way, a process isn’t holding one resource while waiting for another. For example, a system could enforce that when a process starts, it must request everything it will need; or a thread must not acquire a lock if it already holds another (unless it can get both at once). Practically, this can lead to low resource utilization (processes may acquire resources long before they need them, tying them up) and the system needs to know in advance all resources needed, which is not always feasible.
- Allow Preemption: If a process is holding some resources and requests another that cannot be immediately allocated, it must release (preempt) its currently held resources and retry later. This breaks the hold-and-wait (and no-preemption) condition. For example, if Process A holds R1 and requests R2 which is held by B, the OS could force A to release R1. This is easier for some resources (e.g. CPU can be preempted, memory pages can be swapped out) but harder for others (can you preempt a printer in the middle of printing? Possibly by resetting it, but that wastes work). Some database systems preempt transactions (abort and restart them) to resolve deadlocks. In OS resource allocation, preemption is not always possible or desirable due to side effects.
- Eliminate Circular Wait: Impose a strict ordering of resource acquisition so that circular wait cannot occur (Deadlock, Part 2: Deadlock Conditions – UIUC CS241 SystemProgramming Wiki). For example, assign each resource type a unique global number. Require that each process can request resources only in increasing order of their numbers (and release in any order or all at once). This means it’s impossible to have a cycle, because if P1 is waiting for a resource held by P2, then P1 must be requesting a higher-numbered resource than it already holds, etc., which prevents a circular chain. This approach is common: e.g., all locks in a program might be given an order and code is written to always lock in a consistent sorted order. It’s a practical prevention technique in software (many multi-lock algorithms use a lock ordering to avoid deadlock). The OS can also enforce an ordering by design (for instance, always acquire mutexes in kernel in a certain address-based order). The trade-off is inflexibility and potential to reduce concurrency (maybe a process has to request a resource earlier than needed just to respect ordering).
In short, deadlock prevention is proactive: the system is designed so that at least one Coffman condition is never true (Deadlock, Part 2: Deadlock Conditions – UIUC CS241 SystemProgramming Wiki). For example, some systems prevent circular wait by having a global lock ordering. A real-world analogy given by a resource ordering approach: “Two friends need a pen and paper. They agree to always pick up the pen before the paper – this total ordering (pen then paper) prevents circular wait. Or they grab both pen and paper together (eliminating hold-and-wait), or they politely ask the other to give up the resource if needed (allowing preemption).” (Deadlock, Part 2: Deadlock Conditions – UIUC CS241 SystemProgramming Wiki). Each measure breaks a condition (no circular wait, no hold-and-wait, allow preemption respectively) and thus avoids deadlock.
Deadlock avoidance (Banker’s algorithm): Avoidance is a more flexible approach where the OS tries to ensure the system never enters a dangerous deadlock-prone state. The classic example is Dijkstra’s Banker’s Algorithm, named by analogy to a banker who will only approve a loan if they can guarantee that the bank has enough resources for all customers to finish (not go bankrupt). In OS terms, the Banker’s algorithm assumes each process declares in advance its maximum need of each resource type. Then, every time a process requests resources, the OS checks if granting that request could lead to a deadlock in the future (it simulates allocation and sees if there’s a sequence of allocations that can satisfy all processes) (Banker’s algorithm – Wikipedia). If the request would put the system in an unsafe state (one where deadlock is possible if everyone maxes out their needs), the OS delays or denies the request (Banker’s algorithm – Wikipedia). Otherwise, it grants it. This requires the system to have a priori knowledge of maximum needs and to perform a “safety check” (which is essentially a form of graph/algorithm analysis to ensure a safe sequence exists). Banker’s algorithm is mainly of historical and pedagogical interest – it demonstrates deadlock avoidance in principle, but it’s not often used in general-purpose OS because processes don’t usually know their max resource usage in advance, and resource numbers vary dynamically. However, variants of avoidance appear in certain scenarios (like memory allocation in real-time systems where worst-case needs are known). The key idea: the OS does not simply grant every request – it sometimes intentionally withholds resources to avoid entering a deadlock state. This algorithm requires that resources be requested and released in a way that can be tracked, and that the total resources are fixed and known.
To illustrate, suppose we have resources and processes with declared maximum needs; the Banker’s algorithm will compute if there’s a safe sequence of fulfilling requests. If a request arrives that would leave no safe sequence, it is postponed. This algorithm can be seen as testing a potential allocation for safety by “simulating the allocation of all resources and then making an s-state check for possible deadlock conditions before deciding whether to allow it” (Banker’s algorithm – Wikipedia). A safe state guarantees no deadlock will occur if processes proceed to their max demands (there exists some order to finish everyone). An unsafe state is not necessarily a deadlock yet, but could lead to one; avoidance ensures we never even enter an unsafe state.
Deadlock detection and recovery: In many systems (especially general operating systems), it’s impractical to prevent or avoid deadlocks completely due to performance and incomplete information. Instead, an OS might let deadlocks occur but detect them and recover. Detection involves the OS periodically (or when resource requests time out) examining the resource allocation graph or using an algorithm to find cycles (for single-unit resources, cycle detection in a wait-for graph; for multi-instance resources, a more complex banker’s-like check can be used). If a deadlock is detected (cycle found), the system must recover by breaking the deadlock, typically by terminating or suspending one or more of the processes in the cycle, or forcibly preempting resources (Deadlock (computer science) – Wikipedia) (Deadlock (computer science) – Wikipedia). For example, the OS might choose the process that is least costly to abort (perhaps based on priority, how far along it is, or how many resources it holds) (Deadlock (computer science) – Wikipedia). Killing a process will free its resources, hopefully breaking the cycle. Alternatively, if the OS supports resource preemption, it could temporarily take a resource away from a process (e.g. swap out some memory, or rollback a transaction). Recovery via termination is drastic and can cause lost work (e.g. if a process was halfway through a task). Hence, it’s a last resort – but in some cases (like deadlock involving low-level system processes) it may be the only way.
Many general OSes do not actively detect deadlocks in resource allocation for user processes – a strategy known as the Ostrich Algorithm (ignore the problem, assuming deadlocks are rare) (Deadlock (computer science) – Wikipedia). For instance, most Unix/Linux systems simply rely on the fact that most deadlocks in user-level code are due to programming errors (like two threads locking in wrong order) and don’t have a global detector for that. They might rely on timeouts or administrator intervention (kill a process). However, some operating systems or subsystems have deadlock detection: e.g., databases (like SQL servers) often have a deadlock detector for transactions (if two transactions deadlock on locks, the DB will abort one with a specific error). In Windows, the system doesn’t generally auto-kill deadlocked processes, but developer tools or the OS Wait Chain Traversal can detect a deadlock among threads and help diagnose it (Wait chain traversal – Win32 apps | Microsoft Learn). For kernel-level deadlocks, OS developers enforce locking hierarchies and use dynamic checks (like lock order verification tools) to avoid them.
Real-world examples: A famous real-world deadlock example occurred in a resource-sharing scenario in traffic: two trains heading toward each other on a single-track railway can deadlock if each waits for the other to move. In operating systems, one classic example is dining philosophers problem – philosophers (processes) need two chopsticks (resources) to eat, and if each picks up one and waits for the other, they deadlock. In real OS terms, consider two processes each needing two locks (L1 and L2): if Process P holds L1 and waits for L2, while Q holds L2 and waits for L1, it’s a deadlock. This can happen in device drivers or kernel code – e.g., one thread holds a filesystem lock and waits for a disk controller, while another holds the disk controller lock and waits for the filesystem lock. Modern OS kernels have design rules to prevent this (like never call into a filesystem while holding a lower-level lock, etc.). Another example: on a single-core system with no preemption of locks, if a high-priority thread is waiting for a low-priority thread (which is deadlocked or stuck) to release a lock, the high-priority thread effectively deadlocks as well (though this edges into priority inversion territory). In systems like Windows, there is a deadlock detection tool for drivers (Driver Verifier’s deadlock detection) that watches for resource cycles in kernel mode. Windows and Linux will generally not automatically kill processes in a deadlock – it might appear as if the application hung (and indeed it has). In such a case, a user may use tools to see the wait chain (Windows Task Manager’s Analyze Wait Chain can show a thread wait dependency and highlight a deadlock chain (Some of you might be experiencing periodic and heavy frame drops …)). The recommended approach is for developers to avoid deadlocks through careful design, as OS-level resolution in user space is limited.
To summarize, deadlocks are a hazard of concurrent resource usage. We characterize them by the four Coffman conditions. Strategies to handle deadlocks include:
- Prevention: Structurally negate one of the conditions (e.g. enforce resource ordering, or use try-locks and back off).
- Avoidance: Dynamically decide on granting requests based on global state to ensure system stays in safe (no-deadlock) states – exemplified by Banker’s algorithm (Banker’s algorithm – Wikipedia).
- Detection and recovery: Allow deadlock to happen but have a mechanism to detect the circular wait (like cycle detection in wait-for graph) and then break the deadlock by terminating or rolling back one or more processes (Deadlock (computer science) – Wikipedia) (Deadlock (computer science) – Wikipedia).
Different OSes choose different methods. General-purpose OSes often rely on prevention in limited ways (lock ordering in kernel) and largely ignore deadlocks in user processes (recovery is manual – kill the app). Specialized systems (database, real-time) implement detection or avoidance because the cost of deadlock is high and must be mitigated automatically.
CPU & I/O Scheduling
CPU scheduling algorithms: The CPU scheduler (short-term scheduler) is the OS component that decides which ready process (or thread) to run next on the CPU. Its goal is to optimize certain performance metrics while ensuring fair allocation of CPU time. Key metrics include throughput (number of processes completed per unit time), turnaround time (time from process submission to completion), waiting time (time a process spends in the ready queue waiting), response time (time from request to first output, important for interactive jobs), and CPU utilization (keeping the CPU busy). Different scheduling algorithms balance these metrics differently (CPU Scheduling Criteria – GeeksforGeeks) (CPU Scheduling).
Classic scheduling algorithms:
- First-Come, First-Served (FCFS): Also known as First-In, First-Out (FIFO), it queues processes in order of arrival (Page replacement algorithm – Wikipedia). The scheduler runs the first process until it blocks or terminates, then moves to the next. FCFS is simple (just a queue) and non-preemptive (once a process has the CPU, it runs to completion or blocking). It’s fair in the sense of arrival order, but it can lead to very poor waiting times if short jobs wait behind long jobs (the convoy effect). For instance, one CPU-bound job can force all others to wait long. “FIFO is cheap and intuitive, but performs poorly in practice and can suffer from Bélády’s anomaly (in paging context) or convoy effects in scheduling.” (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia). Because of this, pure FCFS is rarely used in general-purpose OS for CPU scheduling, except perhaps at the very initial level of job entry in batch systems.
- Shortest Job First (SJF): SJF selects the process with the smallest next CPU burst (execution time) to run next. This can be proven optimal for minimizing average waiting time (it’s like a greedy algorithm minimizing wait). SJF can be non-preemptive (allow a running process to finish even if a shorter job arrives) or preemptive. The preemptive version is known as Shortest Remaining Time First (SRTF) – at any arrival of a new job, the scheduler may preempt if the new job’s remaining time is less than the current job’s remaining time. SJF/SRTF significantly reduce waiting time especially in workloads with varied burst lengths (Job Scheduling Algorithms: Which Is Best For Your Workflow?). However, SJF requires knowledge of the next CPU burst duration, which is not generally known; in practice, the scheduler must estimate it (often by observing past behavior). SJF also can cause starvation of longer jobs if short jobs keep arriving (long jobs might never get CPU). For example, if the system keeps getting a stream of short tasks, a heavy long task could wait indefinitely. Some systems implement SJF approximation via multi-level queues or dynamic priorities (aging to avoid starvation).
- Priority Scheduling: Each process is assigned a priority, and the CPU is allocated to the highest-priority ready process. Priorities can be static (assigned externally, e.g. nice value in Unix) or dynamic (changing based on behavior or aging). Priority scheduling can be preemptive or non-preemptive. It generalizes SJF if you treat “shortest CPU time” as highest priority. It’s common in real-time systems where certain tasks must have high priority. The risk is starvation of low-priority processes — if high-priority ones continually run, low ones may never execute. OSes implement aging (gradually increasing the priority of waiting processes) to prevent starvation. Windows and Linux use priority-based scheduling with time-slicing and dynamic priority adjustments. “Priority scheduling can be non-preemptive or preemptive. In one approach, jobs are executed in a priority queue; if jobs have the same priority, they run FCFS. It works well for ensuring important jobs run first, but can lead to starvation of low-priority jobs.” (Job Scheduling Algorithms: Which Is Best For Your Workflow?). For instance, Windows has dynamic priorities where I/O-bound (interactive) threads get boosted priority (to improve responsiveness), and if a high-priority thread goes to sleep, lower ones can run.
- Round Robin (RR): Round Robin is designed for time-sharing systems to give each process a fair share of the CPU in turn. The ready queue is treated circularly (like FCFS), but each process can only run for a limited time slice (quantum), e.g. 10ms or 50ms, before it is preempted and placed at the back of the queue, and the next process gets CPU (Job Scheduling Algorithms: Which Is Best For Your Workflow?). RR ensures response time is better for all – no process waits more than (N-1)*quantum to get CPU (where N is number of ready processes). It’s essentially FCFS with preemption on a timer. The length of the quantum is crucial: too long and RR degenerates to FCFS (poor response), too short and too much time is wasted on context-switch overhead. RR is fair and simple and widely used for scheduling threads in general-purpose OS (often with dynamic priorities layered on top). In practice, modern OSes use variants of RR (each queue or priority level round-robins among its processes). RR doesn’t optimize turnaround time as well as SJF, but it prevents starvation and is simple. “In round-robin scheduling, each job gets a time slice in turn in a cyclic manner. RR ensures no job monopolizes the CPU, though if time slices are very short, frequent context-switching can lower CPU efficiency.” (Job Scheduling Algorithms: Which Is Best For Your Workflow?). Typically, OS designers choose a quantum that is a few times larger than the context switch time so that overhead is a few percent.
- Multi-Level Queue (MLQ): This algorithm partitions the ready queue into multiple separate queues based on some criterion (process type or priority). For example, a common split is: interactive processes in one queue (given higher priority), batch/background processes in another lower-priority queue. Each queue can have its own scheduling algorithm (RR for interactive, FCFS for batch, etc.), and the scheduler picks from the highest-priority non-empty queue. This is a static class-based scheduling – e.g. foreground vs background. It’s simple but inflexible, since processes are assigned permanently to a queue.
- Multi-Level Feedback Queue (MLFQ): This is a more flexible mechanism that allows processes to move between queues based on their observed behavior (Job Scheduling Algorithms: Which Is Best For Your Workflow?) (Job Scheduling Algorithms: Which Is Best For Your Workflow?). The OS can have, say, several priority levels (multiple ready queues). New processes start in a high-priority queue (short time slices). If a process uses up its time slice (indicating CPU-bound behavior), it gets demoted to a lower priority queue with a longer time slice. If a process waits (e.g. does I/O or yields CPU before slice end), it stays at high priority (indicating interactive or I/O-bound behavior). This way, interactive tasks stay in high priority and get quick responses, while CPU-bound tasks gradually move to lower priority queues and get CPU in larger chunks (which is fine since they don’t need frequent preemption). MLFQ is widely used (Windows and *nix scheduling in the past can be seen as variants of this). It feeds back process behavior into scheduling decisions, hence “feedback”. It tries to approximate SJF: short jobs finish quickly at high priority; long-running jobs sink and run in lower priority in a round-robin fashion. Aging can be implemented by occasionally boosting priority of waiting tasks to avoid starvation (so even if a task sank to low queue, if it hasn’t run for a long time it might be boosted). This algorithm was exemplified in the BSD Unix scheduler and is conceptually how time-sharing OSes manage different workloads. In fact, modern Windows uses a form of MLFQ (called ideal multilevel feedback queue), and Linux’s older O(1) scheduler was similar. MLFQ essentially combines priority scheduling with round-robin, with priority adjusted on the fly. “The multilevel feedback queue (as used in macOS and Windows) dynamically categorizes jobs into multiple priority queues and allows jobs to move between queues based on runtime behavior (CPU burst, etc.), giving good response to interactive tasks while still handling batch tasks efficiently.” (Job Scheduling Algorithms: Which Is Best For Your Workflow?) (Job Scheduling Algorithms: Which Is Best For Your Workflow?).
Most OSes actually use a variant of priority scheduling with round-robin time slicing. For example, Windows has 32 priority levels; it round-robins among threads of the same priority, and uses priority boosting for I/O-bound threads. Linux (as of kernel 2.6.23+) uses the Completely Fair Scheduler (CFS), which is not strictly round-robin or priority queues, but a proportional share scheduler that tries to give each process a fair fraction of the CPU time based on weights (niceness) (Job Scheduling Algorithms: Which Is Best For Your Workflow?) (Job Scheduling Algorithms: Which Is Best For Your Workflow?). CFS uses a red-black tree to schedule tasks in a way that approximates an “ideal multi-tasking processor” where each runnable task gets an equal share of CPU over time (Job Scheduling Algorithms: Which Is Best For Your Workflow?) (Job Scheduling Algorithms: Which Is Best For Your Workflow?). CFS doesn’t use time slices per se but a targeted latency and virtual runtime for fairness. However, conceptually it still preempts tasks to ensure fairness (effectively like an extremely fine-grained round-robin weighted by niceness). macOS and iOS use an MLFQ-based scheduler as well (with priorities and time-sharing). So the classical algorithms are building blocks or special cases of what real OSes implement.
Scheduling evaluation metrics: When comparing or designing algorithms, common metrics used are:
- Throughput: Processes completed per second (want high).
- CPU utilization: % of time CPU is doing useful work (want high, ideally 100% on busy system).
- Turnaround time: For each process, completion time minus submission time (total time in system). We often minimize average turnaround.
- Waiting time: Total time a process spends waiting in ready queue (turnaround minus actual service time and I/O time). Minimizing average waiting time is a common goal – SJF is optimal for that in an ideal scenario.
- Response time: Particularly for interactive jobs, time from issuing a request (or start) to first response (not completion). RR and giving priority to interactive tasks help minimize response time.
Schedulers often face trade-offs. For instance, SJF minimizes average waiting and turnaround but can starve long jobs (bad fairness). RR is fair and good for response time but can increase turnaround for CPU-bound jobs due to frequent context switches. Priority scheduling meets external requirements (like real-time deadlines or user importance) but needs to handle starvation.
I/O (Disk) scheduling: Beyond CPU, I/O scheduling – especially disk scheduling – is crucial for performance. For traditional hard disks, seek time (moving the disk arm) is the major cost, and disk scheduling algorithms aim to reduce seek delays by optimizing the order of service for pending disk read/write requests.
Classic disk scheduling algorithms (assuming a moving-head disk with tracks):
- FCFS (FIFO): Serve requests in arrival order. Fair and simple but can be very inefficient (the head might zigzag across the disk if requests are in unlucky order).
- SSTF (Shortest Seek Time First): Pick the request that is closest to the current head position (smallest seek distance) (SSTF). This is analogous to SJF – it reduces the average seek time significantly by always moving to the nearest request. However, SSTF can cause starvation: if there are always some requests near the current head position, far requests can be postponed for a long time (SSTF) (SSTF). It also tends to favor middle tracks over the extremes (if requests are continuous, ones at the edges might starve). Still, SSTF is a big improvement over FCFS in practice for disks, since disk access times heavily depend on seek distance. Modern drives’ internal scheduling often does something similar to SSTF.
- SCAN (Elevator algorithm): The disk head moves in one direction (say outward towards the highest-numbered track) servicing all requests along the way, until it reaches the end (or the last request in that direction), then reverses direction (SCAN). This back-and-forth sweeping is like an elevator serving floors in one direction then the other. SCAN ensures a form of fairness – no starvation, because even if new requests keep coming, the head will eventually come back and serve those waiting on the other side (SCAN). It tends to distribute wait times more evenly than SSTF. However, at the edges of the disk, SCAN will go to the end even if no requests are at the extreme end, which is one inefficiency (SCAN) (though the algorithm can be slightly optimized to not go to absolute end if unnecessary). SCAN greatly reduces the variance in response compared to SSTF (SCAN), because requests will be serviced within one sweep at most. It’s commonly used in multi-user systems historically.
- C-SCAN (Circular SCAN): A variant of SCAN that only goes in one direction. When the head reaches the end of the disk (e.g. highest track), it jumps back to the beginning (track 0) without servicing requests on the return (the return is effectively a fast seek with no servicing) (CSCAN). This treats the cylinder list as circular. C-SCAN provides more uniform wait time – like an elevator that, instead of reversing and giving those just served at end a quick second chance, always returns to start and gives equal service frequency to all cylinders (CSCAN). In C-SCAN, a request at the extreme end won’t get serviced again until the head goes back to the start and comes out again – which provides fairness: each track region is visited on a regular interval (the time of a full sweep). C-SCAN tends to reduce the wait for cylinders in the middle that might otherwise wait while the head goes back across in SCAN. It essentially ensures that waiting time is uniform independent of request location (CSCAN) (CSCAN). C-SCAN avoids the situation in SCAN where one end of the disk gets hit twice in quick succession (when the head turns around). Many OS disk schedulers implement circular scanning. For example, Linux’s old default was CFQ (Completely Fair Queuing) which had variants of SCAN for fairness.
- LOOK and C-LOOK: These are variants of SCAN/C-SCAN that “look” only as far as the last request in each direction, rather than going to the physical end of the disk. For example, LOOK will go in one direction but reverse as soon as there are no further requests in that direction (not necessarily the disk’s end). C-LOOK similarly jumps back to the earliest request when finishing the highest request. These optimizations prevent unnecessary travel to the disk edge if no request is there, saving time.
Comparison: SSTF is like a greedy strategy optimizing each move, good for throughput but can starve. SCAN and C-SCAN provide a good compromise: they significantly reduce seek times compared to FCFS by servicing in an order that minimizes back-and-forth, and they avoid starvation. Generally, C-SCAN is preferred in environments where uniform wait is desired, since it treats the disk as circular and gives equal opportunity (CSCAN) (CSCAN). In terms of performance, SCAN/C-SCAN typically have higher throughput (requests per second) and lower average seek time than FCFS, especially under heavy load.
Modern considerations: Modern storage is often SSD (solid-state drives) which have no mechanical seek – thus, the “distance” doesn’t matter for access time (all random accesses are similar). For SSDs, the scheduling focus shifts – NCQ (Native Command Queuing) in SSDs/HDs and OS schedulers still batch requests for efficiency, but the penalty for order is different (more about parallelizing internal flash operations and leveling wear). For simplicity, OSes often still use elevator-like algorithms even on SSDs, or even revert to basically FIFO or noop scheduler since reordering isn’t needed to reduce seek. In Linux, one can choose an I/O scheduler: historically “deadline” (which is a variant of SCAN with deadlines to avoid starvation), “cfq” (which tries to be fair across processes), and now “bfq” or simply “mq-deadline” for multi-queue. For teaching, SCAN/C-SCAN remain relevant to understand scheduling.
Evaluation metrics for I/O scheduling: Mainly average seek time or average latency per request. Also fairness (avoid starvation) and sometimes meeting deadlines (for real-time I/O, the Deadline Scheduler in Linux ensures no request waits beyond a certain time by having a deadline queue). Throughput (total MB/s or IOPS) is key. For HDDs, minimizing head movement is directly tied to throughput. For user experience, variance in I/O response can matter (avoid pathological cases where one request waits very long).
In summary, CPU scheduling chooses which process runs on CPU next. Algorithms like FCFS, SJF, Priority, RR, and MLFQ each have strengths: SJF optimal for average wait (but needs job lengths), RR for fairness and interactivity, etc. Modern OSes use hybrid approaches (like multilevel feedback queues or fair schedulers) to handle the mix of CPU-bound and I/O-bound workloads effectively. I/O scheduling, especially for disks, optimizes the order of servicing disk operations to minimize mechanical delays. Algorithms like SCAN and C-SCAN (elevator algorithms) significantly improve disk throughput by reducing needless seeking (SCAN) (CSCAN). Operating systems implement these strategies within disk drivers or storage subsystems to speed up file access. With the advent of SSDs, I/O scheduling focuses more on leveraging internal parallelism and fairness rather than seek optimization, but the classic algorithms still apply conceptually to any queued resource.
Memory Management & Virtual Memory
Modern operating systems employ virtual memory, a technique that gives processes the illusion of a large, contiguous memory space while allowing the OS to use physical memory (RAM) efficiently and even extend memory onto disk (swap). Memory management thus involves how the OS maps virtual addresses to physical memory, how it allocates physical memory to processes, and how it swaps or moves data between memory and secondary storage.
Paging: Paging is a memory management scheme that eliminates the need for contiguous physical memory allocation for processes and avoids external fragmentation. In paging, the virtual address space of a process is divided into fixed-size blocks called pages (commonly 4KB in size), and physical memory is divided into frames of the same size. The OS keeps a page table for each process, which maps page numbers (virtual) to frame numbers (physical) (Semaphore (programming) – Wikipedia). When a process references an address, the hardware MMU uses the page table to find the corresponding physical frame. If the page is not in memory (page table entry marked invalid), a page fault occurs, and the OS must bring the page in from disk (the backing store, often called the swap or page file) (Demand paging – Wikipedia). Paging enables demand paging: only load a page when it’s needed (on first access) rather than loading entire process memory at start (Demand paging – Wikipedia) (Demand paging – Wikipedia). This lazy loading optimizes memory usage and allows programs to run even if their total memory demand exceeds physical RAM (the OS will swap pages in and out as needed).
– Demand Paging: As described, in a demand-paged system a process starts with none (or few) of its pages in RAM (Demand paging – Wikipedia) (Demand paging – Wikipedia). When it accesses an address on an unmapped page, a page fault triggers the OS to load that page from disk into a free frame, update the page table (mark it valid and point to the frame) (Demand paging – Wikipedia) (Demand paging – Wikipedia), then resume the process. This allows huge virtual address spaces (e.g. sparse address usage or large heap) without wasting physical memory until data is actually used. Over time, a process will have a set of pages it frequently uses – this is called its working set. If physical memory can hold the working sets of all active processes, page faults will be rare and performance good. If not, the system may thrash (constant page faults). Demand paging relies on locality of reference – processes tend to use a subset of memory intensively at any given time, so caching those pages yields efficiency.
– Page table and TLB: Each memory reference in a paged system is translated via the page table. To speed this up, CPUs have a Translation Lookaside Buffer (TLB), a small cache of recent page table entries. A TLB hit means the virtual-to-physical translation is obtained quickly; a TLB miss means the page table in memory must be consulted (which is slower). Good locality in memory accesses also improves TLB hit rates. Modern page table structures are often hierarchical or inverted to manage large address spaces (e.g. 64-bit addresses) efficiently.
Segmentation: Segmentation is an older scheme (or sometimes used in tandem with paging) where a process’s address space is divided into variable-sized segments according to logical divisions (like code, data, stack segments). Each segment has a base (starting physical address) and limit (length). To translate an address, the CPU finds the segment, checks that an offset is within the limit, and adds the base. Segmentation reflects logical program structure and allows different segments to grow independently. Its disadvantage is external fragmentation – as segments are allocated and freed, memory can become chopped into pieces, and satisfying a large segment request might require compaction. Some systems (like early Intel x86) combined segmentation and paging: segments provided logical separation and protection, but each segment was paged internally to avoid fragmentation.
In modern OSes, pure segmentation is rare; instead, paging (with possibly a straightforward segmentation into “kernel” vs “user” address spaces) is the norm. Paging handles fragmentation better by design (only internal fragmentation of a fraction of a page on average).
Swapping: Swapping refers broadly to moving an entire process out of RAM to disk to free memory, and later swapping it back in. In early systems, the OS might swap out a whole process (all its pages) when it was waiting or when memory was overcommitted. With demand paging, full-process swaps are less common; instead, individual pages are swapped (paged) in and out. However, the term “swap space” persists – the area of disk used to store pages that are evicted from RAM. If physical memory fills up, the OS will choose some page(s) currently in RAM and swap them out to disk (write to swap space) to free those frames for other pages. Later, if the evicted page is needed, it is read back from disk (page-in). This on-demand swapping of pages is exactly how demand paging operates. Swapping out entire processes might still be done in low-memory situations or hibernation (where all of a process’s pages are written out), but typically OSes handle memory contention by page-level replacement rather than all-or-nothing swaps.
Page replacement policies: When a page fault occurs and there is no free frame available (memory is full), the OS must choose a victim page to evict (replace) from memory to make room for the needed page (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia). A good replacement algorithm will choose a page that is not likely to be used in the near future, minimizing the chance of throwing out a page that will be referenced again soon (which would cause another fault). Classic algorithms:
- Optimal (OPT): Replace the page that will not be used for the longest time in the future. This is provably optimal (minimum page faults) (Page replacement algorithm – Wikipedia). Of course, it’s unrealizable in practice because it requires knowing the future reference string. It’s used as an ideal benchmark.
- FIFO: Replace the page that has been in memory the longest (the one that came in first) (Page replacement algorithm – Wikipedia). Implemented with a simple queue. FIFO is simple but can be suboptimal; it can even suffer from Belady’s Anomaly, where adding more memory frames increases the number of page faults for certain access patterns (Page replacement algorithm – Wikipedia). FIFO might evict pages that are heavily used just because they are old. For instance, a page that was loaded early and is frequently used could be removed while a rarely used newer page stays. “FIFO requires little bookkeeping but performs poorly in practice. It may remove pages that are still heavily used; some systems historically used FIFO (like early versions of VMS with tweaks) (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia).”
- Least Recently Used (LRU): Replace the page that has not been used for the longest time in the past (Page replacement algorithm – Wikipedia). This is based on the heuristic that pages unused for a long time are less likely to be used soon (temporal locality). LRU is a strong approximation of OPT in many workloads (Page replacement algorithm – Wikipedia) – if a page hasn’t been touched in a while, it might be a good candidate to evict. Implementing LRU exactly can be expensive because you’d have to update use information on every memory reference. However, many systems approximate LRU. For example, keep a list of pages and move a page to the front on access (true LRU) is too slow if done in software for every reference, but hardware can help (e.g., maintain a time stamp or set a referenced bit per page which the OS can use to infer usage). LRU generally outperforms FIFO significantly in terms of fewer page faults (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia). It can still have pathological cases (e.g., cyclic access patterns larger than memory will defeat LRU, causing every access to fault, whereas FIFO might do better in that specific pattern (Page replacement algorithm – Wikipedia)). But overall, LRU is considered a good default policy.
- Clock (Second-Chance) algorithm: This is a popular approximation of LRU that is efficient to implement (Page replacement algorithm – Wikipedia). Each page entry has a reference bit (or “use” bit) that is set by hardware whenever the page is accessed. The pages are conceptually arranged in a circular list (clock). A “hand” pointer cycles through pages: for each page it encounters, if the page’s ref bit is 0 (meaning it hasn’t been used recently), that page is evicted; if the ref bit is 1, the algorithm gives the page a “second chance” by clearing its bit to 0 and moving on to the next page (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia). The hand keeps moving until it finds a page with ref bit 0 to evict. This approximates LRU because frequently used pages will likely have their bit set when first inspected and thus get their bit cleared instead of being immediately evicted (so they survive until at least one full rotation where they weren’t used). Clock is essentially a circular buffer implementation of the Second-Chance algorithm (Page replacement algorithm – Wikipedia) (Page replacement algorithm – Wikipedia). It is simple and efficient (just O(1) work per page fault on average, scanning until a victim is found, which typically is short). Many OSes, including older Unix, used Clock for page replacement. It’s space-efficient (doesn’t need full sorted list of ages, just one bit per page for usage). Modern variations add more information (like an 8-bit age counter approximating LRU more finely, e.g. the “Aging” algorithm which shifts reference bits).
- LFU (Least Frequently Used): Replace the page that has the smallest count of accesses (perhaps over a long period). The idea is that pages that haven’t been used much might be less important. However, LFU can misjudge pages that were heavily used in the past but are no longer needed (they have a high count and stay in memory even if no longer actively used). Also, implementing it requires maintaining counts and possibly decaying them over time. LFU on its own isn’t widely used in OS page replacement because of these issues (it can behave like sticky caching of something that was hot long ago).
- NRU (Not Recently Used) and variants: Some systems use a combination of a referenced bit and a modified (dirty) bit to categorize pages into classes, e.g.: (0) not used, not dirty; (1) not used, dirty; (2) used, not dirty; (3) used, dirty. Then they choose a page from the lowest-numbered non-empty class at fault time (favoring not recently used pages, and among those prefer not dirty to avoid extra disk write). This NRU approach is a coarse LRU approximation used by simpler systems.
The working set and thrashing: In the 1960s, Peter Denning introduced the Working Set concept: the OS can track the set of pages each process has used in the last T time units (that’s the working set). If a process’s working set is bigger than available memory, it will cause frequent faults (thrash). A thrashing system spends most of its time paging instead of executing useful work. To combat thrashing, OSes might use working set model to decide how many processes to keep in memory. If memory is overcommitted, it might swap out (or suspend) some processes entirely to free enough frames for the remaining processes’ working sets. Windows, for example, has a working set manager that adjusts the number of frames per process and can even swap processes out of physical memory if needed.
Copy-on-Write (CoW): Mentioned earlier in processes, CoW is a memory management optimization relevant here: when duplicating memory (like on fork), don’t actually copy pages; mark them read-only and let both processes share them. If either writes, a page fault occurs and the OS then actually copies the page (allocating a new frame for the child or parent) (Copy-on-write – Wikipedia) (Copy-on-write – Wikipedia). This technique reduces the cost of fork() significantly (as most pages are only copied if modified) and is a cornerstone of efficient process creation in Unix-like OS. It relies on page protection and the page fault mechanism to detect writes.
Memory-Mapped Files: Instead of using explicit I/O calls (read, write) to file data, OSes allow mapping a file’s contents into a process’s address space. The file’s data is then accessed via memory loads and stores as if it were an array in memory. Underneath, the OS uses paging: when a mapped region is accessed, the corresponding file blocks are paged in (read from disk into memory) and mapped to the process’s pages (Memory-mapped file – Wikipedia). This can be more efficient and convenient (the OS will also typically write the pages back to disk when modified, either asynchronously or on unmap/exit). Memory-mapped files unify file I/O and memory – for example, the OS page cache is used, and multiple processes mapping the same file share the physical memory for it (saves memory). Also, this mechanism is used for loading executables and shared libraries: the OS maps the executable file into memory and the demand paging mechanism loads pages as code is executed. One advantage is zero-copy I/O for certain operations: e.g., if you want to load a file into memory to parse it, mmap
can avoid an extra copy compared to read()
(which would copy data from kernel to user buffer). Memory mapping can also be used for IPC (if two processes both map the same file or shared memory object, they communicate via that memory). The OS ensures the mapping has a “direct byte-for-byte correlation” with the file on disk – treating file bytes as if they were memory bytes (Memory-mapped file – Wikipedia).
Modern techniques: Besides CoW and memory-mapped files, other techniques include huge pages (using larger page sizes, like 2MB or 1GB pages on x86-64, to reduce overhead and TLB misses for large memory allocations), NUMA-aware allocation (on multiprocessor systems with Non-Uniform Memory Access, the OS tries to allocate memory from the local node of the CPU running the thread), and kernel same-page merging (on some OSes like Linux with KSM, identical memory pages across different processes (like many VMs each with same OS) can be merged to a single copy to save RAM; marked CoW so if a process writes, it gets separated – similar to copy-on-write but across processes for identical content).
Page Replacement in practice: Operating systems often use a variant of Clock or LRU with tweaks. For instance, Linux’s current implementation (since 2.6) is a variant of Clock-Pro (two-list active/inactive LRU list scheme) that approximates LRU but with better performance under certain patterns. Windows uses a similar approach with working set trimming and a clock algorithm per working set. Both will try to keep frequently used pages in memory and replace less used ones. They also consider the modified (dirty) status – replacing a dirty page is more expensive (requires write to disk) than a clean page (which can just be dropped if it’s unmodified copy of file or already saved to swap), so the algorithms prefer to replace clean pages when possible (unless the dirty page hasn’t been used in a long time and the clean page was very recently used).
In summary, memory management uses paging to flexibly and safely allocate memory, with demand paging to load pages on the fly (Demand paging – Wikipedia). The OS must decide which pages to keep in RAM via page replacement algorithms like FIFO, LRU, Clock, etc., aiming to minimize page faults (disk I/O). It also provides features like copy-on-write to optimize memory use on forks (Copy-on-write – Wikipedia), and memory-mapped files to treat file data as memory for efficiency (Memory-mapped file – Wikipedia). Successful memory management gives the illusion of a large memory to processes while in reality sharing and over-committing physical memory intelligently. If mismanaged (e.g., working sets exceed RAM by far), thrashing happens and performance plummets. Thus, OS memory managers also sometimes reduce the multiprogramming level (through swapping or admissions control) to regain stability.
File Systems
A file system manages how data is stored and retrieved on storage devices. It provides logical abstractions (files, directories) to users and programs, and handles the mapping to physical storage blocks. Key file system concepts include the on-disk data structures (like inodes or allocation tables), directory structure and naming, allocation algorithms, and metadata like permissions.
Architecture and data structures: Many traditional file systems (e.g. UNIX’s ext4, ext3, or earlier ext2) use an inode-based design. An inode (index node) is a data structure that stores metadata about a file – such as its size, owner, permissions, timestamps, and pointers to the disk blocks that hold the file’s data (Inodes in Linux: limit, usage and helpful commands) (Inodes in Linux: limit, usage and helpful commands). Each file has an inode (and each inode has a unique number on that volume). The directory is a special kind of file that contains entries mapping filenames to inode numbers ([PDF] File System Implementation – CS@Cornell). For example, a directory entry might say “resume.doc -> inode 12345”. When the OS looks up “resume.doc” in the directory, it gets inode 12345, then it loads inode 12345 from disk (if not already in memory) to find the file’s block pointers and other info. This separation means file names are only in directories, not in the inode itself (inodes don’t know which name(s) refer to them, enabling features like hard links – multiple directory entries pointing to the same inode). The inode has direct pointers to a few data blocks, and indirect pointers for larger files (single indirect blocks that point to many data blocks, double indirect which point to blocks of pointers, etc.). This scheme allows efficient access for small files (direct pointers) while scaling to very large files via multi-level indexing.
Other file systems, like Microsoft’s FAT32 (File Allocation Table) or older systems, use a different structure: FAT keeps a table where each entry corresponds to a disk block and contains a link to the next block of the file (or end-of-file marker) – essentially a linked list of blocks structure, but the links are kept in a contiguous table for speed. Directories in FAT point to the first block of a file, and then the FAT table is used to follow the chain. This is simple but not as efficient for random access (following links) and the FAT table must be in memory for performance.
Newer file systems like NTFS (Windows NT File System) use a Master File Table (MFT) where each file has an MFT record (similar to an inode concept) that contains file attributes and pointers to content (NTFS is more complex, using a B-tree for directories, storing small files directly in the MFT record, etc.). ext4 (in Linux) is an evolved inode-based FS that also uses extents (a single descriptor for a range of contiguous blocks) rather than single-block pointers for large files, to reduce fragmentation and metadata overhead.
Directories and naming: A directory is essentially a file that lists name->file metadata mappings. In Unix, each directory entry contains a filename and an inode number. Directories can typically contain both files and other directories, forming a hierarchy (a tree or acyclic graph structure). The file system has a root directory (“/” in Unix, or a drive like “C:” in Windows). Pathnames are resolved by walking the directory tree, e.g. “/home/alice/doc.txt” means: in root, find “home” (get its inode), then in that directory find “alice” (inode), then find “doc.txt”. Permissions on directories control the ability to list contents (execute/search permission). Some file systems allow multiple names for a file (hard links in Unix, or shortcuts on Windows which are not exactly the same as hard links, Windows NTFS does have hard links and symbolic links as well).
Allocation methods: There are various methods to allocate space for files on disk:
- Contiguous allocation: Allocate a contiguous run of blocks for each file (like how an array is laid out). This gives excellent read performance for sequential access (just one seek and then continuous reading) and easy random access via arithmetic (offset + base). However, contiguous allocation suffers from external fragmentation over time as files grow, shrink, or are deleted, just like a contiguous memory allocation would. It also makes file growth tricky – if the immediate next blocks are occupied, either need to move the file or use techniques like extent overflow. Contiguous allocation is used on CD-ROMs (where files don’t change after creation, so fragmentation isn’t an issue). Modern FS sometimes use extents as a compromise – an extent is a contiguous chunk, and a file can be a linked list of extents. This reduces fragmentation while not requiring one single run per file.
- Linked allocation: Each file is a linked list of blocks, not necessarily contiguous. Each block contains a pointer to the next block. This eliminates external fragmentation (any free block can be used) and files can grow easily. But random access is very slow (to get to block n, you must traverse n links). Also, pointers consume some space in each block. FAT is essentially a form of linked allocation where the linkage is kept in the FAT table in memory, which improves traversal speed (since the table is in RAM, you can follow pointers quickly, though it’s still linear in number of blocks for random access).
- Indexed allocation: Use an index block that contains an array of pointers to the file’s blocks. The inode method in Unix is a form of indexed allocation (with multi-level indexing). This allows direct access to blocks (given an index number and the index block, you can immediately get that block’s address). It handles file growth by allocating a new index block if needed or using multiple levels. The downside is the overhead of the index (for very small files, an entire index block might be partially unused – though systems like ext2/3/4 handle that with direct pointers in the inode for small files). Indexed allocation (with extent improvements) is most common now. NTFS uses a B-tree of extents internally, which is another form of indexed structure (extents as the entry, B-tree for scalability).
Journaling vs non-journaling: Journaling file systems write a journal (log) of changes that will be made, before applying the changes to the main file system structures (Journaling file system – Wikipedia). This is done to improve crash consistency. In a non-journaling FS (like ext2 or FAT), if a crash or power loss occurs during an update (say updating a directory entry and an inode), the data structures can become inconsistent (e.g. directory points to inode, but inode wasn’t updated, or free space bitmap not updated properly, etc.). After reboot, a slow fsck (file system check) must scan and repair inconsistencies – which can take a long time on large disks and may not always fully recover data.
A journaling FS (like ext3, ext4, NTFS, XFS, etc.) logs the intended updates to a circular log on disk. For example, if you create a file, the FS will write a journal entry describing “allocate inode X, update directory Y with new entry” before actually doing those things. Once it has written the journal entry (often called a transaction), it then applies the actual changes to the file system. After applying, it marks the journal entry as completed. If a crash occurs, on reboot the file system replay the journal: any transactions that were fully recorded but not marked completed are redone (since they might not have fully applied), and any incomplete (partial) journal entries are discarded (since those operations hadn’t fully logged, they weren’t started). This way, the file system either has done an operation fully or not at all, avoiding inconsistency (Journaling file system – Wikipedia) (Journaling file system – Wikipedia). Journaling typically focuses on metadata changes (structure of FS) because user data journaling can double write costs (some file systems allow data journaling as an option for extra safety at performance expense). For instance, ext3 had options: journal metadata only, journal metadata+data, or writeback (metadata journaled, data not guaranteed in order). “A journaling file system keeps track of changes not yet committed to the main part by recording the intent of such changes in a journal (log). In event of crash, the file system can be brought back online more quickly with lower chance of corruption (Journaling file system – Wikipedia).” By journaling only metadata, performance cost is modest (the changes to directories, inodes, etc., are written twice – once to journal, once to actual location). Journaling data as well ensures user data integrity (no old garbage if crash during write), but typically OSes use ordered or writeback mode for data (write data buffers out before marking metadata in journal as complete, so metadata doesn’t point to uninitialized blocks, but don’t log data itself).
Examples: NTFS is a journaling FS (it logs metadata). ext3/ext4 are journaling (ext4 even uses journal checksums to improve reliability). HFS+ (macOS) had journaling added. A non-journaling FS example is ext2 (if ext2 crashes, you must run fsck
which can take a while on large volumes). Journaling greatly reduces recovery time – after a crash, the OS can replay a few recent journal entries in seconds, rather than scanning whole disk for issues (Journaling file system – Wikipedia) (Journaling file system – Wikipedia).
Security and access control: File systems enforce access permissions for security. In Unix-like systems, each inode has permission bits for owner, group, others (read, write, execute) and is associated with an owner user ID and group ID (Inodes in Linux: limit, usage and helpful commands). The OS checks these on each file access. These systems also support access control lists (ACLs) for finer-grained permissions (ext4, NTFS, etc. allow ACLs specifying permissions for specific users or groups beyond the owner/group/other scheme). Windows NTFS uses a full ACL model (each file/directory has a security descriptor with an ACL specifying which users/groups have what rights like read, write, execute, delete).
File systems must also manage special attributes: executability, special files (like device nodes in Unix, which are entries that reference OS device drivers), etc. Additionally, FS may support encryption (e.g. NTFS has EFS for encryption, ext4 can have encryption on directories in modern Linux) to secure data at rest.
Performance considerations: File system performance involves both the on-disk layout and caching strategies. OSes use a buffer cache/page cache to keep recently used file data and metadata in memory, reducing disk I/O. Thus, reading a file sequentially may actually fetch blocks into cache and subsequent reads hit the cache, or writes may be buffered and written to disk later (write-back caching). The file system layout tries to allocate blocks that are logically sequential in a file to be physically near each other to improve throughput (less seek). Also, it tries to place related metadata near data (e.g. inode near the file’s data blocks) to reduce seek when opening a file.
Fragmentation can impact performance: Over time, as files are created and deleted, even with paging-like allocation, files can end up fragmented (non-contiguous extents). Many file systems try to maintain some locality (ext4 uses extents and delayed allocation: it doesn’t immediately pick blocks for newly written data until it has a batch, so it can allocate a contiguous run if possible). Windows users historically had to defragment NTFS or FAT partitions periodically to maintain performance (though NTFS is better at avoiding fragmentation than FAT, and Windows auto-defrags in the background now).
Directory implementation can affect performance: linear list directories are slow if many files; many FS use hashed or B-tree directories (NTFS uses B-tree, ext4 uses a hashed B-tree option for large directories) to speed up lookup in huge directories.
Advanced trends: Modern file systems (like ZFS by Sun/Oracle, or btrfs in Linux) integrate advanced features: they are journaling or copy-on-write (ZFS and btrfs always write new copy of data and metadata on change – never overwrite in place, so they are naturally consistent and maintain snapshots easily), they support snapshots (point-in-time copies of the file system for backups or versioning), checksums of data to detect disk errors, built-in RAID mechanisms, and more. These file systems treat the storage more abstractly (ZFS even manages its own pooling of disks rather than relying on OS volume manager). Another example is APFS (Apple File System) introduced in macOS/iOS, which is a COW filesystem with snapshots, clones, and strong encryption features, replacing HFS+.
For teaching core concepts, ext4 and NTFS are typical examples of widely-used FS with journaling.
Real-world differences:
- Linux ext4 uses inodes (and direct, indirect pointers), journaling (with options for data journaling), extent-based allocation (improving large file contiguous allocation), and delayed allocation to reduce fragmentation. It has the traditional Unix permission model plus optional POSIX ACLs.
- Windows NTFS uses a Master File Table (each file has a record). Small files or small directory entries can reside entirely in the MFT entry (for performance). It journals metadata (and can optionally log data in some cases). NTFS supports complex ACLs and file attributes (like compressed or encrypted). It also has features like alternate data streams (multiple forks of data in a single file), which is not commonly used in Unix.
- Performance: NTFS on very large directories or highly fragmented volumes might degrade, whereas ext4 tends to maintain performance with age quite well (until extremely full). NTFS relies on defragmentation for optimal performance long-term, whereas ext4’s design (and Linux’s readahead, etc.) often avoids needing an explicit defrag (though ext4 does have an offline defrag tool if needed).
In distributed/network file systems (like NFS, SMB), similar concepts apply but performance can be limited by network latency; they use caching and consistency protocols.
To conclude, file systems are the OS’s answer to long-term data storage. They provide a hierarchical namespace (files and directories), manage space on disk using structures like inodes or tables, handle how data is laid out (contiguity vs fragmentation trade-offs), and ensure reliability through mechanisms like journaling (Journaling file system – Wikipedia). They also enforce security policies for file access. Modern file systems emphasize robustness (to prevent data loss on crashes) and scalability (handling terabytes of data and millions of files) while balancing performance.
(Additional trends worth noting: the rise of solid-state drives has changed performance characteristics – random reads are much faster, no seeks; file systems like Microsoft’s ReFS and Linux’s f2fs are tailored for different storage. Also, unikernels sometimes embed simple file systems or treat everything as one image. But general OS still use the above concepts. Another trend is object stores in cloud, which forego some FS structure overhead for scalability – but on an OS level, the fundamental ideas of indexing and allocation remain.)*
Be First to Comment