Introduction: This report provides a concise overview of fundamental topics in computer organization and architecture, aimed at undergraduate and graduate students in computer science or related fields. We cover how machine instructions are designed and addressed in memory, the structure of the CPU’s arithmetic/logic unit (ALU) and control mechanisms, the principles of instruction pipelining and how to handle hazards, the design of the memory hierarchy (from caches to secondary storage and virtual memory), and the basics of input/output interfaces including interrupts and direct memory access (DMA). Throughout, we include comparisons (e.g. RISC vs CISC architectures), trade-offs (such as pipeline depth vs. complexity, or cache size vs. speed), and real-world examples (like ARM vs. x86, and typical I/O device behaviors). References are provided for key concepts to ensure academic rigor.
Machine Instructions & Addressing Modes
Instruction Formats and Categories: Machine instructions are binary-encoded commands that a CPU can execute. An instruction format defines how the bits are organized (for example, portions for the opcode and operands). Some architectures use fixed-length instructions (e.g. 32-bit in MIPS RISC), while others (like x86 CISC) use variable-length formats (Comparison of instruction set architectures – Wikipedia). Instructions are often categorized by their function: arithmetic (e.g. ADD, SUB for integer addition/subtraction), logic (e.g. AND, OR, XOR bitwise operations), data transfer (e.g. LOAD and STORE to move data between memory and registers), and control (e.g. BRANCH or JUMP to alter the flow of execution). For instance, an ADD instruction will typically take two operands (from registers or memory) and produce a result, whereas a BRANCH instruction uses a condition to decide the next instruction’s address. Many CPUs also include category-specific instructions such as shifts/rotates, floating-point operations, and system control instructions. Real-world ISAs (Instruction Set Architectures) like ARM and Intel x86 define hundreds of such instructions, but they can be understood as combinations or extensions of these basic categories.
Addressing Modes: An addressing mode describes how an instruction identifies the operands (the data it will process). Common addressing modes include immediate values, direct or memory addresses, registers, and more complex calculated addresses. The table below summarizes some common addressing modes with descriptions and examples (Comparison of instruction set architectures – Wikipedia) (Comparison of instruction set architectures – Wikipedia):
Addressing Mode | Description | Example Instruction |
---|---|---|
Immediate | The operand value is embedded immediately in the instruction itself (no memory lookup needed for the data). Useful for constants. | e.g. MOV R1, #5 — move the constant value 5 into register R1 (5 is encoded in the instruction). |
Direct (Absolute) | The instruction contains the direct memory address of the operand. The CPU fetches the operand from that memory location. | e.g. LOAD R1, 1000 — load into R1 the value from memory address 1000. Here 1000 is the effective address (EA) of the operand. |
Indirect | The instruction contains an address (or reference) that points to another memory location where the operand’s address is stored. In other words, one memory lookup leads to another. This can be thought of as a pointer. | e.g. LOAD R1, (1000) — first fetch the value at memory address 1000, interpret that value as a new address, then load from that address into R1. (If memory at 1000 contains 2000, the value from address 2000 is loaded.) (Microsoft Word – ch11aF08.doc) (Microsoft Word – ch11aF08.doc) |
Register | The operand is stored in a CPU register specified by the instruction. This is very fast (no memory access for the operand). | e.g. ADD R3, R1, R2 — add contents of register R1 and R2, store result in R3. All operands are in registers. |
Indexed (Displacement) | The effective address of the operand is calculated by adding an offset (which can be immediate or in another register) to a base address in a register. This allows access to data structures like arrays. | e.g. STORE R5, [R4 + 12] — store the value from R5 into memory at address (contents of R4 + 12). If R4 is a base address and 12 an index or offset, the operand’s address is computed dynamically. |
Note: Many architectures support variants and combinations of these modes (for example, register indirect is a variant where the instruction specifies a register that contains the memory address of the operand, effectively a combination of Register and Indirect mode (Microsoft Word – ch11aF08.doc)). Another common mode for control instructions is PC-relative addressing, where the address of the next instruction (Program Counter) is used as a base for a offset – this is often how branch targets are specified, enabling position-independent code.
RISC vs. CISC Instruction Sets: A major architectural distinction is between RISC (Reduced Instruction Set Computer) and CISC (Complex Instruction Set Computer) philosophies. CISC architectures (like traditional x86) have many specialized instructions, some of which perform multi-step operations with a single instruction (e.g., an instruction that might fetch from memory, perform an arithmetic operation, and store the result back to memory all in one) (RISC vs. CISC) (RISC vs. CISC). For example, Intel x86 has a “string move” instruction that can copy a block of memory, and some CISC machines had a single instruction to evaluate array indices or complex addressing modes. This can reduce the number of instructions a program needs, but each instruction can take multiple clock cycles to execute and the decoding logic is more complex. RISC architectures (like MIPS, ARM) simplify the instruction set: they use simple, fixed-size instructions that each perform a small unit of work (such as a single arithmetic operation or a load/store). A RISC design typically uses a load/store architecture, meaning only load and store instructions access memory, and all arithmetic/logic instructions operate on registers. For instance, instead of a single complex multiply instruction that reads from memory, a RISC program would do a LOAD
from memory to a register, then a MULT
on registers, then a STORE
back to memory (RISC vs. CISC) (RISC vs. CISC). This often means more instructions in the program, but each is very fast and optimizable. The table below outlines some classic differences:
- Instruction Complexity: RISC instructions are simple and execute in one (or a few) cycles; CISC instructions can be multi-cycle and more complex.
- Instruction Length: RISC uses fixed-length (e.g., all 4 bytes), simplifying decoding; CISC uses variable lengths (1-15 bytes in x86), which are compact but harder to decode (Comparison of instruction set architectures – Wikipedia).
- Memory Access: RISC only accesses memory via specific load/store instructions (arithmetic instructions do not directly fetch operands from memory), whereas CISC instructions may directly operate on memory operands (e.g., add two memory locations).
- Registers: RISC typically has a larger set of general-purpose registers (to facilitate operand storage between memory accesses), while older CISC designs had fewer registers and relied more on memory.
- Microcode vs Hardware: Many CISC implementations internally use microcode (see control units below) to break down complex instructions, whereas RISC designs often use hardwired control for simplicity and speed.
In practice, modern high-performance x86 processors translate complex CISC instructions into RISC-like micro-operations internally. ARM (Advanced RISC Machine) is a prominent RISC architecture used in mobile and embedded devices, emphasizing a load/store design and uniform fixed instruction size. Intel x86 (and x86-64) is the archetypal CISC, with a rich set of instructions for backwards compatibility. For example, adding two numbers in memory on x86 might use one instruction ADD [addr1], [addr2]
(if it were allowed), while on ARM it would require LDR R1, [addr1]
; LDR R2, [addr2]
; ADD R3, R1, R2
; STR R3, [addr1]
to achieve the same final effect. Despite the higher instruction count, RISC designs gain efficiency through pipelining and simpler hardware, whereas CISC aims to reduce program size and instruction count at the cost of more complex hardware per instruction (RISC vs. CISC) (RISC vs. CISC).
In summary, RISC vs CISC is a trade-off between software complexity and hardware complexity. RISC puts more burden on the software (compiler) to break tasks into simple steps, yielding hardware that is easier to optimize and pipeline. CISC puts more intelligence in hardware to do complex operations, which can simplify software in some cases. Modern CPU designs often borrow elements of both philosophies.
ALU, Data Path & Control Unit
ALU (Arithmetic Logic Unit) Design and Operations: The ALU is the part of the CPU that performs arithmetic and logical operations. It is a combinational digital circuit: given inputs (operands) and a code for which operation to perform, it produces an output and status flags (like zero, carry, overflow). Typical ALU operations include integer addition, subtraction, bitwise AND/OR/XOR, bit shifts, and sometimes multiplication or division (or support for them via multiple steps). For example, adding two 32-bit integers is done by the ALU using binary adders (with carry propagation). The design of an ALU might involve carry-lookahead or ripple-carry adders for arithmetic, and logic gates for bitwise operations. Some ALUs also handle comparisons (subtract and check zero/negative flag) and incrementing/decrementing addresses. The ALU is a fundamental building block and usually designed to be as fast as possible, since it directly affects the CPU’s execute stage performance. In many architectures, results from the ALU also set condition codes or flags (for example, a zero flag if the result is 0, a negative flag for sign, carry or overflow for arithmetic conditions), which the control unit can use for branching decisions.
Data Path Components and Interaction: The datapath of a processor is the collection of functional units (like the ALU and possibly multipliers or shifters), registers, and buses that together process instructions and data (Datapath – Wikipedia). Key components include:
- Registers: These are small storage elements within the CPU that hold operands and results. There are typically general-purpose registers (for data/address values) and special registers (like the Program Counter (PC) that holds the address of the next instruction, the Instruction Register (IR) holding the current instruction, and others such as stack pointers or status registers). In the datapath, registers provide the inputs to the ALU and store outputs from the ALU. For instance, in a simple RISC datapath, two source registers (read during the decode stage) feed the ALU, and the ALU result can be written to a destination register in the write-back stage.
- Buses: Buses are shared pathways for data to travel between components. An internal CPU bus might connect registers to the ALU, or multiple ALUs. Many CPU designs use multiplexers (muxes) rather than a single bus, to feed different inputs to units. But conceptually, an internal bus is a set of wires that can carry a binary word from one part of the CPU to another. For example, the output of the ALU might be connected by a bus to both the register file (to store the result) and to other units or an output buffer.
- Multiplexers (MUXes): These are switches that select one of several data inputs to pass along. They are heavily used in datapaths to choose among different possible sources for a given input. For example, one multiplexer might select whether the second ALU operand comes from a register or an immediate value encoded in the instruction (implementing immediate addressing mode). Another mux might select whether the value to write-back to a register comes from the ALU or from memory (implementing the difference between an arithmetic instruction result vs. a load instruction’s result).
- Memory interfaces: The datapath interacts with memory during the instruction fetch and (for load/store instructions) during memory access stage. In a simple CPU, the PC register’s value goes onto an address bus to fetch an instruction from memory. During a load/store, the ALU computes an address (perhaps adding a base register and offset), and that address is sent to memory to fetch or store a data word.
A typical simple RISC datapath flows as follows: The fetch stage uses the PC to get an instruction from instruction memory; in the decode stage, the fetched instruction is decoded and the register operands are read from the register file; the ALU executes the operation in the execute stage (using operands passed from registers or immediate values and possibly computing an address for memory access); in the memory stage, if the instruction is a load or store, data memory is accessed; finally, in the write-back stage, the result (from ALU or memory) is written back into the destination register. During each step, control signals (from the control unit) guide the dataflow, e.g. enabling the appropriate registers and selecting correct multiplexer inputs. The datapath is designed such that all needed operations for a single instruction can flow through these components in an organized way.
Real CPU designs may have multiple ALUs or additional special-purpose functional units (like a floating-point unit or vector units), and more complex interconnects. But the principle is the same: the datapath is the “execution engine” of the processor where data moves through components to carry out instructions (Datapath – Wikipedia).
Hardwired vs. Microprogrammed Control Units: The control unit is responsible for orchestrating the entire process of instruction execution by generating the necessary control signals in the correct sequence. There are two primary strategies for control unit design:
- Hardwired Control: In a hardwired control unit, the control logic is implemented with fixed electronic circuits (combinational logic, state machines). The instruction opcode bits feed into this logic, which is essentially a network of gates, decoders, and flip-flops that directly generate signals to control multiplexers, register loads, ALU operation codes, memory read/write, etc. The sequencing of micro-operations (steps like “load PC into MAR” or “enable ALU add”) is determined by the states of a finite state machine (which often corresponds to the pipeline stage or step of execution). Hardwired control units are typically very fast because the decision logic is composed of simple gates and next-state logic, with no extra level of interpretation. However, they can be inflexible: changing the instruction set or adding a new instruction requires redesigning the logic. They are also more complex to design if the instruction set is large or complex. As an example, many RISC CPUs use hardwired control for simplicity and speed, given their relatively simpler instruction formats and behavior.
- Microprogrammed Control: A microprogrammed control unit uses a small internal memory (control store) to hold micro-instructions (the “microcode”). Each micro-instruction describes one or a few control steps (control signals to assert, etc.). When an instruction is fetched, the control unit enters a microcode routine (a sequence of micro-instructions) that implements that instruction’s behavior. Each micro-instruction is essentially a bit pattern that drives various control lines (e.g., one bit might enable the ALU’s add operation, another bit might enable a register’s input, etc.), and usually also contains the address of the next micro-instruction. In other words, instead of hardwired combinational logic figuring out what to do for an opcode, the opcode is used to jump to a microprogram in the control memory, and then those micro-instructions emit the proper signals in sequence. Microprogrammed control thus sequences through a set of steps for each machine instruction, which is easier to design and modify: to add a new instruction, one can write a new microcode routine; to fix a bug, update the microcode (indeed early microprogrammed machines could have their control store updated, which is akin to firmware updates). The downside is typically a slight decrease in speed – executing a microprogram (which might take several micro-instructions cycles per one machine instruction) can be slower than a streamlined hardwired implementation (Difference between Hardwired and Microprogrammed Control Unit). In modern terms, a microcode control cycle might be on the order of a few clock ticks, so a complex CISC instruction could take many cycles as it steps through microinstructions. Many CISC processors like IBM S/360 and early Intel x86 implementations are famously microprogrammed, which allowed them to implement very complex instructions (e.g. string operations, multi-step arithmetic) with manageable control complexity. RISC processors, in contrast, often avoided microcode to achieve single-cycle instruction execution, although some RISC designs still use microcode for complex or privileged operations.
Control Signal Sequencing: Whether hardwired or microprogrammed, the control unit must issue signals in a timely sequence for each instruction. Consider a simple example of a hardwired control sequence for a generic instruction fetch-execute cycle:
- Fetch step: Assert signals to load the PC’s content onto the address bus, read memory into the instruction register (IR), and increment the PC. These might be orchestrated in one or more clock cycles.
- Decode step: Enable the instruction decoder to examine IR and set up control lines (this could be combinational logic that immediately determines what the next steps are).
- Execute step: Depending on the decoded instruction type, different control signals are issued. For an ADD instruction, the control unit might signal the ALU to perform addition and the register file to latch the result into the destination register. For a memory load, it might signal the ALU to compute an address, then trigger a memory read, etc.
- Write-back step: If needed, signal the register file to write the result (from ALU or memory) into the target register, and update status flags if any.
In a hardwired system, these signals are generated by a state machine progressing through states corresponding to these steps. In a microprogrammed system, these signals are encoded in microinstructions. For example, a microinstruction might have fields: ALU_op = ADD, Reg_dest_enable = 1, Mem_read = 0, Next_micro_PC = X
. The microcontrol unit will load each microinstruction, output those control fields to the hardware, then go to the next microinstruction as directed (sequentially or via branch if the instruction needs to make a decision, e.g. for a conditional branch instruction’s microcode).
Hardwired vs Microprogrammed Trade-offs: Hardwired control units tend to have higher performance (since there is no microinstruction decoding overhead) and can be optimal if the ISA is simple. They are prevalent in high-speed RISC designs and any design aiming for minimum cycle-per-instruction. Microprogrammed control units excel in complexity management: implementing a large CISC ISA or very complex behaviors is easier with microcode. They also allow late design changes (or even field updates) by changing microcode rather than physical wiring (Difference between Hardwired and Microprogrammed Control Unit). Historically, machines like DEC VAX and Intel IA-32 used microcode to handle many complex instructions. As technology evolved, even microprogrammed CPUs added techniques to speed them up (like caching decoded micro-ops, etc.).
In summary, the ALU and datapath perform the computations and data movements, while the control unit (hardwired or microprogrammed) issues signals to make those operations happen in the correct order. For example, when an ARM processor executes an ADD r0, r1, r2
instruction, the control unit will ensure that in that cycle the values from r1 and r2 are fed to the ALU, the ALU is set to “add”, and the result is routed back into r0 — these control signals might be generated by simple logic (if opcode = ADD, then ALU_op = add, RegWrite = true, etc.). On an Intel x86, a single instruction like REP MOVSB
(move a block of bytes, repeating) would trigger a microprogram that loops, copying bytes and updating pointers until a count is exhausted, because doing this in purely hardwired logic would be prohibitive. As a concrete example, the control unit of the MIPS R2000 (a classic RISC) was fully hardwired and pipeline-oriented, whereas the control unit of the Intel 8086 was microprogrammed. Modern x86-64 chips internally translate complex instructions to RISC-like micro-operations and use a combination of hardwired scheduling and microcode for certain instructions, illustrating a hybrid approach.
Instruction Pipelining & Hazards
Classic 5-Stage RISC Pipeline: Pipelining is a technique used in CPUs to improve instruction throughput by overlapping the execution of multiple instructions. A classic RISC pipeline (such as in MIPS architecture) has five stages, each handling a specific part of the instruction execution, and multiple instructions can be in different stages simultaneously (like an assembly line). The five stages are:
- IF – Instruction Fetch: The processor fetches the next instruction from memory (at the address in the program counter, PC). The instruction is stored in the instruction register (IR). Meanwhile, the PC is typically incremented to point to the following instruction.
- ID – Instruction Decode (and Register Fetch): The fetched instruction’s opcode and fields are decoded to determine what type of instruction it is and what needs to be done. In this stage, the register file is also read – the source register operands (if any) are read out into internal buffers. Also, the control signals for subsequent stages are prepared based on the decoded instruction. If the instruction uses an immediate value (constant), it is extracted, and if it’s a branch, the target address might be computed (e.g., PC + immediate offset).
- EX – Execute (or Effective Address calculation): In this stage, the ALU (or other execution unit) performs the required operation. For an arithmetic/logical instruction, the ALU will combine the register operands (and/or immediate) to produce a result (e.g. add two numbers). For a load or store, the ALU computes the effective memory address by adding a base register and offset. For a branch instruction, the ALU might compare registers (for a condition) and also compute the branch target address.
- MEM – Memory Access: If the instruction is a data access (load or store), this stage will perform the memory read or write using the address computed in EX. A load will retrieve data from memory into a pipeline register; a store will write data from a register to memory. If the instruction does not access memory (e.g., an ALU operation), this stage can be effectively idle or bypassed. (In some pipeline designs, this stage is also used for other activities like taking a branch – for example, updating the PC for a branch might happen here if the architecture has a branch delay of one cycle.)
- WB – Write-Back: In the final stage, the result of the instruction is written back to the register file, if needed. For an arithmetic or logic instruction, the ALU result is written to the destination register. For a load instruction, the data fetched from memory is written to the target register. Store instructions typically don’t need the WB stage for themselves (since they wrote to memory in MEM stage, and registers remain unchanged), but other instructions do. After write-back, the instruction is complete.
In a pipeline, while one instruction is in the EX stage, the next one can be in ID, and another in IF, etc. In an ideal case (and a simple RISC like MIPS was designed for this ideal), one instruction completes (and a new one starts) every clock cycle, achieving throughput of one instruction per cycle (IPC = 1). However, this ideal is disrupted by various hazards.
Pipeline Hazards: A hazard is any condition that causes the pipeline to stall or execute instructions in a way that deviates from the ideal one-per-cycle flow. There are three classic classes of hazards (13-14_PPT.pptx):
- Structural Hazards: These occur when hardware resources are insufficient to handle the overlap of instructions. For example, if two stages of the pipeline need to use the same piece of hardware in the same cycle, that’s a structural conflict. A common structural hazard would be if the CPU had a single memory for instructions and data – an instruction fetch and a data memory access could conflict if they happen in the same cycle. Another example is if there is only one ALU and one instruction is in EX stage (using the ALU) while another different stage also needs the ALU (some designs might have multiple operations requiring ALU). In well-designed RISC pipelines, structural hazards are avoided by design (e.g., having separate instruction and data memory (Harvard architecture) or enough duplicated resources). If they do occur, the pipeline must stall one of the instructions until the resource is free (13-14_PPT.pptx).
- Data Hazards: These happen when an instruction depends on the result of a prior instruction that is still in the pipeline and not yet available. For example, if instruction 1 computes a value and instruction 2 needs that value as an input, in a pipeline, instruction 2 might reach the execute stage before instruction 1’s result is written back. This Read-After-Write (RAW) dependency is the most common type of data hazard: the consumer tries to read a source before the producer writes it (13-14_PPT.pptx). Other types (Write-After-Write, Write-After-Read) can occur in more complex pipelines or with out-of-order execution, but in an in-order 5-stage pipeline, RAW is the primary concern. If not handled, a data hazard could result in using incorrect (old) operand values.
- Control Hazards: These are caused by instructions that change the flow of control, such as branches and jumps. The problem is that the pipeline may fetch a few instructions beyond a branch before it is known whether the branch will be taken or not. For example, a conditional branch’s outcome is typically determined in the EX stage (when a comparison is done), which is 2 stages after the instruction fetch of the branch. Meanwhile, the pipeline will have speculatively fetched the next sequential instructions. If the branch is taken, those fetched instructions are wrong and must be discarded, causing a stall or flush of the pipeline (13-14_PPT.pptx). Even for an unconditional jump or call, there is a hazard because the target address might not be known early in the pipeline. Control hazards thus deal with the latency between fetching an instruction that changes PC and knowing the correct next PC.
Hazard Resolution Techniques: Several techniques are used to mitigate or eliminate pipeline hazards, thereby improving performance:
- Stalling (Pipeline Interlock): The simplest approach to handle hazards is to stall the pipeline. This means inserting bubbles (no-ops) or delaying pipeline progress until the hazard is resolved. For a data hazard, if an instruction needs a value that isn’t ready, the control logic can pause the newer instruction in the decode stage (or insert a no-op into execution stage) until the older instruction completes and writes back the data. Stalling ensures correctness but comes at a performance cost (the pipeline isn’t doing useful work for those cycles). In our 5-stage example, a load-use hazard (an instruction trying to use data right after a load) might incur a one-cycle stall because the data from memory won’t be written to the register until the end of the MEM stage, which is one cycle after the next instruction’s EX stage. The control unit detects the hazard in the decode stage and delays the dependent instruction (13-14_PPT.pptx). For control hazards, stalling means the pipeline stops fetching new instructions until the branch decision is made (this can cost several cycles for each branch).
- Operand Forwarding (Data Forwarding/Bypassing): This is a hardware technique to reduce data hazard stalls by “forwarding” results to where they are needed before the normal write-back. If an instruction produces a result in EX stage, normally that result is written to the register file in WB stage (two cycles later). With forwarding, that result can be fed directly from the EX/MEM pipeline register into the ALU input of a subsequent instruction in the next cycle, without waiting for it to go to the register file (13-14_PPT.pptx). For example, consider instructions:
ADD R1, R2, R3
followed bySUB R4, R1, R5
. The SUB needs R1 which is the result of ADD. Without forwarding, the SUB would stall until ADD completes WB. With forwarding, as soon as the ADD computes its sum in EX stage, that sum is routed to the input of the ALU for SUB (which will be in its EX stage one cycle later). Most RISC pipelines include forwarding paths for ALU results and possibly for data loaded from memory (although load-use hazards often still need a one-cycle stall because memory data is only ready in MEM stage). Forwarding requires additional multiplexers and control logic, but significantly improves pipeline efficiency (13-14_PPT.pptx). In summary, forwarding passes the result directly from one pipeline stage to a previous stage’s input as soon as it’s available, solving many RAW hazards without stalling. - Branch Prediction: For control hazards, one powerful solution is to predict the outcome of branches to avoid stalling waiting for certainty. Static branch prediction might be a simple rule like “assume branches are not taken” (and always continue sequentially; if it turns out taken, flush and fetch from target) or “always taken” or using profiling information. Dynamic branch prediction uses hardware that monitors the history of branches to guess future behavior (Lecture9_cda3101) (Lecture9_cda3101). For example, a 1-bit or 2-bit saturating counter scheme in a Branch History Table (BHT) records whether a branch was taken the last time(s) and predicts the same this time (Lecture9_cda3101). Modern CPUs use sophisticated predictors (multi-bit counters, global histories, branch target buffers, etc.) to achieve high accuracy. With prediction, the pipeline will speculatively fetch and execute along the predicted path. If the prediction is correct, no time is lost (just as if there were no hazard). If the prediction is wrong, however, the speculatively executed instructions must be squashed (their results ignored) and the correct path fetched, incurring a penalty (the time spent on the wrong path is wasted). For a simple 5-stage pipeline, a misprediction might cost 2-3 cycles of flush. In deeper pipelines (like on modern processors with 15+ stages), misprediction penalties are much higher, which is why extremely accurate branch predictors are critical in those CPUs. Branch prediction thus converts control hazards into a scenario of usually no stall (for correct guesses) and occasional penalty (for mispredicts), improving overall flow.
- Delayed Branching: This is an architectural technique where the ISA is designed such that the instruction immediately following a branch is always executed, whether or not the branch is taken. This “delay slot” means the hardware will not waste that slot; it’s the compiler’s job to fill it with a useful instruction (one that should run regardless of branch outcome, or perhaps a safe no-op if nothing else) (Delay slot – Wikipedia). In effect, the branch takes effect after one instruction delay. For example, in MIPS, a branch instruction will execute and the next instruction in memory (the delay slot) executes too, then the PC jumps to the target. If the branch is not taken, the delay slot was just the next instruction anyway. This approach eliminates the control hazard for a single-cycle branch latency, as the pipeline doesn’t have to stall for that one cycle – it’s doing useful work in the delay slot. However, delayed branching complicates software and has limitations: it assumes a fixed pipeline timing (one delay slot) and if pipelines get deeper, one slot may not be enough to cover the delay of resolving a branch. Modern architectures have largely dropped the delay slot concept in favor of branch prediction, because with deeper pipelines and out-of-order execution, delay slots become awkward (e.g., no one wants 5 delay slots for a 20-stage pipeline). Early RISC machines like MIPS and SPARC used 1 delay slot. In practice, compilers could often fill around 60-80% of delay slots with useful instructions (by moving independent instructions into the slot), and only use NOPs when no independent work was available. Example: in MIPS assembly:
BEQ R1,R2, target # if (R1==R2) branch ADD R3, R4, R5 # delay slot: this will execute regardless
Here, the ADD will execute even if the branch goes to target. If R3 is not needed or it’s something that makes sense to do anyway, then no cycle was lost. If branch was not taken, we just did ADD as normal. If branch was taken, we still did ADD but then jumped. This requires the compiler/assembler to be careful that the delay slot instruction does not harm program correctness. Delayed branching is a way to cope with control hazard by architectural means (the hazard still exists, but its impact is mitigated by doing useful work in the “delay” period). Modern CPUs, as mentioned, prefer to predict and fetch the likely next instruction rather than rely on delay slots (Delay slot – Wikipedia).
Other techniques and considerations:
- Some pipelines also reorder instructions or have the compiler schedule instructions to avoid hazards (e.g., insert safe instructions in delay slots or between a load and its use to avoid load-use stall). This is a static scheduling approach.
- Out-of-order execution (used in many advanced CPUs) dynamically finds independent instructions to execute while waiting for hazards to resolve, but that goes beyond the simple in-order pipeline model here.
- Pipeline resource duplication: For structural hazards, one can add hardware. For example, the MIPS pipeline avoids an instruction-memory vs data-memory structural conflict by having separate instruction and data caches (so an IF stage memory access and a MEM stage access can happen in parallel). If a structural hazard still remains (say two EX-stage operations needed the single multiplier), designers might include multiple units or decide that such conflict is acceptable and will stall.
- Pipeline flushing on exceptions: Although not exactly a hazard, similar control logic is needed to flush the pipeline if an unexpected event (interrupt, exception) occurs in order to restart or handle it. This is a facet of control in pipelined design.
Summary of Hazards and Solutions: The following table summarizes the types of pipeline hazards and common solutions:
Hazard Type | Cause | Common Solutions |
---|---|---|
Structural Hazard | A required hardware resource is busy or unavailable (e.g., two instructions need one memory or functional unit in the same cycle). | Design-time solution: add duplicate hardware or separate resources (e.g. split caches) to avoid conflict. Runtime solution: Stall the pipeline until the resource is free (pipeline freeze for that cycle). In practice, RISC pipelines aim to eliminate structural hazards by adequate hardware provisioning (13-14_PPT.pptx). |
Data Hazard (RAW dependency) | An instruction needs data that is produced by a prior instruction still in the pipeline (not yet written to register). | Forwarding (Bypassing): Feed the data directly from producer to consumer as soon as it’s computed (13-14_PPT.pptx). Stalling: Pause the consumer until the producer writes the data (if forwarding can’t resolve it, e.g., load-use). Instruction Scheduling: The compiler or CPU can reorder instructions to avoid close dependencies (fill delay slots or separate dependent ops). |
Control Hazard | The next instruction address is uncertain due to a branch or jump. The pipeline may fetch the wrong instruction. | Stalling (sequential fetch): Wait until branch is resolved (simple but costly, e.g. 2-3 cycle stall per branch). Branch Prediction: Guess the branch direction (and possibly target) to continue fetching without pause; flush pipeline if guess was wrong (Lecture9_cda3101). Dynamic predictors greatly reduce stalls on average. Delayed Branch: Use architecture-defined delay slots; execute one or more instructions after the branch regardless of direction (Delay slot – Wikipedia) (places burden on compiler to supply useful work in those slots). Branch Target Buffers: Cache branch targets to quickly fetch correct instruction if taken (minimize lost cycles). |
It’s worth noting that pipelining improves throughput but not the latency of a single instruction. The efficiency of a pipeline is measured by CPI (cycles per instruction). In an ideal 5-stage pipeline with perfect hazard handling, CPI approaches 1. Hazards and stalls make CPI > 1. For example, a data hazard stall might make one instruction take 2 cycles, etc. One design trade-off is pipeline depth: deeper pipelines (more stages) can potentially increase clock speed (since each stage does less work per cycle), but they also increase the cost of hazards (especially branches, since there are more in-flight stages to flush or more delay slots to fill). For instance, early Pentium processors had around 5-stage pipelines, while later ones like Pentium 4 had extremely deep pipelines (20+ stages) to reach high clock rates – but mispredicted branches on those could cost a huge penalty, which made them heavily reliant on very good branch predictors. Meanwhile, some modern RISC like ARM chose more moderate pipeline depths to balance these factors.
In a real-world example, the MIPS R3000 pipeline is a classic 5-stage that relies on a simple branch delay slot and forwarding for data hazards, achieving close to 1 CPI on non-hazard code. Intel’s Pentium (P5) had a dual 5-stage pipeline (called U/V pipes) and performed some limited form of speculative execution with a simple branch predictor; by the time of Pentium II/III and beyond, the pipeline became deeper and dynamic scheduling was introduced to handle hazards more gracefully (out-of-order execution can avoid stalls by doing other work). The concepts of hazards and interlocks, however, remain foundational for understanding any pipelined design.
Memory Hierarchy
Modern computer systems employ a memory hierarchy to balance speed, cost, and capacity. The hierarchy ranges from small, fast storage (registers, caches) that the CPU can access quickly, to very large but slower storage (main memory, and beyond that disk storage). Organizing memory in this way exploits the principle of locality (programs tend to reuse data and instructions that are near in time or address) to achieve high performance at reasonable cost. Key aspects of the memory hierarchy include caches, main memory (RAM), secondary storage (HDD/SSD), and mechanisms like virtual memory that tie it all together.
Cache Design Overview: A cache is a small, fast memory (typically built with SRAM technology) that stores copies of frequently accessed data from a larger, slower memory (DRAM main memory). When the CPU needs to read or write data, it first checks the cache. If the data is in the cache (a cache hit), it can be accessed quickly; if not (a cache miss), the data must be fetched from main memory (incurring a longer delay), and usually that data is also loaded into the cache in anticipation of future accesses. Caches are typically organized in fixed-size units called cache lines or blocks (e.g. 32 or 64 bytes each), which are the chunks moved between cache and main memory.
Three important design parameters of caches are: placement (mapping) policy, replacement policy, and write policy.
- Mapping Techniques (Direct, Associative, Set-Associative): This refers to how memory addresses are mapped to positions in the cache. Because caches are much smaller than main memory, each cache line could potentially hold many different memory locations over time, but at any moment it holds one specific block of memory. Common mapping policies:
- Direct-Mapped Cache: Each memory block maps to exactly one cache line (position) determined by a simple formula (often
index = (Block_Number) mod (Number_of_Lines)
). For example, if a cache has 256 lines, then memory block 0, 256, 512,… would all map to line 0 in the cache (they can’t all be in cache at once; if one is there and another is accessed, it will replace the previous). Direct mapping is simple and fast to determine (you take bits of the address as index), but it can lead to conflict misses: if a program alternates accesses to two addresses that map to the same line, the cache will continually evict one to bring in the other, even if the cache has lots of space free (this is the so-called ping-pong effect) (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture). - Fully Associative Cache: A memory block can be placed in any cache line. The cache must search all its lines to find a block (or an empty spot) because the placement is flexible (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture). Fully associative caches minimize conflict misses (since any block can go anywhere, the only reason a block wouldn’t be in cache is if the cache is full and some block had to be kicked out). However, checking all lines for a match (tag comparison) is expensive in hardware for large caches (requires many comparators in parallel) (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture). Fully associative caches are typically used only for small caches (like TLBs or very small level-1 caches) due to hardware complexity.
- N-way Set Associative Cache: This is a compromise between direct and fully associative. The cache is divided into sets, each set containing N lines (where N is the associativity). A memory block maps to exactly one set (often by index bits like direct-mapped), but within that set, the block can occupy any of the N lines. So you need to check at most N entries on a lookup (an N-way associative search) (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture). For example, in a 4-way set-associative cache with 256 total lines, there would be 64 sets (since 4 lines per set), and each memory block maps to one set (say by bits of address). On a lookup, only the 4 lines in that set need to be checked for a match. This significantly reduces conflict misses compared to direct mapping, since if two blocks map to the same set, they can coexist as long as there are free lines in that set (in a 4-way cache, up to 4 blocks that share the same index can be in cache simultaneously). Hardware complexity is also moderate: checking 4 entries in parallel is feasible. Many L1 and L2 caches today are 4-way or 8-way associative, striking a balance in hit rate and access speed (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture). Direct-mapped caches are effectively 1-way associative; fully associative is N-way where N = number of lines.
- Direct-Mapped Cache: Each memory block maps to exactly one cache line (position) determined by a simple formula (often
- Replacement Policies: When a new block needs to be brought into a cache and the cache (or the target set) is full, a decision must be made on which existing cache block to evict (replace). Common policies:
- LRU (Least Recently Used): Evict the block that has not been used for the longest time, on the assumption that recently used blocks are more likely to be used again. LRU is a popular strategy for small associativities (like 2-way or 4-way caches) because it tends to work well with temporal locality.
- FIFO (First-In First-Out): Evict the oldest block (that has been in cache the longest, regardless of usage). This is easier to implement than true LRU and can work well in certain patterns.
- Random: Pick a random line in the set to evict. Surprisingly, random replacement can perform not too far from LRU in some cases, and is very simple to implement (no need to track usage history). Some processors use pseudo-random or simpler approximations to LRU to reduce hardware cost.
- MRU (Most Recently Used): In some workload patterns (like cyclic buffering), evicting the most recently used can be beneficial, but this is less common as a general policy.
- Write Policies (Write-Through vs Write-Back; Write Allocate): Caches also must handle write operations (store instructions). The policies here concern how and when a write to a cached value propagates to lower levels (like main memory):
- Write-Through: On every write to the cache, also immediately write the data through to the next level of memory (e.g., L1 to L2 cache, or L1 to main memory if it’s the last level). In a write-through cache, the cache and memory are always consistent (memory always has the latest copy of the data). This is simpler to implement and for coherence (in multi-CPU systems, it’s easier to snoop main memory or a shared lower level to see writes). However, write-through can generate a lot of memory traffic (every store instruction hits memory) (cpu architecture – Write-back vs Write-Through caching? – Stack Overflow). Often a write-through cache will use a write buffer to temporarily hold data to be written to memory, so the CPU can continue while the write buffer handles the slower write to memory.
- Write-Back: When a write occurs, update only the cache at that moment. Mark the cache line as “dirty” (modified). The data is written to main memory later, i.e., when that cache line is evicted/replaced. Write-back greatly reduces write traffic to lower levels, because multiple writes to the same cache line will be absorbed by the cache and only one memory write (when the line is evicted) is needed (cpu architecture – Write-back vs Write-Through caching? – Stack Overflow). The complexity is that the cache must keep track of which lines are dirty, and on eviction, perform the write-back. Also, memory is not up-to-date until eviction, which complicates coherence in multi-processor scenarios (cache-to-cache transfers or cache coherency protocols are used to manage this).
- Write Allocate vs No-Write-Allocate: This concerns what happens on a write miss (when the CPU writes to an address not currently in cache). Write allocate (also called fetch on write) will fetch the block into the cache first, then do the write. The idea is that if a CPU is writing to an address, it might write again soon or read it, so bring it to cache (temporal locality). No write allocate (write-no-allocate) will simply write directly to memory (in write-through fashion) and not load it into cache. This makes sense in some write-through caches where you might not want to disturb the cache with data that was just written once. Many systems pair write-back with write-allocate (on a write miss, bring the line in and mark it dirty immediately), and pair write-through with no-write-allocate (since write-through is writing to memory anyway, might not fetch).
In summary, caches greatly speed up access to hot data by exploiting spatial locality (bringing in a whole block so that nearby data is quick to access) and temporal locality (recently used data stays in cache for quick reuse). The effectiveness is measured by hit rate (fraction of accesses served by cache) and miss rate (= 1 – hit rate). A miss causes a miss penalty, which is the extra time to fetch from the next level (e.g., main memory), including the time to transfer the block. The average memory access time (AMAT) can be approximated as: AMAT = Hit_Time + Miss_Rate * Miss_Penalty
. For example, if L1 cache hit time is 1 cycle, miss rate 5%, and main memory penalty 50 cycles, then AMAT = 1 + 0.05*50 = 3.5 cycles on average per access, which is much better than always paying 50 cycles.
DRAM vs. SRAM: The memory in the hierarchy is implemented with different technologies:
- SRAM (Static RAM): SRAM is a type of memory that uses flip-flops (transistors) to store each bit. It is fast (no need to refresh, and typically can operate at a few CPU cycles latency or less) and relatively expensive in terms of transistor count (e.g., 6 transistors per bit in a common design). Because of its cost and physical size, SRAM is used for smaller memory structures like caches and registers. SRAM is called “static” because it retains data as long as it’s powered (no refresh cycles needed).
- DRAM (Dynamic RAM): DRAM stores bits as charge in capacitors (one transistor and a tiny capacitor per bit). These capacitors leak charge over time, so DRAM requires regular refresh cycles (periodically re-reading and re-writing each cell to restore charge) – hence “dynamic.” DRAM is much denser (you can have many more bits per chip than SRAM for the same area) and cheaper per bit, but it’s slower to access. A DRAM access involves selecting a row (wordline) to charge sense amplifiers (which takes some time), then column selection, etc., and typically has higher latency (tens of nanoseconds, which is dozens to hundreds of CPU cycles, depending on speed). Typical main memory in a system (like DDR4 or DDR5 DIMMs in a PC) is DRAM. DRAM is optimized for density and cost, whereas SRAM is optimized for speed (Difference between SRAM and DRAM – GeeksforGeeks). In terms of performance: SRAM access might be 1-2 ns (nanoseconds) for cache, whereas DRAM might be 50-100 ns random access.
- Because SRAM is faster but expensive, caches (L1, L2, etc.) are SRAM. DRAM is used for the main memory (the large memory that programs see). For example, a computer might have 32 KB L1 cache (SRAM), 256 KB L2 cache (SRAM), 8 MB L3 cache (SRAM), and 16 GB main memory (DRAM).
- In terms of cost and integration: SRAM is usually on the same chip as the CPU for caches (to get the speed advantage and due to lower capacity needs), while DRAM is off-chip (on separate DIMMs). There have been some experiments with embedded DRAM on CPU chips, but generally the technology differences make it hard to manufacture on the same process.
- Performance metrics: We often distinguish latency vs bandwidth. Latency is the time to access one unit of data (e.g., how many cycles to get the first word from memory). Bandwidth is how much data can be transferred per unit time once the transfer begins (e.g., a DDR4 module might have a bandwidth of say 20 GB/s). DRAM has relatively high latency but can offer high bandwidth by transferring bursts of data (multiple cache lines) per request. Caches mitigate effective latency by providing data in 1-4 cycles on hits, and also by buffering writes and reads to optimize bandwidth usage.
Secondary Storage – HDD vs SSD: Beyond main memory (which might be a few GB to a few hundred GB of DRAM), computers use secondary storage for even larger capacity, non-volatile storage. Traditionally, this was magnetic hard disk drives (HDDs), and more recently solid-state drives (SSDs) based on flash memory are common.
- Hard Disk Drive (HDD): An HDD stores data on spinning magnetic platters with a moving read/write head. Accessing data involves mechanical motion: the disk must spin to the right position (rotational latency, e.g. a 7200 RPM drive has ~4ms average rotational latency) and the head must seek to the correct track (seek time, maybe ~5-10ms average). Thus random access to an HDD typically takes on the order of 5–15 milliseconds (which is 10^6 nanoseconds, vastly slower than DRAM). The bandwidth of an HDD (sequential read/write) can be quite good once the head is streaming a track – modern drives can achieve 100 MB/s or more sequentially. But the high latency for random seeks is the big drawback. HDDs are very large in capacity (terabytes are common) and relatively cheap per GB. They are used for bulk storage of files, etc., where not all data needs to be accessed constantly.
- Solid State Drive (SSD): An SSD uses NAND flash memory (electronically storing data in memory cells) and has no moving parts. This gives it much lower access latency – typical SSD random access latency might be ~50 to 100 microseconds (0.05–0.1 ms) for data, which is 100x or more faster than an HDD’s 5-10ms. High-end NVMe SSDs on PCIe can have even lower latencies (tens of microseconds). SSDs also have very high throughput, often hundreds of MB/s to several GB/s with modern interfaces. However, SSDs have some different characteristics: writes can be slower than reads (due to needing to erase blocks before writing), and flash cells wear out with many writes (so SSD controllers do wear leveling to prolong life). In terms of cost, SSDs are more expensive per GB than HDDs (though prices have been falling). Typically, a system might use an SSD for the operating system and active data (for speed) and perhaps an HDD for archival or large media storage where speed is less critical.
- From a memory hierarchy perspective, both HDD and SSD are far slower than DRAM, so they are considered secondary storage, usually accessed via the operating system’s I/O calls rather than directly by CPU load/store instructions. They rely on the operating system’s buffer cache (in RAM) to catch recently used disk data and avoid physical disk accesses if possible.
- Performance terms with HDD/SSD: Latency vs bandwidth is key. As noted, an SSD might have ~0.1 millisecond latency, an HDD ~10 milliseconds. That’s a factor of 100 difference (Why Latency Impacts SSD Performance More Than Bandwidth Does | Extremetech). Bandwidth wise, an SSD might be 500 MB/s, an HDD 100 MB/s (5x difference). But for many small accesses, the latency dominates (SSDs can perform tens of thousands of I/O operations per second (IOPS) vs a hard drive maybe a few hundred at best) (Why Latency Impacts SSD Performance More Than Bandwidth Does | Extremetech). This huge latency gap is why SSDs dramatically improve system responsiveness for random I/O. For sequential large transfers (like large file copy), the difference is less stark but still present.
- Real-world example: A 15k RPM enterprise HDD might have ~2ms average latency, whereas a SATA SSD ~0.1ms; the SSD is ~20x faster responding to a single request (Why Latency Impacts SSD Performance More Than Bandwidth Does | Extremetech). Modern NVMe SSDs with PCIe 4.0 can have 5-10 GB/s throughput for large transfers, which dwarfs any HDD (~0.2 GB/s) and even is higher than older DRAM bandwidth from decades ago.
Virtual Memory (Paging and Segmentation): Virtual memory is a system that allows programs to use addresses without regard to the actual amount of physical RAM, by automatically managing data movement between primary memory (RAM) and secondary storage (disk). It provides the illusion of a very large contiguous memory space (the virtual address space) for each process, while actually using potentially smaller physical memory. The two main mechanisms historically used are paging and segmentation (and often a combination).
- Paging: In a paging system, the virtual address space of a process is divided into fixed-size blocks called pages (commonly 4 KB, though other sizes exist such as 2MB huge pages, etc.), and physical memory is divided into equally sized frames. The operating system maintains a page table for each process, which is basically a map from virtual page number to physical frame number (or to a marker that the page is on disk if not in memory). When a program accesses a virtual address, the CPU (with the help of the Memory Management Unit, MMU) splits it into a virtual page number and an offset within the page. It then translates the page number via the page table to find the corresponding physical frame (if the page is in RAM) and the offset remains the same. This gives the physical address to actually access RAM. If the page table indicates the page is not currently in physical memory (it might be stored on disk in a swap file), this triggers a page fault exception: the OS will then find the page on disk (in the swap area), bring it into a free frame in RAM (possibly evicting another page to disk if RAM is full), update the page table, and then resume the program. This all is transparent to the program, except for the performance delay during the page fault service. Paging thus provides automatic memory extension: a process may have, say, 8 GB virtual address space but only 2 GB of physical memory; the less-used parts of its memory can reside on disk until needed. It also provides memory protection and isolation: each process has its own page table, so it cannot accidentally access another process’s memory (without a mapping set up by the OS). Additionally, it simplifies memory allocation from the OS perspective (because everything is a fixed size page frame). A critical component is the Translation Lookaside Buffer (TLB) – a small cache in the MMU that stores recently used virtual-to-physical address translations. Since every memory access needs an address translation, and doing a page table lookup (which may be multi-level) would be slow, the TLB speeds this up by caching the results of recent translations. A TLB hit allows virtual-to-physical mapping to be determined in one cycle or so; a TLB miss means the hardware (or OS in some systems) has to walk the page table in memory to find the mapping.
- Segmentation: Segmentation is an older concept where a program’s address space is divided into variable-size segments (like “code segment”, “data segment”, “stack segment”, or even finer-grained). Each segment has a base physical address and a length. A logical address is given as (segment number, offset). The offset is checked against the segment length for validity (providing protection), and the base is added to the offset to get a physical address. Segmentation directly supports a view of memory as a collection of different regions (which can aid in programming models). However, pure segmentation without paging can lead to external fragmentation (different segments of different sizes loaded in physical memory can leave holes when freed, making it hard to allocate new large segments). Some systems (like Intel x86) combined segmentation with paging – e.g., the x86 uses segments for software view, but then each segment is paged underneath.
- Paged Virtual Memory: Most modern systems implement virtual memory with paging. The combination of large disk-based space and automatic transfer means processes can run without having all their code and data in physical RAM at once. For example, if a program uses only 100 MB of its 1 GB allocation actively, the rest might stay on disk until needed, allowing RAM to be used for other processes or caches.
Virtual memory’s role in system performance is both enabling and cautionary. It enables running large applications and many applications concurrently even if RAM is limited, by using disk as an extension. It also provides isolation: one process cannot see another’s memory (each has its own page table, unless shared intentionally), improving security and stability. However, when a system is low on physical memory and actively using virtual memory, it may swap a lot (move pages in and out from disk frequently). If a workload’s working set (the set of pages it actively needs) doesn’t fit in RAM, the system can start thrashing – spending most of its time swapping pages in and out rather than doing useful work, leading to severe slowdowns. Thus, while virtual memory gives the appearance of a large memory (Virtual Memory in Operating System – GeeksforGeeks), performance remains good only if the accesses are mostly hitting in real RAM. The hierarchy thus goes: L1 cache (KB, ~1-4 cycles), L2/L3 cache (MB, ~10-50 cycles), main memory (GB, ~100+ cycles), then disk via virtual memory (GB-TB, millions of cycles if page fault to disk).
A typical scenario: Suppose an application wants to access an array element at virtual address X. The MMU, with help of TLB, translates X to a physical address. Then the caches are consulted: if the data is in L1 cache (a cache hit), it’s returned in a few cycles. If not, perhaps it’s in L2, then a few more cycles. If not in any cache, a memory access is made – hundreds of cycles. If that physical address was not in RAM (page not present, meaning it was swapped out), a page fault occurs: the CPU traps to the OS, OS finds the page on disk (swapfile), chooses a physical frame (perhaps evicting another page to disk if needed), issues a disk read (many milliseconds), during which the process might be descheduled. Once the page is loaded into RAM, the page table is updated and the program continues, now able to access that data (which will also be brought into cache). This chain shows how the memory hierarchy, while invisible to the programmer (except via performance), dramatically affects the access latency.
Performance Terms Recap: We use hit rate, miss rate, miss penalty for caches (and TLBs similarly). For disks, we use seek time, rotational latency, transfer rate. Bandwidth is often quoted for memory systems (e.g., DDR4-3200 has a theoretical bandwidth of 25.6 GB/s per channel) and for I/O (e.g., an SSD with SATA3 can do ~550 MB/s, with NVMe maybe >3000 MB/s). Latency is time to first data: in caches maybe a few nanoseconds, main memory ~50-100 ns, SSD ~100,000 ns (0.1ms), HDD ~10,000,000 ns (10ms). Good system design and software optimize to have as many accesses as possible serviced at the highest levels (cache hits) to avoid the long latency of lower levels.
The memory hierarchy’s overarching goal is to provide the illusion of a large, fast, and cheap memory: large thanks to disk/DRAM, fast thanks to caches, and (overall) cheap because the bulk is in inexpensive technologies. This works due to locality (most accesses hit in caches) (Basics of Cache Memory – Computer Architecture) and OS policies that manage memory wisely.
I/O Interface (Interrupt and DMA)
Computers need to communicate with a variety of external devices (disks, keyboards, displays, network cards, etc.). The I/O subsystem includes both hardware (controllers, buses) and software (device drivers, interrupt handlers) to facilitate this communication. Two key concepts for efficient I/O are interrupts and DMA (Direct Memory Access), which contrast with simpler programmed I/O methods.
Programmed I/O (Polling): In the most straightforward form of I/O, the CPU directly controls a device and waits for it to be ready for data transfer. The CPU might repeatedly check a status register of the device in a loop (polling) until the device indicates it is ready (or the operation is complete). Then the CPU executes load/store instructions to read or write data to the device’s data register. This method is called programmed I/O because the CPU, running a program (the device driver), explicitly orchestrates every step of the transfer (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). For example, imagine reading data from a serial port without interrupts: the CPU reads a status bit in a loop until it sees that a byte has arrived, then it reads the byte, then goes back to waiting for the next byte. During the waiting, the CPU is doing nothing useful (busy-waiting). Programmed I/O is simple but inefficient, especially for slow devices, because the CPU (which is much faster) wastes a lot of cycles idle or polling (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). It also doesn’t easily allow overlapping computation and I/O – the CPU is tied up with the device.
Interrupt-Driven I/O: To improve efficiency, devices are often equipped with the ability to interrupt the CPU when they need attention. In interrupt-driven I/O, the CPU can issue a command to a device (e.g. “start reading block from disk” or “send this data over network”) and then continue with other work. When the device has completed the I/O or needs service (like new data or has data to deliver), it sends an interrupt signal to the CPU. The CPU, upon finishing its current instruction, suspends the normal flow of execution and jumps to an interrupt service routine (ISR) – a special function (part of the OS or driver) that handles that device’s request (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong) (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). The ISR will quickly service the device: for input, perhaps read a data register and store the byte(s) in a memory buffer; for output, perhaps load the next chunk of data into the device’s buffer; or simply mark that the device is done. Once handled, the ISR returns, and the CPU resumes the process it was running before, as if nothing happened (except the time taken to service the interrupt).
Interrupts thus allow the CPU to react to I/O events asynchronously and avoid constant polling. For example, a keyboard controller will generate an interrupt for each key press, so the CPU doesn’t have to constantly check if a key was pressed – it’s notified when one occurs. Similarly, a disk controller can interrupt the CPU when a data transfer from disk to memory is complete, so the CPU can initiate a disk read and then go do other work until the data is ready (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong).
Key points about interrupts:
- There is typically hardware support: an interrupt controller (either separate or built-in to the CPU) that can prioritize multiple interrupt requests and vector the CPU to the correct service routine. Each device or interrupt source is assigned an interrupt vector or number. When the CPU acknowledges an interrupt, it uses this number to index into an interrupt vector table – a table of addresses for ISRs (Interrupt vector table – Wikipedia) (Interrupt Vector – an overview | ScienceDirect Topics). For instance, vector 5 might correspond to the keyboard handler, vector 6 to disk, etc.
- Priority: Some interrupts are more urgent than others. Interrupt controllers (like the PIC – Programmable Interrupt Controller, or APIC in PCs) allow setting priorities. High-priority interrupts can preempt lower-priority ones or be serviced first if multiple occur. For example, a high-speed network card might be higher priority than a keyboard, to avoid losing data packets.
- Masking: The CPU or OS can mask (disable) certain interrupts temporarily when needed, typically to protect critical sections of code that can’t be interrupted or during the handling of an existing interrupt to not be overwhelmed.
- When an interrupt is serviced, the CPU state (program counter, registers) is saved (usually automatically on the stack by hardware or minimal ISR entry code) so that it can resume the pre-interrupt task accurately.
- Interrupt overhead includes the latency of context switching to the ISR and back. So while interrupts are efficient for moderate-frequency events, if an interrupt rate is extremely high (thousands per second), the overhead might start affecting CPU performance. In such cases, other techniques like polling in batch or using DMA (discussed next) might be considered.
Interrupt-driven I/O significantly improves performance over pure polling: the CPU can handle other tasks or enter a low-power idle instead of busy-waiting. However, one downside is that the CPU still must handle the actual data transfer in many cases. For example, if a disk controller interrupts the CPU for every single byte of data received, the CPU will be doing a lot of interrupt handling and byte moving – which is essentially what programmed I/O did, just interrupt-driven one byte at a time. The solution to free the CPU from data movement is Direct Memory Access (DMA).
Direct Memory Access (DMA): DMA is a feature provided by hardware (a DMA controller, or built into the device controller) that allows devices to transfer blocks of data to or from main memory without continuous CPU intervention (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). The idea is: the CPU sets up the DMA transfer by telling the DMA controller (or the device) the source address, destination address, and the size of the transfer (and perhaps other info like direction, and signaling when done). Then the DMA controller takes over the bus and performs the data transfer directly between the device and memory. The CPU is free to do other work during this time and is only interrupted by the DMA controller when the whole block transfer is finished (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong) (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong).
For example, consider reading a sector from disk (512 bytes). Without DMA, the CPU might have to read each byte (or word) from the disk controller’s I/O port and store it to memory, in a loop 512 times (this is how it was done on very old systems). With DMA, the CPU tells the disk controller/DMA: “transfer 512 bytes from disk controller’s data register to memory buffer address X”. The disk controller will then pump the bytes into memory through the DMA engine as they come from the disk, and perhaps after all 512 bytes, raise an interrupt to signal completion. The CPU only did a tiny setup and then handled one interrupt instead of 512 byte-handling operations.
The DMA controller essentially becomes a mini-processor dedicated to data transfer. It must be able to arbitrate for the memory bus (since it needs memory access). This introduces the concept of bus mastering – the DMA controller can become the “master” on the bus for a cycle, instead of the CPU, to perform a memory read/write cycle. There are two approaches: burst mode DMA, where the DMA grabs the bus and transfers the entire block in one go (CPU is momentarily halted from memory access during that burst), or cycle stealing mode, where the DMA interleaves single transfers with CPU activity, taking one bus cycle here and there (slowing the CPU somewhat but not halting it completely).
Typical devices that use DMA: Disk controllers, graphics cards (GPUs doing memory copies), network interface cards (receiving packets into memory), sound cards (streaming audio to/from memory), USB host controllers, etc. For instance, when a network card receives a packet, it can DMA the packet data directly into a memory buffer and then interrupt the CPU to say “a packet is ready at address X of length L”. The CPU then just processes the fully received packet. Without DMA, the CPU might have to read each network byte as it arrives (impractical at high speeds).
Interrupt handling with DMA: Even with DMA, interrupts still play a role for signaling completion or errors. DMA just means the CPU isn’t moving each piece of data. So typically, the sequence for a DMA-based I/O is:
- CPU programs the device/DMA controller (gives memory address, length, etc. in device registers).
- Device/DMA performs the transfer (device reading/writing memory directly).
- Device or DMA controller interrupts the CPU when the whole transfer is done or if something needs CPU attention (like buffer filled, or next chunk request).
- The interrupt service routine might then perhaps queue the next DMA request or signal to the operating system that the I/O is finished (so the waiting process can be awakened).
Real-World Examples:
- Disk Controller (e.g., SATA or NVMe disk): Uses DMA to transfer sectors of data to/from memory. For instance, when you read a file, the OS issues a command to the disk controller to read certain sectors into a memory buffer; the disk controller will use DMA to do so and then interrupt the CPU when those sectors are in memory. Without DMA, the CPU would waste time reading each word from the disk interface.
- USB Controller: USB devices (like a USB flash drive, or a webcam) connect via a host controller. The host controller will typically use DMA to move data into main memory buffers. E.g., a USB storage device reading data will cause the USB controller to DMA the data into RAM and then notify the CPU via interrupt. Similarly, for output, the CPU might prepare a buffer of data, and the USB controller DMA’s that from memory to send out over the USB bus.
- Graphics (GPU): Modern GPUs can act as bus masters and DMA textures or frame buffers from system memory to their own memory or vice versa. This prevents the CPU from having to be the middleman for massive data transfers (which would be a bottleneck).
- Sound Card: To play audio, the CPU can instruct the sound hardware to DMA a chunk of audio samples from memory to the card’s buffer, rather than feeding the samples one by one in real time. The sound card then generates an interrupt when it needs the next buffer.
- Traditional Example: Old systems had a DMA controller chip (like the Intel 8237 in PC AT) that could manage DMA for floppy disks, sound, etc. The OS would set up the controller with the memory address and count. That controller had limited channels and could only address certain memory ranges, but it offloaded byte-by-byte copying from the CPU. Modern integrated devices have their own DMA engines, often more advanced.
Interrupt Controllers and Priority/Vectoring: When multiple devices can interrupt, an interrupt controller chip/module (such as the APIC in x86 systems) handles inputs from devices, prioritizes them, and typically provides the CPU with an interrupt vector number identifying the device. Vectored interrupts mean the hardware directly identifies the correct service routine. For example, on an interrupt, the CPU might receive an interrupt vector 0x2C which is predetermined (by OS setup) to correspond to, say, “Network card interrupt handler.” In contrast, some simple systems use a single interrupt line for many devices and the software must query devices to find who signaled (this is called polled interrupts or non-vectored). Modern systems avoid that overhead with vectored interrupts.
Trade-offs and Modes:
- Programmed I/O (polling) is easiest but wastes CPU time and is only acceptable when devices are very fast or data is extremely low volume (or in very simple microcontrollers where hardware is minimal).
- Interrupt-driven I/O is efficient for moderate data rates (where the overhead of an interrupt per data chunk is acceptable) and where latency between device ready and CPU response needs to be short.
- DMA shines for high-volume data transfer, as it dramatically reduces CPU overhead. However, DMA hardware adds complexity, and for very small transfers, the setup might not be worth it. Also, concurrency: with DMA, the memory bus is being used by the device, which can slightly stall the CPU’s memory access (cycle stealing). But overall throughput improves because the CPU can do compute in parallel.
A real-world scenario illustrating all: Suppose you plug in a USB flash drive (which internally is like a slow SSD). The OS, via the USB host controller, might schedule a DMA transfer to read a block from the drive into memory. The host controller DMAs the data from the USB interface into a RAM buffer. Once done, it triggers an interrupt to the CPU. The CPU’s USB ISR runs, marking that the data is ready (perhaps waking up a process that was waiting for that data). While the DMA was happening, the CPU could have been doing other work (like running another process). If instead polling was used, the CPU would sit in a loop checking a USB status register for the data-ready flag, which would be a huge waste.
Conclusion / Key Takeaways: The I/O interface section highlights that efficient I/O requires asynchronicity (interrupts so CPU isn’t tied up waiting) and offloading (DMA so CPU isn’t tied up copying data). Nearly all modern peripheral I/O is interrupt-driven with DMA for bulk data. For example, network cards have circular DMA buffers for incoming packets and raise interrupts either periodically or when buffers are full; storage drives use DMA; even things like keyboard and mouse use interrupts (though those are low bandwidth and don’t need DMA). Without these, achieving multitasking and good performance would be impossible – the CPU would spend too much time on device minutiae.
Conclusion
In summary, this report covered essential concepts in computer organization and architecture:
- Machine Instructions & Addressing Modes: Machine instructions are the primitive operations executed by CPUs, with various formats and types (arithmetic, logic, control, etc.). Addressing modes define how these instructions refer to operands (immediate values, registers, memory addresses directly or indirectly). We saw how different architectures (RISC vs CISC) approach instruction set design: RISC simplifying instructions for efficiency and pipelining, CISC offering more complex operations per instruction (RISC vs. CISC). The choice of addressing modes affects both hardware complexity and software flexibility; for example, x86’s rich addressing modes give programmers convenience but require more decoding logic, whereas a RISC like MIPS keeps addressing modes simple (just base+offset, etc.) to streamline the pipeline.
- ALU, Datapath & Control: The ALU is the workhorse for calculations, integrated into a datapath alongside registers and buses to execute instructions. The control unit governs this process, either through hardwired logic or microprograms. Hardwired control offers speed (one cycle per simple instruction) (Difference between Hardwired and Microprogrammed Control Unit), whereas microprogrammed control offers flexibility and easier extensibility (Difference between Hardwired and Microprogrammed Control Unit). Understanding this helps explain why, for instance, certain CPUs can have microcode updates (they are microprogrammed, allowing fixes in firmware). The coordination between control signals and the datapath steps (fetch, decode, execute, etc.) is crucial for correct operation and is a fundamental design aspect of any CPU.
- Pipelining & Hazards: Pipelining improves throughput by overlapping instruction execution, but introduces hazards that must be mitigated for correctness. We identified structural, data, and control hazards (13-14_PPT.pptx) and typical solutions: stalling to resolve resource or dependency conflicts (13-14_PPT.pptx), forwarding to bypass data latency (13-14_PPT.pptx), and branch prediction or delay slots to handle control changes (Delay slot – Wikipedia). These concepts are key to why CPUs achieve the speeds they do and also why sometimes code scheduling and branch prediction matter for performance. A key takeaway is that the theoretical peak of one instruction per cycle is often reduced by these hazards and how well they are handled. Modern processors with deep pipelines rely heavily on advanced dynamic techniques (out-of-order execution, deep branch predictors, speculation) to mitigate hazards, whereas simpler in-order processors rely on the compiler and simpler hardware interlocks.
- Memory Hierarchy: Because different memory technologies have vastly different speeds, a hierarchy (registers → caches → DRAM → disk/SSD) is used to balance speed and size. Caches exploit locality to give the CPU much faster access to data than main memory would (Basics of Cache Memory – Computer Architecture). We looked at how caches are organized (direct vs associative mapping (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture)) and managed (replacement and write policies (cpu architecture – Write-back vs Write-Through caching? – Stack Overflow)). The difference between SRAM and DRAM highlights why caches are limited in size but very fast (Difference between SRAM and DRAM – GeeksforGeeks). Virtual memory extends the hierarchy further, allowing systems to run large applications by using disk as an extension of RAM – at the cost of speed if overly relied upon. The memory hierarchy’s success is evident in everyday computing: most memory accesses are served by caches (L1/L2 hits), making the average access time much closer to the speed of SRAM than DRAM. However, when programs don’t exhibit good locality or memory is overcommitted, performance can degrade (cache misses, TLB misses, page faults). Thus, architects and OS designers strive to make the common case fast (hits) and the worst case rare enough.
- I/O, Interrupts & DMA: Input/output operations are essential for any system, and managing them efficiently requires hardware support. Interrupts allow the CPU to be responsive to external events without constantly checking devices, improving CPU utilization (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong) (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). DMA permits high-throughput device communication by offloading data movement, which is why modern systems can handle, say, 4K video streams from disk or gigabit network traffic – the CPU isn’t copying every byte, dedicated hardware is (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong). The interplay of interrupts and DMA is orchestrated by the OS through device drivers, which ensure devices and the CPU cooperate. Real-world examples like USB or disk operations illustrate these principles in action, making it clear that without interrupts and DMA, we couldn’t achieve the concurrency and performance in I/O that we expect today.
Across all these topics, a recurring theme is trade-offs: simplicity vs. complexity, hardware vs. software, speed vs. flexibility. RISC vs CISC is one trade-off space; hardwired vs microcoded control is another; deeper pipelines (higher frequency) vs. hazard cost is another; larger caches (better hit rate) vs. longer access time is yet another; and in I/O, latency vs throughput trade-offs (e.g., an interrupt per byte vs. DMA block transfer) must be balanced. Computer architects continuously evaluate such trade-offs with technology constraints and workload characteristics in mind.
Understanding these foundational concepts equips one to appreciate how a high-level program in C/C++ eventually triggers specific instructions, how those instructions are executed under the hood, how data is fetched and buffered in multiple layers of memory, and how external inputs/output are handled – all in a matter of nanoseconds to milliseconds. It is this complex coordination of components that enables the powerful performance and capabilities of modern computers.
References:
- Patterson, D. A., & Hennessy, J. L. (2013). Computer Organization and Design: The Hardware/Software Interface (5th ed.). Morgan Kaufmann. (General reference for instruction sets, pipelining, etc., including the classic 5-stage pipeline and hazard discussions.)
- Stallings, W. (2010). Computer Organization and Architecture: Designing for Performance (8th ed.). Prentice Hall. (Provides detailed explanations of addressing modes, ALU design, control unit types, and memory hierarchy concepts.)
- Tanenbaum, A. S., & Austin, T. (2012). Structured Computer Organization (6th ed.). Pearson. (Good coverage of microprogrammed vs hardwired control and other architectural trade-offs.)
- (Comparison of instruction set architectures – Wikipedia)Wikipedia: Comparison of instruction set architectures – Addressing Modes. (Definitions of direct, immediate, indexed, indirect addressing modes.)
- (RISC vs. CISC) (RISC vs. CISC)Stanford CS, RISC vs. CISC. (Differences between RISC and CISC illustrated with examples and trade-off table.)
- (Datapath – Wikipedia)Wikipedia: Datapath. (Definition of a datapath including ALU, registers, and buses.)
- (Difference between Hardwired and Microprogrammed Control Unit)Byju’s, Difference between Hardwired and Microprogrammed Control Unit. (Tabular comparison highlighting speed vs flexibility.)
- (13-14_PPT.pptx)N.D. Lecture Slides on Pipelining Hazards. (Classification and definition of structural, data, control hazards.)
- (13-14_PPT.pptx)N.D. Lecture Slides on Forwarding. (Explanation that forwarding passes result directly to the functional unit needing it, bypassing register write/read.)
- (Delay slot – Wikipedia) (Delay slot – Wikipedia)Wikipedia: Delay slot. (Description of the branch delay slot and note that modern CPUs favor branch prediction over delay slots.)
- (Basics of Cache Memory – Computer Architecture) (Basics of Cache Memory – Computer Architecture)U.Md. Basics of Cache Memory. (Explanation of direct mapping and set-associative caches, including benefits of set-associative mapping in reducing conflicts.)
- (cpu architecture – Write-back vs Write-Through caching? – Stack Overflow)StackOverflow: Write-through vs Write-back caching. (Key difference: write-through writes immediately to memory, write-back defers write, updating memory later.)
- (Difference between SRAM and DRAM – GeeksforGeeks)GeeksforGeeks: Difference between SRAM and DRAM. (SRAM is faster and used for cache, DRAM is slower and used for main memory.)
- (Why Latency Impacts SSD Performance More Than Bandwidth Does | Extremetech)ExtremeTech: Why Latency Impacts SSD vs HDD Performance. (Quantification: ~70µs latency for SATA SSD vs ~12.5ms for HDD, i.e., SSD is ~100x faster in access latency.)
- (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong) (Programmed I/O, Interrupt & Direct Memory Access (DMA) – Louie Wong)Louie Wong Blog: Programmed I/O, Interrupt & DMA. (Describes programmed I/O polling overhead, and how DMA allows a device to read/write memory without CPU involvement, with CPU only active at start and end of transfer.)
- (Interrupt vector table – Wikipedia)Wikipedia: Interrupt vector table. (Interrupt vectors map interrupt requests to ISR addresses, enabling device-specific handling.)
- Tanenbaum, A. S., & Bos, H. (2014). Modern Operating Systems (4th ed.). Pearson. (Chapters on memory management and I/O give context on virtual memory paging, segmentation, and device I/O with interrupts/DMA from an OS perspective.)
Be First to Comment