Dive into Memory Order - Part I: Out of Order Execution

反者道之动：CPU 指令的无序之序

Mar 13, 2025

The "Barrier" concept in the NEMA traffic signal controller’s Ring-Barrier-Structure ensures proper activation of different barrier groups, while the phase sequence within the same barrier group remains vehicle-actuated—all aimed at improving intersection throughput, reducing delays while ensuring safety.
Interestingly, this concept closely mirrors the Memory Barrier and Memory Order mechanisms in modern CPU pipelines, which control the execution order of CPU instructions to max-out instruction pipeline throughput and reduce latency. It's as if traffic and CPUs are on the same page.
在 NEMA 交通信号控制器的环障结构 (Ring-Barrier Structure) 中，“屏障” (Barrier) 概念至关重要。它确保不同屏障组 (Barrier Group) 的正确激活，从而在时间上分隔主路和从路的冲突流；同时，同一屏障组内的相位顺序（Phase Sequence）由车辆的感应事件 (vehicle-actuated) 动态调整。这一切都旨在提升路口通行效率、减少延误并确保安全。这一概念与现代 CPU 内存屏障 (Memory Barrier) 和内存顺序 (Memory Order) 机制十分相似——它们通过调整指令执行顺序，最大化流水线吞吐量，降低延迟。换句话说，CPU 的指令调度与感应式信号控制，遵循着类似的策略。
CPU 指令的执行并非简单的线性推进，而是在乱序调度中动态寻优，实现更高效的计算路径。正如《道德经》所言：“反者道之动”，对线性顺序的“反动”并非混乱，而是孕育更优运行方式的必然。

Image Credit: Paul E. McKenney (2007) “Memory Ordering in Modern Microprocessors”

Introduction

Memory ordering is crucial in multi-threaded programming, as modern CPU optimizations can introduce unexpected behavior.

This unpredictability stems from various CPU optimization techniques, including instruction-level parallelism (ILP) and speculative execution. These optimizations enable the CPU to execute instructions out of order to maximize pipeline throughput and reduce latency, but also create challenges in preserving the correct order of memory operations.

To address these complexities, C++ provides memory order semantics through the <atomic> library, allowing developers to control how memory operations are observed across threads and ensuring correct memory ordering.

This article explores:

Why memory ordering is a critical concern in multi-threaded and multi-core programming.
The different memory order options available in C++.
Practical examples of memory_order_seq_cst, memory_order_acquire/release, and memory_order_relaxed.

Instruction-Level Parallelism (ILP)

Modern CPUs feature multiple execution units that operate simultaneously, enabling instruction-level parallelism (ILP). These units include:

Arithmetic Logic Units (ALUs) – Perform independent calculations in parallel.
Load/Store Units – Handle memory operations separately, improving efficiency.
Branch Prediction Units – Predict conditional operations to keep the pipeline full.
Floating Point Units – Execute floating-point calculations.

With these parallel execution units, CPUs reorder instructions to maximize pipeline throughput and reduce latency, ensuring continuous CPU execution without stalling.

Consider the following code snippet:

// Original code
int value = memory[addr1]; // Possible cache miss (high latency)
int result = value * 2;    // Depends on previous instruction
memory[add2] = 100;        // Independent operation

The CPU might recorder the execution as follows:

// After reordering
int value_future = memory[addr1]; // Start memory load early
memory[addr2] = 100;              // Do this while resolving cache misses
int value = value_future;         // Complete the memory load
int result = value * 2;           // Now compute the result

What's happening under the hood:

The CPU sends the memory load request to the load unit, which might cause a cache miss. If this happens, fetching the data will take additional CPU cycles.
While the load unit is waiting for the memory operation to complete, the CPU sends the store instruction to the store unit. The store operation can be completed in parallel with the memory load since it doesn't depend on the value loaded from memory[addr1].
Once the memory load completes, the CPU retrieves the value and performs the multiplication (value * 2) using the ALU (Arithmetic Logic Unit).

This out-of-order execution (OoOE) is made possible by dynamic scheduling, where the CPU scheduler analyzes instruction dependencies and assigns them available execution units.

Key Concepts:

Instruction-Level Parallelism (ILP): A single CPU core can execute multiple instructions in parallel, increasing pipeline throughput and reducing total execution time.
Out-of-Order Execution (OoOE) Engine: This engine analyzes instruction dependencies and transparently optimizes their execution order without altering program logic.

While ILP occurs within a single core, multi-core processors extend parallelism by running independent threads across cores. Keep in mind that even a single core can efficiently execute multiple instruction streams using its parallel execution units.

Speculative Execution

Speculative Execution is a performance optimization technique that allows a CPU to execute instructions before it knows for sure whether they will be needed. It predicts the outcome of conditional branches (using branch prediction) and proceeds as if the prediction is correct.

This technique is especially beneficial for conditional branches, such as if/else statements or loops, where the CPU typically waits to evaluate the condition before proceeding. By speculating on the branch outcome, the CPU can continue executing subsequent instructions without waiting, effectively avoiding the latency and improving performance.

Consider this code snippet:

if (condition) {     
    result1 = A * B;  
    result2 = C * D;  
} else {
    result1 = A + B;
    result2 = C + D;  
}

// Depends on result1 and result2
next_operation = result1 + result2 + 10;

Here’s how a modern CPU might handle this with speculative execution:

Branch Prediction: Upon encountering the conditional branch, the CPU doesn’t wait for the condition to be evaluated. Instead, it predicts which path is more likely, for example, assuming the 'if' path is taken.
Speculative Execution: The CPU starts executing instructions on the predicted path, such as result1 = A * B and result2 = C * D, without waiting for the branch condition evaluation. These two independent instructions can run in parallel, leveraging the CPU's multiple execution units.
Correct Prediction: If the prediction turns out to be correct, execution continues smoothly, improving performance.
Incorrect Prediction (Branch Misprediction/Branch Misses): If the prediction is wrong, the CPU discards the speculative results, rolls back, and executes the other path.

A Hacker's Delight in Traffic

Discussion about this post