x86 - The ordering of L1 cache controller to process memory requests from CPU

Monday 17 October 2016

x86 - The ordering of L1 cache controller to process memory requests from CPU

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.

I am curious about:

To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?

Answer

I think you're confused because your terminology (and probably also mental model) doesn't seem to match reality. You don't "finish the cache coherence" for a store, you get ownership of the cache line before you're allowed to modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.

(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the loads/stores from two other cores in the same order, but never mind that.)

I think you know some of this, but let me start from the basics.

L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).

Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.

Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:

It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).

It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.

Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.

All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.

The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.

See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.

Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.

Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).

Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.

Back to the actual question:

Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.

Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".

The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.

The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.

The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)

On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)

The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.

Blog

Monday 17 October 2016

x86 - The ordering of L1 cache controller to process memory requests from CPU

Back to the actual question:

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?