Skip to main content

Memory Barriers and Fencing

· 7 min read
Sanjeev Sarda
High Performance Developer

Notes on memory barriers and fencing.

memoryBarriers

The CPU and compiler can perform whatever optimizations they see fit as long as it achieves the same result. Reordering of memory operations is a key technique used by CPUs to improve performance.

Memory barriers prevent reordering of memory operations both at compile time and at runtime.

Store ~ store to memory

Load ~ load from memory

This allows for loads and stores to occur in a well defined order (the program order is consistent when viewed from the perspective of different CPUs), and makes the use of memory barriers a key technique for lock free programming.

Barriers also ensure data is propagated to the cache sub-system, which is what makes the memory/state visible across CPUs.

Runtime Reordering

Let's say we write to 2 vars, a and b. They get written to the CPU cache, not main memory.

alt text

In the above image we see that whilst a was written to the CPU cache before b, when it comes to writing to main memory it was b that was written before a.

There is a difference between the program order and the observed order.

Load and Store Buffering

When we load or store something, there are are a set of buffers that are written or read from, before that data makes it's way to the cache subsystem.

alt text

These buffers are a type of queue with fast lookup - this mean the CPU can also read from these buffers as opposed to the cache subsystem (if the data is in the store buffer and not yet in the cache subsystem) in order to improve performance.

These buffers are also an integral component of the async nature of cache coherency protocols.

SFence

The sfence instruction is how a store barrier is implemented.

All stores prior (in program order) to the sfence instruction are globally visible before any stores that come after (in program order) the sfence.

All stores prior to the sfence instruction must be written from the store buffer to the L1 cache. This ensures visibility of state to other CPUs.

"The use of weakly-ordered memory types can be important under certain data sharing relationships (such as a producer-consumer relationship). Using weakly-ordered memory can make assembling the data more efficient, but care must be taken to ensure that the consumer obtains the data that the producer intended to see." - Intel architecture optimization manual.

LFence

The lfence instruction is how a load barrier is implemented.

All loads prior (in program order) to the lfence instruction occur before any load instructions that come after the lfence instruction.

We ensure global visibility of loads that come before the lfence occur before global visibility of loads that come after the lfence.

Makes sure that state updates from other CPUs is visible to the current CPU before continuing by ensuring the load buffer is drained before continuing execution.

MFence

An mfence or full barrier is a combination of a load and store barrier.

Acquire and Release Barriers

No Barriers

Reads and writes can be reordered however the CPU wants.

alt text

This diagram is taken from a presentation by Fedor Pikus (a Technical Fellow at Siemens Digital Industries) about C++ atomics, the yellow arrows marked with X indicates an atomic variable he is adding 1 to.

The colour coded a,b,c,x boxes indicate the variables he is discussing and writing too or reading from.

The top part (above the blue memory box) is the program order.

The bit below blue memory box is the observed order.

The black arrows indicate memory reordering operations.

Acquire Barrier

Memory operations that occur after the barrier in program order become visible after the barrier. This means we can't reorder memory operations that occur after the barrier to now come before the barrier.

Anything before the barrier at x can be re-ordered to be after the barrier at x.

Anything after the barrier at x can not be re-ordered to be before the barrier at x.

alt text

Release Barrier

Memory operations that occur before the barrier in program order become visible before the barrier. This means we can't reorder memory operations that occur before the barrier to now come after the barrier.

Anything after the barrier at x can be reordered to occur before the barrier at x.

Anything that was before the barrier at x can not be reordered to occur after the barrier at x.

alt text

Using Acquire-Release Barriers

A common pattern of usage for acquire and release barriers is when we have a producing (thread 1) and a consuming thread (thread 2).

Thread 1 writes then issues a release barrier - all writes before the barrier become visible to other threads after the barrier.

Thread 2 reads with an acquire barrier in order to sync with the published data.

alt text

In the above diagram, there is some non-atomic variables (a and b) which are written too before the release-store fence issued by thread 1.

alt text

The reading thread (thread 2) issues an acquire-load and reads variables a and b after the acquire load (in program order).

Examples of Memory Barrier Use

These are Java examples of how memory barriers are used in practice.

Volatile Fields in Java

When we do a write to a volatile variable, we start with an sfence to ensure the store buffers are flushed and state is being propagated. The variable is then written too/stored too. This is followed by a full barrier - the store buffer is flushed and the load buffer is drained.

A full barrier is required to ensure the drain of the load buffer (cache coherency protocol messages?) not just the flushing of the store buffer to L1 cache.

The LMAX Disruptor Sequence Impl

This is how the get() is implemented:

public long get()
{
long value = this.value;
VarHandle.acquireFence();
return value;
}

Remember that with an acquire barrier: "Memory operations that occur after the barrier in program order become visible after the barrier. This means we can't reorder memory operations that occur after the barrier to now come before the barrier."

This is how the set() method is implemented:

public void set(final long value)
{
VarHandle.releaseFence();
this.value = value;
}

Remember that with a release barrier: "Memory operations that occur before the barrier in program order become visible before the barrier. This means we can't reorder memory operations that occur before the barrier to now come after the barrier."

We see that the Sequence implementation uses the acquire-release protocol described above to ensure cross thread visibility and ordering.

Aeron Frame Writing Algorithm

From the Aeron Transport Protocol Specification:

  • Write the length of the frame header atomically with a negative value so that space can be padded later on failure.

  • Readers check for a value greater than zero for if the frame is ready to read.

  • Insert a memory ordering store fence, aka release fence

  • Write the remainder of the header fields.

  • Copy in the frame body.

  • Insert a memory ordering store fence, aka release fence.

  • Atomically write the length field with its positive value so readers know the frame is available.

This occurs over a number of classes (Iam not so hot on that codebase), but below we can see an example of the use of store fences in HeaderWriter:

public void write(final UnsafeBuffer termBuffer....)
{
termBuffer.putLongOrdered(offset + FRAME_LENGTH_FIELD_OFFSET, versionFlagsType | ((-length) & 0xFFFF_FFFFL));

VarHandle.storeStoreFence();

termBuffer.putLong(offset + TERM_OFFSET_FIELD_OFFSET, sessionId | offset);

termBuffer.putLong(offset + STREAM_ID_FIELD_OFFSET, (((long)termId) << 32) | streamId);
}

Java Fences

alt text

Acquire and release fences are delegated to load and store fence operations in Unsafe.

https://www.youtube.com/watch?v=qlkMbxUbKfw - Code Blacksmith, Memory Barriers

https://www.youtube.com/watch?v=ZQFzMfHIxng - Fedor Pikus (Mentor Graphics), "C++ Atomics, from basic to advanced. What do they really do?"

https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html - Martin Thompson, "Memory Barriers/Fences"

https://c9x.me/x86/html/file_module_x86_id_155.html - x86 Instruction set reference

https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html - Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1

https://www.alibabacloud.com/blog/memory-model-and-synchronization-primitive---part-1-memory-barrier_597460 - Jeffle Xu, Alibaba Cloud blog