Distributed Shared Memory (Part 2 of 2) Notes

An example

Example of release consistency model


Key Words: Conditional variable, pthread_signal, pthread_wait

in the concrete example (screenshot below), P1 instructions that update memory (e.g. flag = 1) can be run in parallel with that of P2 because of release consistency model

Advantage of RC over SC



In a nutshell, we gain performance in a shared memory model using release consistency by overlapping computation with communication, because we no longer wait for coherence actions for every memory access

Lazy RC (release consistency)

Lazy Release Consistency


Key Words: Eager

The main idea here is that the “release consistency” is eager, in the sense that cache coherence traffic is generated immediately after unlock occurs. But with lazy RC, we defer that cache coherence traffic until the acquisition

Eager vs Lazy RC

Eager vs Lazy Consistency


Key Words: Eager, Lazy

Basically, eager and lazy goes boils down to a push (i.e. eager) versus pull (i.e. lazy) model. In the former, every time the lock is released, coherence traffic broadcasts to all other processes

Pros and Cons of Lazy and Eager


Advantage of lazy (over eager) is that there are less messages however there will be more latency during acquisition

Software DSM

Software DSM


Address space is partitioned, meaning each processor is responsible for a certain set of pages. This model of ownership is a distributed, and each node holds metadata about the page and is responsible for sending coherence traffic (at the software level)

Software DSM (Continued)

Software DSM (continued)


Key Words: false sharing

DSM software runs on each processor (cool idea) in a single writer multiple reader model. This model can be problematic because, coupled with false sharing, will cause significant bus traffic that ping pongs updates when multiple data structures live within the same cache line (or page)

LRC with Mutli-Writer Coherence Protocol

Lazy Release Consistency with multi-writer coherence


With lazy release consistency, a process will (during the critical section) generate a diff of the pages that have been modified, the diff later applied when another process performs updates to those same pages

LRC with Multi-Writer Coherence Protocol (Continued)


Need to be able to apply multiple diffs in a row, say Xd and Xd’ (i.e. prime)

LRC with Multi Writer Coherence Protocol (Continued)


Key Words: Multi-writer

The same page can be modified at the same time by multiple threads, just so as long as a separate lock is used


Implementation of LRC


Key Words: Run-length encoded

During a write operation (inside of a lock), a twin page will get created, essentially a copy of the original page. Then, during release, a run-length encoded diff is computed. Following this step, the memory access is then write protected

Implementation (continued)

LRC Implementation (Continued)


Key Words: Data Race, watermark, garbage collection

A daemon process (in every node) wakes up periodically and if the number of diffs exceed the watermark threshold, then daemon will apply diffs to original page. All in all, keep in mind that there’s overhead involved with this solution: overhead with space (for the twin page) and overhead in runtime (due to computing the run-length encoded diff)

Non Page Based DSM

Non-page-based DSM


Two types of library based that offer alternatives, both that do not require OS support. The two approaches are library-based (variable granularity) and structured DSM (API for structures that triggers coherence actions)




Do our (i.e. programmer’s) expectations get met as the number of processors increase: does performance increase accordingly as well? Yes, but there’s substantial overhead. To be fair, the same is true with true shared memory multiple processor