## Concurrency in the

 XXI century
## It's WEAK

université
PARIS竐DIDEROT $171 \%$

Gustavo Petri https://gpetri.github.io

# Introduction to Weak Consistency 

## Concurrency: Critical Sections



## Concurrency: Critical Sections



Many Solutions:

- Locks
- Semaphores
- Monitors
- Message Passing
- Actors ...


## Concurrency: Critical Sections



Many Solutions:

- Locks
- Semaphores
- Monitors
- Message Passing
- Actors ...

Many Implementations:

- Dekker
- Patterson
- Lamport (Bakery)
- Dijkstra (P/V) ...


## Concurrency: Critical Sections



Many Solutions:

- Locks
- Semaphores
- Monitors
- Message Passing
- Actors ...

Many Implementations:

- Dekker
- Patterson
- Lamport (Bakery)
- Dijkstra (P/V) ...


## Dekker's Mutual Exclusion



## Dekker's Mutual Exclusion



## Dekker's Mutual Exclusion

```
t1 = false; t2 = false;
t1 = true;
if (!t2) {
    // I'm alone in the
    // critical section
}
What went wrong?
public class Dekker \{
public static boolean \(t 1=\) false; public static boolean \(t 2=\) false; public static int nonNeg = 1;
public static void main(String[] args) \{
```

```
for (;;) {
```

for (;;) {
Dekker.t1 = false;
Dekker.t1 = false;
Dekker.t2 = false;
Dekker.t2 = false;
Dekker.nonNeg = 1;

```
    Dekker.nonNeg = 1;
```

    Thread t1 = new Thread() \{
        public void run() \{
    
## Dekker's Mutual Exclusion

$\mathrm{t} 1=$ false; $\mathrm{t} 2=\mathrm{false} ;$
$\mathrm{t} 1=$ true;
if $(!\mathrm{t} 2)\{$
$\quad / /$ I'm alone in the
$\quad$ // critical section
$\}$

$$
\begin{aligned}
& \text { t2 }=\text { true; } \\
& \text { if (!t1) \{ } \\
& \quad / / \text { I'm alone in the } \\
& \text { // critical section } \\
& \} \quad
\end{aligned}
$$

What went wrong?

- t1 and t2 aren't sync
public class Dekker \{
public static boolean t1 = false; public static boolean $t 2=$ false; public static int nonNeg = 1;
public static void main(String[] args) \{

```
for (;;) {
    Dekker.t1 = false;
    Dekker.t2 = false;
    Dekker.nonNeg = 1;
```

    Thread t1 = new Thread() \{
    public void run() \{
    
## Dekker's Mutual Exclusion

$$
\begin{aligned}
& \text { t1 }=\text { false; t2 }=\text { false; } \\
& \text { t1 }=\text { true; } \\
& \text { if (!t2) \{ } \\
& \quad / / \text { I'm alone in the } \\
& \text { // critical section } \\
& \}
\end{aligned}
$$

```
t2 = true;
if (!t1) {
    // I'm alone in the
    // critical section
}
                                    Dekker.java
```

What went wrong?

- t1 and t2 aren't sync
- nonNeg is not sync
public class Dekker \{
public static boolean $t 1=$ false; public static boolean $t 2=$ false; public static int nonNeg = 1;
public static void main(String[] args) \{

```
for (;;) {
    Dekker.t1 = false;
    Dekker.t2 = false;
    Dekker.nonNeg = 1;
```

    Thread t1 \(=\) new Thread() \{
        public void run() \{
    
## Dekker's Mutual Exclusion

$$
\begin{aligned}
& \text { t1 }=\text { false; t2 = false; } \\
& \text { t1 = true; } \\
& \text { if (!t2) \{ } \\
& \text { // I'm alone in the } \\
& \text { // critical section } \\
& \text { \} } \\
& \text { Dekker.java }
\end{aligned}
$$

What went wrong?

- t1 and t2 aren't sync
- nonNeg is not sync
- can we fix it?
public class Dekker \{
public static boolean $t 1=$ false; public static boolean $t 2=$ false; public static int nonNeg = 1;
public static void main(String[] args) \{

```
for (;;) {
    Dekker.t1 = false;
    Dekker.t2 = false;
    Dekker.nonNeg = 1;
```

    Thread t1 = new Thread() \{
        public void run() \{
    
## Dekker's Mutual Exclusion

$$
\begin{aligned}
& \text { t1 }=\text { false; t2 = false; } \\
& \text { t1 = true; } \\
& \text { if (!t2) \{ } \\
& \text { // I'm alone in the } \\
& \text { // critical section } \\
& \text { \} } \\
& \text { Dekker.java }
\end{aligned}
$$

What went wrong?

- t1 and t2 aren't sync
- nonNeg is not sync
- can we fix it?
- nonNeg--?
public class Dekker \{
public static boolean $t 1=$ false; public static boolean $t 2=$ false; public static int nonNeg = 1;
public static void main(String[] args) \{

```
for (;;) {
    Dekker.t1 = false;
    Dekker.t2 = false;
    Dekker.nonNeg = 1;
```

    Thread t1 = new Thread() \{
    public void run() \{
    
## Dekker's Mutual Exclusion

$$
\begin{aligned}
& \text { t1 }=\text { false; t2 }=\text { false; } \\
& \text { t1 }=\text { true; } \\
& \text { if (!t2) } \\
& \quad / / \text { I'm alone in the } \\
& \text { // critical section } \\
& \}
\end{aligned}
$$

```
t2 = true;
if (!t1) {
    // I'm alone in the
    // critical section
}
                                    Dekker.java
```

What went wrong?

- t1 and t2 aren't sync
- nonNeg is not sync
- can we fix it?
- nonNeg--?
- How do we know we fixed it?
public class Dekker \{
public static boolean $t 1=$ false; public static boolean $t 2=$ false; public static int nonNeg = 1;
public static void main(String[] args) \{

```
for (;;) {
    Dekker.t1 = false;
    Dekker.t2 = false;
    Dekker.nonNeg = 1;
```

    Thread t1 \(=\) new Thread() \{
    public void run() \{
    
## Where to from here?

## Where to from here?

What is a memory model?

- What are the possible results of a memory read operation


## Where to from here?

What is a memory model?

- What are the possible results of a memory read operation

Understanding Memory Models

- Testing
- Formalization
- Validation


## Where to from here?

What is a memory model?

- What are the possible results of a memory read operation

Understanding Memory Models Using Memory Models

- Testing
- Formalization
- Validation
- Programming
- Optimization
- Verification

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

LESLIE LAMPORT
Abstract-Many large sequential computers execute operations in a different order than is specified by the program. A correct execution is achieved if the results produced are the same as woul me produced
order. For a multiproces by executing the program steps in by each processor does computer, such a correct execution by entire program. Addition guarantee the correct execution of the that a computer correctly conditions are given which do guarantee that a computer correctly executes multiprocess programs.

Index Terms-Computer design, callel processing.
correctness, multiprocessing, parallel prencurent computing, hardware
A high-speed processor may exrogram. The correctness of order than is specified by the processor satisfies the foll the execution is guaranteed if thecution is the same as if the op condition: the result of an execurder specified by the program A tions had been executed in the dition will be called sequential. Conprocessor satisfying this cond of several such processors accessing a sider a computer composed ustomary approach to designing and common memory. The of multiprocess algorithms [1]-[3] for proving the correctness of that following condition is satisfied: such a computer assumes $n$ is the same as if the operations of all the result of any execution in some sequential order, and the the processors were execul processor appear in this sequence in A multiprocessor satislyme

## Sequential Consistency (SC)

- R1: Each processor issues memory requests in the order specified by its program
- R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Entering a memory request consists of entering the request on this queue



## Sequential Consistency (SC)

- R1: Each processor issues memory requests in the order specified by its program
- R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Entering a memory request consists of entering the request on this queue

Dekker is safe

```
t1 = false; t2 = false;
t1 = true;
if (!t2) {
    // I'm alone in the
    // critical section
}
```

```
t2 = true;
```

t2 = true;
if (!t1) {
if (!t1) {
// I'm alone in the
// I'm alone in the
// critical section
// critical section
}

```
}
```



## Sequential Consistency

## Sequential Consistency

Formal Tools

## Sequential Consistency

Formal Tools

Data Race Detection

## Sequential Consistency

Formal Tools

Model
Data Race Detection
Checking

## Sequential Consistency

Formal Tools

Model
Data Race Detection
Checking
Testing

## Sequential Consistency

Formal Tools

Model Data Race Detection

Static
Analysis

## Sequential Consistency

Formal Tools

Model
Checking
Data Race Detection

Program Logics
Testing

## Sequential Consistency

Formal Tools

Program Logics

Concurrent Separation Logic

## Sequential Consistency

Formal Tools

Model Checking

Data Race Detection

Program Logics
Testing

Static Analysis

Concurrent Separation Logic

## Sequential Consistency

Formal Tools

Model Checking

Data Race Detection

Program Logics
Testing

## Sequential Consistency

## Formal Tools

Model Cheokinn

Data Race Detection

## Yet most machines don't deliver SC!

Static
Analysis

Concurrent<br>Separation<br>Logic

Rely/<br>Guarantee

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

LESLIE LAMPORT
Abstract-Many large sequential computers execute operations in a different order than is specified by the program. A correct execution is achieved if the results produced are the same as woul me produced
order. For a multiproces by executing the program steps in by each processor does computer, such a correct execution by entire program. Addition guarantee the correct execution of the that a computer correctly conditions are given which do guarantee that a computer correctly executes multiprocess programs.

Index Terms-Computer design, callel processing.
correctness, multiprocessing, parallel prencurent computing, hardware
A high-speed processor may exrogram. The correctness of order than is specified by the processor satisfies the foll the execution is guaranteed if thecution is the same as if the op condition: the result of an execurder specified by the program A tions had been executed in the dition will be called sequential. Conprocessor satisfying this cond of several such processors accessing a sider a computer composed ustomary approach to designing and common memory. The of multiprocess algorithms [1]-[3] for proving the correctness of that following condition is satisfied: such a computer assumes $n$ is the same as if the operations of all the result of any execution in some sequential order, and the the processors were execul processor appear in this sequence in A multiprocessor satislyme

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

## LESLIE LAMPORT

 a different order than is specified by the programe. A correct executi is achieved if the results produceps in order. For a mocessor doe by executing the progract execution by each program. Add computer, such a correct execution of the ent a computer a guarantee the correct exch do guarantee that a computer conditions are given whitiprocess programs.executes mult

A high-speed processor may execute operations

 exendition: the result of an execution specified $r$

$$
\begin{aligned}
& \text { restricts the called a memort and efficient med memory abstract } \\
& \text { esigners use of manory consisto. program. }
\end{aligned}
$$

$$
\begin{aligned}
& \text { mulgners, thereby many perfonsistency mograms for such modion are bu } \\
& \text { moducoce }
\end{aligned}
$$ tions had been executed in condition will be call

$$
\begin{aligned}
& \text { systiprocessors supeducing the tronance optimizatio The most systems requing wide } \\
& \text { systems dise }
\end{aligned}
$$

each model differ from each more relaxed of using a commont intuitive model a formal accepted in man

mputer often leads other in subtl consistency multiprocecon used by unin sequential cification of areas of
that The purpoponse.
 sommon memory. The customultiprocess algo -
 such a computer assumes that is the same as if the operations or the the result of any execution is in some sequen in this sequence in the processors were execulal processor appearessor satisfying this
the processors were individual processoltiprocessor satisty

$$
\begin{aligned}
& \begin{array}{l}
{ }^{ \pm} \text {Western Research Laboratory } \\
\text { Digital Equipment Corporation } \\
\text { alo Alto, California } 94301-1616
\end{array} \\
& \begin{array}{l}
\begin{array}{l}
\text { Rice University ECE Technical Report } 9512
\end{array} \\
\text { Western Research Laboratory Research Report 95/7 } \\
\text { September 1995 }
\end{array}
\end{aligned}
$$

## Hardware Models

## The case for Relaxed

Sequential Program Optimizations

- Memory store takes 10 cycles*
- Memory load takes 1 cycle*
- Memory stores stall memory loads

$$
\begin{aligned}
& \mathrm{x}=1 ; \\
& \mathrm{r}_{1}=\mathrm{y} ; \\
& \mathrm{r}_{2}=\mathrm{z} ;
\end{aligned}
$$

## The case for Relaxed

Sequential Program Optimizations

- Memory store takes 10 cycles*
- Memory load takes 1 cycle*
- Memory stores stall memory loads

$$
\begin{aligned}
& \mathrm{x}=1 ; \\
& \mathrm{r}_{1}=\mathrm{y} ; \\
& \mathrm{r}_{2}=\mathrm{z} ;
\end{aligned}
$$

12 Cycles

## The case for Relaxed

Sequential Program Optimizations

- Memory store takes 10 cycles*
- Memory load takes 1 cycle*
- Memory stores stall memory loads


12 Cycles

* numbers made up for the example


## The case for Relaxed

Sequential Program Optimizations

- Memory store takes 10 cycles*
- Memory load takes 1 cycle*
- Memory stores stall memory loads


12 Cycles


10 Cycles

## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load $x$ :
- find last store to x into the processors buffer
- if none, lookup into the memory



## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory


## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{gathered} \| \begin{aligned}
& \mathrm{y}=1 ; \\
& \mathrm{r}_{2}=\mathrm{x} ; \\
& \hline \mathrm{r}_{1}=0 \& \mathrm{r}_{2}=0
\end{aligned}
$$



## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \begin{array}{l}
\mathrm{y}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{array} \left\lvert\, \begin{array}{l}
\mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \mathrm{r}_{2}=0
\end{array}\right.
\end{gathered}
$$



## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \\
\mathrm{r}_{1}=\mathrm{y} ; \| \\
\hline \mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \mathrm{r}_{2}=0
\end{gathered}
$$



## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \\
\mathrm{r}_{1}=\mathrm{y} ; \| \\
\hline \mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \& \mathrm{r}_{2}=0
\end{gathered}
$$



$$
r_{1}=y ;
$$

$$
\begin{aligned}
& \mathrm{y}=1 \\
& \mathrm{r}_{2}=\mathrm{x}
\end{aligned}
$$

## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \begin{array}{l}
\mathrm{y}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{array} \mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \mathrm{r}_{2}=0
\end{gathered}
$$



$$
r_{1}=y ;
$$

$$
\begin{aligned}
& \mathrm{y}=1 \\
& \mathrm{r}_{2}=\mathrm{x}
\end{aligned}
$$

## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \\
\mathrm{r}_{1}=\mathrm{y} ; \| \\
\hline \mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \& \mathrm{r}_{2}=0
\end{gathered}
$$



## The Case for Relaxed

## Store Buffers:

- Store x: enqueue into a FIFO buffer per processor
- Load x :
- find last store to x into the processors buffer
- if none, lookup into the memory

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \\
\mathrm{r}_{1}=\mathrm{y} ; \| \\
\hline \mathrm{r}_{2}=\mathrm{x} ; \\
\hline \mathrm{r}_{1}=0 \& \& \mathrm{r}_{2}=0
\end{gathered}
$$





## The Case for Relaxed

## Architectural Mechanisms

- Store Buffers
- Caches at different levels (L1, L2)
- Instruction Level Parallelism (ILP)
- Pipelines
- Branch Prediction
- Parallelization
- NUMA

- GPU
- etc.


## The Case for Relaxed

Architectural Mechanisms

- Store Buffers
- Caches at different levels (L1, L2)
- Instruction Level Parallelism (ILP)
- Pipelines
- Branch Prediction
- Parallelization
- NUMA
- GPU
- etc.


## Architectural Choices

- Manuals are explicitly obscure about the actual mechanisms
- Eg: x86 behaves as if it had store buffers
- Eg: Power behaves as if it had predictive caches
- The Manuals tend to be informal (at best)


## Total Store Ordering (TSO)

# A simple relaxed model TSO 

The Anomalies / Litmus test

## A simple relaxed model TSO

The Anomalies / Litmus test

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{gathered} \| \begin{aligned}
& \mathrm{y}=1 ; \\
& \mathrm{r}_{2}=\mathrm{x} ; \\
& \hline \mathrm{r}_{1}=0 \& \& r_{2}=0
\end{aligned}
$$

## A simple relaxed model TSO

The Anomalies / Litmus test

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\hline \mathrm{x}=1 ; \| \mathrm{y}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{gathered} \mathrm{r}_{2}=\mathrm{x} ;
$$



## A simple relaxed model TSO

The Anomalies / Litmus test

$$
\begin{aligned}
& \begin{array}{c}
x=0 \& y=0 \\
\hline x=1 ;\left\|\begin{array}{l}
y=1 ; \\
r_{1}=y ;
\end{array}\right\| r_{2}=x ; \\
\hline r_{1}=0 \& r_{2}=0
\end{array} \\
& x=0 \quad \& \quad y=0 \\
& \mathrm{x}=1 ; \quad \| \quad \mathrm{y}=1 \text {; } \\
& \begin{array}{l||l}
r_{1}=\mathrm{x} ; & \| \\
\mathrm{r}_{2}=\mathrm{y} ; & \begin{array}{l}
\mathrm{r}_{3}=\mathrm{y} ; \\
\mathrm{r}_{4}=\mathrm{x} ;
\end{array}
\end{array} \quad \| \begin{array}{l}
\mathrm{y}=2 ; \\
\mathrm{x}=2 ;
\end{array} \\
& r_{1}=r_{3}=2 \Rightarrow r_{2} \neq 0 \vee r_{4} \neq 0
\end{aligned}
$$



## Part I <br> Running tests with litmus7

- A tour of litmus7
- A simple run
- Cross compilation
- Running several tests at once
- Controlling test parameters
- Architecture of tests
- Affinity
- Controlling executable files
- Advanced control of test parameters
- Timebase synchronisation mode
- Advanced prefetch control
- Usage of litmus7
- Arguments
- Options
- Configuration files

Traditionally, a litmus test is a small parallel program designed to exercise the memory model of a parallel, shared-memory, computer. Given a litmus test in assembler (X86, Power or ARM) litmus7 runs the test.

Using litmus7 thus requires a parallel machine, which must additionally feature gcc and the pthreads library. Our tool litmus7 has some limitations especially as regards recognised instructions. Nevertheless, litmus7 should accept all tests produced by the companion test generators (see Part II) and has been successfully used on Linux, MacOS, AIX and Android.

## 1 A tour of litmus7

## Litmus

# Litmus 

- SB
- SB+rfi-pos
- SBB
- MP
- IRIW


## Operational Formalizations

## A simple language

$$
\begin{array}{cccc}
v \in \mathcal{V} a l & ::= & x|\lambda x e| t t|f f| 0 & \text { values } \\
e \in \mathcal{E x p r} & ::= & v|(v e)|(\operatorname{ref} v) & \text { express } \\
& \mid & (!v) \mid\left(v_{0}:=v_{1}\right) & \\
& \mid c a s v)|\langle\mathrm{wr} \mid \mathrm{rd}\rangle|\langle\mathrm{wr} \mid \mathrm{wr}\rangle & \\
& & e_{0} ; e_{1} \equiv \lambda x e_{1} e_{0} &
\end{array}
$$

## A simple language

$$
\begin{aligned}
& v \in \mathcal{V} a l::=x|\lambda x e| t t|f f| 0 \quad \text { values } \\
& e \in \mathcal{E x p r} \quad::=\quad v|(v e)|(\operatorname{ref} v) \quad \text { expressions } \\
& (!v) \mid\left(v_{0}:=v_{1}\right) \\
& \text { (mas } v \text { ) }|\langle\mathrm{wr} \mid \mathrm{rd}\rangle|\langle\mathrm{wr} \mid \mathrm{wr}\rangle \\
& e_{0} ; e_{1} \equiv \lambda x e_{1} e_{0} \\
& \text { values } \\
& \text { expressions }
\end{aligned}
$$

## A simple language

$$
\begin{aligned}
& v \in \mathcal{V a l} \quad::=\quad x|\lambda x e| t t|f f| 0 \quad \text { values } \\
& e \in \mathcal{E x p r} \quad::=\quad v|(v e)|(\operatorname{ref} v) \quad \text { expressions } \\
& (!v) \mid\left(v_{0}:=v_{1}\right) \\
& \text { (cas } v \text { ) | }\langle\mathrm{wr} \mid \mathrm{rd}\rangle \mid\langle\mathrm{wr} \mid \mathrm{wr}\rangle \\
& e_{0} ; e_{1} \equiv \lambda x e_{1} e_{0} \\
& \begin{array}{l}
\mathrm{x}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{array}\left\|\begin{array}{l}
\mathrm{y}=1 ; \\
\mathrm{r}_{2}=\mathrm{x} ;
\end{array} \quad(x:=1) ;(!y)\right\|(y:=1) ;(!x)
\end{aligned}
$$

## A simple language

$$
\begin{aligned}
& v \in \mathcal{V} a l \quad::=\quad x|\lambda x e| t t|f f| 0 \quad \text { values } \\
& e \in \mathcal{E x p r} \quad::=\quad v|(v e)|(\operatorname{ref} v) \quad \text { expressions } \\
& (!v) \mid\left(v_{0}:=v_{1}\right) \\
& \text { (cis } v \text { ) }|\langle w r \mid r d\rangle|\langle w r \mid w r\rangle \\
& e_{0} ; e_{1} \equiv \lambda x e_{1} e_{0} \\
& \begin{array}{l}
\mathrm{x}=1 ;\left\|\begin{array}{l}
\mathrm{y}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ;
\end{array}\right\| \begin{array}{c}
(x:=1) ;(!y) \|(y:=1) ;(!x) \\
\mathrm{r}_{2}=\mathrm{x} ;
\end{array}(\lambda z!y)(x:=1) \|(\lambda z!x)(y:=1)
\end{array}
\end{aligned}
$$

## A simple language

$$
\begin{aligned}
& v \in \mathcal{V} a l \quad::=\quad x|\lambda x e| t t|f f| 0 \\
& e \in \mathcal{E x p r} \quad::=\quad v|(v e)|(\operatorname{ref} v) \\
& (!v) \mid\left(v_{0}:=v_{1}\right) \\
& \text { (cas } v \text { ) } \mid\langle\mathrm{wr} \mid \mathrm{rd}\rangle \\
& e_{0} ; e_{1} \equiv \lambda x e_{1} e_{0} \\
& r::=(\lambda x e v) \mid(\operatorname{ref} v) \quad \text { redexes } \\
& (!p) \mid(p:=v) \\
& \text { (cas } p \text { ) | }\langle\mathrm{wr} \mid \mathrm{rd}\rangle \mid\langle\mathrm{wr} \mid \mathrm{wr}\rangle \\
& \mathbf{E}::=[] \mid(v \mathbf{E}) \quad \text { evaluation contexts } \\
& p:=1 ;(!q) \equiv(\lambda x(!q)(p:=1)) \equiv(\underbrace{\lambda x(!q)}_{\mathbf{E}} \underbrace{(p:=1)}_{[r]})
\end{aligned}
$$

## Sequential Consistency

$(S, T \| \mathbf{E}[(\lambda x e v)])$


## Sequential Consistency

$$
(S, T \| \mathbf{E}[(\lambda x e v)]) \quad \xrightarrow{\beta} \quad(S, T \| \mathbf{E}[\{x / v\} e])
$$



## Sequential Consistency



## Write buffer semantics

$(S, T \|(B, \mathbf{E}[(\lambda x e v)]))$


## Write buffer semantics

$$
\begin{array}{rlll}
(S, T \|(B, \mathbf{E}[(\lambda x e v)])) & \xrightarrow{\beta} & (S, T \|(B, \mathbf{E}[\{x / v\} e])) \\
(S, T \|(B, \mathbf{E}[(\operatorname{ref} v)])) & \xrightarrow{\nu_{p, v}} & (S \cup\{p \mapsto v\}, T \|(B, \mathbf{E}[p])) p \notin \operatorname{dom}(S) \\
(S, T \|(B, \mathbf{E}[(p:=v)])) & \xrightarrow{\mathrm{wr}_{p, v}} & (S, T \|(B \triangleleft[p \mapsto v], \mathbf{E}[0])) \\
(S, T \|(B, \mathbf{E}[(!p)])) & \xrightarrow{\mathrm{rd}_{p, v}} & (S, T \|(B, \mathbf{E}[v])) & B(p)=\epsilon \& S(p)=v \\
(S, T \|(B, \mathbf{E}[(!p)])) & \xrightarrow{\mathrm{rd}_{p, v}} & (S, T \|(B, \mathbf{E}[v])) & B(p)=l s:: v \\
(S, T \|(B, \mathbf{E}[\langle\mathrm{wr} \mid \mathrm{rd}\rangle])) & \xrightarrow{\mathrm{wr}} & (S, T \|(B, \mathbf{E}[0])) & \forall p, B(p)=\epsilon
\end{array}
$$



## Write buffer semantics

$$
\begin{array}{rlll}
(S, T \|(B, \mathbf{E}[(\lambda x e v)])) & \xrightarrow{\beta} & (S, T \|(B, \mathbf{E}[\{x / v\} e])) \\
(S, T \|(B, \mathbf{E}[(\operatorname{ref} v)])) & \xrightarrow{\nu_{p, v}} & (S \cup\{p \mapsto v\}, T \|(B, \mathbf{E}[p])) \quad p \notin \operatorname{dom}(S) \\
(S, T \|(B, \mathbf{E}[(p:=v)])) & \xrightarrow{\mathrm{wr}_{p, v}} & (S, T \|(B \triangleleft[p \mapsto v], \mathbf{E}[0])) \\
(S, T \|(B, \mathbf{E}[(!p)])) & \xrightarrow{\mathrm{rd}_{p, v}} & (S, T \|(B, \mathbf{E}[v])) & B(p)=\epsilon \& S(p)=v \\
(S, T \|(B, \mathbf{E}[(!p)])) & \xrightarrow{\mathrm{rd}_{p, v}} & (S, T \|(B, \mathbf{E}[v])) & B(p)=l s:: v \\
(S, T \|(B, \mathbf{E}[\langle\mathrm{wr} \mid \mathrm{rd}\rangle]))) & \xrightarrow{\mathrm{wr}} & (S, T \|(B, \mathbf{E}[0])) & \forall p, B(p)=\epsilon \\
(S, T \|([p \mapsto v] \triangleright B, e)) & \xrightarrow{\mathrm{bu}_{p, v}} & (S[p \mapsto v], T \|(B, e)) & T S O
\end{array}
$$



Axiomatic Formalizations

$\because \Delta \Rightarrow$

- Writing simple models
- Sequential consistency
- Total Store Order (TSO).
- Sequential consistency,total order definition
- Computing coherence orders
- Producing pictures of executions
- Graph modes
- Showing forbidden executions
- Model definitions
- Overview
- Identifiers
- Expressions
- Instructions
- Bell extensions
- Models
- Primitives
- Library
- Usage of herd7
- Arguments
- Options
- Configuration files


## Partial Orders

- Strict Partial Orders
- irreflexive, transitive
- po or $\xrightarrow{\text { po }}$ for program order
- Operations
- inverse: $\mathrm{po}^{-1}$
- transitive closure: $\mathrm{po}^{+}$
- composition: po; rf
- set operations: po $\cup r f$ po $\cap r f$
- Conditions on Orders
- Acyclicity
- Irreflexivity
- Transitive
- Consistency

$$
(\mathrm{po} \cup \mathrm{rf})^{+} \cap \mathrm{id}=\emptyset
$$

# Burckhardt's cheatsheet 

| Property | Element-wise Definition $\forall x, y, z \in A:$ | Algebraic Definition |
| :---: | :---: | :---: |
| symmetric <br> reflexive <br> irreflexive <br> transitive <br> acyclic <br> total | $\begin{aligned} & x \xrightarrow{\text { rel }} y \Rightarrow y \xrightarrow{\text { rel }} x \\ & x \xrightarrow[\text { rel }]{\text { re }} x \\ & x \xrightarrow{\text { rel }} x \\ & (x \xrightarrow{\text { rel }} y \xrightarrow{\text { rel }} z) \Rightarrow(x \xrightarrow{\text { rel }} z) \\ & \neg(x \xrightarrow[\text { rel }]{ } \ldots \xrightarrow{\text { rel }} x) \\ & x \neq y \Rightarrow(x \xrightarrow{\text { rel }} y \vee y \xrightarrow{\text { rel }} x) \end{aligned}$ | $\begin{aligned} & \text { rel }=\mathrm{rel}^{-1} \\ & \mathrm{id}_{A} \subseteq \mathrm{rel} \\ & \mathrm{id}_{A} \cap \mathrm{rel}=\emptyset \\ & \left(\mathrm{rel} ; \mathrm{rel}^{\prime} \subseteq\right. \text { rel } \\ & \mathrm{id}_{A} \cap \mathrm{rel}^{+}=\emptyset \\ & \text { rel } \cup \mathrm{rel}^{-1} \cup \mathrm{id}_{A}=A \times A \end{aligned}$ |


| Property | Definition |
| :--- | :--- |
| natural | $\forall x \in A: \mid$ rel $^{-1}(x) \mid<\infty$ |
| partialorder | irreflexive $\wedge$ transitive |
| totalorder | partialorder $\wedge$ total |
| enumeration | totalorder $\wedge$ natural |
| equivalencerelation | reflexive $\wedge$ transitive $\wedge$ symmetric |

Figure 2.1: Definitions of common properties of a binary relation rel $\subseteq A \times A$.

## TSO - The Model

SC

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\mathrm{x}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ; \\
\mathrm{r}_{1}=1 \& \mathrm{r}_{2}=0
\end{gathered}
$$

## TSO - The Model

SC

$$
\begin{gathered}
\mathrm{x}=0 \& \mathrm{y}=0 \\
\operatorname{pol}_{{ }^{\mathrm{x}}=}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ; \\
\mathrm{r}_{1}=1 \& \mathrm{r}_{2}=1 ; \\
\mathrm{r}_{2}=\mathrm{x} ;
\end{gathered}
$$

$\xrightarrow{\mathrm{pO}}$ program order Total order on the actions of each process

## TSO - The Model

SC

$$
\begin{aligned}
& x=0 \& y=0
\end{aligned}
$$

$$
\begin{aligned}
& r_{1}=1 \& r_{2}=0
\end{aligned}
$$

program order Total order on the actions of each process reads from Relates a read to the write that stores its value

## TSO - The Model

SC

$$
\begin{aligned}
& r_{1}=1 \& r_{2}=0
\end{aligned}
$$

$\xrightarrow{\mathrm{po}}$ program order Total order on the actions of each process
$\xrightarrow{\mathrm{CO}}$ commit order Relates writes to the same address

## TSO - The Model

SC

program order Total order on the actions of each process
$\begin{array}{ll}\text { reads from } & \text { Relates a read to the write that stor } \\ \text { commit order } & \text { Relates writes to the same address }\end{array}$
from reads
Read to write order derived from rf and co

## TSO - The Model

SC

$$
x=0 \& y=0
$$


$\xrightarrow{\mathrm{pO}}$ program order Total order on the actions of each process

Relates writes to the same address
$\xrightarrow{\mathrm{fr}}$ from reads Read to write
happens before ${ }^{*}(\mathrm{po} \cup \mathrm{rf})^{+}$

## TSO - The Model

SC


SC

$$
\begin{gathered}
\mathrm{x}=0 \text { \& } \mathrm{y}=0 \\
\mathrm{x}=1 ; \\
\mathrm{r}_{1}=\mathrm{y} ; \\
\mathrm{r}_{1}=0 \& \& \mathrm{r}_{2}=0
\end{gathered}
$$

## TSO - The Model



## TSO - The Model



## TSO - The Model



## TSO - The Model



## TSO - The Model



## TSO - The Model

TSO


## TSO - The Model



## TSO - The Model

TSO


## TSO - The Model

TSO


TSO


## TSO - The Model

TSO


TSO


## TSO - The Model

TSO


TSO


## Formalizing MMs

Events: $\mathrm{e}=\mathrm{p}: \mathrm{R}[\mathrm{x}]=1 \quad|\mathrm{p}: W[\mathrm{x}]=1 \quad| \mathrm{p}:$ Fence
Execution: $\mathrm{E}=\langle\mathrm{P}, \mathrm{Ev}, \mathrm{po}\rangle$
Candidate Execution: $\mathrm{C}=\langle\mathrm{E}, \mathrm{rf}, \mathrm{cos}$
Memory Access Dec: $D=W|R| M$
Derived Relations: $\quad R=D D$

Constraints:
| ext | int | fr acyclic $\mid$ irreflexive

## Causality (Happens-Before)

In the simplest case (SC):

## Causality (Happens-Before)

In the simplest case (SC):

- events from the same process happen in the order of their program: po $\subseteq$ hb


## Causality (Happens-Before)

In the simplest case (SC):

- events from the same process happen in the order of their program: po $\subseteq$ hb
- If a read sees a value, the write storing that value happens before that read: rf $\subseteq$ hb


## Causality (Happens-Before)

In the simplest case (SC):

- events from the same process happen in the order of their program: po $\subseteq$ hb
- If a read sees a value, the write storing that value happens before that read: rf $\subseteq$ hb
- happens before is a transitive relation: $\mathrm{hb}^{*} \subseteq \mathrm{hb}$


## Causality (Happens-Before)

In the simplest case (SC):

- events from the same process happen in the order of their program: po $\subseteq$ hb
- If a read sees a value, the write storing that value happens before that read: rf $\subseteq$ hb
- happens before is a transitive relation: $\mathrm{hb}^{\star} \subseteq \mathrm{hb}$
- happens before is acyclic


## Causality (Happens-Before)

In the simplest case (SC):

- events from the same process happen in the order of their program: po $\subseteq$ hb
- If a read sees a value, the write storing that value happens before that read: if $\subseteq$ hb
- happens before is a transitive relation: $\mathrm{hb}^{*} \subseteq \mathrm{hb}$
- happens before is acyclic
- we can add fr and co to hb (rf $\subseteq$ hb and $\mathrm{cp} \subseteq \mathrm{hb}$ ) but it doesn't change anything


## Causality (Happens-Before)

TSO*:

## Causality (Happens-Before)

TSO*:

- Reads can bypass writes on the same processor


## Causality (Happens-Before)

TSO*:

- Reads can bypass writes on the same processor
- Define the preserved program order: ppo = po / WR


## Causality (Happens-Before)

TSO*:

- Reads can bypass writes on the same processor
- Define the preserved program order: ppo = po / WR
- events from the same process happen in their preserved program: $\mathrm{ppo} \subseteq \mathrm{hb}$


## Causality (Happens-Before)

TSO*:

- Reads can bypass writes on the same processor
- Define the preserved program order: ppo = po / WR
- events from the same process happen in their preserved program: $\mathrm{ppo} \subseteq \mathrm{hb}$
- If a read sees a value, the write storing that value happens before that read: rf $\subseteq$ hb
- fr and co are included in hb (rf $\subseteq \mathrm{hb}$ and $\mathrm{cp} \subseteq \mathrm{hb}$ )
- happens before is a transitive relation: $\mathrm{hb}^{*} \subseteq \mathrm{hb}$
- happens before is acyclic
tso.cat


# tso.cat 

- SB
- SB+rfi-pos
- SBB
- MP
- IRIW


Herding cats: Modeling, Simulation, Testing, and Data-mining for Weak Memory

| notation | name | nature | dirns | reference | description |
| :---: | :---: | :---: | :---: | :---: | :---: |
| po | program order | execution | any, any | §Relations over events | instruction order lifted to events |
| rf | read-from | execution | WR | §Relations over events | links a write $w$ to a read $r$ taking its value from $w$ |
| co | coherence | execution | WW | §Relations over events | total order over writes to the same memory location |
| ppo <br> ffence, ff | $\begin{aligned} & \text { preserved pro- } \\ & \text { gram order } \\ & \text { full fence } \end{aligned}$ | architecture architecture | any, any | §Architectures | program order maintained by the architecture |
|  |  |  | any, any | §Architectures | e.g. sync on Power, dmb and dsb on ARM |
| Iwfence, Iwf cfence | lightweight fence control fence | architecture architecture | any, any | §Architectures | e.g. Iwsync on Power |
|  |  |  | any, any | §Architectures | e.g. isync on Power, isb on ARM |
| fences | fences | architecture | any, any | §Architectures | architecture-dependent. |
| prop | propagation | architectur | Relax: we won't study |  |  |
|  |  |  |  |  | propagave, cyprany en forced by fences |
| po-loc | program order restricted to the same memory | derived | any, any | $\begin{aligned} & \text { §SC PER LO- } \\ & \text { CATION } \end{aligned}$ | $\begin{aligned} & \{(x, y) \mid \underset{ }{(x, y) \in} \in \text { po } \wedge \\ & \operatorname{addr}(x)=\operatorname{addr}(y)\} \end{aligned}$ |
| com | communications | derived | any, any | §Relations | co $\cup r f \cup f r$ |
| $f r$ | from-read | derived | RW | §Relations over events | links a read $r$ to a write $w^{\prime}$ co-after the write $w$ from which $r$ takes its value |
| hb | happens before | derived | any, any | §NO THIN AIR | ppo $\cup$ fences $\cup$ rfe |
| rdw | read different writes | derived | RR | Fig. 27 | two threads; first thread holds a write, second thread holds two reads |

Herding cats: Modeling, Simulation, Testing, and Data-mining for Weak Memory Alglave, Maranget, Tautschnig TOPLAS'14

## PSO - RMO

- Herd
- Other semantical styles
- TSO:


Figure 41-Memory Models from Least Restrictive (RMO) to Most Restrictive (TSO)

- Denotational semantics based on sequences
- Denotational semantics based on POSETs
- PSO \& RMO:
- Axiomatic and operational models are relatively simple
- Denotational? Not so much

A Better x86 Memory Model: x86-TSO

Scott Owens Susmit Sarkar Peter Sewell
University of Cambridge
http ://www.cI.cam.ac.uk/users/pes20/weakmemory

Abstract. Real multiprocessors do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, typically described in ambiguous prose, which lead to widespread confusion. We produced a targets for mechanized formalization. In previous AMD architecture specorous $x 86-C C$ model, formalizing the Int out to be unsound with respect ifications of the time, but those turned oly too weak to program above. to actual hardware, as well as arguab new $x 86$-TSO model that suffers We discuss these issues and present HOL4. We believe it is sound with from neither problem, formalized in better the vendor's intentions, and is respect to real processors, reflects b. We give two equivalent definitions of also better suited for programming. Wodel based on local write buffers, and x86-TSO: an intuitive operational model, similar to that of the SPARCv8. an axiomatic total store ordering mpecific features. We have implemented Both are adapted to handle x86-spents tool, which calculates the set of all the axiomatic model in our memeven and, for greater confidence, verify the valid executions of test program. directly, with code extracted from a third, witnesses of such executions version of the definition.

1 Introduction
Most previous research on the semantics and verification of concurrent programs assumes sequential consistency: that accesses by multiple real multiprocessors, however, inassures in a global-time linear order. These are typically unobservable by consequences for the be-

## x 86 is TSO

- Documentations are really imprecise
- So you say x86 is TSO ..., how do you know?
- Litmus Test
- No conclusive proof
- Errors in both the specification and implementations have been found (mostly ARM/Power)


## PowerPC / ARM

## Power? ARM

- The story is more complicated
- Operationally: Store Atomicity relaxations (co)
- Axiomatically: Many more axioms
- How do we restore assurance?
- Herd/CAT
- Operational Simulators
- Still ..., no guarantees


## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

Threads

Memory
Temporary Store

| $x$ | $y$ | $z$ |
| :---: | :---: | :---: |
| ff | tt | ff |



## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{aligned}
& x:=f f ; \\
& r_{0}:=(!y)
\end{aligned}\left\|\begin{array}{l}
z:=f f ;
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& z:=t t
\end{aligned} \quad \begin{aligned}
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| $x$ | $y$ | $z$ |
| :---: | :---: | :---: |
| ff | tt | ff |


code;
code;

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |


code;
code;
code

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{l}
z:=f f ;
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& z:=t t
\end{aligned} \quad \begin{aligned}
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |


code;
code;
code

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |



## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |



Placeholder values for reads

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |



Placeholder values for reads

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{aligned}
& x:=f f ; \\
& r_{0}:=(!y)
\end{aligned}\left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |



Placeholder values for reads

## The Semantic Framework in a Nutshell

Threads contribute operations to a pipeline-like temporary store

$$
\begin{aligned}
& x:=f f ; \\
& \left\|\begin{array}{l}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{array}{l}
y:=f f ; \\
r_{1}:=(!y)
\end{array}
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |

wrotitif

Placeholder values for reads

## The Semantic Framework in a Nutshell

The Memory Executes Operations Reordering as Allowed by the Memory Model

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{ll}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |

## The Semantic Framework in a Nutshell

The Memory Executes Operations Reordering as Allowed by the Memory Model

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{ll}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |

## The Semantic Framework in a Nutshell

The Memory Executes Operations Reordering as Allowed by the Memory Model

$$
\begin{array}{ll}
x:=f f ; \\
r_{0}:=(!y)
\end{array}\left\|\begin{array}{ll}
z:=f f ; \\
z:=t t
\end{array}\right\| \begin{aligned}
& y:=f f ; \\
& r_{1}:=(!y)
\end{aligned}
$$

Temporary Store

| x | y | z |
| :---: | :---: | :---: |
| ff | tt | ff |

## Examples

- We use a commutability relation constraining the permissible reordering

ISO $\left(t, \mathrm{wr}_{p, v}^{W}\right) \nrightarrow\left(t, \mathrm{rd}_{q, w}\right)$
$\operatorname{PSO}\left(t, \mathrm{wr}_{p, v}^{W}\right) \nrightarrow\left(t, \mathrm{rd}_{q, w}\right) \&\left(t, \mathrm{wr}_{p, v}^{W}\right) \nrightarrow\left(t, \mathrm{wr}_{q, w}\right)$
$\mathrm{RMO}\left(t, \mathrm{wr}_{p, v}^{W}\right) \dashv\left(t, \mathrm{rd}_{q, w}\right) \&\left(t, \mathrm{wr}_{p, v}^{W}\right) \dashv\left(t, \mathrm{wr}_{q, w}\right)$

$$
\left(t, \mathrm{rd}_{p, v}\right) \nrightarrow\left(t, \mathrm{rd}_{q, w}\right)
$$

## Store-Atomicity Relaxation



## Memory Write Rules

Normal Write

$$
\begin{aligned}
\left(S, \sigma_{0} \cdot\left(t, \mathrm{wr}_{p, v}^{W, I}\right) \cdot \sigma_{1}, T\right) \underset{\dashv, \mathcal{W}}{\longrightarrow} & \left(S[p:=v], \sigma_{0} \cdot \sigma_{1}, T\right) \\
& \text { if } \quad \sigma_{0} \dashv\left(t, \mathrm{wr}_{p, v}^{W, I}\right) \& v \in \mathcal{V}_{a l}
\end{aligned}
$$

## Early Write

$$
\left(S, \sigma_{0} \cdot\left(t, \operatorname{wr}_{\varrho, v}^{W, I}\right) \cdot \sigma_{1}, T\right) \underset{\dashv, \mathcal{W}}{\longrightarrow}\left(S, \sigma_{0} \cdot\left(t, \operatorname{wr}_{\varrho, v}^{W^{\prime}, I}\right) \cdot \sigma_{1}, T\right)
$$

$$
\text { if } \quad t \in W^{\prime} \& W \subset W^{\prime} \in \mathcal{W}
$$

## Memory Read Rules

Normal Read

$$
\begin{aligned}
\left(S, \sigma_{0} \cdot\left(t, \mathrm{rd}_{p, \iota}\right) \cdot \sigma_{1}, T\right) \underset{\substack{ \\
\hline \mathcal{W}}}{\longrightarrow} & \left(S,\{\iota \mapsto v\}\left(\sigma_{0} \cdot \sigma_{1}, T\right)\right) \\
\text { if } & \sigma_{0} \dashv\left(t, \mathrm{rd}_{p, \iota}\right) \& S(p)=v
\end{aligned}
$$

## Early Read

$$
\begin{aligned}
&\left(S, \sigma_{0} \cdot\left(t^{\prime}, \operatorname{wr}_{p, v}^{W, I}\right) \cdot \sigma_{1} \cdot\left(t, \mathrm{rd}_{p, \iota}\right) \cdot \sigma_{2}, T\right) \\
& \underset{\neg, \mathcal{W}}{\longrightarrow}\left(S,\{\iota \mapsto v\}\left(\sigma_{0} \cdot\left(t^{\prime}, \operatorname{wr}_{p, v}^{W, I \cup\{\iota\}}\right) \cdot \sigma_{1} \cdot \sigma_{2}, T\right)\right) \\
&\left.\quad \text { if } \quad t \in W \& \sigma_{1}\right\urcorner\left(t, \mathrm{rd}_{p, \iota}\right)
\end{aligned}
$$

## IRIW Example

IRIW $p:=t t\|q:=t t\|$\begin{tabular}{l}
$r_{0}:=!p ;$ <br>
$r_{1}:=!q$

$\|$

$r_{2}:=!q ;$ <br>
$r_{3}:=!p$
\end{tabular}

$r_{0}=r_{2}=t t \& r_{1}=r_{3}=f f$

## IRIW Example

$$
\text { IRIW } p:=t t\|q:=t t\| \begin{aligned}
& r_{0}:=!p ; \quad \begin{array}{l}
r_{2}:=!q ; \\
r_{1}:=!q
\end{array} r_{3}:=!p
\end{aligned}
$$

| $r_{0}=r_{2}=t t \& r_{1}=r_{3}=f f$ |  |  |  |
| :--- | :--- | :--- | :--- | :--- |
| $\left(t_{0}, \mathbf{w r}_{p, t t}^{\left\{t_{0}\right\}}\right)$ | $\left(t_{1}, \mathbf{w r}_{q, t t}^{\left\{t_{1}\right\}}\right)$ | $\left(t_{2}, \mathrm{rd}_{p, \nu}\right)$ | $\left(t_{2}, \mathrm{rd}_{q, \nu^{\prime}}\right)\left(t_{3}, \mathrm{rd}_{q, \mu}\right)\left(t_{3}, \operatorname{rd}_{p, \mu^{\prime}}\right)$ |

## VEarly writes $\downarrow$

| $\left(t_{0}, \mathrm{wr}_{p, t t}^{\left\{t_{0}, t_{2}\right\}}\right)$ | $\left(t_{1}, \mathrm{wr}_{q, t t}^{\left\{t_{1}, t_{3}\right\}}\right)$ | $\left(t_{2}, \mathrm{rd}_{p, \nu}\right)\left(t_{2}, \mathrm{rd}_{q, \nu^{\prime}}\right)\left(t_{3}, \mathrm{rd}_{q, \mu}\right)\left(t_{3}, \mathrm{rd}_{p, \mu^{\prime}}\right)$ |
| :--- | :--- | :--- | :--- | :--- |

## IRIW Example

IRIS $p:=t t\|q:=t t\| \begin{aligned} & r_{0}:=!p ; \\ & r_{1}:=!q\end{aligned} \| \begin{aligned} & r_{2}:=!q ; \\ & r_{3}:=!p\end{aligned}$

| $r_{0}=r_{2}=t t \& r_{1}=r_{3}=f f$ |  |  |  |
| :--- | :--- | :--- | :--- | :--- |
| $\left(t_{0}, \mathbf{w r}_{p, t t}^{\left\{t_{0}\right\}}\right)$ | $\left(t_{1}, \mathrm{wr}_{q, t t}^{\left\{t_{1}\right\}}\right)$ | $\left(t_{2}, \mathrm{rd}_{p, \nu}\right)$ | $\left(t_{2}, \mathrm{rd}_{q, \nu^{\prime}}\right)\left(t_{3}, \mathrm{rd}_{q, \mu}\right)\left(t_{3}, \operatorname{rd}_{p, \mu^{\prime}}\right)$ |

## $\downarrow$ Early writes $\downarrow$

| $\left(t_{0}, \mathrm{wr}_{p, t t}^{\left\{t_{0}, t_{2}\right\}}\right)\left(t_{1}, \mathrm{wr}_{q, t t}^{\left\{t_{1}, t_{3}\right\}}\right)$ | $\left(t_{2}, \mathrm{rd}_{p, \nu}\right)\left(t_{2}, \mathrm{rd}_{q, \nu^{\prime}}\right)\left(t_{3}, \mathrm{rd}_{q, \mu}\right)\left(t_{3}, \mathrm{rd}_{p, \mu^{\prime}}\right)$ |
| :---: | :---: | :---: | :---: |

Early reads
$\left(t_{0}, \mathrm{wr}_{p, t t}^{\left\{t_{0}, t_{2}\right\}}\right)\left(t_{1}, \mathrm{wr}_{q, t t}^{\left\{t_{1}, t_{3}\right\}}\right)$
$\left(t_{2}, \mathrm{rd}_{q, \nu^{\prime}}\right)$
$\left(t_{3}, \mathrm{rd}_{p, \mu^{\prime}}\right)$

PCC.cat

# PCC.cat 

- SB-PPC
- SB-PPC-Iwsync
- SB-PPC-sync
- WRC
- WRC+realdata
- WRC-Iwsync
- IRIW
- IRIW-Iwsync
- IRIW-sync


## POWER and ARM Litmus Tests

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental

| Coherence tests |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| CoRR1: rf,po,fr <br> forbidden <br> Thread 0 <br> Thread 1 <br> Test CoRR1 |  | CoWR: co,po, $\mathbf{r f}^{-1} \quad$ forbidden <br> Thread 0 <br> Thread 1 <br> Test CoWR | CoWW: po,co <br> Thread 0 <br> a: $W[x]=1$ <br> b: $W[x]=2$ <br> Test CoWW | forbidden |


| 4-edge 2-thread tests 5-edge extensions along one rf edge |  |  |
| :---: | :---: | :---: |
| One rf | Two rf | Preserved read-read program order |
|  | WRC: rf,rf,fr <br> needs lwsync+RRdep <br> Test WRC | PPO000-019: barrier,rf,intra-thread ${ }^{*}$,fr |
| S: rf,co needs lwsync + RWdep Thread 0 <br> Thread 1 <br> a: <br> Test S |  |  |
| No rf | One rf | 6-edge extensions along two rf edges |
|  |  |  |
| $\mathbf{R}$ : co,fr $\quad$ needs sync+sync | WRW+WR: rf,co,fr needs sync+sync | IRRWIW: rf,fr,rf,co needs sync+sync |

## Power Barriers

- sync: heavyweight barrier
- cumulative
- lwsync: lightweight barrier
- similar to sync
- does not prevent WR reordering
- Examples
- Herd


## Simple Spin-lock

| $\operatorname{lock}(l) ;$ | $\operatorname{lock}(l) ;$ |
| :---: | :---: |
| $r=x ;$ | $r=x ;$ |
| $x=r+1 ;$ | $x=r+1 ;$ |
| unlock $(l) ;$ | unlock(l); |



To fence or not to fence?

## Other Models

- And yet these are not the most complicated models
- NVIDIA
- Alpha (obsolete)
- We'll see Programming Languages models in the next lecture

