### Software Models

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs LESLIE LAMPORT Abstract—Many large sequential computers execute operations i a different order than is specified by the program. A correct executi is achieved if the results produced are the same as would be produ by executing the program steps in order. For a multiproce computer, such a correct execution by each processor doe guarantee the correct execution of the entire program. Add conditions are given which do guarantee that a computer c executes multiprocess programs. Index Terms—Computer design, concurrent computing correctness, multiprocessing, parallel processing. A high-speed processor may execute operations order than is specified by the program. The corr execution is guaranteed if the processor satisfie condition: the result of an execution is the same tions had been executed in the order specified b designers, thereby reducing the benefit of using a multiprocessor. To alleviate this problem, many current more relaxed consistency models. Unfortunately, the models supported many current waves. Furthermore, medicated with the semantical of the multiprocessors support more relaxed consistency models. Unfortunately, the models supported by various systems differ from each other in subtle yet important ways. Furthermore, precisely defining the semantics of the model offen leads to complex specifications that are difficult to understand for twnical users and builders of processor satisfying this condition will be call systems differ from each other in subtle yet important ways. Furthermore, precisely defining the semantics of each model often leads to complex specifications that are difficult to understand for typical users and builders of sider a computer composed of several such pr In purpose of this tutorial paper is to describe issues related to memory consistency models in a way and he understandable to most computer professionals. We focus on consistency models promosed for had been to processor satisfying this concern sider a computer composed of several succern common memory. The customary approa the correctness of multiprocess algonization is consistent of multiprocess algonization is consistent of the operations the result of any execution is the same as if the operations of the processors were executed in some sequential order, and the the statisticitual processor appear in this sequence in A multiprocessor satisfying this

Shared Memory Consistency Models: A Tutorial \* Sarita V. Adve<sup>†</sup> and Kourosh Gharachorloo<sup>‡</sup> <sup>†</sup>Department of Electrical and Computer Engineering Houston, Texas 77251-1892 <sup>‡</sup>Western Research Laboratory Digital Equipment Corporation Palo Alto, California 94301-1616 Rice University ECE Technical Report 9512 Western Research Laboratory Research Report 95/7 September 1995 Parallel systems that support the shared memory abstraction are becoming widely accepted in many areas of writing correct and efficient programe for such evetame requires a formal enorification of mamory and the states of the Parallel systems that support the shared memory abstraction are becoming widely accepted in memory abstraction are becoming widely accepted in many accepted in model \_\_sequential consistency \_\_greatly computing. Writing correct and efficient programs for such systems requires a formal specification of memory consistency model. The most intuitive model—sequential specification of memory entries and commonly need by unincodes or hardware and commiler to the use of many performance ontimizations commonly need by unincodes or hardware and commiler to the use of many performance ontimizations commonly need by unincodes or hardware and commiler to the use of many performance ontimizations commonly need by unincodes or hardware and commiler to the use of many performance ontimizations commonly need by unincodes or hardware and commiler to the use of many performance ontimizations commonly need by unincodes or hardware and commiler to the use of many performance on the semantics, called a memory consistency model. The most intuitive model—sequential consistency model restricts the use of many performance optimizations commonly used by uniprocessor hardware and consistency—greatly designers, thereby reducing the benefit of using a multiprocessor To alleviate this problem many current restricts the use of many performance optimizations commonly used by uniprocessor hardware and designers, thereby reducing the benefit of using a multiprocessor. To alleviate this problem, many current multiprocessors support more relaxed consistency models. Unfortunately, the models supported by various various to the models supported by various to the models sup



memory model determines the transformations the compiler

are when producing bytecode, the trans-

control dependences that precludes some standard compiler transformations. insity of what is currently done in compil-

the proc

(To appear in Concurrency: Practice and Experience) The Java Memory Model is Fatally Flawed

> William Pugh Dept. of Computer Science Univ. of Maryland, College Park pugh@cs.umd.edu

#### Abstract

The Java memory model described in Chapter 17 of the Java Language Specification gives constraints on how threads interact through memory. This chapter is hard to interpret and poorly understood; it imposes constraints that prohibit common compiler optimizations and are expensive to implement on existing hardware. Most JVMs violate the constraints of the existing Java memory model; conforming to the existing specification would impose significant performance penalties.

In addition, programming idioms used by some programmers and used within Sun's Java Developexisting style of the specification will never be clear, and that attempts to patch the existing specification by adding new rules will make even harder to understand. If we decide to change the Java memory model, a completely new description of the memory model should be devised.

A number of terms are used in the Java memory model but not explicitly related to Java source programs nor the Java virtual machine. Some of these terms have been interpreted differently by various people. I have based my understanding of these terms on conversations with Guy Steele, Doug Lea and others.

A variable refers to a static variable of a loaded

(To appear in Concurrency: Practice and Experience) The Java Memory Model is Fatally Flawed

#### The Java Memory Model

Jeremy Manson and William Pugh Department of Computer Science University of Maryland, College Park College Park, MD {jmanson, pugh}@cs.umd.edu

#### ABSTRACT

This paper describes the new Java memory model, which has been revised as part of Java 5.0. The model specifies the legal behaviors for a multithreaded program; it defines the semantics of multithreaded Java programs and partially determines legal implementations of Java virtual machines and compilers.

The new Java model provides a simple interface for correctly synchronized programs - it guarantees sequential consistency to data-race-free programs. Its novel contribution is requiring that the behavior of incorrectly synchronized programs be bounded by a well defined notion of causality. The causality requirement is strong enough to respect the safety and security properties of Java and weak enough to allow standard compiler and hardware optimizations. To our knowledge, other models are either too weak because they do not provide for sufficient safety/security, or are too

Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign Urbana-Champaign, IL sadve@cs.uiuc.edu

Meanings of Programs]: Operational Semantics

General Terms: Design, Languages

Keywords: Concurrency, Java, Memory Model, Multithreading

#### 1. INTRODUCTION

The memory model for a multithreaded system specifies how memory actions (e.g., reads and writes) in a program will appear to execute to the programmer, and specifically, which value each read of a memory location may return. Every hardware and software interface of a system that admits multithreaded access to shared memory requires a memory model. The model determines the transformations that the system (compiler, virtual machine, or hardware) can apply to a program written at that interface. For example, given a program in machine language, the memory model for the machine language / hardware interface will determine the

Pugh uter Science College Park nd.edu

isting style of the specification will never be clear, d that attempts to patch the existing specification adding new rules will make even harder to unstand. If we decide to change the Java memory del, a completely new description of the memory del should be devised.

number of terms are used in the Java memory lel but not explicitly related to Java source prons nor the Java virtual machine. Some of these is have been interpreted differently by various le. I have based my understanding of these terms inversations with Guy Steele, Doug Lea and oth-

variable refers to a static variable of a loaded

(To appear in Concurrency: Practice and Experience) The Java Memory Model is Fatally Flawed

> Pugh uter Science I, College Park nd.edu

#### The Java Memory Model

Jeremy Manson and William Pugh Department of Computer Science University of Maryland, College Park College Park, MD {jmanson, pugh}@cs.umd.edu

#### ABSTRACT

This paper describes the new Java memory model, which has been revised as part of Java 5.0. The model specifies the legal behaviors for a multithreaded program; it defines the semantics of multithreaded Java programs and partially determines legal implementations of Java virtual machines and compilers.

The new Java model provides a simple interface for correctly synchronized programs – it guarantees sequential consistency to data-race-free programs. Its novel contribution is requiring that the behavior of incorrectly synchronized programs be bounded by a well defined notion of causality. The causality requirement is strong enough to respect the safety and security properties of Java and weak enough to allow standard compiler and hardware optimizations. To our knowledge, other models are either too weak because they do not provide for sufficient safety/security, or are too Dep Universi

#### Meanin Genera Keywor ing

1. If The

how m will ar which ery ha multi mode syste to a

a pr

mac

Java Memory Model Examples: Good, Bad and Ugly

David Aspinall and Jaroslav Ševčík

August 8, 2007

#### Abstract

We review a number of illustrative example programs for the Java Memory Model (JMM) [6, 3], relating them to the original design goals and giving intuitive explanations (which can be made precise). We consider good, bad and ugly examples. The good examples are allowed behaviours in the JMM, showing possibilities for non sequentially consistent executions and reordering optimisations. The bad examples are prohibited behaviours, which are clearly ruled out by the JMM. The ugly examples are most interesting: these are tricky cases which illustrate some problem areas for the current formulation of the memory model, where the anticipated design goals are not met. For some of these we mention possible fixes, drawing on knowledge we gained while formalising the memory model in the theorem prover leabelly [1]

(To appear in Concurrency: Practice and Experience) The Java Memory Model is Fatally Flawed

> Pugh uter Science College Park nd.edu

#### The Java Memory Model

Jeremy Manson and William Pugh Department of Computer Science University of Maryland, College Park College Park, MD {jmanson, pugh}@cs.umd.edu

Dep Universi

Java Memory Model Examples: d, Bad and Ugly

pinall and Jaroslav Ševčík

August 8, 2007

#### Abstract

illustrative example programs for the Java relating them to the original design goals ions (which can be made precise). We conmples. The good examples are allowed bepossibilities for non sequentially consistent nisations. The bad examples are prohibited uled out by the JMM. The ugly examples tricky cases which illustrate some probulation of the memory model, where the met. For some of these we mention possiwe gained while formalising the memory

#### On Validity of Program Transformations in the Java Memory Model

Jaroslav Ševčík and David Aspinall

LFCS, School of Informatics, University of Edinburgh

Abstract. We analyse the validity of several common program transformations in multi-threaded Java, as defined by the Java Memory Model (JMM) section of Chapter 17 of the Java Language Specification. The main design goal of the JMM was to allow as many optimisations as possible. However, we find that commonly used optimisations, such as common subexpression elimination, can introduce new behaviours and

ABS This has the le the s deter and c Threctly sisten is req progr The ( safety allow our ki

they c

**Atomic Operations** 

int x = 1; x++; double x = 1.0;

> monitorenter monitorexit

**Atomic Operations** 

int x = 1;

x++; double x = 1.0

monitorenter monitorexit Synchronizing Operations
volatile int x = 0;
 thread.start();
 thread.join();
 monitorenter
 monitorexit

**Atomic Operations** 

int x = 1;

x++; double x = 1.0

monitorenter monitorexit

Compiler Optimizations Reordering Eliminations (CSE) Introductions Synchronizing Operations
volatile int x = 0;
 thread.start();
 thread.join();
 monitorenter
 monitorexit

**Atomic Operations** 

int x = 1;

x++; double x = 1.0

monitorenter monitorexit

Compiler Optimizations Reordering Eliminations (CSE) Introductions volatile int x = 0; thread.start(); thread.join(); monitorenter monitorexit

Synchronizing Operations

VM Optimizations Just in time compilation Lock elision Biased Locking

#### On Validity of Program Transformations in the Java Memory Model

Jaroslav Ševčík and David Aspinall LFCS, School of Informatics, University of Edinburgh

Abstract. We analyse the validity of several common program transformations in multi-threaded Java, as defined by the Java Memory Model (JMM) section of Chapter 17 of the Java Language Specification. The main design goal of the JMM was to allow as many optimisations as possible. However, we find that commonly used optimisations, such as common subexpression elimination, can introduce new behaviours and so are invalid for Java. In this paper, we describe several kinds of transformations and explain the problems with a number of counterexamples. More positively, we also examine some valid transformations, and prove their validity. Our study contributes to the understanding of the JMM, and has the practical impact of revealing some cases where the Sun Hotspot JVM does not comply with the Java Memory Model.

#### Introduction

Although programmers generally assume an interleaved semantics, the Java Language Specification [11] defines more relaxed semantics, which is called the Java Memory Model [12, 19]. The reasons for having a weaker semantics become apparent from the following example:

 $\cap$ 

r1 = r2 = r3 = 1

 $\cap$ 

r1 = r2 = r3 = 1











Do we forbid these simple optimizations?

If we consider SC executions this program has **no data races** 

$$r_1 = r_3 = 42?$$

If we consider SC executions this program has **no data races** 

The value 42 appears nowhere in the program

$$r_1 = r_3 = 42?$$

If we consider SC executions this program has **no data races** 

The value 42 appears nowhere in the program

Explicitly forbidden by the JMM and C++11

 $\sim$ 



- Data Race
  - concurrent memory accesses on a single variable
  - at least one is a write

$$x = 1 \parallel x = 2$$
  $x = 1 \parallel r = x$ 

- Data Race
  - concurrent memory accesses on a single variable
  - at least one is a write

$$x = 1 \parallel x = 2 \qquad x = 1 \parallel r = x$$

- Data Race Freedom
  - Every SC execution is free of data races

- Data Race
  - concurrent memory accesses on a single variable
  - at least one is a write

$$x = 1 \parallel x = 2 \qquad x = 1 \parallel r = x$$

- Data Race Freedom
  - Every SC execution is free of data races
- DRF Guarantee
  - DRF programs do not exhibit relaxed behaviors

#### **Relaxed Executions**

#### **SC Executions**

#### **DRF Executions**

Relaxed Executions SC Executions DRF Executions Relaxed Executions SC Executions DRF Executions

• Remark: To check if a program is DRF we can use SC tools


- Remark: To check if a program is DRF we can use SC tools
- Corollary: If a program is DRF we can use any SC too



- Remark: To check if a program is DRF we can use SC tools
- Corollary: If a program is DRF we can use any SC too
- Verification Strategy:
  - 1. Check that the program is DRF
  - 2. Verify the program using any SC tool

#### Memory Models in Practice



# What if we define DRF to be the memory model?

- Someone has to implement the JVM
- Someone has to implement java.util.concurrent
- There are no complete DRF detection algorithms
- Performance critical lock-free data structures
- What happens if despite the programmers effort there are data races?

• Catch-fire semantics for data races

- Catch-fire semantics for data races
- Variables declared atomic cannot race std::atomic<int> x;
  - Memory Ordering Decorations: x.load(mo), x.store(v, mo) x.store(5, memory\_order\_release) x.load(memory\_order\_seq\_cst)
    - Release/Acquire: Message Passing
    - Release/Consume: Message Passing modulo dependencies
    - Relaxed: simply not a data race

bool t1; bool t2; int counter = 1; t1 = true; if (!t2) counter--;
bool t1; t2; t1 = true; if (!t1) counter--;

bool t1; bool t2; int counter = 1; t1 = true; if (!t2) counter--;
t2 = true; if (!t1) counter--;



atomic<bool> t1; atomic<bool> t2; int counter = 1; t1.store(true, relaxed); || t2.store(true, relaxed); if (!t2.load(relaxed)) || if (!t1.load(relaxed)) counter--;



atomic<bool> t1; atomic<bool> t2; int counter = 1; t1.store(true, release); || t2.store(true, release); if (!t2.load(acquire)) || if (!t1.load(acquire)) counter--;

atomic<bool> t1; atomic<bool> t2; int counter = 1; t1.store(true, release); || t2.store(true, release); if (!t2.load(acquire)) || if (!t1.load(acquire)) counter--;



atomic<bool> t1; atomic<bool> t2; int counter = 1; t1.store(true, seq\_cst); if (!t2.load(seq\_cst)) counter--;

#### C++11 demo



# The Java Memory Model (JMM)

#### The JMM by Intimidation

$$\begin{array}{cccc} x=0\\ \texttt{lock}(\ell); & \texttt{lock}(\ell); & \texttt{lock}(\ell); \\ x++; & \parallel & x++; & \parallel & x++; & \parallel & \texttt{while}(1) \ \texttt{print}(x); \\ \texttt{unlock}(\ell); & \texttt{unlock}(\ell); & \texttt{unlock}(\ell); \end{array}$$

0 0 0 1 1 2 2 2 2 3 3 ...

### The JMM by Intimidation



# The JMM by Intimidation



Of course, no reasonable implementation would do that!

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$ |              |  |
|-------------------------------|--------------|--|
| 1:lock m                      | 5:lock m     |  |
| 2: $r1 = x$                   | 6: $r^2 = y$ |  |
| 3: $y = r1$                   | 7: $x = r^2$ |  |
| 4: unlock m                   | 8: unlock m  |  |

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                   |              |  |
|---------------------------------------------------------------------------------------------------------------------------------|--------------|--|
| 1:lock m                                                                                                                        | 5:lock m     |  |
| 2: $r1 = x$                                                                                                                     | 6: $r^2 = y$ |  |
| 3: $y = r1$                                                                                                                     | 7: $x = r^2$ |  |
| 4: unlock m                                                                                                                     | 8: unlock m  |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4$ ; $5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$ |              |  |

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                    |              |  |
|----------------------------------------------------------------------------------------------------------------------------------|--------------|--|
| 1:lock m                                                                                                                         | 5:lock m     |  |
| 2: $r1 = x$                                                                                                                      | 6: $r^2 = y$ |  |
| 3: $y = r1$                                                                                                                      | 7: $x = r^2$ |  |
| 4: unlock m 8: unlock m                                                                                                          |              |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4; 5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$     |              |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{so} 8 \vee 5 \xrightarrow{so} 8 \xrightarrow{so} 1 \xrightarrow{so} 4$ |              |  |

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                    |              |  |
|----------------------------------------------------------------------------------------------------------------------------------|--------------|--|
| 1:lock m                                                                                                                         | 5:lock m     |  |
| 2: $r1 = x$                                                                                                                      | 6: $r^2 = y$ |  |
| 3: $y = r1$                                                                                                                      | 7: $x = r^2$ |  |
| 4: unlock m                                                                                                                      | 8: unlock m  |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4; 5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$     |              |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{so} 8 \lor 5 \xrightarrow{so} 8 \xrightarrow{so} 1 \xrightarrow{so} 4$ |              |  |
| sw: $4 \xrightarrow{sw} 5 \lor 8 \xrightarrow{sw} 1$                                                                             |              |  |

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                |                                                                                                                                                        |  |
|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1:lock m                                                                                                                     | 5:lock m                                                                                                                                               |  |
| 2: $r1 = x$                                                                                                                  | 6: $r^2 = y$                                                                                                                                           |  |
| 3: $y = r1$                                                                                                                  | 7: $x = r^2$                                                                                                                                           |  |
| 4: unlock m                                                                                                                  | 8: unlock m                                                                                                                                            |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4; 5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$ |                                                                                                                                                        |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{s}$                                                                | $\stackrel{o}{\rightarrow} 8 \lor 5 \stackrel{so}{\rightarrow} 8 \stackrel{so}{\rightarrow} 1 \stackrel{so}{\rightarrow} 4$                            |  |
| sw: $4 \xrightarrow{sw} 5 \lor 8 \xrightarrow{sw} 1$                                                                         |                                                                                                                                                        |  |
| hb: $1 \xrightarrow{hb} 2 \xrightarrow{hb} 3^{\frac{h}{2}}$                                                                  | $\stackrel{b}{\rightarrow} 4 \stackrel{hb}{\rightarrow} 5 \stackrel{hb}{\rightarrow} 6 \stackrel{hb}{\rightarrow} 7 \stackrel{hb}{\rightarrow} 8 \lor$ |  |
| hb: $5 \xrightarrow{hb} 6 \xrightarrow{hb} 7 \xrightarrow{h}$                                                                | $\stackrel{b}{\rightarrow} 8 \stackrel{hb}{\rightarrow} 1 \stackrel{hb}{\rightarrow} 2 \stackrel{hb}{\rightarrow} 3 \stackrel{hb}{\rightarrow} 4$      |  |

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                 |                                                                                                                                                        |  |
|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1:lock m                                                      | 5:lock m                                                                                                                                               |  |
| 2: $r1 = x$                                                   | 6: $r^2 = y$                                                                                                                                           |  |
| 3: $y = r1$                                                   | 7: $x = r^2$                                                                                                                                           |  |
| 4: unlock m                                                   | 8: unlock m                                                                                                                                            |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3^{p}$             | $\stackrel{o}{\rightarrow}$ 4; 5 $\stackrel{po}{\rightarrow}$ 6 $\stackrel{po}{\rightarrow}$ 7 $\stackrel{po}{\rightarrow}$ 8                          |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{s}$ | $\stackrel{o}{\rightarrow} 8 \lor 5 \stackrel{so}{\rightarrow} 8 \stackrel{so}{\rightarrow} 1 \stackrel{so}{\rightarrow} 4$                            |  |
| sw: $4 \xrightarrow{sw} 5 \vee 8 \xrightarrow{su}$            | <sup>₩</sup> 1                                                                                                                                         |  |
| hb: $1 \xrightarrow{hb} 2 \xrightarrow{hb} 3^{\frac{h}{2}}$   | $\stackrel{b}{\rightarrow} 4 \stackrel{hb}{\rightarrow} 5 \stackrel{hb}{\rightarrow} 6 \stackrel{hb}{\rightarrow} 7 \stackrel{hb}{\rightarrow} 8 \lor$ |  |
| hb: $5 \xrightarrow{hb} 6 \xrightarrow{hb} 7 \xrightarrow{h}$ | $\stackrel{b}{\rightarrow} 8 \stackrel{hb}{\rightarrow} 1 \stackrel{hb}{\rightarrow} 2 \stackrel{hb}{\rightarrow} 3 \stackrel{hb}{\rightarrow} 4$      |  |

Theorem: The JMM respects the DRF guarantee



- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                |                                                                                                                                                        |  |
|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1:lock m                                                                                                                     | 5:lock m                                                                                                                                               |  |
| 2: $r1 = x$                                                                                                                  | 6: $r^2 = y$                                                                                                                                           |  |
| 3: $y = r1$                                                                                                                  | 7: $x = r^2$                                                                                                                                           |  |
| 4: unlock m                                                                                                                  | 8: unlock m                                                                                                                                            |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4; 5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$ |                                                                                                                                                        |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{s}$                                                                | $\stackrel{o}{\rightarrow} 8 \lor 5 \stackrel{so}{\rightarrow} 8 \stackrel{so}{\rightarrow} 1 \stackrel{so}{\rightarrow} 4$                            |  |
| sw: $4 \xrightarrow{sw} 5 \vee 8 \xrightarrow{sw}$                                                                           | <sup>₩</sup> 1                                                                                                                                         |  |
| hb: $1 \xrightarrow{hb} 2 \xrightarrow{hb} 3^{\frac{h}{2}}$                                                                  | $\stackrel{b}{\rightarrow} 4 \stackrel{hb}{\rightarrow} 5 \stackrel{hb}{\rightarrow} 6 \stackrel{hb}{\rightarrow} 7 \stackrel{hb}{\rightarrow} 8 \lor$ |  |
| hb: $5 \xrightarrow{hb} 6 \xrightarrow{hb} 7 \xrightarrow{h}$                                                                | $\stackrel{b}{\rightarrow} 8 \stackrel{hb}{\rightarrow} 1 \stackrel{hb}{\rightarrow} 2 \stackrel{hb}{\rightarrow} 3 \stackrel{hb}{\rightarrow} 4$      |  |

Theorem: The JMM respects the DRF guarantee



#### Theorem:

Independent non-synchronizing statements can always be reordered

- happens-before is enforced through locks or volatile variables
- reads can see any non-hb related write, or
- any immediate predecessor in hb

| $\mathbf{x} = \mathbf{y} = 0$                                                                                                    |                                                                                                                                                        |  |
|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1:lock m                                                                                                                         | 5:lock m                                                                                                                                               |  |
| 2: $r1 = x$                                                                                                                      | 6: $r^2 = y$                                                                                                                                           |  |
| 3: $y = r1$                                                                                                                      | 7: $x = r^2$                                                                                                                                           |  |
| 4: unlock m                                                                                                                      | 8: unlock m                                                                                                                                            |  |
| po: $1 \xrightarrow{po} 2 \xrightarrow{po} 3 \xrightarrow{po} 4; 5 \xrightarrow{po} 6 \xrightarrow{po} 7 \xrightarrow{po} 8$     |                                                                                                                                                        |  |
| so: $1 \xrightarrow{so} 4 \xrightarrow{so} 5 \xrightarrow{so} 8 \lor 5 \xrightarrow{so} 8 \xrightarrow{so} 1 \xrightarrow{so} 4$ |                                                                                                                                                        |  |
| sw: $4 \xrightarrow{sw} 5 \lor 8 \xrightarrow{sw} 1$                                                                             |                                                                                                                                                        |  |
| hb: $1 \xrightarrow{hb} 2 \xrightarrow{hb} 3^{\frac{h}{2}}$                                                                      | $\stackrel{b}{\rightarrow} 4 \stackrel{hb}{\rightarrow} 5 \stackrel{hb}{\rightarrow} 6 \stackrel{hb}{\rightarrow} 7 \stackrel{hb}{\rightarrow} 8 \lor$ |  |
| hb: $5 \xrightarrow{hb} 6 \xrightarrow{hb} 7 \xrightarrow{h}$                                                                    | $\stackrel{b}{\rightarrow} 8 \stackrel{hb}{\rightarrow} 1 \stackrel{hb}{\rightarrow} 2 \stackrel{hb}{\rightarrow} 3 \stackrel{hb}{\rightarrow} 4$      |  |

Theorem: The JMM respects the DRF guarantee



Sevcik, Aspinall 2008

#### Theorem:

Independent non-synchronizing statements can always be reordered

```
x = 1;
synchronized(this) {
    y = 1;
    r1 = x;
}
r2 = y;
```





















# **Compiler Optimizations**

| Transformation                           | SC           | JMM          | DRF          |
|------------------------------------------|--------------|--------------|--------------|
| Trace-preserving transformations         | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Reordering normal memory accesses        | ×            | ×(√)         | $\checkmark$ |
| Redundant read after read elimination    | √*           | ×            | $\checkmark$ |
| Redundant read after write elimination   | √*           | $\checkmark$ | $\checkmark$ |
| Irrelevant read elimination              | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Irrelevant read introduction             | $\checkmark$ | ×            | ×            |
| Redundant write before write elimination | √*           | $\checkmark$ | $\checkmark$ |
| Redundant write after read elimination   | √*           | ×            | $\checkmark$ |
| Roach-motel reordering                   | $\checkmark$ | ×            | $\checkmark$ |
| External action reordering               | ×            | ×            | $\checkmark$ |

 $\checkmark$  – correct,  $\times$  – incorrect,  $\checkmark$  \* – correct only for adjacent memory accesses.

Validity of Program Transformations in the Java Memory Model Sevcik & Aspinall'2008


• Unlock

Executions 
$$E = \langle P, A, \xrightarrow{po}, \xrightarrow{so}, W, V, \xrightarrow{sw}, \xrightarrow{hb} \rangle$$

$$\begin{array}{l} \operatorname{Program} \\ & \swarrow \\ \end{array}$$
Executions  $E = \langle P, A, \overset{po}{\longrightarrow}, \overset{so}{\longrightarrow}, W, V, \overset{sw}{\longrightarrow}, \overset{hb}{\longrightarrow} \rangle$ 

Program Actions  

$$\uparrow$$
  $\uparrow$   
Executions  $E = \langle P, A, \xrightarrow{po}, \xrightarrow{so}, W, V, \xrightarrow{sw}, \xrightarrow{hb} \rangle$ 

Program Actions  

$$\uparrow$$
  $\uparrow$   
Executions  $E = \langle P, A, \xrightarrow{po}, \xrightarrow{so}, W, V, \xrightarrow{sw}, \xrightarrow{hb} \rangle$   
Program Order

Actions 
$$a = \langle t, k, v, u \rangle$$



Actions 
$$a = \langle t, k, v, u \rangle$$



Actions 
$$a = \langle t, k, v, u \rangle$$



Actions 
$$a = \langle t, k, v, u \rangle$$



Actions 
$$a = \langle t, k, v, u \rangle$$



Well-Formedness

- for every read r, V(W(r)) = r.v
- $\xrightarrow{po}$  is consistent with  $\xrightarrow{so}$  ( $\xrightarrow{hb}$  is a partial order)
- lock are consistent with mutual exclusion
- the semantics of single threads is respected
- synchronization is consistent
- happens-before is consistent

Causality: we need to justify each of the actions of the execution

Commit actions of E step by step

1.  $C_i \subset A_i$ 2.  $\xrightarrow{hb_i} |_{C_i} = \xrightarrow{hb} |_{C_i}$ 3.  $\xrightarrow{so_i} |_{C_i} = \xrightarrow{so} |_{C_i}$ 4.  $V_i|_{C_i} = V|_{C_i}$ 5.  $\forall r \in Reads(A_i - C_{i-1}) : W_i(r) \xrightarrow{hb_i} r$ 6.  $W_i|_{C_{i-1}} = W|_{C_{i-1}}$ 7.  $\forall r \in Reads(C_i - C_{i-1}) : \{W_i(r), W(r)\} \subseteq C_{i-1}$ 

1.  $C_i \subset A_i$ 2.  $\xrightarrow{hb_i} |_{C_i} = \xrightarrow{hb} |_{C_i}$ 3.  $\xrightarrow{so_i} |_{C_i} = \xrightarrow{so} |_{C_i}$ 4.  $V_i|_{C_i} = V|_{C_i}$ 5.  $\forall r \in Reads(A_i - C_{i-1}) : W_i(r) \xrightarrow{hb_i} r$ 6.  $W_i|_{C_{i-1}} = W|_{C_{i-1}}$ 7.  $\forall r \in Reads(C_i - C_{i-1}) : \{W_i(r), W(r)\} \subseteq C_{i-1}$ 

**Causality Test Cases** 

# **Operational JMM?**

### **Generative Operational Semantics** for Relaxed Memory Models\*

Radha Jagadeesan, Corin Pitcher, and James Riely School of Computing, DePaul University

Abstract. The specification of the Java Memory Model (JMM) is phrased in terms of acceptors of execution sequences rather than the standard generative

view of operational semantics. This creates a mismatch with language-based techniques, such as simulation arguments and proofs of type safety. We describe a semantics for the IMM using standard programming language techniques that captures its full expressivity. For data-race-free programs, our model coincides with the JMM. For lockless programs, our model is more expressive than the JMM. The stratification properties required to avoid causality cycles are

derived, rather than mandated in the style of the JMM. The JMM is arguably non-canonical in its treatment of the interaction of data races and locks as it fails to validate roach-motel reorderings and various peephole optimizations. Our model differs from the JMM in these cases. We develop a theory of simulation and use it to validate the legality of the above optimizations

in any program context.

In the context of shared memory imperative programs, Sequential Consistency (SC) (Lamport 1979) enforces a global total order on memory operations that includes the program order of each individual thread in the program. SC may be realized by a traditional interleaving semantics where shared memory is represented as a map from locations to values. It has been observed that SC disables compiler optimizations such as reordering of independent statements. Despite arguments that SC does not impair efficiency (Kamil et al. 2005), this observation and others have motivated a body of work on relaxed memory models; Adve and Gharachorloo (1996) provide a tutorial

A first (conceptual, if not chronological) step in generalizing SC is to consider the Data Race Free (DRF) models. Informally, a program is DRF if no execution of the prointroduction with detailed bibliography. is which a write happens concurrently with another operation on

the programmer view of computation co-

### Partial solutions

**Remains elusive** 

# Implementing the JMM



### Given a Source Java Program P we expect



### Given a Source Java Program P we expect



### http://g.oswego.edu/dl/jmm/cookbook.html

### **The JSR-133 Cookbook for Compiler Writers**

by Doug Lea, with help from members of the JMM mailing list.

### dl@cs.oswego.edu.

Preface: Over the 10+ years since this was initially written, many processor and language memory model specifications and issues have become clearer and better understood. And many have not. While this guide is maintained to remain accurate, it is incomplete about some of these evolving details. For more extensive coverage, see especially the work of Peter Sewell and the <u>Cambridge Relaxed Memory Concurrency Group</u>

This is an unofficial guide to implementing the new Java Memory Model (JMM) specified by JSR-133. It provides at most brief backgrounds about why various rules exist, instead concentrating on their consequences for compilers and JVMs with respect to instruction reorderings, multiprocessor barrier instructions, and atomic operations. It includes a set of recommended recipes for complying to JSR-133. This guide is "unofficial" because it includes interpretations of particular processor properties and specifications. We cannot guarantee that the interpretations are correct. Also, processor specifications and implementations may change over time.

### Reorderings

For a compiler writer, the JMM mainly consists of rules disallowing reorderings of certain instructions that access fields (where "fields" include array elements) as well as monitors (locks).

### Volatiles and Monitors

The main JMM rules for volatiles and monitors can be viewed as a matrix with cells indicating that you cannot reorder instructions associated with particular sequences of bytecodes. This table is not itself the JMM specification; it is just a useful way of viewing its main consequences for compilers and runtime systems.

| Can Reorder   | 2nd operation |               |                |  |  |  |  |
|---------------|---------------|---------------|----------------|--|--|--|--|
| 1st operation | Normal Load   | Volatile Load | Volatile Store |  |  |  |  |
|               | Normal Store  | MonitorEnter  | MonitorExit    |  |  |  |  |
| Normal Load   |               |               | No             |  |  |  |  |
| Normal Store  |               |               | NO             |  |  |  |  |
| Volatile Load |               |               |                |  |  |  |  |

 A collection of informal arguments about what is possible under the JMM (Java) — *Roach-Motel Semantics* for Volatiles and Locks

| Can Reorder    | 2nd operation |               |                |  |  |  |
|----------------|---------------|---------------|----------------|--|--|--|
| 1st operation  | Normal Load   | Volatile Load | Volatile Store |  |  |  |
|                | Normal Store  | MonitorEnter  | MonitorExit    |  |  |  |
| Normal Load    |               |               | No             |  |  |  |
| Normal Store   |               |               |                |  |  |  |
| Volatile Load  | No            | No            | No             |  |  |  |
| MonitorEnter   | INO           | INO           | NO             |  |  |  |
| Volatile store |               | No            | No             |  |  |  |
| MonitorExit    |               | INO           | INO            |  |  |  |

 A collection of informal "recipes" indicating how disallow unwanted *reorderings* by adding generic barrier instructions

| Required<br>barriers          | 2nd operation  |                 |                                  |                                  |  |  |  |
|-------------------------------|----------------|-----------------|----------------------------------|----------------------------------|--|--|--|
| 1st operation                 | Normal<br>Load | Normal<br>Store | Volatile<br>Load<br>MonitorEnter | Volatile<br>Store<br>MonitorExit |  |  |  |
| Normal Load                   |                |                 |                                  | LoadStore                        |  |  |  |
| Normal Store                  |                |                 |                                  | StoreStore                       |  |  |  |
| Volatile Load<br>MonitorEnter | LoadLoad       | LoadStore       | LoadLoad                         | LoadStore                        |  |  |  |
| Volatile Store<br>MonitorExit |                |                 | StoreLoad                        | StoreStore                       |  |  |  |

• A final table indicating how to implement each of the generic barriers in the different architectures

| Processor | LoadStore                                     | LoadLoad              | StoreStore | StoreLoad                            | Data<br>dependency<br>orders loads? | Atomic<br>Conditional         | Other<br>Atomics     | Atomics<br>provide<br>barrier? |
|-----------|-----------------------------------------------|-----------------------|------------|--------------------------------------|-------------------------------------|-------------------------------|----------------------|--------------------------------|
| sparc-TSO | no-op                                         | no-op                 | no-op      | membar<br>(StoreLoad)                | yes                                 | CAS:<br>casa                  | swap,<br>ldstub      | full                           |
| x86       | no-op                                         | no-op                 | no-op      | mfence or<br>cpuid or<br>locked insn | yes                                 | CAS:<br>cmpxchg               | xchg,<br>locked insn | full                           |
| ia64      | <i>combine</i><br>with<br>st.rel or<br>ld.acq | ld.acq                | st.rel     | mf                                   | yes                                 | CAS:<br>cmpxchg               | xchg,<br>fetchadd    | target +<br>acq/rel            |
| arm       | dmb<br>(see below)                            | dmb<br>(see below)    | dmb-st     | dmb                                  | indirection only                    | LL/SC:<br>ldrex/strex         |                      | target<br>only                 |
| ррс       | lwsync<br>(see below)                         | lwsync<br>(see below) | lwsync     | hwsync                               | indirection only                    | LL/SC:<br>ldarx/stwcx         |                      | target<br>only                 |
| alpha     | mb                                            | mb                    | wmb        | mb                                   | no                                  | LL/SC:<br>ldx_l/stx_c         |                      | target<br>only                 |
| pa-risc   | no-op                                         | no-op                 | no-op      | no-op                                | yes                                 | <i>build<br/>from</i><br>ldcw | ldcw                 | (NA)                           |

• A final table indicating how to implement each of the generic barriers in the different architectures

|   | Processor | LoadStore                                            | LoadLoad              | StoreStore | StoreLoad                            | Data<br>dependency<br>orders loads? | Atomic<br>Conditional         | Other<br>Atomics     | Atomics<br>provide<br>barrier? |
|---|-----------|------------------------------------------------------|-----------------------|------------|--------------------------------------|-------------------------------------|-------------------------------|----------------------|--------------------------------|
| ſ | sparc-TSO | no-op                                                | no-op                 | no-op      | membar<br>(StoreLoad)                | yes                                 | CAS:<br>casa                  | swap,<br>ldstub      | full                           |
| ٤ | x86       | no-op                                                | no-op                 | no-op      | mfence or<br>cpuid or<br>locked insn | yes                                 | CAS:<br>cmpxchg               | xchg,<br>locked insn | full                           |
|   | ia64      | <i>combine</i><br><i>with</i><br>st.rel or<br>ld.acq | ld.acq                | st.rel     | mf                                   | yes                                 | CAS:<br>cmpxchg               | xchg,<br>fetchadd    | target +<br>acq/rel            |
|   | arm       | dmb<br>(see below)                                   | dmb<br>(see below)    | dmb-st     | dmb                                  | indirection<br>only                 | LL/SC:<br>ldrex/strex         |                      | target<br>only                 |
| { | ppc       | lwsync<br>(see below)                                | lwsync<br>(see below) | lwsync     | hwsync                               | indirection<br>only                 | LL/SC:<br>ldarx/stwcx         |                      | target<br>only                 |
|   | alpha     | mb                                                   | mb                    | wmb        | mb                                   | no                                  | LL/SC:<br>ldx_l/stx_c         |                      | target<br>only                 |
|   | pa-risc   | no-op                                                | no-op                 | no-op      | no-op                                | yes                                 | <i>build<br/>from</i><br>ldcw | ldcw                 | (NA)                           |











• We propagate the semantics of Power for Normal Variables

• We Give SC semantics to Volatiles and Locks



- Encode the different architectures: TSO, PPC
- Encode Cookbook Low and Cookbook High
- Prove simulation at every step
- Prove that Cookbook High is a correct implementation of JMM

# Power Barriers

- sync: Enforces ordering between all operations
  - In our pipeline no action can bypass a sync action of the same thread
  - It requires global consensus
- Iwsync: It enforces ordering between all operations except: wr Iwsync rd
  - It does not enforce consensus

$$\begin{bmatrix} x := 1; \end{bmatrix} \| \begin{bmatrix} y := 1; \end{bmatrix} \| \begin{bmatrix} r_1 := x; \\ \mathsf{lwsync}; \\ r_2 := y; \end{bmatrix} \| \begin{bmatrix} r_3 := y; \\ \mathsf{lwsync}; \\ r_4 := x; \end{bmatrix}$$
$$r_1 = r3 = 0 \& r_2 = r_4 = 1 \text{ is allowed}$$

• A final table indicating how to implement each of the generic barriers in the different architectures

| Processor | LoadStore                                     | LoadLoad              | StoreStore | StoreLoad                            | Data<br>dependency<br>orders loads? | Atomic<br>Conditional | Other<br>Atomics     | Atomics<br>provide<br>barrier? |
|-----------|-----------------------------------------------|-----------------------|------------|--------------------------------------|-------------------------------------|-----------------------|----------------------|--------------------------------|
| sparc-TSO | no-op                                         | no-op                 | no-op      | membar<br>(StoreLoad)                | yes                                 | CAS:<br>casa          | swap,<br>ldstub      | full                           |
| x86       | no-op                                         | no-op                 | no-op      | mfence or<br>cpuid or<br>locked insn | yes                                 | CAS:<br>cmpxchg       | xchg,<br>locked insn | full                           |
| ia64      | <i>combine</i><br>with<br>st.rel or<br>ld.acq | ld.acq                | st.rel     | mf                                   | yes                                 | CAS:<br>cmpxchg       | xchg,<br>fetchadd    | target +<br>acq/rel            |
| arm       | dmb<br>(see below)                            | dmb<br>(see below)    | dmb-st     | dmb                                  | indirection<br>only                 | LL/SC:<br>ldrex/strex |                      | target<br>only                 |
| ррс       | lwsync<br>(see below)                         | lwsync<br>(see below) | lwsync     | hwsync                               | indirection<br>only                 | LL/SC:<br>ldarx/stwcx |                      | target<br>only                 |
| alpha     | mb                                            | mb                    | wmb        | mb                                   | no                                  | LL/SC:<br>ldx_l/stx_c |                      | target<br>only                 |
| pa-risc   | no-op                                         | no-op                 | no-op      | no-op                                | yes                                 | build<br>from<br>ldcw | ldcw                 | (NA)                           |

# IRIW with Volatiles

$$v_{0} = v_{1} = 0 \text{ and } v_{0}, v_{1} \text{ are volatile}$$

$$\begin{bmatrix} v_{0} := 1; \end{bmatrix} \| \begin{bmatrix} v_{1} := 1; \end{bmatrix} \| \begin{bmatrix} r_{1} := v_{0}; \\ r_{2} := v_{1}; \end{bmatrix} \| \begin{bmatrix} r_{3} := v_{1}; \\ r_{4} := v_{0}; \end{bmatrix}$$

$$r_{1} = r_{3} = 0 \& r_{2} = r_{4} = 1 \text{ is allowed}$$

# IRIW with Volatiles

$$v_{0} = v_{1} = 0 \text{ and } v_{0}, v_{1} \text{ are volatile}$$

$$\begin{bmatrix} v_{0} := 1; \end{bmatrix} \| \begin{bmatrix} v_{1} := 1; \end{bmatrix} \| \begin{bmatrix} r_{1} := v_{0}; \\ r_{2} := v_{1}; \end{bmatrix} \| \begin{bmatrix} r_{3} := v_{1}; \\ r_{4} := v_{0}; \end{bmatrix}$$

$$r_{1} = r_{3} = 0 \& r_{2} = r_{4} = 1 \text{ is allowed}$$

# Surprises

- IRIW with volatiles is reproducible on a 32-core Power7 machine
  - With two different VM's
- We found bugs in the implementation of volatiles of the Fiji ahead of time compiler
- Many implementers of JVM's follow the Cookbook

# Guarantees


End of day 2