The Art of Multiprocessor Programming

Maurice Herlihy, Nir Shavit

Mentioned 8

Revised and updated with improvements conceived in parallel programming courses, The Art of Multiprocessor Programming is an authoritative guide to multicore programming. It introduces a higher level set of software development skills than that needed for efficient single-core programming. This book provides comprehensive coverage of the new principles, algorithms, and tools necessary for effective multiprocessor programming. Students and professionals alike will benefit from thorough coverage of key multiprocessor programming issues. This revised edition incorporates much-demanded updates throughout the book, based on feedback and corrections reported from classrooms since 2008 Learn the fundamentals of programming multiple threads accessing shared memory Explore mainstream concurrent data structures and the key elements of their design, as well as synchronization techniques from simple locks to transactional memory systems Visit the companion site and download source code, example Java programs, and materials to support and enhance the learning experience

More on Amazon.com

Mentioned in questions and answers.

Do global pointers have a scope that exist between threads?

For instance, suppose I have two files, file1.c and file2.c:

file1.c:

uint64_t *g_ptr = NULL;

modify_ptr(&g_ptr) { 
    //code to modify g_ptr to point to a valid address 
}

read_from_addr() {
    //code which uses g_ptr to read values from the memory it's pointing to
}

file2.c:

function2A() {
    read_from_addr();
}

So I have threadA which runs through file1.c and executes modify_ptr(&g_ptr) and also read_from_addr(). And then threadB runs, and it runs through file2.c executing function2A().

My question is: Does threadB see that g_ptr is modified? Or does it still see that it's pointing to NULL?

If that's not the case, what does it mean for a pointer to be global? And how do I ensure that this pointer is accessible between different threads?

Please let me know if I need to clarify anything. Thanks

This question is the textbook example of what makes concurrent programming difficult. A really thorough explanation could fill an entire book, as well as lots of articles of varying quality.

But we can summarize a little. A global variable is in a memory space visible to all the threads. (The alternative is thread-local storage, which only one thread can see.) So you would expect that if you have a global variable G, and thread A writes value x to it, then thread B will see x when it reads that variable later on. And in general, that is true -- eventually. The interesting parts are what happens before "eventually".

The biggest source of trickiness are memory consistency and memory coherence.

Coherence describes what happens when thread A writes to G and thread B tries to read it at nearly the same moment. Imagine that thread A and B are on different processors (let's also call them A and B for simplicity). When A writes to a variable, there is a lot of circuitry between it and the memory that thread B sees. First, A will probably write to its own data cache. It will store that value for a while before writing it back to main memory. Flushing the cache to main memory also takes time: there's a number of signals that have to go back and forth on wires and capacitors and transistors, and a complicated conversation between the cache and the main memory unit. Meanwhile, B has its own cache. When changes occur to main memory, B may not see them right away — at least, not until it refills its cache from that line. And so on. All in all, it may be many microseconds before thread A's change is visible to B.

Consistency describes what happens when A writes to variable G and then variable H. If it reads back those variables, it will see the writes happening in that order. But thread B may see them in a different order, depending on whether H gets flushed from cache back to main RAM first. And what happens if both A and B write to G at the same time (by the wall clock), and then try to read back from it? Which value will they see?

Coherence and consistency are enforced on many processors with memory barrier operations. For example, the PowerPC has a sync opcode, which says "guarantee that any writes that have been made by any thread to main memory, will be visible by any read after this sync operation." (basically it does this by rechecking every cache line against main RAM.) The Intel architecture does this automatically to some extent if you warn it ahead of time that "this operation touches synchronized memory".

Then you have the issue of compiler reordering. This is where the code

int foo( int *e, int *f, int *g, int *h) 
{
   *e = *g;
   *f = *h;
   // <-- another thread could theoretically write to g and h here
   return *g + *h ;
}

can be internally converted by the compiler into something more like

int bar( int *e, int *f, int *g, int *h) 
{
  int b = *h;
  int a = *g;
  *f = b ;
  int result = a + b;
  *e = a ;
  return result;
}

which could give you a completely different result if another thread performed a write at the location given above! also, notice how the writes occur in a different order in bar. This is the problem that volatile is supposed to solve -- it prevents the compiler from storing the value of *g in a local, but instead forces it to reload that value from memory every time it sees *g.

As you can see, this is inadequate for enforcing memory coherence and consistency across many processors. It was really invented for cases where you had one processor that was trying to read from memory-mapped hardware -- like a serial port, where you want to look at a location in memory every n microseconds to see what value is currently on the wire. (That is really how I/O worked back when they invented C.)

What to do about this? Well, like I said, there are whole books on the subject. But the short answer is that you probably want to use the facilities your operating system / runtime platform provide for synchronized memory.

For example, Windows provides the interlocked memory access API to give you a clear way of communicating memory between threads A and B. GCC tries to expose some similar functions. Intel's threading building blocks give you a nice interface for x86/x64 platforms, and the C++11 thread support library provides some facilities also.

I am reading Effective Java and in Chapter 10: Concurrency; Item 66: Synchronize access to shared mutable data, there is some code like this:

public class StopThread {
private static boolean stopRequested;
public static void main(String[] args) throws InterruptedException {
    // TODO Auto-generated method stub
    System.out.println(stopRequested);
    Thread backgroundThread = new Thread(new Runnable(){

        @Override
        public void run() {
            // TODO Auto-generated method stub
            int i = 0;
            while (!stopRequested){
                i++;
            }

            System.out.println("done");
        }

    });
    backgroundThread.start();
    TimeUnit.SECONDS.sleep(1);
    stopRequested = true;
}

}

First, I think the thread should run one second and then stop, since the stopRequested is set to true afterwards. However, the program never stops. It will never print done. The author said

while (!stopRequested)
    i++;

will be transformed into this:

if (!stopRequested)
     while(true)
         i++;

Could someone explain me this?

And another thing I find is that if I change the program to this:

public class StopThread {
private static boolean stopRequested;
public static void main(String[] args) throws InterruptedException {
    // TODO Auto-generated method stub
    System.out.println(stopRequested);
    Thread backgroundThread = new Thread(new Runnable(){

        @Override
        public void run() {
            // TODO Auto-generated method stub
            int i = 0;
            while (!stopRequested){
                i++;
                System.out.println(i);
            }

            System.out.println("done");
        }

    });
    backgroundThread.start();
    TimeUnit.SECONDS.sleep(1);
    stopRequested = true;
}

}

The program runs 1 second and stops as expected. What's the difference here?

I think you are making Joshua Bloch (the author of that great book) say something that he did not say :-). To be precise, the book says the following (only emphasize mine):

In the absence of synchronization, it's quite acceptable for the virtual machine to transform this code:

while (!done)
  i++;

into this code:

if (!done)
  while (true)
    i++;

To understand what he means (it's rather tough to explain it in a way better than he has himself done in pages 261-264, but I will try. Sorry Josh!) you should first try to run this program verbatim and see what is happening. With multithreading, anything is possible, but here's what I did:

  1. Coded up StopThread as is.
  2. Ran it on my Linux computer with JRE 1.8.0_72.
  3. It simply hung! So, the behavior has not changed from what he described.
  4. Then I took the 'thread dump' to see what is happening. You can simply send a kill -3 signal to the running JVM pid to see what the threads are doing. Here's what I observed (relevant portion of the thread dump):
"DestroyJavaVM" #10 prio=5 os_prio=0 tid=0x00007fd678009800 nid=0x1b35 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Thread-0" #9 prio=5 os_prio=0 tid=0x00007fd6780f6800 nid=0x1b43 runnable [0x00007fd64b5be000]
   java.lang.Thread.State: RUNNABLE
  at StopThread$1.run(StopThread.java:14)
  at java.lang.Thread.run(Thread.java:745)

"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007fd6780c9000 nid=0x1b41 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

As you can see, the background thread that we started is alive, doing something. I looked at my computer's diagnosis tool named top and here is what it shows:

top cmd output.

Can you see that one of my CPU's (it's a quad core computer) is fully busy (100%!) doing something, moreover, it is the java process that is doing something. Isn't it puzzling? Well it is somewhat puzzling. When a CPU is busily doing something you don't understand, one very likely cause is that it is tirelessly checking contents of some memory location. In this case, if you try to connect the dots, it's the stopRequested variable whose value (as we can expect) it is constantly reading. So, effectively, the CPU is just reading the value of the boolean, finding it false, all the time and it goes back to checking if it has changed! Again, it finds it has not (it is still hanging on my machine as I write this :-)).

You'll say ... Didn't the main thread (which, by the way, has long gone, since it does not appear in the thread dump) stopRequested = true?

Yes, it did!

Naturally, you would suspect then, why doesn't the Thread-0 see it?

And therein lies the clue. In the presence of a data race, the value that a thread writes is not visible to another thread that reads it.

Now we look at the declaration of that data, that variable that shows this peculiar behavior:

private static boolean stopRequested;

is what it is! This particular declaration is rather underspecified as far as its treatment by various involved parties (the compiler, the just in time compiler and its optimizations ...) is concerned. In the case of such underspecification, anything may happen. In particular, the value that the main thread (thought it) wrote may never be actually written into the main memory for Thread-0 to read, making it go into an infinite loop.

Thus, this is a visibility issue. Without enough synchronization, it is not guaranteed that the value written by a thread will be seen by another thread.

Does that explain? For more details, we all need to have a better understanding of modern hardware. An excellent resource for this is The Art of Multiprocessor Programming by Herlihy and Shavit. This book makes a software engineer understand the intricacies of the hardware and also explains why the multithreading is so hard.

I'm reading chapter 2 of "The Art of Multiprocessor Programming" and I'm confused about Filter algorithm which looks like so:

class Filter implements Lock {
    int[] level;
    int[] victim;
    public Filter(int n) {
        level = new int[n];
        victim = new int[n]; // use 1..n-1
        for (int i = 0; i < n; i++) {
            level[i] = 0;
        }
    }
    public void lock() {
        int me = ThreadID.get();
        for (int i = 1; i < n; i++) { //attempt level 1
            level[me] = i;
            victim[i] = me;
            // spin while conflicts exist
            while (( ∃ k != me) (level[k] >= i && victim[i] == me)) {};
        }
    }
    public void unlock() {
        int me = ThreadID.get();
        level[me] = 0;
    }
}

What looks strange to me, is that level and victim arrays are not made volatile. Prior to this algorithm, the author presented less general "Peterson algorithm", where variables are set like so:

private volatile boolean[] flag = new boolean[2];
private volatile int victim;

So my question is why in a more general algorithm we do not specify level and victim as volatile?

Firstly volatile is like final or static in that it only applies to the field, not the object referenced. e.g.

volatile int[] level;

means writes to level not level[0] are volatile.

In fact there is no way to do this in natural Java which is why AtomicIntegerArray uses Unsafe to perform volatile and thread safe operations.

In short, the only real solution is to use AtomicIntegerArray (or Unsafe directly).

I've spent some time trying to understand how are mutexes implemented in several languages. There are multiple links describing the topic (*) but if I understand that correctly, all that hardware provides are some atomic operations that may help to distinguish who should be in turn now.

In software this is always used for busy waiting (try CAS or test-and-set and wait in while cycle if not successful), but how does the scheduler know that now I should take away the process/thread from CPU because all it does is that it waits? Is there some support in OS provided that for example Java synchronization uses in order to signal that "I am blocked, please let other Threads run instead"? I think it is, since busy-waiting is an alternative to use lock(); (so they should not be the same)

*Source:

That's a book level topic. Here's the book:

The Art of Multiprocessor Programming Maurice Herlihy, Nir Shavit ISBN-13: 978-0123973375

https://www.amazon.com/Art-Multiprocessor-Programming-Revised-Reprint/dp/0123973376/


And actually, here's another because there's more to user-level mutexes as provided by an operating system than just using the hardware primitives. User-level mutexes are intimately tied to the operating system's scheduling algorithms.

Understanding the Linux Kernel Daniel P. Bovet, Marco Cesati ISBN-13: 978-0596005658

http://www.amazon.com/Understanding-Linux-Kernel-Third-Daniel/dp/0596005652/

I want to run a multithreaded program on massive data. I usually create a class which is callable (or runnable) and pass the data needed for the process to the class.

public class CallableTrainer implements Callable<PredictorResult> {

   dataType data;

   CallableTrainer( dataType massiveData ) { 
       this.data = massiveData;
   }

   @Override
   public PredictorResult call() throws Exception {
        // do something and return ... 
   }
}

Based on the above implementation, I assume that the 'massiveData' is always copied for each thread (right?) If this is true, I am wasting lots of memory by copying this data for each thread. Is there any way to share the data between threads?

I assume that the 'massiveData' is always copied for each thread (right?) If this is true ...

Nope, false. Only the reference to massiveData is copied.

Java doesn't do magic copies of non-primitive types. If you want to copy something you have to do it explicitly.

If you didn't already know that, I'm guessing you're going to run into all sorts of other problems when you write this multi-threaded code. For example, unless these threads are only reading massiveData, then you really need some sort of synchronization or atomicity guarantees on any updates you make, otherwise you're going to end up with garbage.

Here's a good book on the topic (with Java examples): The Art of Multiprocessor Programming

So I have a script for accepting and processing request from another scripts and/or applications. However, one of the task has to be done with my script is assigning each request a unique, sequential "ID" to each of them.

For example, let's says that application A is giving a 1000 request to my script, and in the same time, application B is giving 500 request to my script. I have to give them 1500 unique, sequential number, like 2001~3500 to each of them.

The order between them however, does not matter, so I can give them numbers like this :

#2001 for 1st request from A (henceforth, A1)
#2002 for A2
#2003 for B1
#2004 for A3
#2005 for B2
...and so on...

I've tried creating a file that stores that number and a separated lock file with a function like this :

private function get_last_id()
{
    // Check if lock file exists...
    while (file_exists("LAST_ID_LOCKED")) {
        // Wait a little bit before checking again
        usleep(1000);
    }

    // Create the lock file
    touch("LAST_ID_LOCKED");

    // Create the ID file for the first time if required
    if (!file_exists("LAST_ID_INDICATOR")) {
        file_put_contents("LAST_ID_INDICATOR", 0);
    }

    // Get the last ID
    $last_id = file_get_contents("LAST_ID_INDICATOR");
    // Update the last ID
    file_put_contents("LAST_ID_INDICATOR", $last_id + 1);

    // Delete the lock file
    unlink("LAST_ID_LOCKED");

    return $last_id;
}

This code, however, would create a race condition, where if I send them those 1500 request, the last ID will have quite a number missings, (e.g. only reach 3211 instead of 3500).

I've also tried using flock like this, but to no avail :

private function get_last_id()
{
    $f = fopen("LAST_ID_INDICATOR", "rw");

    while (true) {
        if (flock($f, LOCK_SH)) {
            $last_id = fread($f, 8192);
            flock($f, LOCK_UN);
            fclose($f);
            break;
        }
        usleep($this->config["waiting_time"]);
    }

    $f = fopen("LAST_ID_INDICATOR", "rw");

    while (true) {
        if (flock($f, LOCK_SH)) {
            $last_id = fread($f, 8192);
            $last_id++;
            ftruncate($f, 0);
            fwrite($f, $last_id);
            flock($f, LOCK_UN);
            fclose($f);
            break;
        }
        usleep($this->config["waiting_time"]);
    }

    return $last_id;
}

So, what else can I do to look for a solution for this situation?

Notes: Due to server limitation, I'm limited to PHP 5.2 without something like semaphores and such.

Since no-one seems to be giving an answer, I'll give you a possible solution.

Use the Lamport's Bakery Algorithm as part of your solution.

Edit: The filter lock would work even better if you don't need the order preserved.

Obviously this will have its own challenges implementing but it's worth a try and if you get it right, it might just do the trick for what you want to do.

Since you mentioned semaphores, I assume you know enough knowledge to understand the concept.

This can be found in chapter 2 of "The art of multiprocessor programming".

trying my luck in making lock-free singly linked list implementation.

typedef _Atomic struct _node
  {
    void *data;
    struct _node *next;
  } Node;

does this make all members of struct with _Atomic atomic as well?

void add_head ( Linked_list* list, void* data )
{
  if ( debugging )
  {
      printf ( "%s\n", __func__ );
  }
  Node *node = ( Node* ) calloc ( 1, sizeof (Node ) );
  //init_node_mutex(node);
  //lock_node_mutex(node);
  atomic_exchange ( &node->next, NULL );
  atomic_exchange ( &node->data, data );

  if ( list->head == NULL )
  {
      Node* the_tail = atomic_load ( &list->tail );
      //  lock_node_mutex ( the_tail );
      atomic_exchange ( &node->next, NULL );
      atomic_compare_exchange_weak ( &list->tail, the_tail, node );

      //unlock_node_mutex ( the_tail );

  }
  else
  {

      Node* the_next = atomic_load ( &node->next );
      // lock_node_mutex ( the_next );
      atomic_compare_exchange_weak ( &node->next, the_next, list->head );
      // unlock_node_mutex ( the_next );
  }

  Node* the_head = atomic_load ( & list->head );
  //lock_node_mutex ( the_head );
  atomic_store ( &list->head, node );
  atomic_store ( &list->current, node );
  //unlock_node_mutex ( the_head );
  //unlock_node_mutex(node);
  atomic_fetch_add ( &list->size, 1 );
}

are usages are atomic_load and atomic_store correct?

In addition to @MikeRobinson's comment, I would add that while your code is "lock-free" in the sense that it does not contain any explicit use of locks, it is now (somewhat ironically) no longer thread-safe. Writing lock-free code is enormously difficult. I recommend reading through this to get a bird's-eye view of the world, and then reading this to get some details or Chapter 7 of this book (it's in C++). You can always go look through the source of Boost.LockFree for inspiration.

What are good introductory tutorial /online course /books /approaches to learn the following topics:

  • process, threads
  • common concurrency issues
  • locks, mutexes, semaphores
  • context switching
  • deadlock and how to avoid
  • livelock and how to avoid
  • scheduling
  • multi-core technology

If you want a strong and solid understanding of concurrency I will advise you to read The Art of Multiprocessor Programming. It became classic reading so far.