The Art of Multiprocessor Programming

Maurice Herlihy, Nir Shavit

Mentioned 29

Multiprocessor machines, or Multicores, as they are known in the industry, are quickly taking over every aspect of computing. This volume provides a presentation of the guiding principles and algorithmic techniques necessary for effective multiprocessor programming.

More on

Mentioned in questions and answers.

When do we use AtomicReference. Is it needed to create objects in all multithreaded programs. Can you provide a simple example where AtomicReference should be used.

Atomic reference should be used in a setting where you need to do simple atomic (i.e., thread safe, non-trivial) operations on a reference, for which monitor-based synchronization is not appropriate. Suppose you want to check to see if a specific field only if the state of the object remains as you last checked:

AtomicReference<Object> cache = new AtomicReference<Object>();

Object cachedValue = new Object();

//... time passes ...
Object cachedValueToUpdate = cache.get();
//... do some work to transform cachedValueToUpdate into a new version
Object newValue = someFunctionOfOld(cachedValueToUpdate);
boolean success = cache.compareAndSet(cachedValue,cachedValueToUpdate);

Because of the atomic reference semantics, you can do this even if the cache object is shared amongst threads, without using synchronized. In general, you're better off using synchronizers or the java.util.concurrent framework rather than bare Atomic* unless you know what you're doing.

Two excellent dead-tree references which will introduce you to this topic: Herlihy's excellent Art of Multiprocessor Programming and Java Concurrency in Practice.

Note that (I don't know if this has always been true) reference assignment (i.e., =) is itself atomic (updating primitive 64-bit types(long/double) may not be atomic; but updating a reference is always atomic, even if it's 64 bit) without explicitly using an Atomic*. See the JLS 3ed, Section 17.7,

With the rise of multicore CPUs on the desktop, multithreading skills will become a valuable asset for programmers. Can you recommend some good resources (books, tutorials, websites, etc.) for a programmer who is looking to learn about threaded programming?

For a rich, thorough treatment of the subject, with a good balance between computer science and practice, I recommend The Art of Multiprocessor Programming. A lot of examples are in object-oriented code, i.e. Java, with other languages scattered throughout. It just depends on the topic being covered. What I really love about this book is that it discusses how common algorithms should be implemented in a concurrent design. Of course, there's so much more!

For general concepts and a treatment of pthreads, I really like Programming with POSIX Threads. Being the library and API that it is, it's in C.

For Windows and C# developers, check out Joe Duffy's blog. Joe works on parallel libraries, infrastructure, and programming models in Microsoft's Developer Division. He has a book coming in Nov. 2008 titled Concurrent Programming on Windows (Amazon link).

Also, don't miss the Godfather's blog: Herb Sutter's Sutter's Mill. He has links to all his articles in Dr. Dobb's Journal and more. Click his Concurrency category.

I've honestly never read it myself, but Concurrent Programming in Java is a book I've heard recommended by several people.

I am asking about a good reference for multithreading programming in terms of concepts with good examples using C++/C#?

I need to implement a lock-free skip list. I tried to look for papers. Unfortunatly all I found was lock-free single linked lists (in many flavors). However how to implement lock-free skip list?

Lock-free skip lists are described in the book The Art of Multiprocessor Programming, and the technical report Practical lock-freedom, which is based on a PhD thesis on the subject. The skip list discussion begins on page 53. An example implementation, based on these sources, is included in this google code project.

There are related discussions, links to literature and implementations (not necessarily lock-free) in the SO questions Skip List vs. Binary Tree, and Skip Lists - ever used them?.

I have built an application in C# that I would like to be optimized for multiple cores. I have some threads, should I do more?

Updated for more detail

  • C# 2.0
  • Run on Windows Vista and Windows Server 2003

Updated again

  • This code is running as a service
  • I do not want to have the complete code... my goal here is to get your experience and how to start. Like I say, I have already use threads. What more can I do?

Understanding the parallelism (or potential for parallelism) in the problem(s) you are trying to solve, your application and its algorithms is much more important than any details of thread synchronization, libraries, etc.

Start by reading Patterns for Parallel Programming (which focuses on 'finding concurrency' and higher-level design issues), and then move on to The Art of Multiprocessor Programming (practical details starting from a theoretical basis).

Let's say I'm programming in a threading framework that does not have multiple-reader/single-writer mutexes. Can I implement their functionality with the following:

Create two mutexes: a recursive (lock counting) one for readers and a binary one for the writer.


  • acquire lock on binary mutex
  • wait until recursive mutex has lock count zero
  • actual write
  • release lock on binary mutex


  • acquire lock on binary mutex (so I know the writer is not active)
  • increment count of recursive mutex
  • release lock on binary mutex
  • actual read
  • decrement count of recursive mutex

This is not homework. I have no formal training in concurrent programming, and am trying to grasp the issues. If someone can point out a flaw, spell out the invariants or provide a better algorithm, I'd be very pleased. A good reference, either online or on dead trees, would also be appreciated.

  1. You may want to prevent write starvation, to accomplish this you can either give preference to writes or make mutex fair.
    ReadWriteLock Java's interface documentation says Writer preference is common,
    ReentrantReadWriteLock class documentation says This class does not impose a reader or writer preference ordering for lock access. However, it does support an optional fairness policy.

  2. Note R..'s comment

    Rather than locking and unlocking the binary mutex for reading, you can just check the binary mutex state after incrementing the count on the recursive mutex, and wait (spin/yield/futex_wait/whatever) if it's locked until it becomes unlocked

  3. Recommended reading:
    Programming with POSIX Threads
    Perl's RWLock
    Java's ReadWriteLock documentation.

The following is taken directly from The Art of Multiprocessor Programming which is a good book to learn about this stuff. There's actually 2 implementations presented: a simple version and a fair version. I'll go ahead and reproduce the fair version.

One of the requirements for this implementation is that you have a condition variable primitive. I'll try to figure out a way to remove it but that might take me a little while. Until then, this should still be better than nothing. Note that it's also possible to implement this primitive using only locks.

public class FifoReadWriteLock {
    int readAcquires = 0, readReleases = 0;
    boolean writer = false;
    ReentrantLock lock;
    Condition condition = lock.newCondition(); // This is the condition variable.

    void readLock () {
        try {
        finally {

    void readUnlock () {
        try {
            if (readAcquires == readReleases)
        finally {

    void writeLock () {
        try {
            while (writer)

            writer = true;

            while (readAcquires != readReleases)
        finally {

    void writeUnlock() {
        writer = false;

First off, I simplified the code a little but the algorithm remains the same. There also happens to be an error in the book for this algorithm which is corrected in the errata. If you plan on reading the book, keep the errata close by or you'll end up being very confused (like me a few minutes ago when I was trying to re-understand the algorithm). Note that on the bright side, this is a good thing since it keeps you on your toes and that's a requirement when you're dealing with concurrency.

Next, while this may be a Java implementation, only use it as pseudo code. When doing the actual implementation you'll have to be carefull about the memory model of the language or you'll definitely end up with a headache. As an example, I think that the readAcquires and readReleases and writer variable all have to be declared as volatile in Java or the compiler is free to optimize them out of the loops. This is because in a strictly sequential programs there's no point in continuously looping on a variable that is never changed inside the loop. Note that my Java is a little rusty so I might be wrong. There's also another issue with integer overflow of the readReleases and readAcquires variables which is ignored in the algorithm.

One last note before I explain the algorithm. The condition variable is initialized using the lock. That means that when a thread calls condition.await(), it gives up its ownership of the lock. Once it's woken up by a call to condition.signalAll() the thread will resume once it has reacquired the lock.

Finally, here's how and why it works. The readReleases and readAcquires variables keep track of the number threads that have acquired and released the read lock. When these are equal, no thread has the read lock. The writer variable indicates that a thread is trying to acquire the write lock or it already has it.

The read lock part of the algorithm is fairly simple. When trying to lock, it first checks to see if a writer is holding the lock or is trying to acquire it. If so, it waits until the writer is done and then claims the lock for the readers by incrementing the readAcquires variable. When unlocking, a thread increases the readReleases variable and if there's no more readers, it notifies any writers that may be waiting.

The write lock part of the algorithm isn't much more complicated. To lock, a thread must first check whether any other writer is active. If they are, it has to wait until the other writer is done. It then indicates that it wants the lock by setting writer to true (note that it doesn't hold it yet). It then waits until there's no more readers before continuing. To unlock, it simply sets the variable writer to false and notifies any other threads that might be waiting.

This algorithm is fair because the readers can't block a writer indefinitely. Once a writer indicates that it wants to acquire the lock, no more readers can acquire the lock. After that the writer simply waits for the last remaining readers to finish up before continuing. Note that there's still the possibility of a writer indefinitely blocking another writer. That's a fairly rare case but the algorithm could be improved to take that into account.

So I re-read your question and realised that I partly (badly) answered it with the algorithm presented below. So here's my second attempt.

The algorithm, you described is fairly similar to the simple version presented in the book I mentionned. The only problem is that A) it's not fair and B) I'm not sure how you would implement wait until recursive mutex has lock count zero. For A), see above and for B), the book uses a single int to keep track of the readers and a condition variable to do the signalling.

Can anyone provide exhaustive explanation, please? I'm diving into concurrent programming and met those registers while trying to understand consensus.

From Lamport's "On interprocess communication": ...a regular register is atomic if two successive reads that overlap the same write cannot obtain the new then the old value....

Assume, that first comes thread0.write(0) - with no overlapping. Basically, one can say using Lamports definition that thread1 can read first 1 and then 0 again, if both reads are consequent and overlap with thread0.write(1). But how is that possible?

Reads and writes to a shared memory location take a finite period of time, so they may either overlap, or be completely distinct.


Thread 1:      wwwww     wwwww
Thread 2:   rrrrr              rrrrr
Thread 3:   rrrrr rrrrr

The first read from thread 2 overlaps with the first write from thread 1, whilst the second read and second write do not overlap. In thread 3, both reads overlap the first write.

A safe register is only safe as far as reads that do not overlap writes. If a read does not overlap any writes then it must read the value written by the most recent write. Otherwise it may return any value that the register may hold. So, in thread 2, the second read must return the value written by the second write, but the first read can return any valid value.

A regular register adds the additional guarantee that if a read overlaps with a write then it will either read the old value or the new one, but multiple reads that overlap the write do not have to agree on which, and the value may appear to "flicker" back and forth. This means that two reads from the same thread (such as in thread 3 above) that both overlap the write may appear "out of order": the earlier read returning the new value, and the later returning the old value.

An atomic register guarantees that the reads and writes appears to happen at a single point in time. Readers that act at a point before that point will all read the old value and readers that act after that point will all read the new value. In particular, if two reads from the same thread overlap a write then the later read cannot return the old value if the earlier read returns the new one. Atomic registers are linearizable.

The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit gives a good description, along with examples and use cases.

Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.

The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.

The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.

It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.

Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.

Any advice or great websites or books? After much Googling, I'm coming up empty.

Sincerely, Wayne

This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:

And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.

my own question, with answer, on stackoverflow's sister site:

I will copy the answer to avoid the need for click-through:

Quote Boris:

Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures

This is a book, I recommend wholeheartedly.

It is:

New - published last year. Means you are not reading somewhat outdated practices.

Short - about 200+ pages, dense with information. These days there is too much to read and too little time to read 1000+ pages books.

Easy to read - not only it is very well written but it introduces hard to grasps concepts in really simple to read way.

Intended to teach - each chapter gives exercises to do. I know it is always beneficial to do these, but rarely do. This book gives very compelling and interesting tasks. Surprisingly I did most of them and enjoyed doing them.

additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.

PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

I've been doing some research on STM (software transactional memory) implementations, specifically on algorithms that utilize locks and are not dependent on the presence of a garbage collector in order to maintain compatibility with non-managed languages like C/C++. I've read the STM chapter in Herlihy and Shavit's "The Art of Multiprocessor Programming", as well as read a couple of Shavit's papers that describe his "Transactional Locking" and "Transactional Locking II" STM implementations. Their basic approach is to utilize a hash-table that stores the values of a global version-clock and a lock to determine if a memory location has been touched by another thread's write. As I understand the algorithm, when a writing transaction is performed, the version-clock is read and stored in thread-local memory, and a read-set and write-set are also created in thread-local memory. Then the following steps are performed:

  1. The values of any addresses read are stored in the read-set. This means that the transaction checks that any locations being read are not locked, and they are equal to or less than the locally stored version clock value.
  2. The values of any addresses written are stored in the write-set, along with the values that are to be written to those locations.
  3. Once the entire write-transaction is complete (and this can include reading and writing to a number of locations), the transaction attempts to lock each address that is to be written to using the lock in the hash-table that is hashed against the address' value.
  4. When all the write-set addresses are locked, the global version clock is atomically incremented and the new incremented value is locally stored.
  5. The write-transaction checks again to make sure that the values in the read-set have not been updated with a new version-number or are not locked by another thread.
  6. The write-transaction updates the version-stamp for that memory location with the new value it stored from step #4, and commits the values in the write-set to memory
  7. The locks on the memory locations are released

If any of the above check-steps fail (i.e., steps #1, #3, and #5), then the write-transaction is aborted.

The process for a read-transaction is a lot simpler. According to Shavit's papers, we simply

  1. Read and locally store the global version-clock value
  2. Check to make sure the memory locations do not have a clock value greater than the currently stored global version-clock value and also make sure the memory locations are not currently locked
  3. Perform the read operations
  4. Repeat step #2 for validation

If either step #2 or #4 fail, then the read-transaction is aborted.

The question that I can't seem to solve in my mind though is what happens when you attempt to read a memory location inside an object that is located in the heap, and another thread calls delete on a pointer to that object? In Shavit's papers, they go into detail to explain how there can be no writes to a memory location that has been recycled or freed, but it seems that inside of a read-transaction, there is nothing preventing a possible timing scenario that would allow you to read from a memory location inside of an object that is has been freed by another thread. As an example, consider the following code:

Thread A executes the following inside of an atomic read-transaction: linked_list_node* next_node = node->next;

Thread B executes the following: delete node;

Since next_node is a thread-local variable, it's not a transactional object. The dereferencing operation required to assign it the value of node->next though actually requires two separate reads. In between those reads, delete could be called on node, so that the read from the member next is actually reading from a segment of memory that has already been freed. Since the reads are optimistic, the freeing of the memory pointed to by node in Thread B won't be detected in Thread A. Won't that cause a possible crash or segmentation fault? If it does, how could that be avoided without locking the memory locations for a read as well (something that both the text-book as well as the papers denotes is unnecessary)?

The simple answer is that delete is a side effect, and transactions do not play nice with side effects.

Logically, because transactions can be rolled back at any time, you can't deallocate memory in the middle of a transaction.

I don't think there is a single universal answer to "how this shall be handled", but a common approach is to defer the delete call until commit-time. The STM API should either do this automatically (for example providing their own delete function and requiring you to do that), or by giving you a hook where you can register "actions to perform on commit". Then during your transaction you can register an object to be deleted if and when the transaction commits.

Any other transaction working on the deleted object should then fail the version check and roll back.

Hope that helps. There isn't a simple answer to side effects in general. It's something each individual implementation will have to come up with mechanisms to handle.

Recently I read some examples from the Chapter 8 of the The Art of Multiprocessor Programming, about “Monitors and Blocking Synchronization” that use the signalAll() of a Condition object, without the acquisition of the lock associated with that Condition.

Surprisingly I did not find any fix for those examples in the book’s errata. Moreover they propose a correction for the example of figure 8.12 of a FifoReadWriteLock, but they keep using the signalAll() without the lock held. That perturbed me and I tried to find other considerations about these examples to understand the reasons why these Java examples were written in this way.

For instance, the answer to the question “How does a read-write mutex/lock work?” shows the same example of the implementation of a FifoReadWriteLock, which implements the writeUnlock() as:

void writeUnlock() {
    writer = false;

About the absence of the lock acquisition you can read two different reasons:

  1. only use it as pseudo code
  2. some implementation of a condition variable doesn't require that the lock be held to signal.

It is difficult to accept the first argument since the book use examples in Java and explicitly says:

The book uses the Java programming language.

About the second point, I know that the Java API in java.util.concurrent.locks.Condition states for signal() method:

An implementation may (and typically does) require that the current thread hold the lock associated with this Condition when this method is called.

If "an implementation may" only, that means that it is NOT mandatory. Yet, to the best of my knowledge I don’t find any implementation that does NOT fulfill this requirement. So I would like to know which implementations of Java Condition do not require current thread to hold the lock?

I'm not aware of any Condition implementation in the JDK that allows waiting or signaling without owning the monitor at the same time.

Practically all of the java.util.concurrent classes rely on AbstractQueuedSynchronizer which establishes the same contract as the built-in monitor methods wait()/notify()/notifyAll() for the condition variables it provides, i.e. it requires owning the internal lock in order to allow calling await()/signal()/signalAll().

If you try a simple example using the proposed FifoReadWriteLock, you'll find that it spews a serious amount of IllegalMonitorStateExceptions courtesy of its writeUnlock() method. These exceptions disappear if you apply the lock-try-finally approach from the other methods.

While indeed owning the monitor is not absolutely required to wait or signal, often it's the preferable approach, as it saves you from racy condition reads, it shouldn't be too costly as the hand-off between the internal wait sets of the same monitor can still be done fairly efficiently, and because most often you need it for both signaling and scheduling instead of just signaling.

I am reading up on TM, and one of the papers I'm reading says[1]:

Indeed, it was two nonblocking algorithms, the obstruction-free DSTM and lock-free FSTM that reinvigorated STM research in the past decade.

I was under the impression that lock imply obstruction. Apparently, I was wrong...

What is the difference between the terms "lock-free" and "obstruction-free"?

Here are the definitions from Herlihy & Shavit's The Art Of Multiprocessor Programing.

A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps.

A method is lock-free if it guarantees that infinitely often some method call finishes in a finite number of steps.

A method is obstruction-free if, from any point after which it executes in isolation, it finishes in a finite number of steps (method call executes in isolation if no other threads take steps).

All wait-free methods are lock-free, and all lock-free methods are obstruction-free.

What is the ReentrantLock#tryLock(long,TimeUnit) implementation doing when it tries to aquire a lock ? Assume Thread A acually owns the Lock of myLock, and Thread B call myLock.tryLock(10,SECONDS), is Thread B sleeping or waiting ?

In other words, was is the difference of this 2 implementations:


while (true)
   try {
     if (readLock.tryLock())
   }catch (InterruptedException e) {}


 while (true)
   try {
     if (readLock.tryLock(5,MILLISECONDS))
   }catch (InterruptedException e) {}

For a great reference on how locks and other concurrency primitives are implemented see Shavit and Herlihy's excellent The Art of Multiprocessor Programming.

I have been searching lately for information on how to construct a lock-free priority queue in C#. I have yet to even find an implementation in any language, or a decent paper on the matter. I have found several papers which appear to be copies or at least referencing one particular paper which is not actually a paper on lock free priority queues, despite its name; it is in fact a paper on a priority queue which uses fine grained locks.

The responses I have been receiving from elsewhere include "use a single thread" and "you do not need it to be lock free" and "it is impossible". All three of these responses are incorrect.

If someone has some information on this, I would greatly appreciate it.

The Art of Multiprocessor Programming. Look at Chapter 15 - Priority Queues. Book is in Java, but can be easily translated to C# since they both have GC (which is important for most implementations in the book).

I have learnt C++ for a while and still didn't come across good book which would explain what are those beasts? Are they integral C++ feature? If so how is it that they are only mentioned in such book like The C++ Programming Language by B.S. If not, where can you get reliable information about them - prefferably a book (don't really like web tutorials), how to define them, how to use them etc. Thank you for any valuable help.

Locks and Mutexes are concurrency constructs used to ensure two threads won't access the same shared data at the same time, thus achieving correctness.

The current C++ standard doesn't feature concurrency tools.

Although you mentioned you prefer books to online tutorials, Herb Sutter's Effective Concurrency column is definitely a must read.

There is also Anthony Williams's upcoming book called C++ Concurrency in Action. Anthony Williams is the author of the Boost.Thread library.

Another threading library worth a look is Intel's TBB.

Locks and mutexes (think: mutual exclusion) allow cooperating threads to synchronize access to shared resources. For a brief overview of the concept, read the Wikipedia article on mutual exclusion.

These concepts are not part of the C++ language. The O'Reilly pthreads book would be a good reference for you, assuming you're on a POSIX platform. For Windows, you might go with Windows System Programming from Addison-Wesley.

They are basic constructs used to ensure correctness in parallel programs. They are included Boost and the new C++ standard.

I can recommend this book, although it doesn't focus on C++:

I need some recommendations of books/links that discuss design on multi-threaded data structures for an intermediate level C++ developer who knows STL/Boost and pthreads individually but would now like to blend these 2 knowledge streams.

Any help appreciated.

When it comes out in Feb 2011:

Anthony Williams - C++ Concurrency in Action

High on my wishlist...

I need to implement (in C++) a thread safe container in such a way that only one thread is ever able to add or remove items from the container. I have done this kind of thing before by sharing a mutex between threads. This leads to a lot of mutex objects being littered throughout my code and makes things very messy and hard to maintain.

I was wondering if there is a neater and more object oriented way to do this. I thought of the following simple class wrapper around the container (semi-pseudo C++ code)

 class LockedList {
        std::list<MyClass> m_List;

        MutexObject Mutex;

so that locking could be done in the following way

 LockedList lockableList;     //create instance
 lockableList.Mutex.Lock();    // Lock object

 ... // search and add or remove items

 lockableList.Mutex.Unlock();   // Unlock object

So my question really is to ask if this is a good approach from a design perspective? I know that allowing public access to members is frowned upon from a design perspective, does the above design have any serious flaws in it. If so is there a better way to implement thread safe container objects?

I have read a lot of books on design and C++ in general but there really does seem to be a shortage of literature regarding multithreaded programming and multithreaded software design.

If the above is a poor approach to solving the problem I have could anyone suggest a way to improve it, or point me towards some information that explains good ways to design classes to be thread safe??? Many thanks.

It's hard to say that the coarse grain locking is a bad design decision. We'd need to know about the system that the code lives in to talk about that. It is a good starting point if you don't know that it won't work however. Do the simplest thing that could possibly work first.

You could improve that code by making it less likely to fail if you scope without unlocking though.

struct ScopedLocker {
  ScopedLocker(MutexObject &mo_) : mo(mo_) { mo.Lock(); }
  ~ScopedLocker() { mo.Unlock(); }

  MutexObject &mo;

You could also hide the implementation from users.

class LockedList {
    std::list<MyClass> m_List;
    MutexObject Mutex;

    struct ScopedLocker {
       ScopedLocker(LockedList &ll);

Then you just pass the locked list to it without them having to worry about details of the MutexObject.

You can also have the list handle all the locking internally, which is alright in some cases. The design issue is iteration. If the list locks internally, then operations like this are much worse than letting the user of the list decide when to lock.

void foo(LockedList &list) {
  for (size_t i = 0; i < 100000000; i++) {

Generally speaking, it's a hard topic to give advice on because of problems like this. More often than not, it's more about how you use an object. There are a lot of leaky abstractions when you try and write code that solves multi-processor programming. That is why you see more toolkits that let people compose the solution that meets their needs.

There are books that discuss multi-processor programming, though they are few. With all the new C++11 features coming out, there should be more literature coming within the next few years.

I have a shared tempfile resource that is divided into chunks of 4K (or some such value). Each 4K in the file is represented by an index starting from zero. For this shared resource, I track the 4K chunk indices in use and always return the lowest indexed 4K chunk not in use, or -1 if all are in use.

This ResourceSet class for the indices has a public acquire and release method, both of which use synchronized lock whose duration is about like that of generating 4 random numbers (expensive, cpu-wise).

Therefore as you can see from the code that follows, I use an AtomicInteger "counting semaphore" to prevent a large number of threads from entering the critical section at the same time on acquire(), returning -1 (not available right now) if there are too many threads.

Currently, I am using a constant of 100 for the tight CAS loop to try to increment the atomic integer in acquire, and a constant of 10 for the maximum number of threads to then allow into the critical section, which is long enough to create contention. My question is, what should these constants be for a moderate to highly loaded servlet engine that has several threads trying to get access to these 4K chunks?

public class ResourceSet {

    // ??? what should this be
    // maximum number of attempts to try to increment with CAS on acquire
    private static final int    CAS_MAX_ATTEMPTS = 50;

    // ??? what should this be
    // maximum number of threads contending for lock before returning -1 on acquire
    private static final int    CONTENTION_MAX = 10;

    private AtomicInteger        latch = new AtomicInteger(0);

    ... member variables to track free resources

    private boolean aquireLatchForAquire ()
        for (int i = 0; i < CAS_MAX_ATTEMPTS; i++) {
            int val = latch.get();
            if (val == -1)
                throw new AssertionError("bug in ResourceSet");        // this means more threads than can exist on any system, so its a bug!
            if (!latch.compareAndSet(val, val+1))
            if (val < 0 || val >= CONTENTION_MAX) {
                // added to fix BUG that comment pointed out, thanks!
                return false;
        return false;

    private void aquireLatchForRelease ()
        do {
            int val = latch.get();
            if (val == -1)
                throw new AssertionError("bug in ResourceSet");    // this means more threads than can exist on any system, so its a bug!
            if (latch.compareAndSet(val, val+1))
        } while (true);

    public ResourceSet (int totalResources)
        ... initialize

    public int acquire (ResourceTracker owned)
        if (!aquireLatchForAquire())
            return -1;

        try {
            synchronized (this) {
                ... algorithm to compute minimum free resoource or return -1 if all in use
                return resourceindex;
        } finally {

    public boolean release (ResourceIter iter)
        try {
            synchronized (this) {
                ... iterate and release all resources
        } finally {

Writting a good and performant spinlock is actually pretty complicated and requires a good understanding of memory barriers. Merely picking a constant is not going to cut it and will definitely not be portable. Google's gperftools has an example that you can look at but is probably way more complicated then what you'd need.

If you really want to reduce contention on the lock, you might want to consider using a more fine-grained and optimistic scheme. A simple one could be to divide your chunks into n groups and associate a lock with each group (also called stripping). This will help reduce contention and increase throughput but it won't help reduce latency. You could also associate an AtomicBoolean to each chunk and CAS to acquire it (retry in case of failure). Do be careful when dealing with lock-free algorithms because they tend to be tricky to get right. If you do get it right, it could considerably reduce the latency of acquiring a chunk.

Note that it's difficult to propose a more fine-grained approach without knowing what your chunk selection algorithm looks like. I also assume that you really do have a performance problem (it's been profiled and everything).

While I'm at it, your spinlock implementation is flawed. You should never spin directly on a CAS because you're spamming memory barriers. This will be incredibly slow with any serious amount of contention (related to the thundering-herd problem). A minimum would be to first check the variable for availability before your CAS (simple if on a no barrier read will do). Even better would be to not have all your threads spinning on the same value. This should avoid the associated cache-line from ping-pong-ing between your cores.

Note that I don't know what type of memory barriers are associated with atomic ops in Java so my above suggestions might not be optimal or correct.

Finally, The Art Of Multiprocessor Programming is a fun book to read to get better acquainted with all the non-sense I've been spewing in this answer.

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.

To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.

Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?

Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.

If you can definitely prove that a method has no linearization points, does it necessarily mean that that method is not linearizable? Also, as a sub question, how can you prove that a method has no linearizatioon points?

To build upon the answers described above, a method can be described as linearizable. As referenced in the book that djoker mentioned:

on page 69, exercise 32, we see

enter image description here

It should be noted that enq() is indeed a method, that is described as possibily being linearizable/not linearizable.

Proving that there are linearizable points comes down to finding if there are examples that can break linearizability. If you make the assumption that various read/write memory operations in a method are linearizable, and then prove by contradiction that there are non-linearizable situations that result from such an assumption, you can declare that the previously mentioned read/write operation is not a valid linearization point.

Take, for example, the following enq()/deq() methods, assuming they are part of a standard queue implementation with head/tail pointers thaand a backing array "arr":

public terribleQueue(){
  arr = new T[10];
  tail = 0;
  head = 0;

void enq(T x){
  int slot = tail;
  arr[slot] = x;
  tail = tail + 1;

T deq(){
  if( head == tail ) throw new EmptyQueueException();
  T temp = arr[head];
  head = head + 1;
  return temp;

In this terrible implementation, we can easily prove, for example, that the first line of enq is not a valid linearization point, by assuming that it is a linearization point, and then finding an example displaying otherwise, as seen here:

Take the example two threads, A and B, and the example history:

A: enq( 1 )
A: slot = 0
B: enq( 2 )
B: slot = 0

(A and B are now past their linearization points, therefore we are not allowed to re-order them to fit our history)

A: arr[0] = 1
B: arr[0] = 2
A: tail = 1
B: tail = 2

C: deq()
C: temp = arr[0] = 2
C: head = 1
C: return 2

Now we see that because of our choice of linearization point (which fixes the order of A and B), this execution will be impossible make linearizable, because we cannot make C's deq return 1, no matter where we put it.

Kind of a long winded answer, but I hope this helps

If you can definitely prove that a method has no linearization points, does it necessarily 
mean that that method is not linearizable? 

Firstly, linearizability is not property of a method, it is property of execution sequence.

how can you prove that a method has no linearizatioon points?

It depends on the execution sequence whether we are able to find linearization point for the method or not.

For example, we have the below sequence, for thread A on a FIFO queue. t1, t2, t3 are time intervals.

A.enq(1)   A.enq(2)   A.deq(1)
     t1          t2                t3

We can choose linearization points(lp) for first two enq methods as any points in time interval t1 and t2 respectively, and for deq any point in t3. The points that we choose are lp for these methods.

Now, consider a faulty implementation

A.enq(1)   A.enq(2)    A.deq(2)
    t1           t2                 t3

Linerizability allows lp to respect the real-time ordering. Therefore, lp of the methods should follow the time ordering i.e. t1 < t2 < t3. However, since our implementation is incorrect, we cannot clearly do this. Hence, we cannot find linearization point for the method A.deq(2), in turn our seq. too in not linerizable.

Hope this helps, if you need to know more you can read this book:

Why is the following snippet for deleting a node in a linked list not thread safe?

edit: note every node has a lock of its own

// ... lock acquisition here
// ... assumption found to be valid here
prev->next = p->next;
p->next = NULL;
p->deleted = 1;

You might wanna take a look at this presentation. From slide #39, it demonstrates how fine-grained linked list locking should be implemented, in a clear and figurative way (the slides' notes add some explanations as well). The presentation is based on (or taken from...) a book called The Art of Multiprocessor Programming.

I am trying to learn about multithreading and how to use it to perform tasks on a set of data in parallel. For example if I have a array of numbers that I want to perform a rather long operation on, I have created the following code to process it:

mutex mm;
int nums[] = {10,20,30,40,50,60,70,80,90};
int index = 0;

void threadProc()
    while (index != sizeof(nums)/sizeof(nums[0])) //While != to end of array
        int num = nums[index]; //Create local copy so we can unlock mutex for other threads
        cout << num + 2; //Replace with time-consuming function

int main()
    //Create 2 threads
    thread t(threadProc);
    thread a(threadProc);

Since I am creating this code based off of what I seem logical, is this the proper way to do this? Of course I could add more threads based on the amount of hardware threads the CPU has, but I am going for the general idea here. If there are any good resources on this (preferably C++ oriented), I would be glad to hear about them. Thanks!

In general, you have two options: Thread-based parallelism or Task-based parallelism. The first is the most traditional approach and pthreads and OpenMP are a good examples of it. In the second alternative, you have one more abstraction level, where you see your parallel program as a set of tasks that are mapped to threads. A good reference to learn the model of computation is the chapter 27 of Introduction to Algorithms of Cormen ( and some tools to program are CilkPlus (, Threading Building Blocks (, OpenMP Tasks(, and Microsoft’s Task Parallel Library (

Finally, you can read The Art of Multiprocessor Programming (

I've recently acquired responsibility for a critical system which is a data store featuring memory-mapped IO, multi-threaded (local) clients, and an interesting mixture of constraints. It has to store a mix of serial and indexed data including sparse 3D data of varying levels of detail.

Performance is critical, including being able to work with gigabytes of data on Win32 including working on XP but I only have to worry about Windows implementations. It is mostly implemented but I have to worry about debugging and improving performance.

I'm looking for tips on where to study, book recommendations, themes etc. I have a varied 25+ years experience including writing Quicktime components and a lot of work with ISAM data stores but not having to worry about the finer points of performance and cache coherence in this kind of system. I don't have traditional computer science education so if there's a theoretical gap I need to fill, please tell me!

Areas I was considering are the data stores used for some games, file system implementations, git internals and books such as Maurice Herlihy's The Art of Multiprocessor Programming.

I am planning to attend a one week course on this subject. I am primarily involved in Java projects and have decent knowledge of C and C++ too. And, I am interested in learning more on concurrent programming and would like to get feedback on this course. Has someone read the book or found these concepts relevant in contemporary programming?

More information on the course:

I would definitely, suggest you to go with this. But I would like to add another really important resource, specific to java - as you labeled the question 'java' - which is Java Concurrency in Practice.

If this code got no performance benefit at all from running multiple threads on multiple cores, I would not be scratching my head. But how can it actually run slower?

Look at the code first:

class ThreadSafeStack
  def initialize
    @s,@m = [],
  def push(value)
    @m.synchronize { @s.push(value) }
  def pop
    @m.synchronize { @s.pop }
  def peek
    @m.synchronize { @s.last }

The full benchmarking script is at Basically, I do a million pushes, a million peeks, and a million pops, divided between 1, 5, or 25 threads (running in parallel).

Results from a 4-core Mac Pro, running JRuby

Testing ThreadSafeStack with 1 thread, iterating 1000000x each
  1.575000   0.000000   1.575000 (  1.575000)
Testing ThreadSafeStack with 5 threads, iterating 200000x each
  4.838000   0.000000   4.838000 (  4.838000)
Testing ThreadSafeStack with 25 threads, iterating 40000x each
 11.409000   0.000000  11.409000 ( 11.409000)

What gives???

EDIT: One more piece of information which might be relevant -- this benchmark does run faster with multiple threads when I use a lockless stack (implemented with compare-and-swap operations).

I recommend you go over Scott Meyer's slides CPU Caches and Why You Care. Of special interest for you is slide 8, which shows how a naive approach of adding multi-threading to an algorithm actually needs 16 physical CPU threads to match the performance of a single thread, and 2 threads are about twice slower than a single thread (much like your experiment). Herb Sutter has also many articles and seminars covering this topic, and the Software optimization Cookbook is an excellent book on the topic. And there is, of course, The Art of Multiprocessor Programming. Note that nothing I mentioned above have anything related to Ruby. This is no accident, the topic/issue is fundamental and comes from the hardware.

What happens is that, even if your Mutexes are lightweight and user space implemented only (no trip to Kernel land), you are running up against the CPU cache coherency algorithm. Every time you find yourself looking at code that, in a concurrent environment, modifies a shared state just about as often as it reads it (hint: your stack protection Mutex is exactly such a shared state, as well as the stack itself) you should expect pretty much abysmal performance, much slower than a single thread. Basically all your accesses to such a shared state have to be served from the main RAM instead of from cache, and this is about 100 times slower. A single thread will only pay this penalty on first access, all subsequent accesses will be from the L1/L2 cache.

This why serious multi-threaded application

  • don't share state between threads
  • use lock free structures

The art of how to achieve this exactly varies from case to case (I highly recommend the books linked before). Tricks include getting work in large batches instead of a single item at a time (so the contention occurs far less often and is amortized across many items), partition the shared state (the stack) to reduce contention, use a lock free stack (not a trivial task to implement).

I have been reading the book The Art of Multiprocessor Programming and noticed that a lot of the algorithms mentioned assume a fixed number of threads (e.g. the combining tree algorithm). Is there a straightforward way of generalizing such algorithms to a scenario where threads are created and destroyed in an unpredictable manner?

I used Task like below but there is no performance gain. I checked my method which executes in 0-1 seconds but with Task(30 Tasks), it takes 5-12 seconds. Can anyone guide if I have done any mistake. I want to run 30 parallel and expect 30 done in max 2 seconds.

Here is my code:

Task[] tasks = new Task[30];
for (int p = 0; p <= dstable.Tables[0].Rows.Count - 1; p++)
    MethodParameters newParameter = new MethodParameters();
    newParameter.Name = dstable.Tables[0].Rows[p]["Name"].ToString();

    tasks[p] = Task.Factory.StartNew(() => ParseUri(newParameter));
    //Console.Write("task completed");
catch (AggregateException ae)
    throw ae.Flatten();

There are some major problems in your thinking.

  1. does your PC have 30 Cores, so that every core can exactly takes one task? I don't think so
  2. starting a seperate thread also takes some time.
  3. with every concurrent thread more overhead is generated.
  4. Can your problem be solved faster by starting more threads? This is only the case, when all threads do different tasks, like reading from disk, quering a database, computing something, etc.. 10 threads that do all "high-performance" tasks on the cpu, won't give an boost, quite contrary to, because every thread needs to clean up his mess, before he can give some cpu time to the next thread, and that one needs to clean up his mess too.

Check this link out

You can use the TPL

they try to guaranty the maximum effect from parallel threads. Also I recommend this book

When you really want to solve your problem in under 2 seconds, buy more CPU power ;)