C++ Concurrency in Action

Anthony Williams

Mentioned 22

With the new C++ Standard and Technical Report 2 (TR2), multi-threading is coming to C++ in a big way. TR2 will provide higher-level synchronization facilities that allow for a much greater level of abstraction, and make programming multi-threaded applications simpler and safer. Concurrent programming is required if programmers are to take advantage of the multi-core microprocessors increasingly available from Intel and others. The new standard for C++ has extensions to the language that make concurrent programming more accessible to regular developers. As a guide and reference to the new concurrency features in the upcoming C++ Standard and TR2, this book is invaluable for existing programmers familiar with writing multi-threaded code in C++ using platform-specific APIs, or in other languages, as well as C++ programmers who have never written multithreaded code before.

More on Amazon.com

Mentioned in questions and answers.

This question attempts to collect the few pearls among the dozens of bad C++ books that are published every year.

Unlike many other programming languages, which are often picked up on the go from tutorials found on the Internet, few are able to quickly pick up C++ without studying a well-written C++ book. It is way too big and complex for doing this. In fact, it is so big and complex, that there are very many very bad C++ books out there. And we are not talking about bad style, but things like sporting glaringly obvious factual errors and promoting abysmally bad programming styles.

Please edit the accepted answer to provide quality books and an approximate skill level — preferably after discussing your addition in the C++ chat room. (The regulars might mercilessly undo your work if they disagree with a recommendation.) Add a short blurb/description about each book that you have personally read/benefited from. Feel free to debate quality, headings, etc. Books that meet the criteria will be added to the list. Books that have reviews by the Association of C and C++ Users (ACCU) have links to the review.

Note: FAQs and other resources can be found in the C++ tag info and under . There is also a similar post for C: The Definitive C Book Guide and List

Beginner

Introductory, no previous programming experience

  • Programming: Principles and Practice Using C++ (Bjarne Stroustrup) (updated for C++11/C++14) An introduction to programming using C++ by the creator of the language. A good read, that assumes no previous programming experience, but is not only for beginners.

Introductory, with previous programming experience

  • C++ Primer * (Stanley Lippman, Josée Lajoie, and Barbara E. Moo) (updated for C++11) Coming at 1k pages, this is a very thorough introduction into C++ that covers just about everything in the language in a very accessible format and in great detail. The fifth edition (released August 16, 2012) covers C++11. [Review]

  • A Tour of C++ (Bjarne Stroustrup) (EBOOK) The “tour” is a quick (about 180 pages and 14 chapters) tutorial overview of all of standard C++ (language and standard library, and using C++11) at a moderately high level for people who already know C++ or at least are experienced programmers. This book is an extended version of the material that constitutes Chapters 2-5 of The C++ Programming Language, 4th edition.

  • Accelerated C++ (Andrew Koenig and Barbara Moo) This basically covers the same ground as the C++ Primer, but does so on a fourth of its space. This is largely because it does not attempt to be an introduction to programming, but an introduction to C++ for people who've previously programmed in some other language. It has a steeper learning curve, but, for those who can cope with this, it is a very compact introduction into the language. (Historically, it broke new ground by being the first beginner's book to use a modern approach at teaching the language.) [Review]

  • Thinking in C++ (Bruce Eckel) Two volumes; is a tutorial style free set of intro level books. Downloads: vol 1, vol 2. Unfortunately they’re marred by a number of trivial errors (e.g. maintaining that temporaries are automatically const), with no official errata list. A partial 3rd party errata list is available at (http://www.computersciencelab.com/Eckel.htm), but it’s apparently not maintained.

* Not to be confused with C++ Primer Plus (Stephen Prata), with a significantly less favorable review.

Best practices

  • Effective C++ (Scott Meyers) This was written with the aim of being the best second book C++ programmers should read, and it succeeded. Earlier editions were aimed at programmers coming from C, the third edition changes this and targets programmers coming from languages like Java. It presents ~50 easy-to-remember rules of thumb along with their rationale in a very accessible (and enjoyable) style. For C++11 and C++14 the examples and a few issues are outdated and Effective Modern C++ should be preferred. [Review]

  • Effective Modern C++ (Scott Meyers) This is basically the new version of Effective C++, aimed at C++ programmers making the transition from C++03 to C++11 and C++14.

  • Effective STL (Scott Meyers) This aims to do the same to the part of the standard library coming from the STL what Effective C++ did to the language as a whole: It presents rules of thumb along with their rationale. [Review]

Intermediate

  • More Effective C++ (Scott Meyers) Even more rules of thumb than Effective C++. Not as important as the ones in the first book, but still good to know.

  • Exceptional C++ (Herb Sutter) Presented as a set of puzzles, this has one of the best and thorough discussions of the proper resource management and exception safety in C++ through Resource Acquisition is Initialization (RAII) in addition to in-depth coverage of a variety of other topics including the pimpl idiom, name lookup, good class design, and the C++ memory model. [Review]

  • More Exceptional C++ (Herb Sutter) Covers additional exception safety topics not covered in Exceptional C++, in addition to discussion of effective object oriented programming in C++ and correct use of the STL. [Review]

  • Exceptional C++ Style (Herb Sutter) Discusses generic programming, optimization, and resource management; this book also has an excellent exposition of how to write modular code in C++ by using nonmember functions and the single responsibility principle. [Review]

  • C++ Coding Standards (Herb Sutter and Andrei Alexandrescu) “Coding standards” here doesn't mean “how many spaces should I indent my code?” This book contains 101 best practices, idioms, and common pitfalls that can help you to write correct, understandable, and efficient C++ code. [Review]

  • C++ Templates: The Complete Guide (David Vandevoorde and Nicolai M. Josuttis) This is the book about templates as they existed before C++11. It covers everything from the very basics to some of the most advanced template metaprogramming and explains every detail of how templates work (both conceptually and at how they are implemented) and discusses many common pitfalls. Has excellent summaries of the One Definition Rule (ODR) and overload resolution in the appendices. A second edition is scheduled for 2017. [Review]


Advanced

  • Modern C++ Design (Andrei Alexandrescu) A groundbreaking book on advanced generic programming techniques. Introduces policy-based design, type lists, and fundamental generic programming idioms then explains how many useful design patterns (including small object allocators, functors, factories, visitors, and multimethods) can be implemented efficiently, modularly, and cleanly using generic programming. [Review]

  • C++ Template Metaprogramming (David Abrahams and Aleksey Gurtovoy)

  • C++ Concurrency In Action (Anthony Williams) A book covering C++11 concurrency support including the thread library, the atomics library, the C++ memory model, locks and mutexes, as well as issues of designing and debugging multithreaded applications.

  • Advanced C++ Metaprogramming (Davide Di Gennaro) A pre-C++11 manual of TMP techniques, focused more on practice than theory. There are a ton of snippets in this book, some of which are made obsolete by typetraits, but the techniques, are nonetheless useful to know. If you can put up with the quirky formatting/editing, it is easier to read than Alexandrescu, and arguably, more rewarding. For more experienced developers, there is a good chance that you may pick up something about a dark corner of C++ (a quirk) that usually only comes about through extensive experience.


Reference Style - All Levels

  • The C++ Programming Language (Bjarne Stroustrup) (updated for C++11) The classic introduction to C++ by its creator. Written to parallel the classic K&R, this indeed reads very much alike it and covers just about everything from the core language to the standard library, to programming paradigms to the language's philosophy. [Review]

  • C++ Standard Library Tutorial and Reference (Nicolai Josuttis) (updated for C++11) The introduction and reference for the C++ Standard Library. The second edition (released on April 9, 2012) covers C++11. [Review]

  • The C++ IO Streams and Locales (Angelika Langer and Klaus Kreft) There's very little to say about this book except that, if you want to know anything about streams and locales, then this is the one place to find definitive answers. [Review]

C++11/14 References:

  • The C++ Standard (INCITS/ISO/IEC 14882-2011) This, of course, is the final arbiter of all that is or isn't C++. Be aware, however, that it is intended purely as a reference for experienced users willing to devote considerable time and effort to its understanding. As usual, the first release was quite expensive ($300+ US), but it has now been released in electronic form for $60US.

  • The C++14 standard is available, but seemingly not in an economical form – directly from the ISO it costs 198 Swiss Francs (about $200 US). For most people, the final draft before standardization is more than adequate (and free). Many will prefer an even newer draft, documenting new features that are likely to be included in C++17.

  • Overview of the New C++ (C++11/14) (PDF only) (Scott Meyers) (updated for C++1y/C++14) These are the presentation materials (slides and some lecture notes) of a three-day training course offered by Scott Meyers, who's a highly respected author on C++. Even though the list of items is short, the quality is high.

  • The C++ Core Guidelines (C++11/14/17/…) (edited by Bjarne Stroustrup and Herb Sutter) is an evolving online document consisting of a set of guidelines for using modern C++ well. The guidelines are focused on relatively higher-level issues, such as interfaces, resource management, memory management and concurrency affecting application architecture and library design. The project was announced at CppCon'15 by Bjarne Stroustrup and others and welcomes contributions from the community. Most guidelines are supplemented with a rationale and examples as well as discussions of possible tool support. Many rules are designed specifically to be automatically checkable by static analysis tools.

  • The C++ Super-FAQ (Marshall Cline, Bjarne Stroustrup and others) is an effort by the Standard C++ Foundation to unify the C++ FAQs previously maintained individually by Marshall Cline and Bjarne Stroustrup and also incorporating new contributions. The items mostly address issues at an intermediate level and are often written with a humorous tone. Not all items might be fully up to date with the latest edition of the C++ standard yet.

  • cppreference.com (C++03/11/14/17/…) (initiated by Nate Kohl) is a wiki that summarizes the basic core-language features and has extensive documentation of the C++ standard library. The documentation is very precise but is easier to read than the official standard document and provides better navigation due to its wiki nature. The project documents all versions of the C++ standard and the site allows filtering the display for a specific version. The project was presented by Nate Kohl at CppCon'14.


Classics / Older

Note: Some information contained within these books may not be up-to-date or no longer considered best practice.

  • The Design and Evolution of C++ (Bjarne Stroustrup) If you want to know why the language is the way it is, this book is where you find answers. This covers everything before the standardization of C++.

  • Ruminations on C++ - (Andrew Koenig and Barbara Moo) [Review]

  • Advanced C++ Programming Styles and Idioms (James Coplien) A predecessor of the pattern movement, it describes many C++-specific “idioms”. It's certainly a very good book and might still be worth a read if you can spare the time, but quite old and not up-to-date with current C++.

  • Large Scale C++ Software Design (John Lakos) Lakos explains techniques to manage very big C++ software projects. Certainly a good read, if it only was up to date. It was written long before C++98, and misses on many features (e.g. namespaces) important for large scale projects. If you need to work in a big C++ software project, you might want to read it, although you need to take more than a grain of salt with it. The first volume of a new edition is expected in 2015.

  • Inside the C++ Object Model (Stanley Lippman) If you want to know how virtual member functions are commonly implemented and how base objects are commonly laid out in memory in a multi-inheritance scenario, and how all this affects performance, this is where you will find thorough discussions of such topics.

  • The Annotated C++ Reference Manual (Bjarne Stroustrup, Margaret A. Ellis) This book is quite outdated in the fact that it explores the 1989 C++ 2.0 version - Templates, exceptions, namespaces and new casts were not yet introduced. Saying that however this is book goes through the entire C++ standard of the time explaining the rationale, the possible implementations and features of the language. This is not a book not learn programming principles and patterns on C++, but to understand every aspect of the C++ language.

I read the following article by Antony Williams and as I understood in addition to the atomic shared count in std::shared_ptr in std::experimental::atomic_shared_ptr the actual pointer to the shared object is also atomic?

But when I read about reference counted version of lock_free_stack described in Antony's book about C++ Concurrency it seems for me that the same aplies also for std::shared_ptr, because functions like std::atomic_load, std::atomic_compare_exchnage_weak are applied to the instances of std::shared_ptr.

template <class T>
class lock_free_stack
{
public:
  void push(const T& data)
  {
    const std::shared_ptr<node> new_node = std::make_shared<node>(data);
    new_node->next = std::atomic_load(&head_);
    while (!std::atomic_compare_exchange_weak(&head_, &new_node->next, new_node));
  }

  std::shared_ptr<T> pop()
  {
    std::shared_ptr<node> old_head = std::atomic_load(&head_);
    while(old_head &&
          !std::atomic_compare_exchange_weak(&head_, &old_head, old_head->next));
    return old_head ? old_head->data : std::shared_ptr<T>();
  }

private:
  struct node
  {
    std::shared_ptr<T> data;
    std::shared_ptr<node> next;

    node(const T& data_) : data(std::make_shared<T>(data_)) {}
  };

private:
  std::shared_ptr<node> head_;
};

What is the exact difference between this two types of smart pointers, and if pointer in std::shared_ptr instance is not atomic, why it is possible the above lock free stack implementation?

The atomic "thing" in shared_ptr is not the shared pointer itself, but the control block it points to. meaning that as long as you don't mutate the shared_ptr across multiple threads, you are ok. do note that copying a shared_ptr only mutates the control block, and not the shared_ptr itself.

std::shared_ptr<int> ptr = std::make_shared<int>(4);
for (auto i =0;i<10;i++){
   std::thread([ptr]{ auto copy = ptr; }).detach(); //ok, only mutates the control block 
}

Mutating the shared pointer itself, such as assigning it different values from multiple threads, is a data race, for example:

std::shared_ptr<int> ptr = std::make_shared<int>(4);
std::thread threadA([&ptr]{
   ptr = std::make_shared<int>(10);
});
std::thread threadB([&ptr]{
   ptr = std::make_shared<int>(20);
});    

Here, we are mutating the control block (which is ok) but also the shared pointer itself, by making it point to a different values from multiple threads. This is not ok.

A solution to that problem is to wrap the shared_ptr with a lock, but this solution is not so scalable under some contention, and in a sense, loses the automatic feeling of the standard shared pointer.

Another solution is to use the standard functions you quoted, such as std::atomic_compare_exchange_weak. This makes the work of synchronizing shared pointers a manual one, which we don't like.

This is where atomic shared pointer comes to play. You can mutate the shared pointer from multiple threads without fearing a data race and without using any locks. The standalone functions will be members ones, and their use will be much more natural for the user. This kind of pointer is extremely useful for lock-free data structures.

I'm reading C++ concurrency in action. Chapter 2.4 describes a parallell_accumulate algorithm.

I tried - as a learning experiment - to replace the functor used there, with a generic lambda.

I've distilled the compilation error down to:

#include <thread>

template <typename T>
struct f {
    void operator() (T& result) { result = 1;}
};

int main() {
    int x = 0;
    auto g = [](auto& result) { result = 1; };

    std::thread(f<int>(), std::ref(x));  // COMPILES
    std::thread(g, std::ref(x));         // FAILS TO COMPILE
}

The error message:

 In file included from /usr/include/c++/4.9/thread:39:0,
                 from foo.cpp:1:
/usr/include/c++/4.9/functional: In instantiation of ‘struct std::_Bind_simple<main()::<lambda(auto:1&)>(std::reference_wrapper<int>)>’:
/usr/include/c++/4.9/thread:140:47:   required from ‘std::thread::thread(_Callable&&, _Args&& ...) [with _Callable = main()::<lambda(auto:1&)>&; _Args = {std::reference_wrapper<int>}]’
foo.cpp:13:31:   required from here
/usr/include/c++/4.9/functional:1665:61: error: no type named ‘type’ in ‘class std::result_of<main()::<lambda(auto:1&)>(std::reference_wrapper<int>)>’
       typedef typename result_of<_Callable(_Args...)>::type result_type;
                                                             ^
/usr/include/c++/4.9/functional:1695:9: error: no type named ‘type’ in ‘class std::result_of<main()::<lambda(auto:1&)>(std::reference_wrapper<int>)>’
         _M_invoke(_Index_tuple<_Indices...>)
         ^

My compiler version

$ g++ --version
g++ (Ubuntu 4.9.1-16ubuntu6) 4.9.1

Why do the compilation fail for the lambda but not the functor?

EDIT: How can I achieve what the functor is doing (assigning to a ref) with a generic lambda?

Another variation on the same theme that template argument deduction doesn't look through conversions.

The operator() of f<int> is

void operator() (int& result);

when you pass a reference_wrapper<int> to it, the conversion function (operator int &) is called, yielding a reference that can be bound to result.

The operator() of your generic lambda is

template<class T> void operator() (T& result) const;

If it were passed a reference_wrapper lvalue, it would deduce T as a reference_wrapper and then fail to compile on the assignment. (Assignment to a reference_wrapper reseats the "reference" rather than affects the value.)

But it fails even before that, because the standard requires that what you pass to std::thread must be callable with prvalues - and a non-const lvalue reference doesn't bind to a prvalue. This is the error you see - result_of contains no type because your functor is not callable for the argument type. If you attempt to do g(std::ref(x));, clang produces a rather clear error:

main.cpp:16:5: error: no matching function for call to object of type '(lambda at main.cpp:11:14)'
    g(std::ref(x));
    ^
main.cpp:11:14: note: candidate function [with $auto-0-0 = std::__1::reference_wrapper<int>] not viable: expects an l-value for 1st argument
    auto g = [](auto& result) { result = 1; };         
    ^

You should probably consider just capturing the relevant local by reference:

auto g = [&x]() { x = 1; };

Or if, for whatever reason, you must use a generic lambda, then you might take a reference_wrapper by value (or by const reference), and then unwrap it using get():

 auto g = [](auto result) { result.get() = 1; };

or perhaps add a std::bind which will unwrap the reference_wrappers, which lets template argument deduction do the right thing (hat tip @Casey):

 std::thread(std::bind(g, std::ref(x)));

or perhaps dispense with this reference_wrapper nonsense and write your lambda to take a non-owning pointer instead:

auto g = [](auto* result) { *result = 1; };
std::thread(g, &x);

I have found the following 2 pieces of code:

  1. http://en.cppreference.com/w/cpp/thread/lock

    void assign_lunch_partner(Employee &e1, Employee &e2)                                                                                                  
    {   
        // use std::lock to acquire two locks without worrying about 
        // other calls to assign_lunch_partner deadlocking us
        {   
            // m is the std::mutex field
            std::unique_lock<std::mutex> lk1(e1.m, std::defer_lock);
            std::unique_lock<std::mutex> lk2(e2.m, std::defer_lock);
            std::lock(lk1, lk2);
            // ...
        }   
    }
    
  2. http://www.amazon.com/C-Concurrency-Action-Practical-Multithreading/dp/1933988770

    void swap(X& lhs, X&rhs){                                                                                                                              
      if(&lhs == &rhs)
        return;
      // m is the std::mutex field
      std::lock(lhs.m, rhs.m);
      std::lock_guard<std::mutex> lock_a(lhs.m, std::adopt_lock);
      std::lock_guard<std::mutex> lock_b(rhs.m, std::adopt_lock);
      swap(lhs.some_detail, rhs.some_detail);
    }
    

I wanted to ask what is the difference and consequence of using either of 2 versions? (first lock or first create the lock_guard or unique_lock?)

There's actually a paragraph (3.2.6) in the book explaining that the code is virtually equivalent and you could replace one with the other. The only difference being is that std::unique_lock tends to take more space and is a fraction slower than std::lock_guard.

Bottom line is whenever you don't need the additional flexibility that std::unique_lock provides, go with std::lock_guard.

IS there a way of running a function back on the main thread ?

So if I called a function via Async that downloaded a file and then parsed the data. It would then call a callback function which would run on my main UI thread and update the UI ?

I know threads are equal in the default C++ implementation so would I have to create a shared pointer to my main thread. How would I do this and pass the Async function not only the shared pointer to the main thread but also a pointer to the function I want to rrun on it and then run it on that main thread ?

I have been reading C++ Concurrency in Action and chapter four (AKA "The Chapter I Just Finished") describes a solution.

The Short Version

Have a shared std::deque<std::packaged_task<void()>> (or a similar sort of message/task queue). Your std::async-launched functions can push tasks to the queue, and your GUI thread can process them during its loop.

There Isn't Really a Long Version, but Here Is an Example

Shared Data

std::deque<std::packaged_task<void()>> tasks;
std::mutex tasks_mutex;
std::atomic<bool> gui_running;

The std::async Function

void one_off()
{
    std::packaged_task<void()> task(FUNCTION TO RUN ON GUI THREAD); //!!
    std::future<void> result = task.get_future();

    {
        std::lock_guard<std::mutex> lock(tasks_mutex);
        tasks.push_back(std::move(task));
    }

    // wait on result
    result.get();
}

The GUI Thread

void gui_thread()
{
    while (gui_running) {
        // process messages
        {
            std::unique_lock<std::mutex> lock(tasks_mutex);
            while (!tasks.empty()) {
                auto task(std::move(tasks.front()));
                tasks.pop_front();

                // unlock during the task
                lock.unlock();
                task();
                lock.lock();
            }
        }

        // "do gui work"
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    }
}

Notes:

  1. I am (always) learning, so there is a decent chance that my code is not great. The concept is at least sound though.

  2. The destructor of the return value from std::async (a std::future<>) will block until the operation launched with std::async completes (see std::async ), so waiting on the result of a task (as I do in my example) in one_off might not be a brilliant idea.

  3. You may want to (I would, at least) create your own threadsafe MessageQueue type to improve code readability/maintainability/blah blah blah.

  4. I swear there was one more thing I wanted to point out, but it escapes me right now.

Full Example

#include <atomic>
#include <chrono>
#include <deque>
#include <iostream>
#include <mutex>
#include <future>
#include <thread>


// shared stuff:
std::deque<std::packaged_task<void()>> tasks;
std::mutex tasks_mutex;
std::atomic<bool> gui_running;


void message()
{
   std::cout << std::this_thread::get_id() << std::endl;
}


void one_off()
{
    std::packaged_task<void()> task(message);
    std::future<void> result = task.get_future();

    {
        std::lock_guard<std::mutex> lock(tasks_mutex);
        tasks.push_back(std::move(task));
    }

    // wait on result
    result.get();
}


void gui_thread()
{
    std::cout << "gui thread: "; message();

    while (gui_running) {
        // process messages
        {
            std::unique_lock<std::mutex> lock(tasks_mutex);
            while (!tasks.empty()) {
                auto task(std::move(tasks.front()));
                tasks.pop_front();

                // unlock during the task
                lock.unlock();
                task();
                lock.lock();
            }
        }

        // "do gui work"
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    }
}


int main()
{
    gui_running = true;

    std::cout << "main thread: "; message();
    std::thread gt(gui_thread);

    for (unsigned i = 0; i < 5; ++i) {
        // note:
        // these will be launched sequentially because result's
        // destructor will block until one_off completes
        auto result = std::async(std::launch::async, one_off);

        // maybe do something with result if it is not void
    }

    // the for loop will not complete until all the tasks have been
    // processed by gui_thread

    // ...

    // cleanup
    gui_running = false;
    gt.join();
}

Dat Output

$ ./messages
main thread: 140299226687296
gui thread: 140299210073856
140299210073856
140299210073856
140299210073856
140299210073856
140299210073856

I have learnt C++ for a while and still didn't come across good book which would explain what are those beasts? Are they integral C++ feature? If so how is it that they are only mentioned in such book like The C++ Programming Language by B.S. If not, where can you get reliable information about them - prefferably a book (don't really like web tutorials), how to define them, how to use them etc. Thank you for any valuable help.

Locks and Mutexes are concurrency constructs used to ensure two threads won't access the same shared data at the same time, thus achieving correctness.

The current C++ standard doesn't feature concurrency tools.

Although you mentioned you prefer books to online tutorials, Herb Sutter's Effective Concurrency column is definitely a must read.

There is also Anthony Williams's upcoming book called C++ Concurrency in Action. Anthony Williams is the author of the Boost.Thread library.

Another threading library worth a look is Intel's TBB.

Locks and mutexes (think: mutual exclusion) allow cooperating threads to synchronize access to shared resources. For a brief overview of the concept, read the Wikipedia article on mutual exclusion.

These concepts are not part of the C++ language. The O'Reilly pthreads book would be a good reference for you, assuming you're on a POSIX platform. For Windows, you might go with Windows System Programming from Addison-Wesley.

They are basic constructs used to ensure correctness in parallel programs. They are included Boost and the new C++ standard.

I can recommend this book, although it doesn't focus on C++: http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916.

I need some recommendations of books/links that discuss design on multi-threaded data structures for an intermediate level C++ developer who knows STL/Boost and pthreads individually but would now like to blend these 2 knowledge streams.

Any help appreciated.

When it comes out in Feb 2011:

Anthony Williams - C++ Concurrency in Action

High on my wishlist...

I'm reading C++ Concurrency in Action by Anthony Williams. In the chapter about designing concurrent code there is parallel version of std::for_each algorihtm. Here is slightly modified code from the book:

join_thread.hpp

#pragma once

#include <vector>
#include <thread>

class join_threads
{
public:
  explicit join_threads(std::vector<std::thread>& threads)
    : threads_(threads) {}

  ~join_threads()
  {
    for (size_t i = 0; i < threads_.size(); ++i)
    {
      if(threads_[i].joinable())
      {
        threads_[i].join();
      }
    }
  }

private:
  std::vector<std::thread>& threads_;
};

parallel_for_each.hpp

#pragma once

#include <future>
#include <algorithm>

#include "join_threads.hpp"

template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func func)
{
  const auto length = std::distance(first, last);
  if (0 == length) return;

  const auto min_per_thread = 25u;
  const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;

  const auto hardware_threads = std::thread::hardware_concurrency();

  const auto num_threads= std::min(hardware_threads != 0 ?
        hardware_threads : 2u, max_threads);

  const auto block_size = length / num_threads;

  std::vector<std::future<void>> futures(num_threads - 1);
  std::vector<std::thread> threads(num_threads-1);
  join_threads joiner(threads);

  auto block_start = first;
  for (unsigned i = 0; i < num_threads - 1; ++i)
  {
    auto block_end = block_start;
    std::advance(block_end, block_size);
    std::packaged_task<void (void)> task([block_start, block_end, func]()
    {
      std::for_each(block_start, block_end, func);
    });
    futures[i] = task.get_future();
    threads[i] = std::thread(std::move(task));
    block_start = block_end;
  }

  std::for_each(block_start, last, func);

  for (size_t i = 0; i < num_threads - 1; ++i)
  {
    futures[i].get();
  }
}

I benchmarked it with sequential version of std::for_each using the following program:

main.cpp

#include <iostream>
#include <random>
#include <chrono>

#include "parallel_for_each.hpp"

using namespace std;

constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;

template <class FE, class F>
void test_for_each(const Array& a, FE fe, F f, atomic<uint64_t>& result)
{
  auto time_begin = chrono::high_resolution_clock::now();
  result = 0;
  fe(a.begin(), a.end(), f);
  auto time_end = chrono::high_resolution_clock::now();

  cout << "Result = " << result << endl;
  cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
            time_end - time_begin).count() << endl;
}

int main()
{
  random_device device;
  default_random_engine engine(device());
  uniform_int_distribution<uint8_t> distribution(0, 255);

  Array a;
  a.reserve(ARRAY_SIZE);

  cout << "Generating array ... " << endl;
  for (size_t i = 0; i < ARRAY_SIZE; ++i)
    a.push_back(distribution(engine));

  atomic<uint64_t> result;
  auto acc = [&result](uint64_t value) { result += value; };

  cout << "parallel_for_each ..." << endl;
  test_for_each(a, parallel_for_each<Array::const_iterator, decltype(acc)>, acc, result);
  cout << "for_each ..." << endl;
  test_for_each(a, for_each<Array::const_iterator, decltype(acc)>, acc, result);

  return 0;
}

The parallel version of the algorithm on my machine is more than two times slower than sequential one:

parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496

I'm using GCC 6.2 compiler on Ubuntu Linux running on Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz.

How such a behavior can be explained? Is this because of sharing of atomic<uint64_t> variable between threads and cache ping-pong?

I profiled both separately with perf. For the parallel version the stats are the following:

 1137982167      cache-references                                            
  247652893      cache-misses              #   21,762 % of all cache refs    
60868183996      cycles                                                      
27409239189      instructions              #    0,45  insns per cycle        
 3287117194      branches                                                    
      80895      faults                                                      
          4      migrations

And for the sequential one:

  402791485      cache-references                                            
  246561299      cache-misses              #   61,213 % of all cache refs    
40284812779      cycles                                                      
26515783790      instructions              #    0,66  insns per cycle
 3188784664      branches                                                    
      48179      faults
          3      migrations

It is obvious that the parallel version generates far more cache references, cycles and faults but why?

You are sharing the same result variable: all the threads are accumulating on atomic<uint64_t> result, thrashing the cache!

Every time a thread writes to result, all the caches in the other cores are invalidated: this leads to cache line contention.

More information:

  • "Sharing Is the Root of All Contention".

    [...] to write to a memory location a core must additionally have exclusive ownership of the cache line containing that location. While one core has exclusive use, all other cores trying to write the same memory location must wait and take turns — that is, they must run serially. Conceptually, it's as if each cache line were protected by a hardware mutex, where only one core can hold the hardware lock on that cache line at a time.

  • This article on "false sharing", which covers a similar issue, explains more in depth what happens in the caches.


I made some modifications to your program and achieved the following results (on a machine with an i7-4770K [8 threads + hyperthreading]):

Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727

The parallel version is roughly 92% faster than the serial version.


  1. std::future and std::packaged_task are heavyweight abstractions. In this case, an std::experimental::latch is sufficient.

  2. Every task is sent to a thread pool This minimizes thread creation overhead.

  3. Every task has its own accumulator. This eliminates sharing.

The code is available here on my GitHub. It uses some personal dependencies, but you should understand the changes regardless.


Here are the most important changes:

// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);

l.execute_and_wait_until_zero([&]
{
    auto block_start = first;
    for (unsigned i = 0; i < num_threads - 1; ++i)
    {
        auto block_end = block_start;
        std::advance(block_end, block_size);

        // `p` is a thread pool.
        // Every task posted in the thread pool has its own `tempacc` accumulator.
        p.post([&, block_start, block_end, tempacc = 0ull]() mutable
        {
            // The task accumulator is filled up...
            std::for_each(block_start, block_end, [&tempacc](auto x){ tempacc += x; });

            // ...and then the atomic variable is incremented ONCE.
            func(tempacc);
            l.decrement_and_notify_all();
        });

        block_start = block_end;
    }

    // Same idea here: accumulate to local non-atomic counter, then
    // add the partial result to the atomic counter ONCE.
    auto tempacc2 = 0ull;
    std::for_each(block_start, last, [&tempacc2](auto x){ tempacc2 += x; });
    func(tempacc2);
});

I'm working on a C++ project that needs to run many jobs in a threadpool. The jobs are failure-prone, which means that I need to know how each job terminated after it completes. Being a Java programmer for the most part, I like the idea of using "futures" or a similar paradigm, akin to the various classes in Java's util.concurrent package.

I have two questions: first, does something like this already exist for C++ (I haven't found anything in Boost, but maybe I'm not looking hard enough); and second, is this even a sane idea for C++?

I found a brief example of what I'm trying to accomplish here:

http://www.boostcookbook.com/Recipe:/1234841

Does this approach make sense?

Futures are both present in the upcoming standard (C++0x) and inside boost. Note that while the main name future is the same, you will need to read into the documentation to locate other types and to understand the semantics. I don't know Java futures, so I cannot tell you where they differ, if they do.

The library in boost was written by Anthony Williams, that I believe was also involved in the definition of that part of the standard. He has also written C++ Concurrency in Action, that includes a good description of futures, tasks, promises and related objects. His company also sells a complete and up to implementation of the C++0x threading libraries, if you are interested.

In a multi-producer, multi-consumer situation. If producers are writing into int a, and consumers are reading from int a, do I need memory barriers around int a?

We all learned that: Shared resources should always be protected and the standard does not guarantee a proper behavior otherwise.

However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.

Therefore, why protect int a at all?

However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.

Unless you strictly adhere to the requirements of the C++ spec to avoid data races, the compiler is not obligated to make your code function the way it appears to. For example:

int a = 0, b = 0; // shared variables, initialized to zero

a = 1;
b = 1;

Say you do this on your fully cache-coherent architecture. On such hardware it would seem that since a is written before b no thread will ever be able to see b with a value of 1 without a also having that value.

But this is not the case. If you have failed to strictly adhere to the requirements of the C++ memory model for avoiding data races, e.g. you read these variables without the correct synchronization primitives being inserted anywhere, then your program may in fact observe b being written before a. The reason is that you have introduce "undefined behavior" and the C++ implementation has no obligation to do anything that makes sense to you.

What may be going on in practice, is that the compiler may reorder writes even if the hardware works very hard to make it seem as if all writes occur in the order of the machine instructions performing the writes. You need the entire toolchain to cooperate, and cooperation from just the hardware, such as strong cache coherency, is not sufficient.


The book C++ Concurrency in Action is a good source if you care to learn about the details of the C++ memory model and writing portable, concurrent code in C++.

I have the following requirements:

I need a datastructure with key,value pairs(keys are integers if that helps).

I need the below operations:-

  1. Iteration(most used)
  2. Insertion (2nd most used)
  3. Searching by key and deletion(least)

I plan to use multiple locks over the structure for concurrent access. What is the ideal data structure to use?

Map or an unordered map?

I think unordered map makes sense, because i can insert in O(1), delete in O(1). But i am not sure of the iteration. How bad is the performance when compared to map?

Also, i plan to use multiple locks on blocks instead of the whole structure. Any good implementation example of this?

Thanks

The speed of iterator incrementing is O(1) for both containers, although you might get somewhat better cache locality from std::unordered_map.

Apart from the slower O(log N) find/insert/erase functionality of std::map, one other difference is that std::map provides bidirectional iterators, whereas the faster (amortized O(1) element access) std::unordered_map only provides forward iterators.

The excellent book C++ Concurrency in Action: Practical Multithreading by Anthony Williams provides a code example of a multithreaded unordered_map with a lock per entry. This book is highly recommended if you are doing serious multithreaded coding.

I'm currently reading C++ Concurrency in Action book by Anthony Williams and there are several lock free data structures implementations. In the forward of the chapter about lock free data structures in the book Anthony is writing:

This brings us to another downside of lock-free and wait-free code: although it can increase the potential for concurrency of operations on a data structure and reduce the time an individual thread spends waiting, it may well decrease overall performance.

And indeed I tested all lock free stack implementations described in the book against lock based implementation from one of the previous chapters. And it seems the performance of lock free code is always lower than the lock based stack.

In what circumstances lock free data structure are more optimal and must be preferred?

One benefit of lock-free structures is that they do not require context switch. However, in modern systems, uncontended locks are also context-switch free. To benefit (performance-wise) from lock-free algo, several conditions have to be met:

  • Contention has to be high
  • There should be enough CPU cores so that spinning thread can run uninterrupted (ideally, should be pinned to it's own core)

I'm just reading C++ Concurrency in Action by Anthony Williams and I'm trying to put pieces together of the last example in the chapter about lock free data structures, which is for multiple producers and multiple consumers lock free queue. The code which I managed to put together crashes when I tested it. I'm publishing the code for the MPMC queue and the test program which I'm using.

lock_free_queue_mpmc.hpp

#pragma once

#include <memory>

template <class T>
class lock_free_queue_mpmc
{
public:
  lock_free_queue_mpmc()
  {
    counted_node_ptr counted_node;
    counted_node.ptr = new node;
    counted_node.external_count = 1;

    head_.store(counted_node);
    tail_.store(head_);
  }

  lock_free_queue_mpmc(const lock_free_queue_mpmc& other) = delete;
  lock_free_queue_mpmc& operator=(const lock_free_queue_mpmc& other) = delete;

  ~lock_free_queue_mpmc()
  {
    counted_node_ptr old_head = head_.load();
    while (node* const old_node = old_head.ptr)
    {
      head_.store(old_node->next);
      delete old_node;
      old_head = head_.load();
    }
  }

  void push(const T& new_value)
  {
    std::unique_ptr<T> new_data(new T(new_value));
    counted_node_ptr new_next;
    new_next.ptr = new node;
    new_next.external_count = 1;
    counted_node_ptr old_tail = tail_.load();

    while (true)
    {
      increase_external_count(tail_, old_tail);
      T* old_data = nullptr;
      if (old_tail.ptr->data.compare_exchange_strong(old_data, new_data.get()))
      {
        counted_node_ptr old_next = {0};
        if (!old_tail.ptr->next.compare_exchange_strong(old_next, new_next))
        {
          delete new_next.ptr;
          new_next = old_next;
        }
        set_new_tail(old_tail, new_next);
        new_data.release();
        break;
      }
      else
      {
        counted_node_ptr old_next = {0};
        if(old_tail.ptr->next.compare_exchange_strong(old_next, new_next))
        {
          old_next = new_next;
          new_next.ptr = new node;
        }
        set_new_tail(old_tail, old_next);
      }
    }
  }

  std::unique_ptr<T> pop()
  {
    counted_node_ptr old_head = head_.load(std::memory_order_relaxed);
    while (true)
    {
      increase_external_count(head_, old_head);
      node* const ptr = old_head.ptr;
      if(ptr == tail_.load().ptr)
      {
        return std::unique_ptr<T>();
      }
      counted_node_ptr next = ptr->next.load();
      if (head_.compare_exchange_strong(old_head,next))
      {
        T* const res = ptr->data.exchange(nullptr);
        free_external_counter(old_head);
        return std::unique_ptr<T>(res);
      }
      ptr->release_ref();
    }
  }

private:
  struct node;

  struct counted_node_ptr
  {
    int external_count;
    node* ptr;
  };

  struct node_counter
  {
    unsigned internal_count    : 30;
    unsigned external_counters : 2;
  };

  struct node
  {
    std::atomic<T*> data;
    std::atomic<node_counter> count;
    std::atomic<counted_node_ptr> next;

    node()
    {
      node_counter new_count;
      new_count.internal_count    = 0;
      new_count.external_counters = 2;
      count.store(new_count);

      counted_node_ptr new_next;
      new_next.ptr            = nullptr;
      new_next.external_count = 0;
      next.store(new_next);
    }

    void release_ref()
    {
      node_counter old_counter = count.load(std::memory_order_relaxed);
      node_counter new_counter;

      do
      {
        new_counter=old_counter;
        --new_counter.internal_count;
      }
      while(!count.compare_exchange_strong(old_counter, new_counter,
                                           std::memory_order_acquire,
                                           std::memory_order_relaxed));

      if(!new_counter.internal_count && !new_counter.external_counters)
      {
        delete this;
      }
    }
  };

private:
  void set_new_tail(counted_node_ptr& old_tail,
                    const counted_node_ptr& new_tail)
  {
    node* const current_tail_ptr = old_tail.ptr;

    while (!tail_.compare_exchange_weak(old_tail, new_tail) &&
           old_tail.ptr == current_tail_ptr);

    if(old_tail.ptr == current_tail_ptr)
    {
      free_external_counter(old_tail);
    }
    else
    {
      current_tail_ptr->release_ref();
    }
  }

  static void increase_external_count(std::atomic<counted_node_ptr>& counter,
                                      counted_node_ptr& old_counter)
  {
    counted_node_ptr new_counter;

    do
    {
      new_counter = old_counter;
      ++new_counter.external_count;
    }
    while(!counter.compare_exchange_strong(old_counter, new_counter,
                                           std::memory_order_acquire,
                                           std::memory_order_relaxed));

    old_counter.external_count = new_counter.external_count;
  }

  static void free_external_counter(counted_node_ptr& old_node_ptr)
  {
    node* const ptr = old_node_ptr.ptr;
    const int count_increase = old_node_ptr.external_count - 2;
    node_counter old_counter= ptr->count.load(std::memory_order_relaxed);
    node_counter new_counter;

    do
    {
      new_counter = old_counter;
      --new_counter.external_counters;
      new_counter.internal_count += count_increase;
    }
    while(!ptr->count.compare_exchange_strong(old_counter, new_counter,
                                              std::memory_order_acquire,
                                              std::memory_order_relaxed));

    if(!new_counter.internal_count && !new_counter.external_counters)
    {
      delete ptr;
    }
  }

private:

  std::atomic<counted_node_ptr> head_;
  std::atomic<counted_node_ptr> tail_;

};

main.cpp

#include <iostream>
#include <vector>
#include <algorithm>
#include <thread>
#include <atomic>

#include "lock_free_queue_mpmc.hpp"

using namespace std;

constexpr size_t PRODUCER_THREADS_COUNT = 100;
constexpr size_t CONSUMER_THREADS_COUNT = 100;
constexpr size_t ELEMENTS_COUNT = 100'000'000;

std::atomic<size_t> valuesProduced = { 0 };
std::atomic<size_t> valuesConsumed = { 0 };
std::atomic<uint64_t> sum = { 0 };

lock_free_queue_mpmc<int> g_queue;

// -----------------------------------------------------

void producerFunc()
{
  while (valuesProduced < ELEMENTS_COUNT)
  {
    ++valuesProduced;
    g_queue.push(1);
  }
}

// -----------------------------------------------------

void consumerFunc()
{
  while (valuesConsumed < ELEMENTS_COUNT)
  {
    auto value = g_queue.pop();
    if (value)
    {
      sum += *value;
      ++valuesConsumed;
    }
  }
}

// -----------------------------------------------------

int main()
{
  auto timeBegin = chrono::high_resolution_clock::now();

// -----------------------------------------------------

  vector<thread> producerThreads;
  producerThreads.reserve(PRODUCER_THREADS_COUNT);

  for (size_t i = 0; i < PRODUCER_THREADS_COUNT; ++i)
  {
    producerThreads.emplace_back(producerFunc);
  }

// -----------------------------------------------------

  vector<thread> consumerThreads;
  consumerThreads.reserve(CONSUMER_THREADS_COUNT);

  for (size_t i = 0; i < CONSUMER_THREADS_COUNT; ++i)
  {
    consumerThreads.emplace_back(consumerFunc);
  }

// -----------------------------------------------------

  for_each(producerThreads.begin(), producerThreads.end(), mem_fn(&thread::join));
  for_each(consumerThreads.begin(), consumerThreads.end(), mem_fn(&thread::join));

// -----------------------------------------------------

  auto timeEnd = chrono::high_resolution_clock::now();
  cout << "Sum: " << sum << endl;
  cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
            timeEnd - timeBegin).count() << endl;

  return 0;
}

GDB backtrace

#0  0x00007ffff70fcc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff7100028 in __GI_abort () at abort.c:89
#2  0x00007ffff71392a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff72476b0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff714555e in malloc_printerr (ptr=<optimized out>, str=0x7ffff72477e0 "double free or corruption (out)", action=1) at malloc.c:4996
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5  0x00000000004029fb in std::default_delete<int>::operator() (this=0x7fffee5bde20, __ptr=0x7ffff00009c0) at /usr/include/c++/6/bits/unique_ptr.h:76
#6  0x0000000000401e0f in std::unique_ptr<int, std::default_delete<int> >::~unique_ptr (this=0x7fffee5bde20, __in_chrg=<optimized out>) at /usr/include/c++/6/bits/unique_ptr.h:236
#7  0x000000000040127f in consumerFunc () at /home/bobeff/data/Dropbox/sources/books_sources/cpp_concurrency/21_lock_free_queue_mpmc/main.cpp:38
#8  0x0000000000403fc7 in std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (this=0x619de8) at /usr/include/c++/6/functional:1400
#9  0x0000000000403f67 in std::_Bind_simple<void (*())()>::operator()() (this=0x619de8) at /usr/include/c++/6/functional:1389
#10 0x0000000000403ea2 in std::thread::_State_impl<std::_Bind_simple<void (*())()> >::_M_run() (this=0x619de0) at /usr/include/c++/6/thread:196
#11 0x00007ffff773287f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff7bc4184 in start_thread (arg=0x7fffee5be700) at pthread_create.c:312
#13 0x00007ffff71c037d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Can someone find what is wrong with the code?

I'm updating my C++ skills to C++11. I'm up to threads, always a problem area. Consider this testing code:

// threaded.h
class MyThreadedClass
{
public:
    MyThreadClass();
    bool StartThread();
    bool IsThreadDone();
    inline void WorkThread();

private:
    std::thread* workThread;
    atomic<bool> threadDone;
}

// threaded.cpp
MyThreadedClass::MyThreadedClass() {
    workThread = nullptr;
    threadDone.store(true);
}

bool MyThreadedClass::StartThread() {
    if (!threadDone.load()) { return false; }
    threadDone.store(false);
    workThread = new std::thread(&MyThreadedClass:WorkThread, this);
    workThread->detach();
    return true;
}

bool MyThreadedClass:IsThreadDone() {
    return threadDone.load();
}

inline void MyThreadedClass::WorkThread() {
    while (some_condition) { /*Do work*/ }
    threadDone.store(true);
}

// main.cpp
int main() {
    MyThreadedClass testInstance;
    testInstance.StartThread();
    for (int x = 0; x < 10; x++) {
        do {
            // This is sometimes true:
            if (testInstance.StartThread()) { return 1;}
        } while (!testInstance.IsThreadDone())
    }
    return 0;
}

I wanted to look at a worst case scenario for this type of code therefore I'm pounding it continually in main while waiting for thread to terminate. Sometimes the failure condition in main is triggered. As with many threading problems it's not consistent and therefore not easy to debug.

The threadDone variable is used because I access a file in my actual code and don't want multiple threads accessing the same file.

Insight into what I'm missing or ways to redesign this with C++11 idioms welcome.

The best reference / learning book I know of for C++11 concurrency is here: C++ Concurrency in Action: Practical Multithreading by Anthony Williams.

I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.

void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);

}

void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;

VectorXd b(n);
VectorXd z(n);

v.setIdentity();    
z.setZero();

#pragma omp parallel for 
for (ip=0;ip<n;ip++)
{   
    d(ip)=a(ip,ip);
    b(ip)=d(ip);
}

for (i=0;i<50;i++) 
{
    sm=0.0;
    for (ip=0;ip<n-1;ip++) 
    {
        #pragma omp parallel for reduction (+:sm)
        for (iq=ip+1;iq<n;iq++)
            sm += fabs(a(ip,iq));
    }
    if (sm == 0.0) {
        break;
    }

    if (i < 3)
    tresh=0.2*sm/(n*n); 
    else
    tresh=0.0;  

    #pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
    for (ip=0;ip<n-1;ip++)
    {
    //#pragma omp parallel for private (g,h,t,theta,c,s,tau)
        for (iq=ip+1;iq<n;iq++)
        {
            g=100.0*fabs(a(ip,iq));
            if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
            a(ip,iq)=0.0;
            else if (fabs(a(ip,iq)) > tresh)
            {
                h=d(iq)-d(ip);
                if ((fabs(h)+g) == fabs(h))
                {
                    t=(a(ip,iq))/h;
                }   
                else 
                {
                    theta=0.5*h/(a(ip,iq));
                    t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                    if (theta < 0.0)
                    {
                        t = -t;
                    }
                    c=1.0/sqrt(1+t*t);
                    s=t*c;
                    tau=s/(1.0+c);
                    h=t*a(ip,iq);

                   #pragma omp critical
                    {
                    z(ip)=z(ip)-h;
                    z(iq)=z(iq)+h;
                    d(ip)=d(ip)-h;
                    d(iq)=d(iq)+h;
                    a(ip,iq)=0.0;


                    for (j=0;j<ip;j++)
                        ROTATE(a,j,ip,j,iq,s,tau);
                    for (j=ip+1;j<iq;j++)
                        ROTATE(a,ip,j,j,iq,s,tau);
                    for (j=iq+1;j<n;j++)
                        ROTATE(a,ip,j,iq,j,s,tau);
                    for (j=0;j<n;j++)
                        ROTATE(v,j,ip,j,iq,s,tau);
                    }

                }
            } 
        }
    }


}

}

I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:

 //#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
 //#pragma omp parallel for private (g,h,t,theta,c,s,tau)

are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.

I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.

Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.

My thanks in advance.

Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the

#pragma omp for

actually decrease the computation time.

I'll try to help, but I'm not sure this is the answer to your question.

There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.

For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.

I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.

Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?

How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.

Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.

My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.

I use a thread in order to provide a shell to user, in an OpenGL application.

My problem is that I can't cancel the thread at the end of my mainloop, because std::thread doesn't provide a cancel method, and my thread is blocked with a std::cin >> var call, so I can't use a boolean to store the fact that the thread should stop.

I would like to know if there is a good way of using std::cin in a thread (std::thread), or alternative solutions.

What you might want is an interrupting thread, c++ doesn't give you one but c++ concurrency in action has a simple implementation of one. That might be what you need. Wouldn't be surprised if boost also has one since the book is written by the maintainer of the thread library.

class interrupt_flag
    {
    public:
    void set();
        bool is_set() const;
    };
    thread_local interrupt_flag this_thread_interrupt_flag;
    class interruptible_thread
    {
        std::thread internal_thread;
        interrupt_flag* flag;
    public:
        template<typename FunctionType>
        interruptible_thread(FunctionType f)
        {
            std::promise<interrupt_flag*> p;
            internal_thread=std::thread([f,&p]{
                    p.set_value(&this_thread_interrupt_flag);
                    f(); 
                });
        flag=p.get_future().get();
    }
    void interrupt()
    {
        if(flag) {
            flag->set();
        }
    } 
};  

Now you can easily cancel the thread from your main. I put a link to the book but all the code from the book is also online for free. Here is a link to the source code in the book, though it may be hard to navigate without the book, which is a great read BTW.

Supposed I have a class that represent some data structure called foo:

class foo{
  public:
    foo(){
      attr01 = 0;
    }
    void f(){
      attr01 += 5;
    }
  private:
    int attr01;
};

class fooSingleThreadUserClass{
    void usefoo(){
      fooAttr.f();
    }

    foo fooAttr;
}

Now supposed later in software construction, I found out that I need multithreading. Should I add the mutex in foo?

class foo{
  public:
    foo(){
      attr01 = 0;
    }
    void f(){
      attr01Mutex.lock();
      attr01 += 5;
      attr01Mutex.unlock();
    }
  private:
    int attr01;
    std::mutex attr01Mutex;
};

class fooMultiThreadUserClass{
    void usefoo(){
      std::thread t1(&fooMultiThreadUserClass::useFooWorker, this);
      std::thread t2(&fooMultiThreadUserClass::useFooWorker, this);
      std::thread t3(&fooMultiThreadUserClass::useFooWorker, this);
      std::thread t4(&fooMultiThreadUserClass::useFooWorker, this);

      t1.join();
      t2.join();
      t3.join();
      t4.join();
    }

    void useFooWorker(){
      fooAttr.f();
    }

    foo fooAttr;
}

I know that fooMultiThreadUserClass will now be able to run foo without races in high performance, but will fooSingleThreadUserClass loose performance due to mutex overhead? I would be very intrested to know. Or should I derive fooCC from foo for concurrency purposes so fooSingleThreadUserClas can keep using foo without mutex, and fooMultiThreadUserClass use fooCC with mutexes, as shown below

class fooCC : public foo{
  public:
    foo(){
      attr01 = 0;
    }
    void f(){  // I assume that foo::f() is now a virtual function.
      attr01Mutex.lock();
      foo::f();
      attr01Mutex.unlock();
    }
  private:
    std::mutex attr01Mutex;
};

Also assume that compiler optimization already took care of virtual dispatches. I would like an opinion wether I should use inhertance or simply put the mutex lock in the original class.

I've search through Stackoverflow already, but I guess my question is a little too specific.

Edit: Note, there doesn’t have to be just one argument, the question is meant to be abstract with a class of n argument.

Use an std::lock_guard. The lock_guard takes a mutex in its constructor. During construction, the lock_guard locks the mutex. When the lock_guard goes out of scope, its destructor automatically releases the lock.

class foo
{
private:
  std::mutex mutex;
  int attr01;

public:
  foo() {
    attr01 = 0;
  }

  void f(){
    std::lock_guard<std::mutex> lock (mutex);
    attr01 += 5;
  }
};

You can put mutable on the mutex if you need to be able to lock or unlock the mutex from const functions. I usually leave mutable off the mutex until I specifically need it.

Will it lose performance? It depends. If you are calling the function a million times then maybe the overhead of creating the mutex will become a problem (they are not cheap). If the function takes a long time to execute and it is called frequently by many threads, then perhaps the rapid blocking will hinder performance. If you can't pinpoint a specific concern, just use std::lock_guard.

Hans Passant brings up a valid concern that is out-of-scope of your question. I think Herb Sutter (?) wrote about this in one of his website articles. Unfortunately I can't find it right now. To understand why multi-threading is so hard, and why locks on single data fields is "not enough", read a book on multi-threaded programming like C++ Concurrency in Action: Practical Multithreading

I'm looking for an example/tutorial of a real-time thread-safe buffering algorithm (ie wait-free queue algorithms?). Looking in the web.. I found some pretty heavy duty master thesis solutions. I need something (in 21st century english) it to solve a real time application i'm working on. Thanks!

the book C++ Concurrency in Action has a chapter dedicated to the implementations of such data structures.. and since objective C can use C++.. this works.

I want to synchroznie threads in C++ using pthreads in smart way.

I have one global variable:

int Resources = 0;

I have two thread functions:

void *incResources(void *arg)
{
   while(1)
   {
     pthread_mutex_lock (&resourcesMutex);
     Resources += 2;
     pthread_mutex_unlock (&resourcesMutex);
   }

pthread_exit((void*) 0);
}

void *consumeResources(void *arg)
{
   while(1)
   {
     pthread_mutex_lock (&resourcesMutex);
     Resources--;
     pthread_mutex_unlock (&resourcesMutex);
   }
   pthread_exit((void*) 0);
 }

And in main function I intialize two consuming threads and one incrementing thread:

pthread_mutex_init(&resourcesMutex, NULL);

pthread_create(&callThd[0], &attr, incResources, (void *)i);
pthread_create(&callThd[1], &attr, consumeResources, (void *)i);
pthread_create(&callThd[2], &attr, consumeResources, (void *)i);

I feel this so unefficient and it can be done better. Can you provide me some ideas? I've tried to use wait but i dont get it :/ Thanks!

I you look for good and C++ ways, I strongly suggest to read C++ Concurrency in Action: by Anthony Williams and leave pthread behind to use futures and similar high-level thing where you can. And if you must go with manual thread fiddling you can find good examples for that too.

Your problem statement is too vague for sensible advice -- the basic idea of good threading is to NOT have shared state at all, and for handshake situation like yours is likely, use some synchronized queue made for that very purpose.

trying my luck in making lock-free singly linked list implementation.

typedef _Atomic struct _node
  {
    void *data;
    struct _node *next;
  } Node;

does this make all members of struct with _Atomic atomic as well?

void add_head ( Linked_list* list, void* data )
{
  if ( debugging )
  {
      printf ( "%s\n", __func__ );
  }
  Node *node = ( Node* ) calloc ( 1, sizeof (Node ) );
  //init_node_mutex(node);
  //lock_node_mutex(node);
  atomic_exchange ( &node->next, NULL );
  atomic_exchange ( &node->data, data );

  if ( list->head == NULL )
  {
      Node* the_tail = atomic_load ( &list->tail );
      //  lock_node_mutex ( the_tail );
      atomic_exchange ( &node->next, NULL );
      atomic_compare_exchange_weak ( &list->tail, the_tail, node );

      //unlock_node_mutex ( the_tail );

  }
  else
  {

      Node* the_next = atomic_load ( &node->next );
      // lock_node_mutex ( the_next );
      atomic_compare_exchange_weak ( &node->next, the_next, list->head );
      // unlock_node_mutex ( the_next );
  }

  Node* the_head = atomic_load ( & list->head );
  //lock_node_mutex ( the_head );
  atomic_store ( &list->head, node );
  atomic_store ( &list->current, node );
  //unlock_node_mutex ( the_head );
  //unlock_node_mutex(node);
  atomic_fetch_add ( &list->size, 1 );
}

are usages are atomic_load and atomic_store correct?

In addition to @MikeRobinson's comment, I would add that while your code is "lock-free" in the sense that it does not contain any explicit use of locks, it is now (somewhat ironically) no longer thread-safe. Writing lock-free code is enormously difficult. I recommend reading through this to get a bird's-eye view of the world, and then reading this to get some details or Chapter 7 of this book (it's in C++). You can always go look through the source of Boost.LockFree for inspiration.

I need to share a map :

map <string, vector< pair <string,string> > > repoMap;

among the parent and forked child process so that each of them is aware when one of them adds/deletes an element in the map.

Please provide a small and quick solution with an example.

"Keeping track" is not something real in low-language programming. You can do the following:

  • Make a function that checks content, that is thread safe
  • Make a function that adds content and reports changes, that is thread safe
  • Make a function that removes content and reports changes, that is thread safe

I'm going to give a quick solution, but then you have to learn more to make it fancier.

The key to this in C++11 is using std::mutex, which provides a lock on the map. For example:

std::mutex lock; //must be accessible by all threads, either being global (not the best, but will work) or put it in your class body
void addElementToMap(map<string, vector< pair <string,string> > >& myMap, string key, pair element)
{
    std::lock_guard<std::mutex> guard(lock);
    myMap[key] = element;
    //the guard will be unlocked when destroyed
}

now you can use std::thread to call this function, by for example:

std::thread myThread(addElementToMap, std::ref(mapToModify), key, element);

You use std::ref() here because usually copies are supplied to threads.

It seems like you have to learn all the new facilities of C++11. This is a good book.

Feel free to ask questions, but please do some research about this, too.

Good luck!