Using Intel® TBB

Does Intel® Inspector XE and Intel® VTune™ Amplifier XE support TBB?



Yes. Applications threaded with Threading Building Blocks can be analyzed with both Intel® Inspector XE  and Intel® VTuneTM Amplifier.


If I’m writing new code, how do I choose between TBB, OpenMP, and MPI? Can TBB and OpenMP be used together in the same program?


First, you’ll want to look at the development environment. If the code is written in C++, it’s likely that TBB is the best fit. TBB matches especially well with the code that is highly object oriented, and makes heavy use of C++ templates and user defined types. . If the code is written in C or FORTRAN, OpenMP may be the better solution because it fits better than TBB into a structured coding style and for simple cases, it introduces less coding overhead. TBB and native threads don’t require specific compiler support; OpenMP does. The use of OpenMP requires that you compile with a compiler that recognizes OpenMP pragmas.

Next, look at what you want to make parallel. Use OpenMP if the parallelism is primarily for bounded loops over built-in types, or if it is flat do-loop centric parallelism. OpenMP works especially well with large and predictable data parallel problems. It can be very challenging to match OpenMP performance with TBB for such problems. It is seldom worth the effort to bother – just use OpenMP. TBB excels at the common problem of having less structured or consistent parallelism in a program.

TBB relies on generic programming; use its loop parallelization patterns if you need to work with custom iteration spaces or complex reduction operations. Also, consider using TBB if you need to go beyond loop-based parallelism, since it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes.

Finally, what happens if you come upon a case where either TBB or OpenMP could be a usable option? Then you would look at the features within the APIs. If you need features exclusive to OpenMP, then choose OpenMP. If you need features exclusive to TBB, then use TBB. If the features you need are available with both TBB and OpenMP, then we recommend you use TBB. If you already are using OpenMP for such features, and plan to add TBB moving forward, then it would be a good idea to replace the OpenMP code with TBB. The reason why is because TBB is designed to anticipate incremental parallelization - allowing additional parallelization without creating unnecessary threads that can lead to over-utilization.

And TBB and OpenMP can coexist. You can utilize TBB in part of your application while using OpenMP in another part. That way you can use the tool which best fit the problem.

What Intel® TBB synchronization primitive should I use if I want waiting thread not to consume CPU cycles?


Intel® TBB has cross-OS wrapper class tbb::mutex that wraps POSIX mutex on Linux and CRITICAL_SECTION on Windows*.

What if I already have code written using OpenMP… should I migrate that code to TBB?


In general, the answer is no – since TBB and OpenMP serve different needs based on the development environment and the actual need for the parallelism and associated algorithms. The exception here is when you already are using OpenMP for certain features that overlap, and plan to add TBB moving forward, then it would be a good idea to replace the OpenMP code with TBB. The reason why is because TBB is designed to anticipate incremental parallelization - allowing additional parallelization without creating unnecessary threads that can lead to over-utilization.

Do I have to initialize task scheduler to use TBB concurrent containers?


If you want to use TBB concurrent containers only, you don’t have to initialize TBB task scheduler.

I write software of <a particular nature>. Is TBB use appropriate for me?


It depends on what your application profile is. TBB does not try to replace I/O threads or GUI threads or general Win Threads. TBB is best for computational tasks that are not prone to frequent waiting for I/O or events in order to proceed (this is an area the TBB team does want to tackle later).


Do TBB concurrent containers use OS synchronization objects?



No, they don’t. TBB concurrent containers utilize TBB user-level synchronization primitives and atomic operations.

Are there any books planned to help developers better understand how to use Intel® TBB?

Morgan Kaufmann release a book July 2012 on Structured Parallel Programming, which was written by parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of the most popular and cutting edge programming models for parallel programming: Threading Building Blocks, and Cilk Plus. O’Reilly Media released a book in mid-July 2007 on Threading Building Blocks, written by James Reinders. It includes comments from many people, including a foreword by Alex Stepanov (father of the Standard Template Library) and an introduction by Arch Robison (architect of Intel® TBB). Half the book covers examples, which really make it an excellent resource for learning TBB.

How does this compare with Boost threads?

With TBB, we are attempting to address a problem that did not previously have an appropriate solution. We are not trying to replace Boost or to be a solely a thread wrapper. With tasks, we do look to abstract the developer from low level threading details. However, the true benefit of TBB is in its ability to provide a scalable solution for multi-core hardware in a way that is easy for C++ developers to implement.

Is it thread-safe to access and modify elements of tbb::concurrent_vector without locking?


No, you have to use locks explicitly.

Do I have to use Intel® compilers?

No. You should be able to use any ISO compliant C++ compiler. We have tested it very well on the gnu (gcc) compiler, Intel’s C++ compiler, Microsoft’s compiler, and Apple’s gnu compiler. We have also built with success on a variety of other compilers. Check the web site for update on experiences with different systems and compilers.

Should I expect Intel® TBB to outperform Intel® OpenMP and Intel® MPI?


No, Intel® TBB may offer a competitive alternative but in general Intel® TBB exists to help where OpenMP cannot, and to be far easier to program than Intel® MPI. With Intel® TBB, we’re looking to provide a unique solution that provides acceptable scalability for multi-core platforms. Intel® OpenMP and Intel® MPI continue to be good choices in High Performance Computing applications; Intel® TBB has been designed to be more conducive to application parallelization on client platforms such as laptops and desktops, going beyond data parallelism to be suitable for programs with nested parallelism, irregular parallelism and task parallelism.

What makes tbb::concurrent_vector a thread-safe container?


It is thread-safe to grow tbb::concurrent_vector concurrently.

Does TBB support non-Intel hardware, and/or operating systems other than Windows/Linux/Mac OS?

Yes it does. TBB is now truly cross platform and portable across different hardware platforms. One of the more significant processor issues has to do with the difference between weak and strong memory ordering. We have ported TBB to both types successfully, which paves the way for easy ports to other processors.

How did TBB come to be?


We actually researched several academic and historical models and developed TBB using proven methods that solved the problems we were trying to solve with TBB. Then we adapted the product during development to adjust to customer requests and preferences, and here we are. James Reinders covered this in more depth in Chapter 12 of his book on TBB.

Is it thread-safe to use tbb::concurrent_hash_map iterator or do I have to use locks if I want to iterate over tbb::concurrent_hash_map concurrently?


No, it’s not thread-safe. You have to use locks for all whole table operations.

There seem to be a lot of disparate “pieces” in Intel® TBB. How do I organize the pieces in my head so as to understand the whole of TBB better?


Focus on the algorithms and containers/data structures that allow the quickest introduction of parallelism. Underneath that you have the task scheduler for parallel algorithms and the synchronization primitives, allowing you to build your own algorithms. Use of the scalable memory allocator is important, but can come later in your usage.



How will TBB help my software run on future processors?


The great part about TBB is that it will detect the number of cores on the hardware platform and make the necessary adjustments to allow your software to adapt. All you have to do is ensure that you use the latest versions of the TBB libraries, and you will have to do no new work as new platforms with more cores are introduced.

Does TBB task scheduler replace OS scheduler?


No, it does not. TBB task scheduler creates and manages a pool of OS worker threads.

Can I assign priority to TBB task?


No, the current version of the product does not support assigning priority to the tasks.


If I have an n-core system, and I have other important programs running on some of those cores, will TBB take over all n cores? Or will it leave some of the cores alone to do other things?


TBB creates a thread pool and subdivides tasks amongst threads, managing load balancing and cache efficiency. It co-exists with other threading packages, and the OS scheduler is ultimately what looks at the ordered tasks coming from TBB and the other threads and sends them to the hardware. In the future, we hope to see additional interfaces in operating systems to coordinate threaded applications including those built with TBB. We agree with those who have called for OSes to get out of the business of scheduling threads and focus instead on allocation of processors to applications. It’s an interesting topic to say the least.

What are the library components?

Threading Building Blocks contains the following library components:

Generic Parallel Algorithms

   * parallel_for
   * parallel_reduce
   * parallel_scan
   * parallel_sort
   * parallel_while
   * parallel_do
   * pipeline 

Assistant Classes to Use with Algorithms

   * blocked_range (for use with algorithms, containers, etc.)
   * blocked_range2d (for use with algorithms, containers, etc.)
   * blocked_range3d (for use with algorithms, containers, etc.) 

Thread-Safe Containers

   * concurrent_hash_map
   * concurrent_queue
   * concurrent_vector

Synchronization Primitives

   * atomic
   * spin_mutex
   * spin_rw_mutex (reader-writer spin mutex)
   * queuing_mutex
   * queuing_rw_mutex (reader-writer queuing mutex)
   * mutex

Task Scheduler

Memory Allocation

   * scalable_allocator
   * cache_aligned_allocator
   * aligned_space 


   * tick_count
Is there a way to make TBB task scheduler execute TBB task on a particular core?


No, the current version of the product does not support tasks affinity.


Where do you see Intel® TBB going in the future?

Intel® TBB will go where our customers take it – we’ll study the input we receive from developers based on their experience with their application use. We’ll encourage experimentation and look to fold the best ideas back into Intel® TBB regularly with at least a commercial release a year. Here is a partial wish list as of July 2006:

• Usage by open source projects, distributions

• More usage by closed source projects

• New ports get more experience/usage

• More ports, more binaries – truly ubiquitous

• More scalable memory allocator tuning

• Extend scheduler & algorithms for event based, and I/O rich, applications

• Affinity tuning

• Explore .NET and Java interest / ideas

Do I have to initialize the task scheduler to use TBB parallel algorithms?


Yes, you must initialize TBB task scheduler when you use TBB parallel algorithms.

How does TBB support recursive algorithms?

Please refer to fibonacci example at \examples\test_all\fibonacci\ There are several good examples in the TBB book as well, including the Quick Sort example in Chapter 11.

Why did you offer Intel® TBB on C/C++ instead of say C# or Java?


We saw the largest immediate need and usefulness in C++. C++ is the hot bed for concurrent application development. It supplies the underpinnings of the majority of programming, whether directly or indirectly. .NET and Java get a great deal of concurrency through concurrent tasks, which serve the current needs of customers. That being said, we are now learning more about the applicability of and need for TBB on .NET, and encourage feedback on the possibility of extending this project to other languages in the future. Right now, our focus is on C++ and making that a big success. Other languages will come later for us.

How do I know what “grain size” to choose for the parallel_for?


Since the TBB 1.1 release, you no longer have to choose the grain size; you can allow TBB to choose it for you using the new auto partitioner feature. TBB also allows you to manually set the grain size; please refer to the TBB Tutorial or the TBB book for more information and examples.

What libraries do I have to link with if I want to use scalable memory allocator?


If you want to use TBB scalable memory allocator you must link with tbbmalloc_debug.{dll,so,dynlib} for debug build or tbbmalloc.{dll,so,dynlib} for release build.

Is there a version of TBB that provides statically linked libraries?


TBB is not provided as a statically linked library, for the following reasons*:

Most libraries operate locally. For example, an Intel(R) MKL FFT transforms an array. It is irrelevant how many copies of the FFT there are. Multiple copies and versions can coexist without difficulty. But some libraries control program-wide resources, such as memory and processors. For example, garbage collectors control memory allocation across a program. Analogously, TBB controls scheduling of tasks across a program. To do their job effectively, each of these must be a singleton; that is, have a sole instance that can coordinate activities across the entire program. Allowing k instances of the TBB scheduler in a single program would cause there to be k times as many software threads as hardware threads. The program would operate inefficiently, because the machine would be oversubscribed by a factor of k, causing more context switching, cache contention, and memory consumption. Furthermore, TBB's efficient support for nested parallelism would be negated when nested parallelism arose from nested invocations of distinct schedulers.

The most practical solution for creating a program-wide singleton is a dynamic shared library that contains the singleton. Of course if the schedulers could cooperate, we would not need a singleton. But that cooperation requires a centralized agent to communicate through; that is, a singleton!

Our decision to omit a statically linkable version of TBB was strongly influenced by our OpenMP experience. Like TBB, OpenMP also tries to schedule across a program. A static version of the OpenMP run-time was once provided, and it has been a constant source of problems arising from duplicate schedulers. We think it best not to repeat that history. As an indirect proof of the validity of these considerations, we could point to the fact that Microsoft Visual C++ only provides OpenMP support via dynamic libraries.

Do I have to initialize task scheduler to use TBB pipeline?


Yes, you must initialize TBB task scheduler when you use TBB pipeline.

Do I have to initialize task scheduler to use the scalable memory allocator?


If you want to use TBB scalable memory allocator only, you don’t have to initialize TBB task scheduler.

What sort of performance scalability does Intel® TBB demonstrate? Where do you see this going in the future?

Intel® TBB itself can theoretically scale linearly with an increase in cores. A dual socket quad core platform with 8 cores would allow an 8x speedup, and Intel® TBB has been shown to scale close to that level. The more a software application is able to use  Intel® TBB to perform the necessary work, the closer it can get to the theoretical limit. We cannot make specific future predictions at this time and it is important to note that an application’s parallelism is bounded by the percentage of the application that is / needs to remain serial.

When I ran one tbb::pipeline after another, I noticed that they didn’t execute concurrently. Is there a way to launch multiple pipelines in parallel?


With the current version of Intel® TBB, tbb::pipeline::run method doesn’t return until pipeline finishes processing items. It is possible to run multiple pipelines concurrently by launching them from the different Intel® TBB tasks or OS threads.

Can I use scalable memory allocator as an allocator parameter for STL containers?


Yes, you can.

Have you thought about asking that TBB be part of the C++ standard?


Yes, we are working with the C++ standards group to consider what is appropriate to do. We think standards are important. We have made a proposal (open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2104.pdf). We do not expect the C++ standard to change quickly. Standards are always best if they are based on experiences, including mistakes which lead to key lessons and improvements before standardization. We think TBB is great, but we would prefer to let time and experience of developers guide us and the standards we will live with forever. Many think the C++ standard committee has a better track record in this regard than most, and we would agree. And we wouldn’t want to upset that track record.

Can I re-use tbb::pipeline object if I want to run the same pipeline multiple times?


Yes, you can call method run for the same tbb::pipeline object multiple times. You don’t have to clear or re-load stages.

Why and when should I expect Intel® TBB scalable memory allocator to perform better than the standard one?


Intel® TBB scalable memory allocator was designed to improve performance of multi-threaded applications. We have found that it performs especially well when threads allocate small memory blocks.

Isn’t Intel® TBB essentially a standard already?


You can think of it that way, because you can use it everywhere. We think of it as a useful tool for developers to use now which they can count on in the future. We hope to maintain the project as a solid standard base for everyone to build upon, as we do strongly believe that is important.

Can multiple threads or TBB tasks share tbb::pipeline object?


No, concurrent tasks should not share tbb::pipeline object.

Is it possible to re-direct all my memory management calls to TBB scalable memory allocator? How can I do that?


TBB scalable memory allocator provides a simple interface that can help you accomplish this. There are three functions: scalable_malloc, scalable_free, scalable_realloc.

This is covered in the TBB book in the Chapter 11 on Memory Allocation of James Reinders book “Intel Threading Building Blocks", as well as in the documentation for TBB.


Are there minimal memory requirements? Is there a recommended amount of memory?

Please refer to the release notes at intel.com/software/products/tbb. The current version of the product requires at least 512MB of RAM, though we recommend 1GB of RAM.

Do I have to initialize task scheduler to use TBB synchronization primitives?

If you want to use synchronization primitives only, you don’t have to initialize TBB task scheduler.

Is TBB scalable memory allocator portable? Can I just re-build it on my system and expect it to work properly?


Yes it is. Anywhere TBB builds, the scalable allocator builds. On new platforms it defaults to using malloc/free as the low-level allocator.

I only have a Dual Core, not a Quad Core, processor… will TBB help me?


Absolutely. The beauty of TBB is that it detects the number of cores, and facilitates scalable performance that maps to this core detection.

Many of Intel® TBB synchronization primitives are user-level. What does it really mean?

Intel® TBB spin_mutex, queuing_mutex, spin_rw_mutex, and queuing_rw_mutex, are user-level synchronization primitives. This means that they use spin-wait and Intel® TBB doesn’t call any of OS synchronization APIs