Intel® Xeon Phi™

Can I use Intel® TBB Plus to program an Intel® Xeon Phi™ coprocessor?

Yes!  Intel® TBB can be used for both data and task parallelism on a Xeon Phi coprocessor.

However, you need to be aware that Intel® TBB assumes that your application is executing in a single, unified address space.  Tasks that are offloaded to the Xeon Phi coprocessor are executing in a different address space than the application running on the host processor, and the thread pools on the host processor and the Xeon Phi coprocessor are totally separate.  Work cannot be stolen between the host processor and the Xeon Phi coprocessor.

How much parallelism do I need to take advantage of all of the cores on a Xeon Phi coprocessor?

A good rule of thumb is that an Intel® TBB application should have approximately 10 tasks for every worker.  This allows the work-stealing scheduler to redistribute work if one of the tasks is unexpectedly large.  By default, a typical Xeon Phi coprocessor has 60 cores, and 4 hyperthreads/core:

60 cores × 4 hyperthreads/core × 10 tasks/thread = 2400 tasks.

Now factor in the parallelism provided by the vector units that can process 16 single precision floating point numbers simultaneously, and your loops should have a range of at least

2400 tasks × 16 lanes/instruction = 38,400.

That’s a lot of parallelism.  

How important is vectorization on a Xeon Phi coprocessor?

Getting the most out of the vector units is a vital part of getting the maximum performance out of a Xeon Phi coprocessor.  The compiler provides a number of options to report on the automatic vectorization of your program.  You should review the compiler’s documentation on auto-vectorization as well as of the vec-report option.  If the compiler cannot automatically vectorize your code, you should explore user mandated vectorization and the Extensions for Array Notation.

How important is cache-locality on a Xeon Phi coprocessor?

Maximizing cache-locality is vital to maximizing the performance of a Xeon Phi coprocessor.

The current implementation of the Xeon Phi coprocessor uses cores that execute instructions in the order they are presented by the application.  This means that a cache-miss requires a hyperthread to stall until the data is available.  In contrast, a Xeon processor can execute instructions out-of-order, masking the effects of cache misses.

What are the advantages and disadvantages of work-stealing and work-sharing?

In a work-stealing scheduler, idle worker threads will randomly choose another worker and attempt to steal work from that worker.  This places the burden of locating available work on the threads that have no other useful work to perform.  As a general rule, applications should expose approximately 10 tasks for every available core.  This allows a work-stealing scheduler to redistribute work in a poorly balanced workload, or if a core gets bogged down with other tasks.  A work-stealing scheduler can provide near-optimal scheduling in a dynamic environment such as a general-purpose computer or with a poorly-balanced workload.

In a work-sharing scheduler, the available work is portioned into tasks, and the tasks are assigned to each thread in the thread-pool.   This is extremely efficient for well-balanced workloads, but the entire team may need to wait for a long task in a poorly balance workload, or if one of the processors gets bogged down with other tasks. A work-sharing scheduler can provide near-optimal scheduling in a dedicated environment with a well-balanced workload.  A work-sharing scheduler that assigns tasks in a round-robin fashion can also take advantage of cache-locality effects.

Why don’t I see linear speedup as the number of hyperthreads/core increases?

Hyperthreads share portions of the physical core that they run on.  As long as the shared resources aren’t a bottleneck, hyperthreads can improve performance.   But as you increase the number of hyperthreads on a core, the probability of a resource bottleneck increases, so you’ll see less improvement with each added hyperthread.