TWiki> UCSDTier2 Web>Xeonprocessor (revision 7)EditAttach
-- WilliamFedus - 2014/02/04

Intel Xeon Phi Coprocessor


This twiki provides an end-to-end description of running some basic examples on the Intel Xeon Phi processor and coprocessor which demonstrate various features of parallel computing. All examples are from the book, Intel Xeon Phi Coprocessor High-Performance Programming by Jeffers, Reinders. In addition to the basic examples, the twiki highlights a very simple use case of the parallel architecture in the seed processing procedure.

Setting Up

Logging Onto Processor and Coprocessor

Throughout the tutorial, $ will represent the host command prompt and % will represent the coprocessor command prompt.

The Host may be accessed via the root user,


and cd to my directory on the Host which holds all of the relevant code

$  cd liamb315

And from the host, we may access the 0th Many Integrated Core (MIC) chip via

$  ssh mic0


All of the following code lies in the directory, /root/liamb315

One Thread and One Core on Coprocessor

In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,

fa[k] = a * fa[k] + fb[k];

which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.

To compile the helloflops1.c code using the Intel C compiler, use the following command

$  icc -mmic -vec -report=3 -O3 helloflops1.c -o helloflops1

Here, -mmic requests the code be generated for the Intel Many Integrated Core (MIC) architecture, -vec-report=3 indicates to generate a vector report and -O3 indicates to use standard optimization techniques.

Now, copy the helloflops1 executable object over to the mic0 via the command

$  scp helloflops1 mic0:~/liamb315 

Now, ssh to the mic0 chip

$  ssh mic0

move to the relevant directory and launch the executable

%  cd liamb315
%  ./helloflops1

which results in the output

move to the relevant directory and launch the executable

% ./helloflops1
Starting Compute
GFlops =     25.600, Secs =      3.900, GFlops per sec =      6.563

This test continues to be an anomaly for our system, however, and only achieves 6.5 GFlop/s. Theoretically, we should expect 17.5 GFlop/s (half the 34.9 GFlop/s performance of a single core since scheduling skips every other clock cycle with only one thread running).


Automatic vectorization of code is critical to achieving the maximum performance with this architecture. For the prior simple example, the compiler was able to arrange chunks of the arrays to be loaded into the machine registers and use up to 16 single precision floating point lanes for simultaneous calculation since in our case, the MIC has a 512-bit vector processing unit (VPU) and each single precision number is 32 bits (512/32 = 16 lanes).

We can immediately see the impact of vectorization by disabling it via the compile option -no-vec

%  icc -mmic -no-vec -vec-report=3 -O3 helloflops1.c -o helloflops1novec

which results in the output

%  ./helloflops1novec
Starting Compute
GFlops =     25.600, Secs =     62.399, GFlops per sec =      0.410

Which we immediately see is the expected factor of 16 slow-down from our vectorized code.

Scaling to Multiple Cores

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing.

* OpenMP? Tutorials: Tutorial


Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm

Edit | Attach | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2014/02/21 - 21:08:15 - WilliamFedus
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback