-- WilliamFedus - 2014/02/04

Intel Xeon Phi Coprocessor


This twiki provides an end-to-end description of running some basic examples on the Intel Xeon Phi processor and coprocessor which demonstrate various features of parallel computing. All examples are from the book, Intel Xeon Phi Coprocessor High-Performance Programming by Jeffers, Reinders. In addition to the basic examples, the twiki highlights a very simple use case of the parallel architecture in the seed processing procedure.

Setting Up

Logging Onto Processor and Coprocessor

Throughout the tutorial, $ will represent the host command prompt and % will represent the coprocessor command prompt.

The Host may be accessed via the root user,

ssh root@phiphi.t2.ucsd.edu

and cd to my directory on the Host which holds all of the relevant code

$  cd liamb315

And from the host, we may access the 0th Many Integrated Core (MIC) chip via

$  ssh mic0


All of the following code lies in the directory, /root/liamb315

One Thread and One Core on Coprocessor

In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,

fa[k] = a * fa[k] + fb[k];

which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.

To compile the helloflops1.c code using the Intel C compiler, use the following command

$  icc -mmic -vec-report=3 -O3 helloflops1.c -o helloflops1

Here, -mmic requests the code be generated for the Intel Many Integrated Core (MIC) architecture, -vec-report=3 indicates to generate a vector report and -O3 indicates to use standard optimization techniques.

Now, copy the helloflops1 executable object over to the mic0 via the command

$  scp helloflops1 mic0:~/liamb315 

Now, ssh to the mic0 chip

$  ssh mic0

move to the relevant directory and launch the executable

%  cd liamb315
%  ./helloflops1

which results in the output

move to the relevant directory and launch the executable

% ./helloflops1
Starting Compute
GFlops =     25.600, Secs =      3.900, GFlops per sec =      6.563

This test continues to be an anomaly for our system, however, and only achieves 6.5 GFlop/s. Theoretically, we should expect 17.5 GFlop/s (half the 34.9 GFlop/s performance of a single core since scheduling skips every other clock cycle with only one thread running).


Automatic vectorization of code is critical to achieving the maximum performance with this architecture. For the prior simple example, the compiler was able to arrange chunks of the arrays to be loaded into the machine registers and use up to 16 single precision floating point lanes for simultaneous calculation since in our case, the MIC has a 512-bit vector processing unit (VPU) and each single precision number is 32 bits (512/32 = 16 lanes).

We can immediately see the impact of vectorization by disabling it via the compile option -no-vec

%  icc -mmic -no-vec -vec-report=3 -O3 helloflops1.c -o helloflops1novec

which results in the output

%  ./helloflops1novec
Starting Compute
GFlops =     25.600, Secs =     62.399, GFlops per sec =      0.410

Which we immediately see is the expected factor of 16 slow-down from our vectorized code.

Two Threads and One Core on Coprocessor

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing; below, you will find a set of excellent tutorials maintained by the Lawrence Livermore National Laboratory.

To run on two threads, a few OpenMP directives and API calls are used to change the program to one that scales using OpenMP threads. With these additions, OpenMP will run another instance of the code, but in parallel and on the same set of data. This new program, helloflops2.c, relies on OpenMP parallel directive ahead of the for loop. Here each thread initiated will work on a separate set of array elements as set by the offset added to the code.

In order to run on two threads, we use the following OpenMP setup calls,


where the first call sets the number of threads we will use and and second call sets the affinity variable, KMP_AFFINITY of the threads across the cores. To compile the code, use the following command,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops2.c -o helloflops2

moving this over to the coprocessor and executing results in the following output,

%  ./helloflops2
Starting Compute on 2 threads
GFlops =     51.200, Secs =      1.301, GFlops per sec =     39.358

This result is sensible and is nearly exactly in line with the maximum achievable computation rate of two threads on a single core with single precision accuracy. Specifically,

Peak Single Precision FLOPs = Clock Frequency x Number of Cores x 16 Lanes x 2 (FMA) FLOPs/cycle

and for our Intel Xeon Phi processor,

1.238 GHz x 1 Core x 16 Lanes x 2 (FMA) FLOPs/cycle = 39.616 GFlops/s

and therefore, we have achieved 99.3% of the theoretical maximum computation rate.

122 Threads and 61 Cores on Coprocessor

This first tutorial example code is optimized for two threads per core (the Intel Xeon Phi chip allows hyper-threading to four threads per core) and therefore, we will maximize the computation rate by placing two OpenMP threads per core. Now, using helloflops3.c, we can seek the maximum computation rate for the coprocessor. To change the number of threads and the affinity variable, we may set these as environment variables, rather than as lines in the code.

So to proceed, compile the code on the host,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops3.c -o helloflops3

and transfer the output to the coprocessor. Then on the coprocessor, we may change the environment variables with the following commands,

% export OMP_NUM_THREADS=122
% export KMP_AFFINITY=scatter

where we are requesting that the threads be "scattered" across the 61 available cores, rather than completely filling a single core with threads before proceeding to the next. Now, on the coprocessor, run the executable,

%  ./helloflops3
Starting Compute on 122 threads
GFlops =   3123.200, Secs =      1.388, GFlops per sec =   2250.830

which is 93.1% of the theoretical maximum calculation rate for single precision floats.


Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm


Topic revision: r10 - 2014/04/07 - 22:34:06 - WilliamFedus
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback