--
WilliamFedus - 2014/02/04
Intel Xeon Phi Coprocessor
Overview
This twiki provides an end-to-end description of running some basic examples on the Intel Xeon Phi processor and coprocessor which demonstrate various features of parallel computing. All examples are from the book, Intel Xeon Phi Coprocessor High-Performance Programming by Jeffers, Reinders. In addition to the basic examples, the twiki highlights a very simple use case of the parallel architecture in the seed processing procedure.
Setting Up
Logging Onto Processor and Coprocessor
Throughout the tutorial, $ will represent the host command prompt and % will represent the coprocessor command prompt.
The Host may be accessed via the root user,
ssh root@phiphi.t2.ucsd.edu
and cd to my directory on the Host which holds all of the relevant code
$ cd liamb315
And from the host, we may access the 0th Many Integrated Core (MIC) chip via
$ ssh mic0
Examples
All of the following code lies in the directory, /root/liamb315
One Thread and One Core on Coprocessor
In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,
fa[k] = a * fa[k] + fb[k];
which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.
To compile the
helloflops1.c code using the Intel C compiler, use the following command
$ icc -mmic -vec-report=3 -O3 helloflops1.c -o helloflops1
Here,
-mmic
requests the code be generated for the Intel Many Integrated Core (MIC) architecture,
-vec-report=3
indicates to generate a vector report and
-O3
indicates to use standard optimization techniques.
Now, copy the
helloflops1
executable object over to the mic0 via the command
$ scp helloflops1 mic0:~/liamb315
Now, ssh to the mic0 chip
$ ssh mic0
move to the relevant directory and launch the executable
% cd liamb315
% ./helloflops1
which results in the output
move to the relevant directory and launch the executable
% ./helloflops1
Initializing
Starting Compute
GFlops = 25.600, Secs = 3.900, GFlops per sec = 6.563
This test continues to be an anomaly for our system, however, and only achieves 6.5 GFlop/s. Theoretically, we should expect 17.5 GFlop/s (half the 34.9 GFlop/s performance of a single core since scheduling skips every other clock cycle with only one thread running).
Vectorization
Automatic vectorization of code is critical to achieving the maximum performance with this architecture. For the prior simple example, the compiler was able to arrange chunks of the arrays to be loaded into the machine registers and use up to 16 single precision floating point lanes for simultaneous calculation since in our case, the MIC has a 512-bit vector processing unit (VPU) and each single precision number is 32 bits (512/32 = 16 lanes).
We can immediately see the impact of vectorization by disabling it via the compile option
-no-vec
% icc -mmic -no-vec -vec-report=3 -O3 helloflops1.c -o helloflops1novec
which results in the output
% ./helloflops1novec
Initializing
Starting Compute
GFlops = 25.600, Secs = 62.399, GFlops per sec = 0.410
Which we immediately see is the expected factor of 16 slow-down from our vectorized code.
Two Threads and One Core on Coprocessor
To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing; below, you will find a set of excellent tutorials maintained by the Lawrence Livermore National Laboratory.
To run on two threads, a few OpenMP directives and API calls are used to change the program to one that scales using OpenMP threads. With these additions, OpenMP will run another instance of the code, but in parallel and on the same set of data. This new program,
helloflops2.c, relies on OpenMP
parallel
directive ahead of the for loop. Here each thread initiated will work on a separate set of array elements as set by the offset added to the code.
In order to run on two threads, we use the following OpenMP setup calls,
omp_set_num_threads(2);
kmp_set_defaults("KMP_AFFINITY=compact");
where the first call sets the number of threads we will use and and second call sets the affinity variable, KMP_AFFINITY of the threads across the cores. To compile the code, use the following command,
$ icc -openmp -mmic -vec-report=3 -O3 helloflops2.c -o helloflops2
moving this over to the coprocessor and executing results in the following output,
% ./helloflops2
Initializing
Starting Compute on 2 threads
GFlops = 51.200, Secs = 1.301, GFlops per sec = 39.358
This result is sensible and is nearly exactly in line with the maximum achievable computation rate of two threads on a single core with single precision accuracy. Specifically,
Peak Single Precision FLOPs = Clock Frequency x Number of Cores x 16 Lanes x 2 (FMA) FLOPs/cycle
and for our Intel Xeon Phi processor,
1.238 GHz x 1 Core x 16 Lanes x 2 (FMA) FLOPs/cycle = 39.616 GFlops/s
and therefore, we have achieved 99.3% of the theoretical maximum computation rate.
122 Threads and 61 Cores on Coprocessor
This first tutorial example code is optimized for two threads per core (the Intel Xeon Phi chip allows hyper-threading to four threads per core) and therefore, we will maximize the computation rate by placing two OpenMP threads per core. Now, using
helloflops3.c, we can seek the maximum computation rate for the coprocessor. To change the number of threads and the affinity variable, we may set these as environment variables, rather than as lines in the code.
So to proceed, compile the code on the host,
$ icc -openmp -mmic -vec-report=3 -O3 helloflops3.c -o helloflops3
and transfer the output to the coprocessor. Then on the coprocessor, we may change the environment variables with the following commands,
% export OMP_NUM_THREADS=122
% export KMP_AFFINITY=scatter
where we are requesting that the threads be "scattered" across the 61 available cores, rather than completely filling a single core with threads before proceeding to the next. Now, on the coprocessor, run the executable,
% ./helloflops3
Initializing
Starting Compute on 122 threads
GFlops = 3123.200, Secs = 1.388, GFlops per sec = 2250.830
which is 93.1% of the theoretical maximum calculation rate for single precision floats.
Vectorization
Useful
guide for enabling compiler vectorization capability in the Intel compiler
Seed Algorithm
References