Difference: Xeonprocessor (9 vs. 10)

Revision 102014/04/07 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 43 to 43
  which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.
Changed:
<
<
To compile the helloflops1.c code using the Intel C compiler, use the following command
>
>
To compile the helloflops1.c code using the Intel C compiler, use the following command
 
$  icc -mmic -vec-report=3 -O3 helloflops1.c -o helloflops1
Line: 99 to 99
  Which we immediately see is the expected factor of 16 slow-down from our vectorized code.
Changed:
<
<

Scaling to Multiple Cores

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing.
>
>

Two Threads and One Core on Coprocessor

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing; below, you will find a set of excellent tutorials maintained by the Lawrence Livermore National Laboratory.
  * OpenMP Tutorials: Tutorial
Added:
>
>
To run on two threads, a few OpenMP directives and API calls are used to change the program to one that scales using OpenMP threads. With these additions, OpenMP will run another instance of the code, but in parallel and on the same set of data. This new program, helloflops2.c, relies on OpenMP parallel directive ahead of the for loop. Here each thread initiated will work on a separate set of array elements as set by the offset added to the code.

In order to run on two threads, we use the following OpenMP setup calls,

omp_set_num_threads(2);
kmp_set_defaults("KMP_AFFINITY=compact");

where the first call sets the number of threads we will use and and second call sets the affinity variable, KMP_AFFINITY of the threads across the cores. To compile the code, use the following command,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops2.c -o helloflops2

moving this over to the coprocessor and executing results in the following output,

%  ./helloflops2
Initializing
Starting Compute on 2 threads
GFlops =     51.200, Secs =      1.301, GFlops per sec =     39.358

This result is sensible and is nearly exactly in line with the maximum achievable computation rate of two threads on a single core with single precision accuracy. Specifically,

Peak Single Precision FLOPs = Clock Frequency x Number of Cores x 16 Lanes x 2 (FMA) FLOPs/cycle

and for our Intel Xeon Phi processor,

1.238 GHz x 1 Core x 16 Lanes x 2 (FMA) FLOPs/cycle = 39.616 GFlops/s

and therefore, we have achieved 99.3% of the theoretical maximum computation rate.

122 Threads and 61 Cores on Coprocessor

This first tutorial example code is optimized for two threads per core (the Intel Xeon Phi chip allows hyper-threading to four threads per core) and therefore, we will maximize the computation rate by placing two OpenMP threads per core. Now, using helloflops3.c, we can seek the maximum computation rate for the coprocessor. To change the number of threads and the affinity variable, we may set these as environment variables, rather than as lines in the code.

So to proceed, compile the code on the host,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops3.c -o helloflops3

and transfer the output to the coprocessor. Then on the coprocessor, we may change the environment variables with the following commands,

% export OMP_NUM_THREADS=122
% export KMP_AFFINITY=scatter

where we are requesting that the threads be "scattered" across the 61 available cores, rather than completely filling a single core with threads before proceeding to the next. Now, on the coprocessor, run the executable,

%  ./helloflops3
Initializing
Starting Compute on 122 threads
GFlops =   3123.200, Secs =      1.388, GFlops per sec =   2250.830

which is 93.1% of the theoretical maximum calculation rate for single precision floats.

 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm

Added:
>
>

References

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback