Difference: Xeonprocessor (1 vs. 10)

Revision 102014/04/07 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 43 to 43
  which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.
Changed:
<
<
To compile the helloflops1.c code using the Intel C compiler, use the following command
>
>
To compile the helloflops1.c code using the Intel C compiler, use the following command
 
$  icc -mmic -vec-report=3 -O3 helloflops1.c -o helloflops1
Line: 99 to 99
  Which we immediately see is the expected factor of 16 slow-down from our vectorized code.
Changed:
<
<

Scaling to Multiple Cores

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing.
>
>

Two Threads and One Core on Coprocessor

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing; below, you will find a set of excellent tutorials maintained by the Lawrence Livermore National Laboratory.
  * OpenMP Tutorials: Tutorial
Added:
>
>
To run on two threads, a few OpenMP directives and API calls are used to change the program to one that scales using OpenMP threads. With these additions, OpenMP will run another instance of the code, but in parallel and on the same set of data. This new program, helloflops2.c, relies on OpenMP parallel directive ahead of the for loop. Here each thread initiated will work on a separate set of array elements as set by the offset added to the code.

In order to run on two threads, we use the following OpenMP setup calls,

omp_set_num_threads(2);
kmp_set_defaults("KMP_AFFINITY=compact");

where the first call sets the number of threads we will use and and second call sets the affinity variable, KMP_AFFINITY of the threads across the cores. To compile the code, use the following command,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops2.c -o helloflops2

moving this over to the coprocessor and executing results in the following output,

%  ./helloflops2
Initializing
Starting Compute on 2 threads
GFlops =     51.200, Secs =      1.301, GFlops per sec =     39.358

This result is sensible and is nearly exactly in line with the maximum achievable computation rate of two threads on a single core with single precision accuracy. Specifically,

Peak Single Precision FLOPs = Clock Frequency x Number of Cores x 16 Lanes x 2 (FMA) FLOPs/cycle

and for our Intel Xeon Phi processor,

1.238 GHz x 1 Core x 16 Lanes x 2 (FMA) FLOPs/cycle = 39.616 GFlops/s

and therefore, we have achieved 99.3% of the theoretical maximum computation rate.

122 Threads and 61 Cores on Coprocessor

This first tutorial example code is optimized for two threads per core (the Intel Xeon Phi chip allows hyper-threading to four threads per core) and therefore, we will maximize the computation rate by placing two OpenMP threads per core. Now, using helloflops3.c, we can seek the maximum computation rate for the coprocessor. To change the number of threads and the affinity variable, we may set these as environment variables, rather than as lines in the code.

So to proceed, compile the code on the host,

$ icc -openmp -mmic -vec-report=3 -O3 helloflops3.c -o helloflops3

and transfer the output to the coprocessor. Then on the coprocessor, we may change the environment variables with the following commands,

% export OMP_NUM_THREADS=122
% export KMP_AFFINITY=scatter

where we are requesting that the threads be "scattered" across the 61 available cores, rather than completely filling a single core with threads before proceeding to the next. Now, on the coprocessor, run the executable,

%  ./helloflops3
Initializing
Starting Compute on 122 threads
GFlops =   3123.200, Secs =      1.388, GFlops per sec =   2250.830

which is 93.1% of the theoretical maximum calculation rate for single precision floats.

 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm

Added:
>
>

References

Revision 92014/03/05 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 100 to 100
 Which we immediately see is the expected factor of 16 slow-down from our vectorized code.

Scaling to Multiple Cores

Changed:
<
<
To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing.
>
>
To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing.
 
Changed:
<
<
* OpenMP? Tutorials: Tutorial
>
>
* OpenMP Tutorials: Tutorial
 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Revision 82014/02/28 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04

Revision 72014/02/21 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 102 to 102
 

Scaling to Multiple Cores

To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing.
Changed:
<
<
* OpenMP? Tutorials: Tutorial
>
>
* OpenMP? Tutorials: Tutorial
 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Revision 62014/02/20 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 100 to 100
 Which we immediately see is the expected factor of 16 slow-down from our vectorized code.

Scaling to Multiple Cores

Added:
>
>
To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing.

* OpenMP? Tutorials: Tutorial

 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Revision 52014/02/20 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 33 to 33
 

Examples

All of the following code lies in the directory, /root/liamb315
Changed:
<
<

Simple Calculation

>
>

One Thread and One Core on Coprocessor

  In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,
Line: 68 to 68
 % ./helloflops1
Added:
>
>
which results in the output

move to the relevant directory and launch the executable

% ./helloflops1
Initializing
Starting Compute
GFlops =     25.600, Secs =      3.900, GFlops per sec =      6.563

This test continues to be an anomaly for our system, however, and only achieves 6.5 GFlop/s. Theoretically, we should expect 17.5 GFlop/s (half the 34.9 GFlop/s performance of a single core since scheduling skips every other clock cycle with only one thread running).

Vectorization

Automatic vectorization of code is critical to achieving the maximum performance with this architecture. For the prior simple example, the compiler was able to arrange chunks of the arrays to be loaded into the machine registers and use up to 16 single precision floating point lanes for simultaneous calculation since in our case, the MIC has a 512-bit vector processing unit (VPU) and each single precision number is 32 bits (512/32 = 16 lanes).

We can immediately see the impact of vectorization by disabling it via the compile option -no-vec

%  icc -mmic -no-vec -vec-report=3 -O3 helloflops1.c -o helloflops1novec

which results in the output

%  ./helloflops1novec
Initializing
Starting Compute
GFlops =     25.600, Secs =     62.399, GFlops per sec =      0.410

Which we immediately see is the expected factor of 16 slow-down from our vectorized code.

Scaling to Multiple Cores

 

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm

Revision 42014/02/20 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 6 to 6
 

Overview

Changed:
<
<
The main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:

  • ROOT -- the C++ based data analysis package that is used in HEP
  • CMSSW -- the software framework used by the collaboration
  • CMS2 -- the software sub-framework used by the UCSD/UCSB/FNAL group (a.k.a. SNT)
  • A full analysis example -- measuring the Z cross section

These topics are not necessarily ordered in any particular way and are only loosely related.

>
>
This twiki provides an end-to-end description of running some basic examples on the Intel Xeon Phi processor and coprocessor which demonstrate various features of parallel computing. All examples are from the book, Intel Xeon Phi Coprocessor High-Performance Programming by Jeffers, Reinders. In addition to the basic examples, the twiki highlights a very simple use case of the parallel architecture in the seed processing procedure.
 
Line: 27 to 20
 ssh root@phiphi.t2.ucsd.edu
Changed:
<
<
And once logged in, the 0th coprocessor (mic0) may be accessed via
>
>
and cd to my directory on the Host which holds all of the relevant code
$  cd liamb315

And from the host, we may access the 0th Many Integrated Core (MIC) chip via

 
Changed:
<
<
ssh mic0
>
>
$ ssh mic0
 

Examples

Added:
>
>
All of the following code lies in the directory, /root/liamb315
 

Simple Calculation

Deleted:
<
<
In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,
 
Changed:
<
<
fa[k] = a * fa[k] + fb[k];
>
>
In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,

fa[k] = a * fa[k] + fb[k];

 
Changed:
<
<
which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor.
>
>
which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.

To compile the helloflops1.c code using the Intel C compiler, use the following command

$  icc -mmic -vec -report=3 -O3 helloflops1.c -o helloflops1

Here, -mmic requests the code be generated for the Intel Many Integrated Core (MIC) architecture, -vec-report=3 indicates to generate a vector report and -O3 indicates to use standard optimization techniques.

Now, copy the helloflops1 executable object over to the mic0 via the command

$  scp helloflops1 mic0:~/liamb315 

Now, ssh to the mic0 chip

$  ssh mic0

move to the relevant directory and launch the executable

%  cd liamb315
%  ./helloflops1

Vectorization

Useful guide for enabling compiler vectorization capability in the Intel compiler

Seed Algorithm

Revision 32014/02/19 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
Line: 17 to 17
 
Changed:
<
<

Order of topics (Subject to change)

ROOT

git clone https://github.com/kelleyrw/AnalysisTutorial

Lesson 1

  • Studied the basics of TTree and made efficiency plots for some tracking variables
  • Reading: ROOT user's guide: read ch 1-3,7,12
  • Example code: Lesson 1
TTree example
To facilitate a non-trivial example of making plots, a very simple TTree was constructed using CMSSW that contains the composite generated/simulated particle known as tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson.

This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:

Events
  |
  --> list of TrackingParticles
             |
             --> Tracking particle information (p4, # sim hits, d0, dz, ...) 
             --> Matching reconstructed Track info (bogus values filled if no matching track).
>
>

Setting Up

Logging Onto Processor and Coprocessor

Throughout the tutorial, $ will represent the host command prompt and % will represent the coprocessor command prompt.
 
Changed:
<
<
The tree is small (1000 events) and I was able to check into the repository (https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size:
>
>
The Host may be accessed via the root user,
ssh root@phiphi.t2.ucsd.edu
 
Changed:
<
<
// TrakingParticle info
std::vector<LorentzVector> tps_p4:  four momentum
std::vector<int> tps_pdgid:         pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf
std::vector<double> tps_d0:         transverse impact parameter
std::vector<double> tps_dz:         longitudinal impact parameter
std::vector<bool> tps_matched:      matched to track?  true/false 
std::vector<int> tps_charge:        charge
std::vector<int> tps_nhits:         # of simulated hits

// reco track info
std::vector<LorentzVector> trks_p4:  four momentum
std::vector<double> trks_tip:        transverse impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_lip:        longitudinal impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_d0:         transverse impact parameter (using the trajectory builder)
std::vector<double> trks_dz:         longitudinal impact parameter (using the trajectory builder)
std::vector<double> trks_pterr:      pt uncertainty
std::vector<double> trks_d0err:      d0 uncertainty
std::vector<double> trks_dzerr:      dz uncertainty
std::vector<double> trks_chi2:       chi^2 of the track's fit
std::vector<int> trks_ndof:          # degrees of freedom
std::vector<int> trks_nlayers:       # number of valid layers with a measurement
std::vector<bool> trks_high_purity:  # track passes high purity quality requirement
>
>
And once logged in, the 0th coprocessor (mic0) may be accessed via
ssh mic0
 
Added:
>
>

Examples

Simple Calculation

In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,

fa[k] = a * fa[k] + fb[k];

which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor.

Revision 22014/02/19 - Main.WilliamFedus

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04 \ No newline at end of file
Added:
>
>

Intel Xeon Phi Coprocessor

Overview

The main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:

  • ROOT -- the C++ based data analysis package that is used in HEP
  • CMSSW -- the software framework used by the collaboration
  • CMS2 -- the software sub-framework used by the UCSD/UCSB/FNAL group (a.k.a. SNT)
  • A full analysis example -- measuring the Z cross section

These topics are not necessarily ordered in any particular way and are only loosely related.

Order of topics (Subject to change)

ROOT

git clone https://github.com/kelleyrw/AnalysisTutorial

Lesson 1

  • Studied the basics of TTree and made efficiency plots for some tracking variables
  • Reading: ROOT user's guide: read ch 1-3,7,12
  • Example code: Lesson 1
TTree example
To facilitate a non-trivial example of making plots, a very simple TTree was constructed using CMSSW that contains the composite generated/simulated particle known as tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson.

This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:

Events
  |
  --> list of TrackingParticles
             |
             --> Tracking particle information (p4, # sim hits, d0, dz, ...) 
             --> Matching reconstructed Track info (bogus values filled if no matching track).

The tree is small (1000 events) and I was able to check into the repository (https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size:

// TrakingParticle info
std::vector<LorentzVector> tps_p4:  four momentum
std::vector<int> tps_pdgid:         pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf
std::vector<double> tps_d0:         transverse impact parameter
std::vector<double> tps_dz:         longitudinal impact parameter
std::vector<bool> tps_matched:      matched to track?  true/false 
std::vector<int> tps_charge:        charge
std::vector<int> tps_nhits:         # of simulated hits

// reco track info
std::vector<LorentzVector> trks_p4:  four momentum
std::vector<double> trks_tip:        transverse impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_lip:        longitudinal impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_d0:         transverse impact parameter (using the trajectory builder)
std::vector<double> trks_dz:         longitudinal impact parameter (using the trajectory builder)
std::vector<double> trks_pterr:      pt uncertainty
std::vector<double> trks_d0err:      d0 uncertainty
std::vector<double> trks_dzerr:      dz uncertainty
std::vector<double> trks_chi2:       chi^2 of the track's fit
std::vector<int> trks_ndof:          # degrees of freedom
std::vector<int> trks_nlayers:       # number of valid layers with a measurement
std::vector<bool> trks_high_purity:  # track passes high purity quality requirement

Revision 12014/02/04 - Main.WilliamFedus

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WebHome"
-- WilliamFedus - 2014/02/04
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback