Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 43 to 43 | ||||||||
which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision. | ||||||||
Changed: | ||||||||
< < | To compile the helloflops1.c code using the Intel C compiler, use the following command | |||||||
> > | To compile the helloflops1.c code using the Intel C compiler, use the following command | |||||||
$ icc -mmic -vec-report=3 -O3 helloflops1.c -o helloflops1 | ||||||||
Line: 99 to 99 | ||||||||
Which we immediately see is the expected factor of 16 slow-down from our vectorized code. | ||||||||
Changed: | ||||||||
< < | Scaling to Multiple CoresTo utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing. | |||||||
> > | Two Threads and One Core on CoprocessorTo utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing; below, you will find a set of excellent tutorials maintained by the Lawrence Livermore National Laboratory. | |||||||
* OpenMP Tutorials: Tutorial | ||||||||
Added: | ||||||||
> > | To run on two threads, a few OpenMP directives and API calls are used to change the program to one that scales using OpenMP threads. With these additions, OpenMP will run another instance of the code, but in parallel and on the same set of data. This new program, helloflops2.c, relies on OpenMP parallel directive ahead of the for loop. Here each thread initiated will work on a separate set of array elements as set by the offset added to the code.
In order to run on two threads, we use the following OpenMP setup calls,
omp_set_num_threads(2); kmp_set_defaults("KMP_AFFINITY=compact");where the first call sets the number of threads we will use and and second call sets the affinity variable, KMP_AFFINITY of the threads across the cores. To compile the code, use the following command, $ icc -openmp -mmic -vec-report=3 -O3 helloflops2.c -o helloflops2moving this over to the coprocessor and executing results in the following output, % ./helloflops2 Initializing Starting Compute on 2 threads GFlops = 51.200, Secs = 1.301, GFlops per sec = 39.358This result is sensible and is nearly exactly in line with the maximum achievable computation rate of two threads on a single core with single precision accuracy. Specifically, Peak Single Precision FLOPs = Clock Frequency x Number of Cores x 16 Lanes x 2 (FMA) FLOPs/cycle and for our Intel Xeon Phi processor, 1.238 GHz x 1 Core x 16 Lanes x 2 (FMA) FLOPs/cycle = 39.616 GFlops/s and therefore, we have achieved 99.3% of the theoretical maximum computation rate. 122 Threads and 61 Cores on CoprocessorThis first tutorial example code is optimized for two threads per core (the Intel Xeon Phi chip allows hyper-threading to four threads per core) and therefore, we will maximize the computation rate by placing two OpenMP threads per core. Now, using helloflops3.c, we can seek the maximum computation rate for the coprocessor. To change the number of threads and the affinity variable, we may set these as environment variables, rather than as lines in the code. So to proceed, compile the code on the host,$ icc -openmp -mmic -vec-report=3 -O3 helloflops3.c -o helloflops3and transfer the output to the coprocessor. Then on the coprocessor, we may change the environment variables with the following commands, % export OMP_NUM_THREADS=122 % export KMP_AFFINITY=scatterwhere we are requesting that the threads be "scattered" across the 61 available cores, rather than completely filling a single core with threads before proceeding to the next. Now, on the coprocessor, run the executable, % ./helloflops3 Initializing Starting Compute on 122 threads GFlops = 3123.200, Secs = 1.388, GFlops per sec = 2250.830which is 93.1% of the theoretical maximum calculation rate for single precision floats. | |||||||
VectorizationUseful guide for enabling compiler vectorization capability in the Intel compilerSeed Algorithm | ||||||||
Added: | ||||||||
> > | References |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 100 to 100 | ||||||||
Which we immediately see is the expected factor of 16 slow-down from our vectorized code.
Scaling to Multiple Cores | ||||||||
Changed: | ||||||||
< < | To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing. | |||||||
> > | To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP API for shared memory multiprocessing. | |||||||
Changed: | ||||||||
< < | * OpenMP? Tutorials: Tutorial | |||||||
> > | * OpenMP Tutorials: Tutorial | |||||||
VectorizationUseful guide for enabling compiler vectorization capability in the Intel compiler |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 102 to 102 | ||||||||
Scaling to Multiple CoresTo utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing. | ||||||||
Changed: | ||||||||
< < | * OpenMP? Tutorials: Tutorial | |||||||
> > | * OpenMP? Tutorials: Tutorial | |||||||
VectorizationUseful guide for enabling compiler vectorization capability in the Intel compiler |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 100 to 100 | ||||||||
Which we immediately see is the expected factor of 16 slow-down from our vectorized code.
Scaling to Multiple Cores | ||||||||
Added: | ||||||||
> > | To utilize the full capability of our cores, we must run more than one thread on each core in order to execute the FMA calculation for each clock cycle. Here, we use the OpenMP? API for shared memory multiprocessing. * OpenMP? Tutorials: Tutorial | |||||||
VectorizationUseful guide for enabling compiler vectorization capability in the Intel compiler |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 33 to 33 | ||||||||
ExamplesAll of the following code lies in the directory, /root/liamb315 | ||||||||
Changed: | ||||||||
< < | Simple Calculation | |||||||
> > | One Thread and One Core on Coprocessor | |||||||
In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop, | ||||||||
Line: 68 to 68 | ||||||||
% ./helloflops1 | ||||||||
Added: | ||||||||
> > | which results in the output
move to the relevant directory and launch the executable
% ./helloflops1 Initializing Starting Compute GFlops = 25.600, Secs = 3.900, GFlops per sec = 6.563This test continues to be an anomaly for our system, however, and only achieves 6.5 GFlop/s. Theoretically, we should expect 17.5 GFlop/s (half the 34.9 GFlop/s performance of a single core since scheduling skips every other clock cycle with only one thread running). VectorizationAutomatic vectorization of code is critical to achieving the maximum performance with this architecture. For the prior simple example, the compiler was able to arrange chunks of the arrays to be loaded into the machine registers and use up to 16 single precision floating point lanes for simultaneous calculation since in our case, the MIC has a 512-bit vector processing unit (VPU) and each single precision number is 32 bits (512/32 = 16 lanes). We can immediately see the impact of vectorization by disabling it via the compile option-no-vec
% icc -mmic -no-vec -vec-report=3 -O3 helloflops1.c -o helloflops1novecwhich results in the output % ./helloflops1novec Initializing Starting Compute GFlops = 25.600, Secs = 62.399, GFlops per sec = 0.410Which we immediately see is the expected factor of 16 slow-down from our vectorized code. Scaling to Multiple Cores | |||||||
VectorizationUseful guide for enabling compiler vectorization capability in the Intel compilerSeed Algorithm |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 6 to 6 | ||||||||
Overview | ||||||||
Changed: | ||||||||
< < | The main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:
| |||||||
> > | This twiki provides an end-to-end description of running some basic examples on the Intel Xeon Phi processor and coprocessor which demonstrate various features of parallel computing. All examples are from the book, Intel Xeon Phi Coprocessor High-Performance Programming by Jeffers, Reinders. In addition to the basic examples, the twiki highlights a very simple use case of the parallel architecture in the seed processing procedure. | |||||||
Line: 27 to 20 | ||||||||
ssh root@phiphi.t2.ucsd.edu | ||||||||
Changed: | ||||||||
< < | And once logged in, the 0th coprocessor (mic0) may be accessed via | |||||||
> > | and cd to my directory on the Host which holds all of the relevant code
$ cd liamb315And from the host, we may access the 0th Many Integrated Core (MIC) chip via | |||||||
Changed: | ||||||||
< < | ssh mic0 | |||||||
> > | $ ssh mic0 | |||||||
Examples | ||||||||
Added: | ||||||||
> > | All of the following code lies in the directory, /root/liamb315 | |||||||
Simple Calculation | ||||||||
Deleted: | ||||||||
< < | In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop, | |||||||
Changed: | ||||||||
< < | fa[k] = a * fa[k] + fb[k]; | |||||||
> > | In this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,
fa[k] = a * fa[k] + fb[k]; | |||||||
Changed: | ||||||||
< < | which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. | |||||||
> > | which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. This enables two floating point operations in a single clock cycle, without loss of precision.
To compile the helloflops1.c code using the Intel C compiler, use the following command
$ icc -mmic -vec -report=3 -O3 helloflops1.c -o helloflops1Here, -mmic requests the code be generated for the Intel Many Integrated Core (MIC) architecture, -vec-report=3 indicates to generate a vector report and -O3 indicates to use standard optimization techniques.
Now, copy the helloflops1 executable object over to the mic0 via the command
$ scp helloflops1 mic0:~/liamb315Now, ssh to the mic0 chip $ ssh mic0move to the relevant directory and launch the executable % cd liamb315 % ./helloflops1 VectorizationUseful guide for enabling compiler vectorization capability in the Intel compilerSeed Algorithm |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Line: 17 to 17 | ||||||||
Changed: | ||||||||
< < | Order of topics (Subject to change)ROOT
git clone https://github.com/kelleyrw/AnalysisTutorial Lesson 1
TTree exampleTo facilitate a non-trivial example of making plots, a very simple TTree was constructed using CMSSW that contains the composite generated/simulated particle known as tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson. This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:Events | --> list of TrackingParticles | --> Tracking particle information (p4, # sim hits, d0, dz, ...) --> Matching reconstructed Track info (bogus values filled if no matching track). | |||||||
> > | Setting UpLogging Onto Processor and CoprocessorThroughout the tutorial, $ will represent the host command prompt and % will represent the coprocessor command prompt. | |||||||
Changed: | ||||||||
< < | The tree is small (1000 events) and I was able to check into the repository (https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size: | |||||||
> > | The Host may be accessed via the root user,
ssh root@phiphi.t2.ucsd.edu | |||||||
Changed: | ||||||||
< < | // TrakingParticle info std::vector<LorentzVector> tps_p4: four momentum std::vector<int> tps_pdgid: pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf std::vector<double> tps_d0: transverse impact parameter std::vector<double> tps_dz: longitudinal impact parameter std::vector<bool> tps_matched: matched to track? true/false std::vector<int> tps_charge: charge std::vector<int> tps_nhits: # of simulated hits // reco track info std::vector<LorentzVector> trks_p4: four momentum std::vector<double> trks_tip: transverse impact parameter (from the TrackingParticle vertex) std::vector<double> trks_lip: longitudinal impact parameter (from the TrackingParticle vertex) std::vector<double> trks_d0: transverse impact parameter (using the trajectory builder) std::vector<double> trks_dz: longitudinal impact parameter (using the trajectory builder) std::vector<double> trks_pterr: pt uncertainty std::vector<double> trks_d0err: d0 uncertainty std::vector<double> trks_dzerr: dz uncertainty std::vector<double> trks_chi2: chi^2 of the track's fit std::vector<int> trks_ndof: # degrees of freedom std::vector<int> trks_nlayers: # number of valid layers with a measurement std::vector<bool> trks_high_purity: # track passes high purity quality requirement | |||||||
> > | And once logged in, the 0th coprocessor (mic0) may be accessed via
ssh mic0 | |||||||
Added: | ||||||||
> > |
ExamplesSimple CalculationIn this first example, we consider the operation on a single core of the Intel Xeon Phi Coprocessor. This key piece of the code is the inner loop,fa[k] = a * fa[k] + fb[k];which is a fused multiply and add (FMA) instruction, representing the primary compute capability of the coprocessor. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Added: | ||||||||
> > |
Intel Xeon Phi CoprocessorOverviewThe main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:
Contents:
Order of topics (Subject to change)ROOT
git clone https://github.com/kelleyrw/AnalysisTutorial Lesson 1
TTree exampleTo facilitate a non-trivial example of making plots, a very simple TTree was constructed using CMSSW that contains the composite generated/simulated particle known as tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson. This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:Events | --> list of TrackingParticles | --> Tracking particle information (p4, # sim hits, d0, dz, ...) --> Matching reconstructed Track info (bogus values filled if no matching track).The tree is small (1000 events) and I was able to check into the repository (https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size: // TrakingParticle info std::vector<LorentzVector> tps_p4: four momentum std::vector<int> tps_pdgid: pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf std::vector<double> tps_d0: transverse impact parameter std::vector<double> tps_dz: longitudinal impact parameter std::vector<bool> tps_matched: matched to track? true/false std::vector<int> tps_charge: charge std::vector<int> tps_nhits: # of simulated hits // reco track info std::vector<LorentzVector> trks_p4: four momentum std::vector<double> trks_tip: transverse impact parameter (from the TrackingParticle vertex) std::vector<double> trks_lip: longitudinal impact parameter (from the TrackingParticle vertex) std::vector<double> trks_d0: transverse impact parameter (using the trajectory builder) std::vector<double> trks_dz: longitudinal impact parameter (using the trajectory builder) std::vector<double> trks_pterr: pt uncertainty std::vector<double> trks_d0err: d0 uncertainty std::vector<double> trks_dzerr: dz uncertainty std::vector<double> trks_chi2: chi^2 of the track's fit std::vector<int> trks_ndof: # degrees of freedom std::vector<int> trks_nlayers: # number of valid layers with a measurement std::vector<bool> trks_high_purity: # track passes high purity quality requirement |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
|