Analysis Tutorial
Overview
The main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:
- ROOT -- the C++ based data analysis package that is used in HEP
- CMSSW -- the software framework used by the collaboration
- CMS2 -- the software sub-framework used by the UCSD/UCSB/FNAL group (a.k.a. SNT)
- A full analysis example -- measuring the Z cross section
These topics are not necessarily ordered in any particular way and are only loosely related.
Order of topics (Subject to change)
ROOT
- Documentation
- To Check out the example code:
git clone https://github.com/kelleyrw/AnalysisTutorial
Lesson 1
- Studied the basics of TTree and made efficiency plots for some tracking variables
- Reading: ROOT user's guide: read ch 1-3,7,12
- Example code: Lesson 1
TTree example
To facilitate a non-trivial example of making plots, a very simple TTree was constructed using
CMSSW that contains the composite generated/simulated particle known as
tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson.
This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:
Events
|
--> list of TrackingParticles
|
--> Tracking particle information (p4, # sim hits, d0, dz, ...)
--> Matching reconstructed Track info (bogus values filled if no matching track).
The tree is small (1000 events) and I was able to check into the repository (
https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size:
// TrakingParticle info
std::vector<LorentzVector> tps_p4: four momentum
std::vector<int> tps_pdgid: pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf
std::vector<double> tps_d0: transverse impact parameter
std::vector<double> tps_dz: longitudinal impact parameter
std::vector<bool> tps_matched: matched to track? true/false
std::vector<int> tps_charge: charge
std::vector<int> tps_nhits: # of simulated hits
// reco track info
std::vector<LorentzVector> trks_p4: four momentum
std::vector<double> trks_tip: transverse impact parameter (from the TrackingParticle vertex)
std::vector<double> trks_lip: longitudinal impact parameter (from the TrackingParticle vertex)
std::vector<double> trks_d0: transverse impact parameter (using the trajectory builder)
std::vector<double> trks_dz: longitudinal impact parameter (using the trajectory builder)
std::vector<double> trks_pterr: pt uncertainty
std::vector<double> trks_d0err: d0 uncertainty
std::vector<double> trks_dzerr: dz uncertainty
std::vector<double> trks_chi2: chi^2 of the track's fit
std::vector<int> trks_ndof: # degrees of freedom
std::vector<int> trks_nlayers: # number of valid layers with a measurement
std::vector<bool> trks_high_purity: # track passes high purity quality requirement
Playing with TTree::Draw
TTree::Draw gives you basic plotting from the ROOT prompt. This is convenient for quick studies and on the fly plot making. In addition to the chapter 12 on TTree from the ROOT User's guide, you should also read the documentation in the class webpage:
http://root.cern.ch/root/html532/TTree.html#TTree:Draw%2
Open the ROOT file and try playing with the following examples:
[rwk7t@tensor lesson1]$ root ../data/tracking_ntuple.root
root [0]
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .ls
List the branches in the tree:
root [2] tree->Print()
List the branches in the tree with a less verbose printout:
root [3] tree->GetListOfBranches()->ls()
Draw the tracking particles's pT.
root [4] tree->Draw("tps_p4.pt()")
On the previous plot, the automatic binning choice was sub optimal since it tries to get everything included in a bin. To specific the binning explicitly:
root [5] tree->Draw("tps_p4.pt()>>(100,0,10)")
In order to keep have a handle to the histogram for later manipulation, you can name the output hist. Now you can do subsequent operations on it.
root [7] tree->Draw("tps_p4.pt()>>h1(100,0,10)")
root [8] h1->SetLineColor(kRed)
root [9] h1->SetTitle("tracking particle p_{T};p_{T} (GeV);A.U.")
root [10] h1->SetLineWidth(2)
root [11] h1->Draw()
To make a selection, use the 2nd field. This is also an example of how to overlay to plots.
root [12] tree->Draw("tps_p4.pt()>>h_pt_barrel(100,0,10)", "fabs(tps_p4.eta())<1.1");
root [13] h_pt_barrel->Draw();
root [14] h_pt_barrel->SetLineColor(kBlue);
root [15] h_pt_barrel->SetLineWidth(2);
root [16] h_pt_barrel->SetTitle("Tracking Particle p_{T} (Barrel);p_{T} (GeV)");
root [17] h_pt_barrel->Draw();
root [18] tree->Draw("tps_p4.pt()>>h_pt_endcap(100,0,10)", "fabs(tps_p4.eta())>1.1");
root [19] h_pt_endcap->SetLineColor(kRed);
root [20] h_pt_endcap->SetLineWidth(2);
root [21] h_pt_endcap->SetTitle("Tracking Particle p_{T} (Endcap);p_{T} (GeV)");
root [22] h_pt_endcap->Draw();
root [23] h_pt_barrel->Draw("sames");
root [24] TLegend leg(0.3, 0.8, 0.6, 0.5);
root [25] leg.AddEntry(h_pt_endcap, "Endcap");
root [26] leg.AddEntry(h_pt_barrel, "Barrel");
root [27] leg.SetFillStyle(0);
root [28] leg.Draw();
Now at this point, you may be sick of typing in commands everytime. Time to use a macro (see chapter 7). Consider the following macro (
https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/overlay.C) which is the same as the previous example
{
tree->Draw("tps_p4.pt()>>h_pt_barrel(100,0,10)", "fabs(tps_p4.eta())<1.1");
h_pt_barrel->Draw();
h_pt_barrel->SetLineColor(kBlue);
h_pt_barrel->SetLineWidth(2);
h_pt_barrel->SetTitle("Tracking Particle p_{T} (Barrel);p_{T} (GeV)");
h_pt_barrel->Draw();
tree->Draw("tps_p4.pt()>>h_pt_endcap(100,0,10)", "fabs(tps_p4.eta())>1.1");
h_pt_endcap->SetLineColor(kRed);
h_pt_endcap->SetLineWidth(2);
h_pt_endcap->SetTitle("Tracking Particle p_{T} (Endcap);p_{T} (GeV)");
h_pt_endcap->Draw();
h_pt_barrel->Draw("sames");
TLegend leg(0.3, 0.8, 0.6, 0.5);
leg.AddEntry(h_pt_endcap, "Endcap");
leg.AddEntry(h_pt_barrel, "Barrel");
leg.SetFillStyle(0);
leg.Draw();
}
To run, you open the ROOT tree and the run it:
[rwk7t@tensor lesson1]$ root ../data/tracking_ntuple.root
root [0]
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .x macros/overlay.C
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
Efficiency Plots
Our next example is to use this simple tree to make an efficiency plot vs eta. We define the efficiency is a ratio:
ε = numerator count / denominator count
Where denominator is a tracking particle that has a reasonable chance of actually being reconstructed:
- non zero charge
- pT > 0.9 GeV
- |η| < 2.5
- |transverse impact parameter| < 3.5 cm
- |longitudinal impact parameter| < 30 cm
- |# of simulated hits| >= 3
The numerator selection is the same as the denominator selection except that we require the tracking particle to be matched to a reconstructed track.
The following macro produces this plot:
https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff.C
[rwk7t@tensor lesson1]$ root ../data/tracking_ntuple.root
root [0]
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .x macros/eff.C
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
</verbatim>
As an exercise we're going to make this same plot two more times. The next example, we're going to compile the macro. In general it is a good idea to compile you macros because the interpreter (CINT) is not robust and can incorrectly interpret even simple code. Also, it will greatly increase the execution time if the macro is doing anything significant. See chapter 7 of the User's Guide for more details. The following macro produces this plot same plot as above but is meant to run compiled:
https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff_compiled.C
[rwk7t@tensor lesson1]$ root
root [0] .L macros/eff_compiled.C++
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson1/./macros/eff_compiled_C.so
root [1] eff_compiled("../data/tracking_ntuple.root", "plots/eff_vs_eta.root", "png")
[eff_compiled] tree is opened with 1000 entries
[eff_compiled] printing plots to plots directory.
Info in <TCanvas::Print>: png file plots/h_num_vs_eta_compiled.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta_compiled.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta_compiled.png has been created
[eff_compiled] writing plots to plots/eff_vs_eta.root
</verbatim>

</verbatim>

</verbatim>
The final example from this lesson is a very simple "looper". A looper is simple a macro that manually loops over the entries in a TTree rather than relying on TTree::Draw(). This has the advantage of speed since there will be only one loop over the tree instead of one for each plot. Also, it has the flexibility to implement arbitrary logic whereas TTree::Draw, while flexible, can still be limited on what you can calculate and plot.
The following macro is a simple looper:
https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff_looper.C. It breaks up the process into two steps: create the numerator and denominator plots (CreatePlots) and doing the division to make the efficiency (FinalPlots). The purpose of this is making the numerator/denominator is the "slow" part -- you don't want to remake these plots if you all you need is to change the label or something on the efficiency plot. This is a simple example of breaking up the work flow to keep the analysis running efficiently. You can, of course, make a function that simply calls both if you want the option of doing it in two steps (exercise left to the reader)
root [2] .L macros/eff_looper.C++
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson1/./macros/eff_looper_C.so
root [3] CreatePlots("../data/tracking_ntuple.root", "plots/counts_vs_eta_looper.root")
root [4] FinalPlots("plots/counts_vs_eta_looper.root", "plots/eff_vs_eta_looper.root", "png")
Info in <TCanvas::Print>: png file plots/h_num_vs_eta_looper.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta_looper.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta_looper.png has been created
</verbatim>

</verbatim>

</verbatim>
Homework
- check out, understand, and run the macros from lesson1
- modify the macros to also produce the efficiency vs pt. The only change to the selection is since this is an efficiency vs pT, we should reduce the pT threshold to pT > 0.1 GeV.
- Create resolution plots vs eta and pT. They are made by x_{reco} - x_{truth}. Do this using whatever technique you feel is appropriate. These should looks similar (but not exactly) to:
- write a brief discuss the features you see in the plots and try to explain them based on what we've gone over in the student given lectures
Lesson 2
This lesson is an example skeleton for organizing an analysis. This is a short lesson and the main point of this lesson is to show how to organize an analysis in a more structured way and to introduce some of the abstractions that will lead us to the conventions of
CMSSW.
- Lesson 2 example code
- This example uses the same simple tracking tree from the previous lesson and also produces the same plots to show, what I consider, a less tedious way. There will always be an overhead to making plots and doing an analysis; however, with some smart choices it can be less tedious.
Code Layout
- In the previous example looper (eff_looper.C), everything was contained in the same file. Here the code will be re-factored and extended to promote some code re-use. Below is a list of all the files and directories
|____compile.C File to compile the code using ACLiC
|____compile.sh File to compile the code using straight up gcc
|____include Directory to include header files for reusable code
| |____HistTools.h Header file for example ROOT histogram tools
| |____TRKEFF.h Header file for the TTree branch wrapper class
|____lib Directory to contain the libraries
|____macros Directory to contain ACLiC compiled simple macros (usually throw away style code)
| |____FormattedPlots.C Simple macro to print the formated plots
| |____run_all.C Simple macro to run the analysis
|____plots Directory to contain the output of the plots
|____source Directory to contain the main source code for the analysis
| |____HistTools.cc Source file for the example ROOT histogram tools
| |____TrackingEfficiencyAnalysis.cc Source file for the main analysis looper
| |____TRKEFF.cc Source file for the TTree branch wrapper class
- Note: that I'm using the CMSSW convention that *.C files are intended to be ROOT compiled macros and *.cc files are supposed to be "fully C++" complaint code.
The main analysis is done in
TrackingEfficiencyAnalysis.cc. This is a C++ class that holds all of the metadata and runs the analysis. The main reason to make this a class is to keep all of the relevant variables and data together. If this were a set of functions, you would have to pass a bunch of parameters back and forth -- a class is more more suited for this purpose. Also, In the class definition below
class TrackingEfficiencyAnalysis
{
public:
// construct:
TrackingEfficiencyAnalysis
(
const std::string& output_file_name = "plots/counts_vs_eta_looper.root",
const std::string& suffix = "png",
const bool verbose = true
);
// destroy:
~TrackingEfficiencyAnalysis();
// method to run on the data:
void ScanChain(TChain& chain, long long num_events = std::numeric_limits<long long>::max());
private:
// analysis methods:
void BeginJob();
void EndJob();
void Analyze();
// members:
const std::string m_output_file_name;
const std::string m_suffix;
const bool m_verbose;
TH1Map m_hist_map;
};
the analysis methods were deliberately chosen. You will see this methodology again when we deal with
CMSSW.
- The TrackingEfficiencyAnalysis::BeginJob function is to be called before the analysis loop begins. It can be used to setup the initial conditions for the analysis (open files, book histograms, set flags, etc.).
- The TrackingEfficiencyAnalysis::Analyze function is the main logic to be performed on an individual event.
- The TrackingEfficiencyAnalysis::EndJob function is to be called after the analysis loop ends. It can be used to save the results for the analysis (close files, save and histograms, print tables, etc.).
- The TrackingEfficiencyAnalysis::ScanChain function is the main looper. Before it loops, it calls the BeginJob() function. It then loops over the whole TChain and calls the Analyze() method for each entry in the tree. Finally, after the loop it calls the EndJob() method.
HistTools -- a starter kit for histogram tools
In order to have some re-usable ROOT tools, I've provided a "starter" set for people. I have a much more mature version
here but it may not be a bad idea to start doing some of these yourself to familiarize yourself with ROOT. When you get comfortable you should feel free to rip off from my area on anyone else since the main goal is analysis not tools.
HistTools? provides a very basic set of histogram tools.
This a piece of code that was automatically generated to give wrapper functions to the branches of the this TTree. It only works on the tree in this example and if the tree itself is changes, then it will need to be regenerated. The main point is twofold:
- It provides a simple interface to access the branches of the tree in a looper without having to call the boiler plate functions everytime (see previous lessons eff_looper.C).
- It provides "lazy evaluation" on the branches. By default, none of the branches are loaded. In your analysis logic, it loads the branch the first time you access it then leaves it initialized until you access the next entry in the tree (event).
We will discuss this more when we talk about CMS2 in latter lessons.
Building and Running the code
I've provided two ways to build the code. The first is using ROOT wrapper to gcc called
ACLiC (the .L macros.C++ thing). The second example is with GCC directly to show you what is under the hood. There is no reason you have to stick with these two methods and they have their pluses and minuses; however, this is all that is really needed to get started on analysis.
ACLiC
I've provided a simple macro to compile the code called
compile.C. To compile this analysis code, you simple run the macro in ROOT:
root [0] .x compile.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./compile_C.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libHistTools.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libTRKEFF.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libTrackingEfficiencyAnalysis.so
(bool)1
When you ready to run the code, there is a simple wrapper to compile the code, create an TrackingEfficiencyAnalysis object, and run it called
run_all.C:
root [0] .x macros/run_all.C
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./compile_C.so
[TrackingEfficiencyAnalysis::ScanChain] finished processing 1000 events
------------------------------
CPU Time: 1.2
Real Time: 1.2
[TrackingEfficiencyAnalysis::EndJob] printing histograms:
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta_sigma.png has been created
GCC directly
To see what is really going on, I provided a simple script that contains a single line to compile this code as a stand alone program (rather than running in ROOT/CINT). The main reason is to demonstrate that it is possible to compile ROOT objects and classes without the interpreter at all (CINT). You will find the interpreter has limitations and sometimes you may want to go around it. One could easily extend this using GNU Make or even a fully flushed out build system or IDE (Ecplipse, Boost Build, CMake,
CMSSW's scram, etc.).
To compile from the command prompt:
[rwkelley@uaf-7 lesson2]$ ./compile.sh
g++ -O2 source/TrackingEfficiencyAnalysis.cc source/TRKEFF.cc source/HistTools.cc -o tracking_eff_analysis -pthread -m64 -I/code/osgcode/imacneill/root/05.34.07/include -L/code/osgcode/imacneill/root/05.34.07/lib -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lPostscript -lMatrix -lPhysics -lMathCore -lThread -pthread -lm -ldl -rdynamic -lGenVector -Iinclude
This produces an executable program called "tracking_eff_analysis".
[rwk7t@tensor lesson2]$ ./tracking_eff_analysis
[TrackingEfficiencyAnalysis::ScanChain] finished processing 1000 events
------------------------------
CPU Time: 1.2
Real Time: 1.2
[TrackingEfficiencyAnalysis::EndJob] printing histograms:
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta_sigma.png has been created
Output plots
The output plots will be put in the plots directory. I've provided a macro called
Formated to produce a nice version of the residual plots. First you must load the
HistTools? so I use them in the macro:
root [0] .L lib/libHistTools.so
root [1] .x macros/FormattedPlots.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./macros/FormattedPlots_C.so
Info in <TCanvas::Print>: pdf file plots/sigma_res.pdf has been created
This produces formated plots: </verbatim>
Homework
- check out, understand, and run the code from lesson2
- modify the code to also produce the resolution vs pt. The only change to the selection is since this is an efficiency vs pT, we should reduce the pT threshold to pT > 0.1 GeV.
- modify the macros to produce the nicely formatted plots using log scale.
Overview of CMSSW
- Where is the code?
- Where is the LXR?
- Where is the Dyoxygen?
- Where is the data?
Selected workbook topics Workbook
Chapter 2
Chapter 3
- Ignore PAT
- work flow of an analysis: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookAnalysisOverviewIntroduction
- which release: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookWhichRelease
- CMSSW setup: put this in your .bash_profile
-
source /code/osgcode/cmssoft/cms/cmsset_default.sh
export SCRAM_ARCH=slc5_amd64_gcc462
- for this exercise, use CMSSW_5_3_2_patch4 (this is what we currently use in cms2)
- setup a release by doing the following:
-
cmsrel CMSSW_5_3_2_patch4
pushd CMSSW_5_3_2_patch4/src
cmsenv
popd
- It is often convenient to create a CMSSW project with a special name, so that its contents are more easily recognized by you. For example, one could
-
scramv1 p -n CMSSW_5_3_2_patch4_Tutorial CMSSW CMSSW_5_3_2_patch4
pushd CMSSW_5_3_2_patch4_Tutorial/src
cmsenv
popd
- Get a ROOT file. You can find them on the dbs (or DAS) and use xrootd to either open then or copy them:
- Find a file that you want do down load
-
dbsql find file where dataset=/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/Summer12_DR53X-PU_S10_START53_V7A-v1/AODSIM
- Open it in xrootd
-
root root://xrootd.unl.edu//store/mc/Summer12_DR53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/AODSIM/PU_S10_START53_V7A-v1/0000/00037C53-AAD1-E111-B1BE-003048D45F38.root
- Or copy it locally
-
xrdcp root://xrootd.unl.edu//store/mc/Summer12_DR53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/AODSIM/PU_S10_START53_V7A-v1/0000/00037C53-AAD1-E111-B1BE-003048D45F38.root dy.root
- If you don't have a grid certificate yet, then use the file I copied locally
-
root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
- Open a root tree and play with TTree::Draw(): https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookFWLite
- Example EDM ROOT file: /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
-
[rwkelley@uaf-7 lesson3]$ root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
root [0]
Attaching file /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root as _file0...
Warning in <TClass::TClass>: no dictionary for class edm::EventAuxiliary is available
Warning in <TClass::TClass>: no dictionary for class edm::Hash<2> is available
Warning in <TClass::TClass>: no dictionary for class edm::EventID is available
Warning in <TClass::TClass>: no dictionary for class edm::Timestamp is available
Warning in <TClass::TClass>: no dictionary for class edm::LuminosityBlockAuxiliary is available
Warning in <TClass::TClass>: no dictionary for class edm::LuminosityBlockID is available
...
Warning in <TClass::TClass>: no dictionary for class pair<unsigned int,string> is available
What were all these errors? The cause is CINT/ROOT is independent of CMSSW and thus does not natively understand any of the CMSSW specific classes (reco::Track, edm::EventAuxiliary, TrackingParticle, etc.). If you are running fully compiled this is not an issue; however, if you want to look at this file via CINT, you will need to load the dictonaries such that ROOT will understand these classes. Fortunately, this can be accomplished by setting of CMSSW's FWLite:
{
// Set up FW Lite for automatic loading of CMS libraries
// and data formats. As you may have other user-defined setup
// in your rootlogon.C, the CMS setup is executed only if the CMS
// environment is set up.
//
const TString cmsswbase = getenv("CMSSW_BASE");
if (cmsswbase.Length() > 0) {
//
// The CMSSW environment is defined (this is true even for FW Lite)
// so set up the rest.
//
cout << "Loading FW Lite setup." << endl;
gSystem->Load("libFWCoreFWLite.so");
AutoLibraryLoader::enable();
gSystem->Load("libDataFormatsFWLite.so");
gSystem->Load("libDataFormatsPatCandidates.so");
}
}
Before you load the file, load FWLite. Now the warnings should disapear.
[rwkelley@uaf-7 lesson3]$ root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
[rwkelley@uaf-7 lesson3]$ root -b
root [0] .x macros/load_fwlite.C
Loading CMSSW FWLite
root [1] TFile* file = TFile::Open("/nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root")
root [2]
Often, people put this into their .rootlogon.C file such that it is loaded with ROOT whenever CMSSW is setup.
Chapter 4
CMSSW Z Analysis exercise
- Write a EDProducer to make a simple ntuple with some basic variables for a quick generator study of the Z-boson
- keep the p4 of the Z and it's daughter leptons
- gen label --> "genParticles"
- match daughter leptons to its corresponding reco objects (if they exist).
- muon label --> "muons"
- electron label --> "gsfElectrons"
- Write an EDFilter to filter out Z --> tau tau events
- Write an EDAnalyzer to make some histograms of the variables quantities above
- Try to look at the difference between the generator level quantity vs the reco quantity to get a feel for the resolution to measure these quantities (e.g. m_Z vs m_l1l2).
- I'd like to see the following plots at least:
- mass of the Z
- Compare the dilepton invariant mass at gen level and reco level
- for ee, μμ and both together
- Compare pT and η of the leptons (gen and reco level)
- Resolution of the z mass vs z mass
- Resolution of the lepton pT vs pT (e's and μ's separately).
- Efficiency of reconstructing the lepton vs pT (e's and μ's separately).
- Please make a few slides summarizing your results. I will look at them and check your code on Friday.
Building Lesson
C was developed in the 1970s and C++ in the 1980s. These both predate nice GUI Integrated Development Environments (IDEs) that are quite popular today. These include MS Visual Studio, Eclipse, XCode, etc. NOTE: the IDE is not the compiler but is merely a nice GUI around the compiler. Most of the time, the nitty gritty details of managing compiling and linking code is handled by the IDE but not always. For large systems, even in industry which uses IDEs, the build system is handled separately from the IDE. Here build refers to the process of compiling and linking your code into the final product (exe, libs, etc.).
In CMS, we use
SCRAM (Source Configuration, Release, And Management tool). This short lesson, motivates the need for a "build" system and then goes into how SCRAM solves the problem for
CMSSW.
Why do we care?
C++ is object oriented. This encourages modular code that can be spread among multiple files. Unfortunately, nobody updated the old file management system from C to handle this so C++ is stuck with quite the antiquated system -- very crude compared to more recently developed languages. The next subsections illustrates some of the issues.
Simple one file program
Consider the following simple program in a file called
hello.cpp
in the
src
directory.
#include <iostream>
int main()
{
std::cout << "hello world!" << std::endl;
return 0;
}
Use g++ to compile
[rwk7t@tensor lesson4]$ g++ src/hello.cpp
This produces an exe called
a.out
. To run:
[rwk7t@tensor lesson4]$ ./a.out
hello world!
You can control the name of the output exe with the option -o to g++. Here we put the exe in the
bin
directory:
[rwk7t@tensor lesson4]$ g++ src/hello.cpp -o bin/hello
[rwk7t@tensor lesson4]$ ./bin/hello
hello world!
Multiple File programs
Now what happens if we want to write some reusable functions and put them in separate files. How does this change our building procedure? Consider the function
Module1::PrintHello
in a separate file called
module1.cpp
:
#include <iostream>
namespace Module1
{
void PrintHello()
{
std::cout << "hello from module1!" << std::endl;
}
}
And the main program that goes with it is called
test1.cpp
:
#include "module1.cpp"
int main()
{
Module1::PrintHello();
return 0;
}
Now we compile into and exe in the
bin
directory:
[rwk7t@tensor lesson4]$ g++ src/test1.cpp -o bin/test1
[rwk7t@tensor lesson4]$ ./bin/test1
hello from module1!
Everything here looks fine. Now, consider another function added into our program which uses the
Module1::PrintHello
function from
module1.cpp
-- call this
module2.cpp
:
#include <iostream>
#include "module1.cpp"
namespace Module2
{
void PrintHello()
{
std::cout << "hello from module2!" << std::endl;
Module1::PrintHello();
}
}
Now our main program changes to the following which is saved in
test2.cpp
:
#include "module1.cpp"
#include "module2.cpp"
int main()
{
Module1::PrintHello();
Module2::PrintHello();
return 0;
}
Now compiling will give you an error:
[rwk7t@tensor lesson4]$ g++ src/test2.cpp -o bin/test2
In file included from src/module2.cpp:2:0,
from src/test2.cpp:2:
src/module1.cpp: In function 'void Module1::PrintHello()':
src/module1.cpp:5:10: error: redefinition of 'void Module1::PrintHello()'
In file included from src/test2.cpp:1:0:
src/module1.cpp:5:10: error: 'void Module1::PrintHello()' previously defined here
The reason for the error is module1.cpp has been included twice. Once in the test2.cpp and once in the module2.cpp. Recall from C++ that you can
declare a function or class many times but you can only
define it once. Here because we are including the definition twice, we get a compile error.
To fix the problem, we forward declare the two functions in
test3.cpp
:
namespace Module1 {void PrintHello();}
namespace Module2 {void PrintHello();}
int main()
{
Module1::PrintHello();
Module2::PrintHello();
return 0;
}
We also need to foward declare the function in "module2.cpp"
#include <iostream>
namespace Module1 {void PrintHello();}
namespace Module2
{
void PrintHello()
{
std::cout << "hello from module2!" << std::endl;
Module1::PrintHello();
}
}
And when we compile, we need to supply the definitions to the compiler. This is done by adding all three files to the g++ call and now g++ knows to compile all 3 files and then link them together:
[rwk7t@tensor lesson4]$ g++ src/test3.cpp src/module1.cpp src/module2.cpp -o bin/test3
[rwk7t@tensor lesson4]$ ./bin/test3
hello from module1!
hello from module2!
hello from module1!
Now, editing the files and forward declaring is quite cumbersome so the convention is to factor out forward declarations into interface or "header files" usually denoted by
*.h
or
*.hpp
. So refactoring the modules and main program:
module1.h:
#ifndef MODULE1_H
#define MODULE1_H
namespace Module1
{
void PrintHello();
}
#endif //MODULE1_H
Notice that the header file had the code surrounded some C preprocessor commands (called "header guard"). This really only needed for class definitions since these are typically
defined in header files. This is protect from multiple class definitions. In this particular example, you wouldn't need them but its good practice.
module1.cpp:
#include "module1.h"
#include <iostream>
namespace Module1
{
void PrintHello()
{
std::cout << "hello from module1!" << std::endl;
}
}
module2.h:
#ifndef MODULE2_H
#define MODULE2_H
namespace Module2
{
void PrintHello();
}
#endif //MODULE2_H
module2.cpp:
#include "module2.h"
#include "module1.h"
#include <iostream>
namespace Module2
{
void PrintHello()
{
std::cout << "hello from module2!" << std::endl;
Module1::PrintHello();
}
}
test4.cpp:
#include "module1.h"
#include "module2.h"
int main()
{
Module1::PrintHello();
Module2::PrintHello();
return 0;
}
It is customary to put the header files in a different folder than the source code. This is to facilitate a "user interface" that is easy to parse and not polluted with implementation details. This is merely a convention but it is standard practice. Because the headers in this example live in the
include
director, we need to tell g++ where to find them. This is done with the
-I
switch to g++:
[rwk7t@tensor lesson4]$ g++ src/test4.cpp -I include src/module1.cpp src/module2.cpp -o bin/test4
[rwk7t@tensor lesson4]$ ./bin/test4
hello from module1!
hello from module2!
hello from module1!
Libraries
In the previous section, we saw how to build an exe that has code spread among multiple file. What happens though when you have 10, 100, 10k file? You cannot possibly compile every file every time you wish to build your code! For intermediate and large systems it might take hours to compile. For example, compiling
CMSSW takes several hours and has to be scheduled. Even ROOT takes ~30 minutes.
To save time, you could compile each piece at a time and then you will only have to link them which can save some building time:
[rwk7t@tensor lesson4]$ g++ -c src/module1.cpp -Iinclude -o obj/module1.o
[rwk7t@tensor lesson4]$ g++ -c src/module2.cpp -Iinclude -o obj/module2.o
[rwk7t@tensor lesson4]$ g++ -c src/test4.cpp -Iinclude -o obj/test4.o
[rwk7t@tensor lesson4]$ g++ obj/test4.o obj/module1.o obj/module2.o -o bin/test4
This breaks each file into a binary or "object" file by using the -c option. The binary file does nothing on its own but can be linked against in the last line to create the exe. However, again compiling every file into its own object file can be cumbersome for intermediate to large systems so the convention is to create a reusable library.
Static Libraries
A static library is essentially where you collect all of the binary files for set of related code and combine it into one file that can be linked against to create you main program. The main program will essentially pull the binaries that are relevant into the main program and absorb the binary content into the exe file. In the following example, we pull module1 and module2 into a single reusable static library:
[rwk7t@tensor lesson4]$ g++ -c src/module1.cpp -Iinclude -o obj/module1.o
[rwk7t@tensor lesson4]$ g++ -c src/module2.cpp -Iinclude -o obj/module2.o
[rwk7t@tensor lesson4]$ ar rvs lib/libmodule.a obj/module1.o obj/module2.o
ar: creating archive lib/libmodule.a
a - obj/module1.o
a - obj/module2.o
NOTE: the file extension convention for static libraries is .a for "archive".
Now you can use this static library to use this reusable package:
[rwk7t@tensor lesson4]$ g++ -Iinclude src/test4.cpp lib/libmodule.a
[rwk7t@tensor lesson4]$ ./bin/test4
hello from module1!
hello from module2!
hello from module1!
There is an interface that g++ uses to link against libraries. The following is equivalent to the above:
[rwk7t@tensor lesson4]$ g++ -Iinclude src/test4.cpp -L lib -l module
[rwk7t@tensor lesson4]$ ./bin/test4
hello from module1!
hello from module2!
hello from module1!
Here
-L
allows you to specify the path to the library and
-l
allows you to specify the actual library you wish to link against. Notice the
-l
argument does not have the extensions or the "lib" prefix to the archive file -- this is implied. This is because you may have a dynamic and static version of the same library and by default, dynamic is preferred. The advantage of this method is you don't have to keep repeating the path the libraries which can make you command really long and unreadable.
Dynamic Libraries
Static libraries are great for small or intermediate size systems; however, since the binary code is copied to the exe, this can cause "code bloat" and cause the exe to be very large (~GB). Also, if you use a hierarchical library scheme where some libs are dependent on other libs, then you are duplicating binary code and again causing more bloated binaries. The solution is to use dynamic libraries (also called shared libraries). This allows the exe to find the library at runtime. There is a slight initialization penalty but not really that noticeable with modern computers. The advantage is the exe will be small and you don't have to copy around binaries into other libs or exes. This for systems with "standard" shared libraries installed so only the exe needs to be shipped. However, the down side to shared libraries is now you have to keep track of where the dynamic libraries are and if you wish to run the exe somewhere else (like the grid), you will have to send the appropriate libraries as well if they are not already installed somewhere.
To create a dynamic lib:
[rwk7t@tensor lesson4]$ g++ -shared obj/module*.o -o lib/libmodule.so
[rwk7t@tensor lesson4]$ g++ -I include src/test4.cpp -L lib -l module
[rwk7t@tensor lesson4]$ ./bin/test4
hello from module1!
hello from module2!
hello from module1!
NOTE: the file extension convention varies by platform:
- Windows: *.
dll
for dynamic link library
- Linux: *.
so
for shared object
- Mac OSX: *.
dylib
for dynamic library
Finally, exe will look in its same directory or the original place where the dynamic lib was during linking. If you move it to some central place, you will need to ensure that the exe can find it at runtime. You can specify an environment variable called
LD_LIBRARY_PATH
to include the path of the libraries you use frequently and the OS will search there for the shared lib. This is usually for system wide code or systems that the user sets up (like ROOT or
CMSSW). Mac OSX also has a environment variable also called
DYLD_LIBRARY_PATH
which unfortunately causes mental friction when using a mac.
This only scratches the service of using libraries.
Building an exe with ROOT objects
There is no reason you have to use ROOT interactively though CINT -- it is really just a set of C++ libraries. Consider the following simple program called
with_root.cpp
:
#include "TH1F.h"
#include "TFile.h"
int main()
{
TFile file("file.root", "RECREATE");
TH1F h1("h1", "h1", 100, -5, 5);
h1.FillRandom("gaus", 10000);
h1.Write();
file.Close();
return 0;
}
If I try to compile this, I'll get an error since g++ doesn't know where to find the ROOT headers and libraries:
[rwk7t@tensor lesson4]$ g++ src/with_root.cpp -o bin/with_root
src/with_root.cpp:1:18: fatal error: TH1F.h: No such file or directory
compilation terminated.
ROOT ships with a nice command line utility that will give you the appropriate g++ options to pass to your command:
[rwk7t@tensor lesson4]$ root-config --cflags --libs
-pthread -std=c++0x -m64 -I/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/include -L/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/lib -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lPostscript -lMatrix -lPhysics -lMathCore -lThread -lpthread -Wl,-rpath,/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/lib -lm -ldl
So to build a standalone exe with ROOT objects, you can do the following:
[rwk7t@tensor lesson4]$ g++ src/with_root.cpp `root-config --cflags --libs` -o bin/with_root
[rwk7t@tensor lesson4]$ ./bin/with_root
The point is you are not stuck with ACLiC to run ROOT code.
Build Systems
To facilitate managing the low level details from the last section, programs and tools called "build systems" have been developed. This can range from a simple bash script to something from a GUI (like Visual Studio). Learning a build system can increase your productivity and reduce the mental friction of managing the development, build, test cycle.
GNU Make
GNU Make is considered a first step towards a build system. GNU Make is not a build system per se; but a tools that can be used to fashion a build system. This section will not go into any real detail other than to give the most basic example. Make is old technology from the 1970s and many more modern systems are available. It is still widely used as a piece of the build system so it is still worth knowing a bit.
Make has the simple structure which is a series of rules:
target: dependency1 dependency2 ...
<tab> command1
<tab> command2
...
The goal of the each rule is to create the target. If the target is older than any of its dependencies then the commands are executed. You can then build up a series of interconnect rules to build your code. Consider the following example:
# simple Makefile for test4
# target
all: bin/test4
# rule to make the exe
bin/test4: src/test4.cpp lib/libmodule.so
g++ -Iinclude src/test4.cpp -Llib -lmodule -o bin/test4
lib/libmodule.so: obj/module1.o obj/module2.o
g++ -shared obj/module1.o obj/module2.o -o lib/libmodule.so
obj/module1.o: include/module1.h src/module1.cpp
g++ -c src/module1.cpp -Iinclude -o obj/module1.o
obj/module2.o: include/module2.h src/module2.cpp include/module1.h src/module1.cpp
g++ -c src/module2.cpp -Iinclude -o obj/module2.o
# clean up
clean:
if [ -f lib/libmodule.so ]; then rm lib/libmodule.so; fi
if [ -f bin/test4 ]; then rm bin/test4; fi
if [ -f obj/module1.o ]; then rm obj/module1.o; fi
if [ -f obj/module2.o ]; then rm obj/module2.o; fi
if [ -f obj/test4.o ]; then rm obj/test4.o; fi
# PHONY means these rules are always out of date
.PHONY: all clean
The ultimate goal is create the
all
rule:
[rwk7t@tensor lesson4]$ make all
g++ -c src/module1.cpp -Iinclude -o obj/module1.o
g++ -c src/module2.cpp -Iinclude -o obj/module2.o
g++ -shared obj/module1.o obj/module2.o -o lib/libmodule.so
g++ -Iinclude src/test4.cpp -Llib -lmodule -o bin/test4
Notice the rules get called in reverse since nothing has been made. Also, not that if you just type make, the rule
all
is implied. So this is convenient since whenever a file is out of date, it will call the appropriate rule and so on until the
all
rule is satisfied.
Also, there is a user defined rule clean which deletes all of the binary and library files:
[rwk7t@tensor lesson4]$ make clean
if [ -f lib/libmodule.a ]; then rm lib/libmodule.a; fi
if [ -f bin/test4 ]; then rm bin/test4; fi
if [ -f obj/module1.o ]; then rm obj/module1.o; fi
if [ -f obj/module2.o ]; then rm obj/module2.o; fi
if [ -f obj/test4.o ]; then rm obj/test4.o; fi
This might seem magical but for even intermediate size systems (like an analysis), this can get out of hand and cumbersome. Some more digging into Make's features can alleviate some of this but in the end, there is still quite a bit of work to get make to work well.
See:
http://oreilly.com/catalog/make3/book/index.csp
ACLiC
ROOT for interactive work uses an interactive system that ships with CINT called The Automatic Compiler of Libraries for CINT (ACLiC).
root [0] .L macro.C++
ACLiC is nice for simple and mostly standalone code. Building larger systems can be troublesome. More on ACLiC can be found here:
http://root.cern.ch/download/doc/ROOTUsersGuideChapters/CINT.pdf.
The main disadvantage of ACLiC is it has a hard time parsing complex code, especially with templates. Thus the need to use something more robust for complex systems.
SCRAM
CMSSW uses SCRAM systems (Source Configuration, Release, And Management tool). The following twikis explain it more detail:
SCRAM BuildFiles -- Short Tutorial
SCRAM is a build system that is a wrapper around make. Recall that that
CMSSW is broken into
Subsystem/package
(e.g. Reconstruction/TrackReco or Demo/DemoZAnalysis). This is historical from the ORCA days (the predecessor to
CMSSW). To first order, each package will become a shared library (*.so) and it will live in the $CMSSW_BASE/lib/$SCRAM_ARCH directory (e.g. on my mac -- CMSSW_6_1_2/lib/osx107_amd64_gcc472/).
Subsytem1 --> package 1 --> BuildFile.xml --> libSubsystem1Package1.so
--> package 2 --> BuildFile.xml --> libSubsystem1Package2.so
--> package 3 --> BuildFile.xml --> libSubsystem1Package3.so
....
Subsytem2 --> package 1 --> BuildFile.xml --> libSubsystem2Package1.so
....
You can do more fancy things like build a standalone executable or plugin. This is beyond the scope of this tutorial -- see the twiki's if you want to go there. I've done some of this and its not terrible.
EDM Plugin
To use SCRAM, you present to the system the
BuildFile.xml
(or
CMS.BuildFile
) which is a declarative statement about which libraries you need to include to build this package. Consider the following example of an analysis:
Analysis C++ class Subystem/Package they live in
|-------> reco::GenParticles --> DataFormats/HepMCCandidate
--> reco::Muons --> DataFormats/MuonReco
--> reco::GsfElectrons --> DataFormats/EGammaReco
--> edm::EDAnalyzer --> FWCore/Framework
--> edm::Event --> FWCore/Framework
--> edm::ParameterSet --> FWCore/ParameterSet
--> TH1D --> ROOT
--> TFileService --> PhysicsTools/UtilAlgos
....
To properly compile and
LINK this analysis package, consider the following BuildFile.xml in the $CMSSW_BASE/src/Ryan/Analysis":
<use name="FWCore/Framework"/>
<use name="FWCore/PluginManager"/>
<use name="FWCore/ParameterSet"/>
<use name="DataFormats/HepMCCandidate"/>
<use name="DataFormats/MuonReco"/>
<use name="DataFormats/EgammaReco"/>
<use name="PhysicsTools/UtilAlgos"/>
<use name="root"/>
<flags EDM_PLUGIN="1"/>
This BuildFile declares library called libRyanAnalysis.so, which is also an EDM plugin that cmsRun will understand (EDProducer, EDAnalyzer, EDFilter, ...). This is registered by the plugin manager at compile time.
- The lines with
<use name="Subsystem/package">
means to include and link against this library
- The lines with
<flags EDM_PLUGIN="1"/>
means to promote this library to a edm plugin
- The "1" means to use the directory structure to resolve the Subsystem/Package (i.e. "Subsystem/Package")
- you could have named it something different but this is conventional.
So when you run cmsRun, the plugin can be used in the python configuration, loaded at runtime and do whatever the plugin was designed to do.
These dependencies change as you write more or change your code. It takes some experience to build up knowledge of where all the different packages are in
CMSSW -- the LXR and DOxygen are the best places to figure this out.
Reusable Library
Not all the code in
CMSSW is an EDM plugin. Sometimes, you want to just make a reusable library that other packages use (e.g. data formats or analysis tools). This is simple in SCRAM. You just don't declare a EDM_PLUGIN and "export" the lib:
<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>
<use name="AnalysisTools/LanguageTools"/>
<export>
<lib name="1"/>
<use name="root"/>
</export>
Here I include the packages that are needed and export the library name. Again, the "1" for lib name means to use the directory structure to resolve the Subsystem/Package (i.e. "Subsystem/Package"). By exporting, I'm saying there will be a library available to everyone of name "Subsystem/Package" (e.g. AnalysisTools/!RootTools).
Purpose of Exporting
Consider that we wrote a reusable library called
MyTools
that we want to expose to "clients" in our system. Consider the following awesome drawing:
Here, the implementation (below the squiggly line) of
MyTools
uses three different packages: ROOT, the BOOST C++ libraries and GNU Scientific Library (GSL). The
MyTools
package used these three package in the implementation of the code but not necessarily in the interface. In the interface (above the squiggly line), it only uses ROOT. For example, consider the following pseudo-code from
MyTools
:
MyAwesomePlotMaker.h
#ifndef MYAWESOMEPLOTTOOLS_H
#define MYAWESOMEPLOTTOOLS_H
#include "TH1F.h"
namespace MyTools
{
TH1F MyAwesomePlotMaker();
}
#endif // MYAWESOMEPLOTTOOLS_H
MyAwesomePlotMaker.cc
#include "MyAwesomePlotMaker.h"
#include "TH1F.h"
#include "boost/shared_ptr.hpp"
#include "gsl/gsl_matrix.h"
namespace MyTools
{
TH1F MyAwesomePlotMaker()
{
// awesome code that does everything
...
}
}
To properly link against these, the following lines are needed in the BuildFile.xml:
<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>
However, the user or "client" of the
MyTools
package, does not need to know about Boost and GSL since these are now binaries are linked dynamically to
MyTools
. However, whenever you use something from another package in the interface (header files) of your package, you will need to tell the client about it. For example, if in my tools I included
#include "TH1F.h"
Now, the client of
MyTools
has to also know about ROOT. If he just includes the
MyTools
package, he will get a linker error. To fix this he needs to explicitly include ROOT as a dependency in the client's BuildFile.xml. This is
not sustainable since the client would have to figure out that he needed to include ROOT and whenever you use
MyTools
. This is error prone and inelegant since what if
MyTools
adds another dependency --> if there are 1000 "clients" of
MyTools
, are you going to change the 1000 build files that included
MyTools
?
The way SCRAM deals with this is using "export". When I export my lib, I also export the dependencies that the client needs to link this lib:
<export>
<lib name="1"/>
<use name="root"/>
</export>
Here, I exposing my library called
Subsystem/package
and also whenever you use this library, you will also get ROOT with it. So if the interface to
MyTools
changes and you need another package in the interface, you just have to change the BuildFile.xml for that package. So now, the picture of the build decencies will look like:
Client now gets the ROOT dependency implicitly. The final BuildFile.xml to realize this picture will look like:
<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>
<export>
<use name="root"/>
<lib name="1"/>
</export>
Unfortunately, in practice, people are not aware and/or disciplined enough to use the export properly. But you should be aware and try to use this feature. This is also standard practice in commercial C++ systems.
FYI, in my opinion, the
BOOST build
system handle this problem particularly well and is not tied to
CMSSW...
Other systems
There are other systems that can be used as a build system. One goal of a modern build system is to be "platform independent" -- i.e. not depend on the system that the code is compiled on (Windows, Linux, PS3, iOS, etc.). Examples of build systems nclude but are not limited to:
Check out the wiki page on this:
http://en.wikipedia.org/wiki/List_of_build_automation_software
CMS2
CMS2 is the name of the ntupling and supporting framework used by the UCSD/UCSB/FNAL group (a.k.a TAS or SNT). The work flow is to produce, for the whole group, a set of standard CMS2 style ntuples using
CMSSW. The ntuples are analysis ready with the relevant variables needed to get started on a physics analysis but are general enough to be flexible. An individual analysis (e.g. Same-Sign dileptons) would take these ntuples and make further selections to produce the final analysis.
Where is the CMS2 code?
The code is now kept in github:
https://github.com/cmstas
how do I produce CMS2 ntuples?
The CMS2 ntuples are produced via cmsRun from
CMSSW. The instructions are kept on the SNT twiki:
http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/NtupleMakingInstructions
Where are they kept?
The ntuples are produced by various members of the group (usually students/postdocs

) and are kept in a central place:
If you are charged to create ntuples, please ensure that this twiki is up to date with
ALL the field filled in. If you don't know the value to put in a field -- ask someone!
Example CMS2 Looper
This section outlines how to checkout and compile the example CMS2 looper. Note that there is no "standard" way to build this code in this group -- the user is on his own to figure it out. This example uses SCRAM since it is supported by the CMS collaboration and you can get help via hypernews. If you decide to use something else (ACLiC, make, cmake, boost build, IDE, etc.), then you are on your own.
checkout the code
First decide where to put the code in your file system. This example will assume that you created an directory called
cms2_project_example
folder.
Create a base level project directory:
mkdir cms2_project_example
cd cms2_project_example
Check out the
AnalysisTutorial code:
git clone https://github.com/kelleyrw/AnalysisTutorial.git
Create a
CMSSW release area and setup the environment:
scramv1 p -n CMSSW_5_3_2_patch4_cms2example CMSSW CMSSW_5_3_2_patch4
cd CMSSW_5_3_2_patch4_cms2example/src
cmsenv
Checkout and compile the CMS2 CORE and Dictionaries:
git clone https://github.com/cmstas/Dictionaries.git CMS2/Dictionaries
git clone https://github.com/cmstas/CORE.git CMS2/NtupleMacrosCore
cd CMS2/NtupleMacrosCore
git pull
git checkout scram_compliant
./setup/setupCoreForSCRAM.sh setup/cms2_ntuple_postprocess_05.03.23.root
cd $CMSSW_BASE/src
CORE is the setup of common code the group shares to ensure reproducibility. It is mainly for implementing selections and corrections to physics objects (e.g. JEC). Also, to facilitate easy access to the ntuples's branches in the looper code, CMS2.h/cc are automatically generated. CORE uses CMS2.h heavily.
Note: CMS2.cc is actually quite slow to compile since it is ~33k lines long, thus the reason for it being made its own library.
Copy the Analysis code over to the
CMSSW area (For your code, depending on how you set it up, you could just check it out directly).
cp -r ../../AnalysisTutorial/cms2_examples/CMS2Project/* .
Compile the code one last time to make sure all the modules are compiled:
scram b -j20
code layout
There should be three directories in your analysis area:
- CMS2: this the CMS2 code
- Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
- NtupleMacrosCore: the "sandbox" for the CORE common code
- NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
- Packages: this is a "starter" reusable toolkit. Feel free to run with this...
- HistTools: A set of basic ROOT histogram helper functions and classes.
- LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
- ... add whatever other packages you feel you need...
- Analysis: for the analysis specific code
- ExampleCMS2Looper: the actual example looper
- ExampleCMS2BabyMaker: an example baby maker (see next section)
- ...add as many analyses as you need...
.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper
How do I run it?
There two examples of running the looper. Interactively use the ROOT non-compiled script:
cd Analysis/ExampleCMS2Looper
root -b -q -l macros/run_sample.C
Also, there is an example of a fully compiled executable which is defined in bin/run_sample.cc:
run_sample
This should produce a ROOT file in the output directory that contains your histogram(s).
Example CMS2 Baby Maker
This builds off the concepts in the previous subsection on CMS2 Loopers. In many cases, it may make more sense to create a small ntuple (a.k.a a "baby"), that is a more cooked and less flexible version of the original CMS2 ntuple. These are usually very analysis specific and is usually both a skim and translation of the general CMS2 variables into more specific analysis variables. The babies have the following advantages:
- the final variables you need for your analysis already calculated
- small and light wait (e.g. all of my SS babies fit on my laptop)
- remaking plots with different binning is trivial
- great for specific studies
The main disadvantage of a baby is in when you run into a situtaion where you need a varaible from the CMS2 that you didn't store -- at that point you will have to re-run and re-create your baby ntuples.
checkout the code
This is the same workflow as before for the:
CMS2 Looper.
code layout
There should be three directories in your analysis area:
- CMS2: this the CMS2 code
- Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
- NtupleMacrosCore: the "sandbox" for the CORE common code
- NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
- Packages: this is a "starter" reusable toolkit. Feel free to run with this...
- HistTools: A set of basic ROOT histogram helper functions and classes.
- LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
- ... add whatever other packages you feel you need...
- Analysis: for the analysis specific code
- ExampleCMS2Looper: an example looper (see previous section)
- ExampleCMS2BabyMaker: an example baby maker
- ...add as many analyses as you need...
.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper
| |____ExampleCMSBabyMaker
how does it work?
The code for the example baby maker is located:
The main logic to fill the tree is in the class CMS2BabyMaker:
running the code
To run interactively use the ROOT non-compiled script
create_baby.C
:
https://github.com/kelleyrw/AnalysisTutorial/blob/master/cms2_examples/CMS2Project/Analysis/ExampleCMS2BabyMaker/macros/create_baby.C
The arguments to create_baby.C are:
-
const std::string& sample_name
--> sample name to allow for sample dependent logic
-
const std::string& input_filename
--> input TTree files
-
const std::string& output_filename
--> output baby TTree path
-
const std::string& runlist_filename
--> good run list
-
const long num_events = -1
--> number of events to run on
-
const double lumi = 1.0
--> luminosity in fb^-1
-
const bool verbose = false
--> toggle on/off extra debugging printout
The following is an example of how to run the baby maker:
root -b -q "macros/create_baby.C(\"dyll\", \"/nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root\", \"babies/dyll.root\", \"json/Merged_190456-208686_8TeV_PromptReReco_Collisions12_goodruns.txt\", -1, 0.082, 0)"
running the code in parallel
The code can be run in parallel by using the script
create_babies_all.sh
:
https://github.com/kelleyrw/AnalysisTutorial/blob/master/cms2_examples/CMS2Project/Analysis/ExampleCMS2BabyMaker/scripts/create_babies_all.sh
Just call the script to execute the baby maker on all the datasets:
./scripts/create_babies_all.sh
examples on the ROOT interpreter
// raw
root [1] tree->Draw("gen_p4.mass()>>h1(150, 0, 150)", "is_gen_ee");Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
root [2] cout << h1->Integral(0,-1) << endl;
238287
// scaled
root [2] TH1D h1("h1", "dielectron mass;m_{ee} (GeV)", 150, 0, 150);
root [3] tree->Draw("gen_p4.mass()>>h1", "0.082*scale1fb*is_gen_ee");
root [4] cout << h1->Integral(0,-1) << endl;
root [5] cout << h1->Integral(0,-1) << endl;
99815.7
Example Lepton Study
Say one needs to do a study on lepton ID and isolation. The simplest way conceptually (maybe not from a programming standpoint) is to study the individual leptons and not necessarily the whole event. In this sub-section, a simple skeleton single lepton study is performed. This will be done using the following steps:
- Define the interesting variables.
- Make a simple "lepton based" baby (i.e. TTree) that hold the variables and is filled per lepton (as apposed to per event).
- Define a set of selections the variables and apply them one-by-one cumulatively as you observed the affect of the variables' distributions. This is the so-called N-1 plots.
- Summarize the selections in a "cut-flow".
- Present the results in a slides (or whatever)
Again, this builds off the concepts in the previous subsection on CMS2 Loopers. The main exception is that the baby ntuple is filled per reconstructed lepton rather than per event. This is a choice to gives us a simpler format to work with when studying per lepton quantities. The disadvantage is you give up the ability to explore event level structure but if you need that, then you can redefine your format.
checkout the code
This is the same work flow as before with one exception. This was written in a newer
CMSSW release area to take advantage of the newer compiler. The instructions are the same as
here except use the following CMSSW release:
Create a CMSSW release area and setup the environment:
export SCRAM_ARCH=slc5_amd64_gcc472
scramv1 p -n CMSSW_6_1_2_cms2example CMSSW CMSSW_6_1_2
cd CMSSW_6_1_2_cms2example/src
cmsenv
code layout
There should be three directories in your analysis area:
- CMS2: this the CMS2 code
- Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
- NtupleMacrosCore: the "sandbox" for the CORE common code
- NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
- Packages: this is a "starter" reusable toolkit. Feel free to run with this...
- HistTools: A set of basic ROOT histogram helper functions and classes.
- LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
- ... add whatever other packages you feel you need...
- Analysis: for the analysis specific code
- ExampleCMS2Looper: an example looper (see previous sub-sections)
- ExampleCMS2BabyMaker: an example baby maker (see previous sub-sections)
- LeptonSelectionStudy: A single lepton analysis to study the selections.
.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper
| |____ExampleCMSBabyMaker
| |____LeptonSelectionStudy
how does it work?
The code for the single lepton baby maker is located:
Rather than spread the baby making code over multiple files, it has been consolidated into one fully exectuable program:
This will produce a program called
create_singlelep_baby
. The code was written in one file since it is probably not going to be re-used somewhere else. If one does decide to make this a re-usable class, then you can always factor out the pieces into a separate source (i.e. *.cc), a header (i.e. *.h) file, and setup the library using the
BuildFile? .xml.
running the code
To compile the code, you use scram:
scram b -j20
This code defines an executables called
create_singlelep_baby
. It uses
boost program option to parse command line arguments. To see the available arguments, use
--help
.
[rwk7t@tensor LeptonSelectionStudy]$ create_singlelep_baby --help
Allowed options:
--help print this menu
--sample arg REQUIRED: Sample name to run on
--output arg output ROOT file
--nevts arg maximum number of events to run
--verbose arg verbosity toggle
The following is an example of how to run the baby maker:
[rwk7t@tensor LeptonSelectionStudy]$ create_singlelep_baby --sample dyll
[create_singlep_baby] inputs:
sample = dyll
input = /nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root
output = babies/dyll_baby.root
nevts = -1
verbose = 0
Loading CMSSW FWLite.
running single lepton baby maker...
[LeptonD0] WARNING - first good vertex index < 0. Returning bogus value 999999
[LeptonDz] WARNING - first good vertex index < 0. Returning bogus value 999999
---> 100.0% <---
[SingleLeptonNtupleMaker] Saving TTree to output file: babies/dyll_baby.root
62454 Events Processed
0 Duplicate Events
0 Bad Events
------------------------------
CPU Time: 14.1
Real Time: 14.2
So far, there are only two samples defined:
- "dyll" → Drell-Yan dataset
- "QCD" → QCD dataset
If you run on another, it should fail...
[rwk7t@tensor LeptonSelectionStudy]$ create_singlelep_baby --sample bogus
[create_singlep_baby] Error: failed...
[create_singlelp_baby] sample bogus is not valid.
One can of course make this code more general, but usually this is sufficient for a specific study like this one.
running the code in parallel
The code can be run in parallel by using the python script
create_babies_all.py
:
Just call the script to execute the baby maker on all the datasets:
./scripts/create_babies_all.py
Create the Plots
Load Interactive HistTools
In order to maximize code-reuse and make plotting easier, use the HistTools interactively in CINT. To load the tools, add the following lines to your .rootlogon.C (or call it in each session):
// Load HistTools
gROOT->ProcessLine(".x $CMSSW_BASE/src/Packages/HistTools/macros/load_histtools.C");
This is a macro that loads the library and sets the include statements in a ROOT session:
run ROOT compiled plotting script
A ROOT compiled macro is used to use TTree::Draw statements to create all the plots for a set of selections.
This macro has several parts
- GetSelections → a function returning a list of selections to be applied successively (i.e. a std::vector)
- BookHists → book the histograms
- CreateElectronHists → fill the electron histograms using TTree::Draw
- PrintOverlays and PrintCutflowOverlay → overlay the Drell-Yan and QCD plots
- CreatePlots → main program of sorts
Example:
[rwk7t@tensor LeptonSelectionStudy]$ root
Loading CMSSW FWLite.
Loading HistTools
root [1] .L macros/CreatePlots.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/cms2_examples/CMS2Project/Analysis/LeptonSelectionStudy/./macros/CreatePlots_C.so
root [2] CreatePlots("test", "eps")
[CreatePlots] creating dyll plots
[CreatePlots] creating QCD plots
[CreatePlots] creating overlays plots
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_base.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_fid.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_pt20.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_base.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_fid.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_pt20.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_cutflow.eps has been created
root [3]
study and record results
Once you have the plots, record (in ppt, evernotes, whatever...) and interpret.
fill in the rest
This is a starter-kit framework to do the study. Take this and do whatever you want with it and finish the study.
To do:
- Fill in the missing variables in the baby (do electron isolation as an example)
- Add the observables to the CreatePlots.C
- Add the observable to the study
Normalizing MC Samples
In order to get a proper prediction for the Monte Carlo samples (MC), we need to properly normalize the number of events to the number we would have expected from data given an integrated luminosity (L). Normally, we can calculate the expected number of events by using the theoretical cross-section:
N = L x σ
However, in MC, we usually don't have a sample that has N events. Since this is MC and we're really after the PDF(s) of the N and all the other kinematical variables associated with this process (e.g. # jets, parton p
T, etc.), we need as many events as practical. Some of these MC samples are quite large and can be O(TB) for AOD level samples.
Since the number of event generated N
gen is usually quite larger than N, we need a way to scale our observations from this MC sample to what we would have expected if it were real data. Consider the following:
N = N x (N
gen / N
gen)
= N
gen x (N / N
gen)
= N
gen x (L x σ/N
gen)
= N
gen x L x (σ/N
gen)
= N
gen x L x scale1fb
Basically, we rewrite the equation to factor the MC dependent variables into one number we call "scale1fb". This is the number we store in our CMS2 ntuples so that we can weight each event so that our distributions (histograms) have the correct normalization.
Scale1fb TTree::Draw example
Here I show a couple of examples of using scale1fb using TTree::Draw statements. A consider the following example code fragment:
TChain chain("Events");
chain.Add("/nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root");
// integrated luminosity
const double lumi = 0.082; //fb^-1
// get the raw number of entries for true e^+e^- events
const TCut selection = "Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1";
const double raw_count = chain.GetEntries(selection);
// information from twiki: http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/Summer12MonteCarlo53X_Slim_Winter13#Drell_Yan
const double xsec = 3532.81; // pb
const double kfactor = 1.0; // kinematically dependent factor to scale to a higher order calc -- not really used as much
const double filt_eff = 1.0; // Filter efficiency for the generator level (done during MC generation at the CMSSW level)
const double nevts_file = chain.GetEntries();
// scale1fb = the factor the weight each event to get scale # events / fb^-1
const double scale1fb = (xsec * kfactor * 1000.0f * filt_eff)/nevts_file;
// output
std::cout << "raw_count = " << raw_count << "\n";
std::cout << "scale1fb = " << scale1fb << "\n";
std::cout << "scaled count = lumi * scale1fb * raw_count = " << lumi*scale1fb*raw_count << "\n\n";
Here, I calculate the scale1fb by hand for demonstration purposes. Notice that there are other factors that went into the scalefb:
variable |
description |
xsec |
The production cross-section x BR for this MC sample. |
kfactor |
Kinematically dependent factor to scale the xsec for higher order affect -- not used as much now. |
filt_eff |
The efficiency at the generator level. A non-unity value indicates that not all events that are generated are kept and then run throw sim/reco. |
nevts_file |
The number of events in this cms2 level file |
The result of this for lumi = 0.082 pb
-1 is:
raw_count = 23332
scale1fb = 56.5367
scaled count = lumi * scale1fb * raw_count = 108167
This seems high. From our exercise earlier, we estimated that number of Z → e
+e
- events to be ~96k (if you includes τ → eν decays). What did we do wrong?
We did not account for the fact that the MC sample was filtered when the ntuple was created from the AOD to the CMS2. Specifically, in general, the N
AOD ≠ N
CMS2. To get an accurate scale1fb, you need the number of events before the CMS2 ntuple making procedure filtered out some events. To correct for this, we need to get the CMS2 filter efficiency from the
twiki. We correct the scale1fb by:
scalefb' = scale1fb x (N
CMS2 / N
AOD)
In code:
const double nevts_aod = 30459503;
const double nevts_cms2 = 27137253;
const double cms2_filter_eff = nevts_cms2/nevts_aod;
std::cout << "properly scale count = lumi * scale1fb * raw_count * cms2_filter_eff = " << lumi*scale1fb*raw_count*cms2_filter_eff << "\n\n";
This gives:
properly scale count = lumi * scale1fb * raw_count * cms2_filter_eff = 96369.5
This is much more consistent.
In practice, when the CMS2 ntuples are generated with the scale1fb using the N
AOD and is stored in the branch called
evt_scale1fb
. This and the following related branches are added to the ntuples in the "post-processing" step of generating the ntuples:
variable |
description |
evt_scale1fb |
event weight to scale to N expected for 1 fb-1 of data |
evt_nEvts |
Number of events generated NAOD |
evt_kfactor |
Factor to scale LO to NLO+ -- a bit antiquated but historical |
evt_filt_eff |
Filter efficiency at the generator level |
evt_xsec_excl |
Exclusive cross-section |
evt_xsec_incl |
Inclusive cross-section |
Repeating the above exercise using the branches stored in the ntuple:
// scale1fb correction since we used a subset of the events
const double scale = nevts_cms2/nevts_file;
// get the scaled entries
const TCut scaled_selection = Form("%f*%f*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)", lumi, scale);
TH1D* const h_scale1fb_count = new TH1D("h_scale1fb_count", "h_scale1fb_count", 3, -0.5, 2.5);
chain.Draw("1>>h_scale1fb_count", scaled_selection, "goff");
std::cout << "hist entries = " << h_scale1fb_count->GetEntries() << "\n";
std::cout << "scale1fb = " << chain.GetMaximum("evt_scale1fb") << "\n";
std::cout << "applying weight = " << scaled_selection.GetTitle() << "\n";
std::cout << "scale1fb_count = " << h_scale1fb_count->Integral() << "\n";
Again -- when working on the full cms2 ntuples, you do not have to do the following and scaling with
evt_scale1fb
is sufficient. However, since we are using a subset of the cms2 ntuples, we still have to account for the cms2 filter efficiency and have to scale again since
evt_scale1fb
was calculated using the whole sample -- not one file.
scale1fb 1 file = (σ/N
file_if_aod)
where N
file_if_aod is the # events on this file before it was skimmed by the CMS2 ntuple making filter (which we didn't store anywhere).
N
file_if_aod = N
file x ( N
aod/N
cms2)
= N
aod x ( N
file/N
cms2)
so
scale1fb 1 file = (σ/N
AOD) x (N
cms2/N
file)
scale1fb 1 file = evt_scale1fb x (N
cms2/N
file)
The result is:
hist entries = 23332
scale1fb = 0.115984
applying weight = 0.082000*434.286380*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)
scale1fb_count = 96369.6
The full example code can be found
here.
Scale1fb TTree looper example
The above example is useful for quick studies on the command line. Also, this can be done in looper code when filling histograms.
void DrellYanLooper::Analyze(const long event)
{
// select e+e- events (NOTE: no test for Z)
bool contains_eplus = false;
bool contains_eminus = false;
bool is_ee = false;
for (size_t idx = 0; idx < tas::genps_id().size() && !is_ee; ++idx)
{
if (tas::genps_status().at(idx) != 3) {continue;}
if (tas::genps_id().at(idx) == 11 ) {contains_eminus = true;}
if (tas::genps_id().at(idx) == -11 ) {contains_eplus = true;}
is_ee = (contains_eplus && contains_eminus);
}
if (!is_ee) {return;}
// --------------------------------------------------------- //
// scale factor to scale assuming running on full MC sample
// --------------------------------------------------------- //
const double lumi = 0.082; //fb^-1
const double scale_full = lumi * tas::evt_scale1fb();
hc["h_ee_yield_full"]->Fill(1.0, scale_full);
// --------------------------------------------------------- //
// scale1fb correction since we used a subset of the events
// --------------------------------------------------------- //
// information from twiki: http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/Summer12MonteCarlo53X_Slim_Winter13#Drell_Yan
const double nevts_aod = tas::evt_nEvts(); // number of events run in CMSSW job to make ntuple
const double nevts_cms2 = 27137253; // number of events after the CMS2 level filter
const double nevts_file = m_num_events; // number of events in the current job
const double nevts_scale = nevts_cms2/nevts_file; // scale up the weight to account for using subset of full dataset
const float scale_subset = scale_full * nevts_scale;
hc["h_ee_yield_sub"]->Fill(1.0, scale_subset);
}
The result is:
[rwkelley@uaf-7 DrellYan]$ dy_scale1fb_example
Loading CMSSW FWLite.
running drell-yan looper...
raw count = 23332
full count = 221.903
subset count = 96369.6
Here the "full count" is the yield assuming that the input was the full dataset. Since we ran on a subset of the full dataset, we had to scale it.
A full analysis example -- measuring the Z cross section
Goal measure the σ(pp → Z) x BR(Z → ee/μμ) for a subset of the 8 TeV data. We will use MC for all backgrounds and ignore details such as scale factors.
References
- Z → ee/μμ analysis note for 8 TeV Data AN2012_067_v6
- read sections 1-5, 8, 10
- The actual data yields are obscurely in the appendix: tables 44 and 45
- Some old but still useful references:
- Particle Flow Algorithms:
- Electron Reconstruction and Identification:
- Electron Isolation:
- Muon Reconstruction and Identification:
- Muon Isolation:
Datasets for Exercise
- We will use 8 TeV dataset run2012A 06Aug2012 (82 pb-1). This is small enough to finish interactively:
Single electron/muon data
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Data2012/CMSSW_5_3_2_patch4_V05-03-24/SingleMu_Run2012A-recover-06Aug2012-v1_AOD/merged/*.root");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Data2012/CMSSW_5_3_2_patch4_V05-03-24/SingleElectron_Run2012A-recover-06Aug2012-v1_AOD/merged/*.root");
Drell-Yan
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
W + Jets → l + ν
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WJetsToLNu_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-28/merged_ntuple_1[0-9].root");
tt(bar) → 2l + 2ν
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_FullLeptMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A-v2/V05-03-24/merged_ntuple_1[0-9].root");
tt(bar) → l + ν + jj
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_SemiLeptMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A_ext-v1/V05-03-24/merged_ntuple_1[0-9].root");
tt(bar) → hadronic
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_HadronicMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A_ext-v1/V05-03-24/merged_ntuple_1[0-9].root");
QCD muon enriched (use for μμ analysis). For now, let's ignore QCD for ee analysis since it requires more statistics than is practical without using the grid.
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/QCD_Pt_20_MuEnrichedPt_15_TuneZ2star_8TeV_pythia6_Summer12_DR53X-PU_S10_START53_V7A-v3/V05-03-18_slim/merged_ntuple_1[0-9].root");
WW → 2l + 2ν
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WWJetsTo2L2Nu_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
WZ → 2l + 2q
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WZJetsTo2L2Q_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
WZ → 3l + ν
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WZJetsTo3LNu_TuneZ2_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
ZZ → 2l + 2ν
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo2L2Nu_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v3/V05-03-23/merged_ntuple_1[0-9].root");
ZZ → 2l + 2q
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo2L2Q_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
ZZ → 4l
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo4L_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
Deliverables
- Due Friday 3/14: Set of slides to summarize their understanding of what we went over. Make sure you cover the following:
- How is a cross section measured --> what are the pieces?
- What is the signal?
- What is a "prompt" lepton?
- What are the major backgrounds for this analysis? Why are they backgrounds?
- Due Friday 3/14: A spreadsheet of the expected number of events for the signal and the background
- Due: Friday 4/25: m_{ll} plots and table of yields:
- A total of 4 plots
- gen level ee yields and m_{ee} plot
- count events with 2 true opposite-signed and same-flavor leptons (OSSF) normalized properly (luminosity*cross section)
- only select one hypothesis per event in processes that could have more than one OSSF pair
- numbers should be consistent with your spreadsheet numbers
- m_{mm} same as above
- reco m_{ee} plot with opposite-signed and same-flavor leptons (OSSF)
- no gen matching requirement
- Select all OSSF lepton pairs (no hypothesis choice yet)
- data overlaid with a stacked plot of all the backgrounds
- Background should be normalized to the lumi*cross-section
- reco m_{mm} plot same as above
- Due Friday 4/25: A Draft of your talk for the following deliverable.
- Due Tuesday 4/29 or Friday 5/2: Write up a 20 minute presentation on the e/μ ID or Isolation.
- ~3/4 of talk should be on: What is it and why do we need it? What are the variables involved?
- ~1/4 of talk should be on: How do I implement it using CMS2 ntuples.
- schedule:
- Dan: electron identification (Tuesday)
- Mark: muon identification (Tuesday)
- Elizabeth: electron Isolation (Friday)
- Liam: muon isolation (Friday)
- Due Friday 5/16: Implement a cut flow, kinematic and N-1 plots.
- Due Friday 5/27: Implement a cut flow, kinematic and N-1 plots (2nd iteration).
- Due Friday 6/3: Implement all selections
- Due Friday 6/10: Slides of the Results at the level of a "full status report"
Current Results
Place here your latest version of the running slides (PDF please). Update them as often as you wish but have them updated with the above diliverables by the deadline.
Results slides
Place you talk slides here (PDF please). Update them as often as you wish but have them updated before the meeting you with to present.
ID/Iso talks:
The following questions have been repeatedly asked so I wanted to write the answer down in a central place.
Do I chain the data into one chain or keep them separate?
You can of course chain them all up into one huge chain and use some logic to parse which dataset is which; however, I would advise against it. The disadvantage to this technique is two-fold:
- The job will take much longer to run.
- The analysis code has to be more complicated to handle the different cases. More potential for bugs...
I would design the code to run on each dataset separately and produce either a set of plots and/or a baby ntuple. This way each dataset can be a separate job and if you need to regenerate a single dataset (e.g. add more files or change the MC you use), you do not have to run all the jobs again.
What is the most efficient work flow?
This is of course user preference and there is no canonical answer. I will describe what I like to do:
- Write a looper that will produce either a set of plots or a baby
- Write a main program that will call this looper and take as arguments dataset dependent metadata:
- dataset to run on
- input files
- output files
- number of events
- other dataset dependent issues...
- Write a script (bash, python, or whatever), to call the jobs in parallel on the uaf. I personally like to use python for this but whatever you are most comfortable with. The following is a psuedo-script to illustrate my point:
#!/bin/bash
nevts=-1
root -q -b -l "macros/run_sample.C ($nevts, \"data\" , \"output/data_baby.root\")" >& logs/data_example.log &
root -q -b -l "macros/run_sample.C ($nevts, \"dyll\" , \"output/dyll_baby.root\")" >& logs/dyll_example.log &
root -q -b -l "macros/run_sample.C ($nevts, \"wjets\", \"output/wjets_baby.root\")" >& logs/wjets_example.log &
# ...
- write a ROOT compiled macro (i.e. .L macro.C++) that will make the yield tables from the output ROOT files above. I have this script possess an option to produce latex code that i can cut and paste into LaTex/!LaTexIt to make nice tables.
- write a ROOT compiled macro (i.e. .L macro.C++) that will make the overlay plots from the output ROOT files above. I have this script possess an option to print different plot types: png, eps, pdf.
- I find that eps make the nicest looking plots in ppt
- Most people use pdf for papers
- I like png for quick plots because I want to look at them without the delay it takes OSX's finder to convert and display them.
- Take the final plots and table and drop them into your ppt/paper.
How do I present the answer?
I would like you to append the results to your DY slides. The slides are meant to be a running documentation of your progress. Please add some meaning observations and not just dumps numbers and plots on a table. Also, you should compare them to the numbers you expect from the both the AN and the your excel exercise.
Where is the definition of the CMS2 branches?
This is a tough question. In an ideal world, we'd have a department to document our code and dataformats; however, this is physics code so that is never going to happen. You will basically need to learn to read the CMS2 ntuple maker code to interpret what each branch means:
https://github.com/cmstas/NtupleMaker/tree/master/src
I suggest (if you haven't already), check out this code so that you can grep it for the branches you need to look at.
What branches should I use?
You can see what branches are available by:
Events->GetListOfAliases()->ls()
However, a the following blocks witll be important for this analysis
-
genps_*
is the generater level code
-
evt_*
is the event level information such as scale1fb, run, event, etc.
-
mus_*
reco muon information
-
els_*
reco electron information
-
hyp_*
the dilepton hypothesis block. These are convenience branches produced for every reco ee, eµ, or µµ pair in the event.
Should I make a baby ntuple?
I would suggest making a baby to facilitate this analysis. A baby is usually a simple ntuples with the "final" answer and high level variables to make plots of relevant variables. It is usually not meant for details studies -- for these, you usually drop back down to CMS2 level or even AOD.
Should I use TTree::Draw or should I use the looper code?
I would like you to at least do this with looper code. Looper code is more complicated initially but is more efficient and more flexible. This is why we went through it in so much detail. With that said, it is not a bad ideal to also do this with TTree::Draw as cross-check and to gain that facility. For quick studies, it can be much quicker to a few plots with TTree::Draw.
Should I include uncertainties
Yes, any measured number should at least have an associated statistical uncertainty. Systematic uncertainties are beyond the scope of this exercise and we shall ingore them.
How I check my Gen yield?
It is often a good idea (if you can), to check your results in more than one way. You can do this on the ROOT command line doing the following:
{
// data
TChain chain("Events");
chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
const double nevts_cms2 = 27137253;
const double nevts_file = chain.GetEntries();
const double lumi = 0.082;
// selections
TCut raw_ee = "Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1";
TCut scaled_ee = Form("%f*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)", lumi*nevts_cms2/nevts_file);
cout << "raw_ee = " << raw_ee.GetTitle() << endl;
cout << "scaled_ee = " << scaled_ee.GetTitle() << endl;
// raw yields
const double y_raw_ee = chain.GetEntries(raw_ee);
cout << "raw ee yield = " << y_raw_ee << endl;
// scaled yields
TH1D* h_gen_yield = new TH1D("h_gen_yield", "h_gen_yield;m_{ee}(GeV);# events", 3, 0, 3);
chain.Draw("1>>h_gen_yield", scaled_ee, "goff");
h_gen_yield.Draw();
cout << "scaled ee yield = " << h_gen_yield->Integral(0, -1) << endl;
// another way to get scaled yields
const double evt_scale1fb = chain.GetMaximum("evt_scale1fb");
const double event_scale = lumi * evt_scale1fb * nevts_cms2/nevts_file;
cout << "scaled ee yield (v2) = " << y_raw_ee * event_scale << endl;
}
--
RyanKelley - 2014/05/20