Analysis Tutorial

Overview

The main goal of this twiki is to document a series of training sessions to teach the basics of doing a particle physics analysis from a practical perspective. The loose set of topics that will be covered are:

  • ROOT -- the C++ based data analysis package that is used in HEP
  • CMSSW -- the software framework used by the collaboration
  • CMS2 -- the software sub-framework used by the UCSD/UCSB/FNAL group (a.k.a. SNT)
  • A full analysis example -- measuring the Z cross section

These topics are not necessarily ordered in any particular way and are only loosely related.

Contents:

Order of topics (Subject to change)

ROOT

git clone https://github.com/kelleyrw/AnalysisTutorial

Lesson 1

  • Studied the basics of TTree and made efficiency plots for some tracking variables
  • Reading: ROOT user's guide: read ch 1-3,7,12
  • Example code: Lesson 1
TTree example
To facilitate a non-trivial example of making plots, a very simple TTree was constructed using CMSSW that contains the composite generated/simulated particle known as tracking particles. You can think of these tracking particles as the combined generator and simulated truth information of all the debris of p-p collision (e.g. Pythia6 status 1). These tracking particles are associated with reconstructed tracks by looking at the simulated energy deposits in the tracker (sim hits) and matching them to the reconstructed hits from the track reconstruction algorithms (rec hits). We will go into how this TTree was made in a later lesson.

This tree was filled per event and contains a unfiltered list (in the form of a std::vector) of TrackingParticles per event:

Events
  |
  --> list of TrackingParticles
             |
             --> Tracking particle information (p4, # sim hits, d0, dz, ...) 
             --> Matching reconstructed Track info (bogus values filled if no matching track).

The tree is small (1000 events) and I was able to check into the repository (https://github.com/kelleyrw/AnalysisTutorial/blob/master/week1/trees/tracking_ntuple.root). All the branches should be the same size:

// TrakingParticle info
std::vector<LorentzVector> tps_p4:  four momentum
std::vector<int> tps_pdgid:         pdg particle ID code: http://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf
std::vector<double> tps_d0:         transverse impact parameter
std::vector<double> tps_dz:         longitudinal impact parameter
std::vector<bool> tps_matched:      matched to track?  true/false 
std::vector<int> tps_charge:        charge
std::vector<int> tps_nhits:         # of simulated hits

// reco track info
std::vector<LorentzVector> trks_p4:  four momentum
std::vector<double> trks_tip:        transverse impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_lip:        longitudinal impact parameter  (from the TrackingParticle vertex)
std::vector<double> trks_d0:         transverse impact parameter (using the trajectory builder)
std::vector<double> trks_dz:         longitudinal impact parameter (using the trajectory builder)
std::vector<double> trks_pterr:      pt uncertainty
std::vector<double> trks_d0err:      d0 uncertainty
std::vector<double> trks_dzerr:      dz uncertainty
std::vector<double> trks_chi2:       chi^2 of the track's fit
std::vector<int> trks_ndof:          # degrees of freedom
std::vector<int> trks_nlayers:       # number of valid layers with a measurement
std::vector<bool> trks_high_purity:  # track passes high purity quality requirement

Playing with TTree::Draw

TTree::Draw gives you basic plotting from the ROOT prompt. This is convenient for quick studies and on the fly plot making. In addition to the chapter 12 on TTree from the ROOT User's guide, you should also read the documentation in the class webpage: http://root.cern.ch/root/html532/TTree.html#TTree:Draw%2

Open the ROOT file and try playing with the following examples:

[rwk7t@tensor lesson1]$  root ../data/tracking_ntuple.root 
root [0] 
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .ls

List the branches in the tree:

root [2] tree->Print()

List the branches in the tree with a less verbose printout:

root [3] tree->GetListOfBranches()->ls()

Draw the tracking particles's pT.

root [4] tree->Draw("tps_p4.pt()") 
tps_p4_pt_ex1.png

On the previous plot, the automatic binning choice was sub optimal since it tries to get everything included in a bin. To specific the binning explicitly:

root [5] tree->Draw("tps_p4.pt()>>(100,0,10)")
tps_p4_pt_ex2.png

In order to keep have a handle to the histogram for later manipulation, you can name the output hist. Now you can do subsequent operations on it.

root [7] tree->Draw("tps_p4.pt()>>h1(100,0,10)")
root [8] h1->SetLineColor(kRed)
root [9] h1->SetTitle("tracking particle p_{T};p_{T} (GeV);A.U.")
root [10] h1->SetLineWidth(2)
root [11] h1->Draw()
tps_p4_pt_ex3.png

To make a selection, use the 2nd field. This is also an example of how to overlay to plots.

root [12] tree->Draw("tps_p4.pt()>>h_pt_barrel(100,0,10)", "fabs(tps_p4.eta())<1.1");
root [13] h_pt_barrel->Draw();
root [14] h_pt_barrel->SetLineColor(kBlue);
root [15] h_pt_barrel->SetLineWidth(2);
root [16] h_pt_barrel->SetTitle("Tracking Particle p_{T} (Barrel);p_{T} (GeV)");
root [17] h_pt_barrel->Draw();
root [18] tree->Draw("tps_p4.pt()>>h_pt_endcap(100,0,10)", "fabs(tps_p4.eta())>1.1");
root [19] h_pt_endcap->SetLineColor(kRed);
root [20] h_pt_endcap->SetLineWidth(2);
root [21] h_pt_endcap->SetTitle("Tracking Particle p_{T} (Endcap);p_{T} (GeV)");
root [22] h_pt_endcap->Draw();
root [23] h_pt_barrel->Draw("sames");
root [24] TLegend leg(0.3, 0.8, 0.6, 0.5);
root [25] leg.AddEntry(h_pt_endcap, "Endcap");
root [26] leg.AddEntry(h_pt_barrel, "Barrel");
root [27] leg.SetFillStyle(0);
root [28] leg.Draw();
tps_p4_pt_overlay.png

Now at this point, you may be sick of typing in commands everytime. Time to use a macro (see chapter 7). Consider the following macro (https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/overlay.C) which is the same as the previous example

{
    tree->Draw("tps_p4.pt()>>h_pt_barrel(100,0,10)", "fabs(tps_p4.eta())<1.1");
    h_pt_barrel->Draw();
    h_pt_barrel->SetLineColor(kBlue);
    h_pt_barrel->SetLineWidth(2);
    h_pt_barrel->SetTitle("Tracking Particle p_{T} (Barrel);p_{T} (GeV)");
    h_pt_barrel->Draw();

    tree->Draw("tps_p4.pt()>>h_pt_endcap(100,0,10)", "fabs(tps_p4.eta())>1.1");
    h_pt_endcap->SetLineColor(kRed);
    h_pt_endcap->SetLineWidth(2);
    h_pt_endcap->SetTitle("Tracking Particle p_{T} (Endcap);p_{T} (GeV)");

    h_pt_endcap->Draw();
    h_pt_barrel->Draw("sames");
    TLegend leg(0.3, 0.8, 0.6, 0.5);
    leg.AddEntry(h_pt_endcap, "Endcap");
    leg.AddEntry(h_pt_barrel, "Barrel");
    leg.SetFillStyle(0);
    leg.Draw();
}

To run, you open the ROOT tree and the run it:

[rwk7t@tensor lesson1]$  root ../data/tracking_ntuple.root 
root [0] 
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .x macros/overlay.C 
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1

Efficiency Plots

Our next example is to use this simple tree to make an efficiency plot vs eta. We define the efficiency is a ratio:

ε = numerator count / denominator count

Where denominator is a tracking particle that has a reasonable chance of actually being reconstructed:

  • non zero charge
  • pT > 0.9 GeV
  • |η| < 2.5
  • |transverse impact parameter| < 3.5 cm
  • |longitudinal impact parameter| < 30 cm
  • |# of simulated hits| >= 3

The numerator selection is the same as the denominator selection except that we require the tracking particle to be matched to a reconstructed track.

The following macro produces this plot: https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff.C

[rwk7t@tensor lesson1]$  root ../data/tracking_ntuple.root 
root [0] 
Attaching file ../data/tracking_ntuple.root as _file0...
root [1] .x macros/eff.C 
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
</verbatim> eff_vs_eta.png

As an exercise we're going to make this same plot two more times. The next example, we're going to compile the macro. In general it is a good idea to compile you macros because the interpreter (CINT) is not robust and can incorrectly interpret even simple code. Also, it will greatly increase the execution time if the macro is doing anything significant. See chapter 7 of the User's Guide for more details. The following macro produces this plot same plot as above but is meant to run compiled: https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff_compiled.C

[rwk7t@tensor lesson1]$  root
root [0] .L macros/eff_compiled.C++
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson1/./macros/eff_compiled_C.so
root [1] eff_compiled("../data/tracking_ntuple.root", "plots/eff_vs_eta.root", "png")
[eff_compiled] tree is opened with 1000 entries
[eff_compiled] printing plots to plots directory.
Info in <TCanvas::Print>: png file plots/h_num_vs_eta_compiled.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta_compiled.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta_compiled.png has been created
[eff_compiled] writing plots to plots/eff_vs_eta.root
</verbatim> h_num_vs_eta_compiled.png </verbatim> h_den_vs_eta_compiled.png </verbatim> h_eff_vs_eta_compiled.png

The final example from this lesson is a very simple "looper". A looper is simple a macro that manually loops over the entries in a TTree rather than relying on TTree::Draw(). This has the advantage of speed since there will be only one loop over the tree instead of one for each plot. Also, it has the flexibility to implement arbitrary logic whereas TTree::Draw, while flexible, can still be limited on what you can calculate and plot.

The following macro is a simple looper: https://github.com/kelleyrw/AnalysisTutorial/tree/master/lesson1/macros/eff_looper.C. It breaks up the process into two steps: create the numerator and denominator plots (CreatePlots) and doing the division to make the efficiency (FinalPlots). The purpose of this is making the numerator/denominator is the "slow" part -- you don't want to remake these plots if you all you need is to change the label or something on the efficiency plot. This is a simple example of breaking up the work flow to keep the analysis running efficiently. You can, of course, make a function that simply calls both if you want the option of doing it in two steps (exercise left to the reader)

root [2] .L macros/eff_looper.C++
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson1/./macros/eff_looper_C.so
root [3] CreatePlots("../data/tracking_ntuple.root", "plots/counts_vs_eta_looper.root")
root [4] FinalPlots("plots/counts_vs_eta_looper.root", "plots/eff_vs_eta_looper.root", "png")
Info in <TCanvas::Print>: png file plots/h_num_vs_eta_looper.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta_looper.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta_looper.png has been created
</verbatim> h_num_vs_eta_looper.png </verbatim> h_den_vs_eta_looper.png </verbatim> h_eff_vs_eta_looper.png

Homework

Lesson 2

This lesson is an example skeleton for organizing an analysis. This is a short lesson and the main point of this lesson is to show how to organize an analysis in a more structured way and to introduce some of the abstractions that will lead us to the conventions of CMSSW.
  • Lesson 2 example code
  • This example uses the same simple tracking tree from the previous lesson and also produces the same plots to show, what I consider, a less tedious way. There will always be an overhead to making plots and doing an analysis; however, with some smart choices it can be less tedious.

Code Layout
  • In the previous example looper (eff_looper.C), everything was contained in the same file. Here the code will be re-factored and extended to promote some code re-use. Below is a list of all the files and directories

|____compile.C                              File to compile the code using ACLiC
|____compile.sh                             File to compile the code using straight up gcc
|____include                              Directory to include header files for reusable code
| |____HistTools.h                        Header file for example ROOT histogram tools
| |____TRKEFF.h                           Header file for the TTree branch wrapper class
|____lib                                  Directory to contain the libraries
|____macros                               Directory to contain ACLiC compiled simple macros (usually throw away style code)
| |____FormattedPlots.C                   Simple macro to print the formated plots
| |____run_all.C                          Simple macro to run the analysis
|____plots                                Directory to contain the output of the plots
|____source                               Directory to contain the main source code for the analysis
| |____HistTools.cc                       Source file for the example ROOT histogram tools
| |____TrackingEfficiencyAnalysis.cc      Source file for the main analysis looper
| |____TRKEFF.cc                          Source file for the TTree branch wrapper class

  • Note: that I'm using the CMSSW convention that *.C files are intended to be ROOT compiled macros and *.cc files are supposed to be "fully C++" complaint code.
The main analysis is done in TrackingEfficiencyAnalysis.cc. This is a C++ class that holds all of the metadata and runs the analysis. The main reason to make this a class is to keep all of the relevant variables and data together. If this were a set of functions, you would have to pass a bunch of parameters back and forth -- a class is more more suited for this purpose. Also, In the class definition below
class TrackingEfficiencyAnalysis
{
    public:
        // construct: 
        TrackingEfficiencyAnalysis
        (
             const std::string& output_file_name = "plots/counts_vs_eta_looper.root",
             const std::string& suffix = "png",
             const bool verbose = true
        );

        // destroy:
        ~TrackingEfficiencyAnalysis();

        // method to run on the data:
        void ScanChain(TChain& chain, long long num_events = std::numeric_limits<long long>::max());

    private:
        // analysis methods:
        void BeginJob();
        void EndJob();
        void Analyze();

        // members:
        const std::string m_output_file_name;
        const std::string m_suffix;
        const bool m_verbose;
        TH1Map m_hist_map;
};

the analysis methods were deliberately chosen. You will see this methodology again when we deal with CMSSW.

  • The TrackingEfficiencyAnalysis::BeginJob function is to be called before the analysis loop begins. It can be used to setup the initial conditions for the analysis (open files, book histograms, set flags, etc.).
  • The TrackingEfficiencyAnalysis::Analyze function is the main logic to be performed on an individual event.
  • The TrackingEfficiencyAnalysis::EndJob function is to be called after the analysis loop ends. It can be used to save the results for the analysis (close files, save and histograms, print tables, etc.).
  • The TrackingEfficiencyAnalysis::ScanChain function is the main looper. Before it loops, it calls the BeginJob() function. It then loops over the whole TChain and calls the Analyze() method for each entry in the tree. Finally, after the loop it calls the EndJob() method.

HistTools -- a starter kit for histogram tools

In order to have some re-usable ROOT tools, I've provided a "starter" set for people. I have a much more mature version here but it may not be a bad idea to start doing some of these yourself to familiarize yourself with ROOT. When you get comfortable you should feel free to rip off from my area on anyone else since the main goal is analysis not tools. HistTools? provides a very basic set of histogram tools.

TRKEFF class

This a piece of code that was automatically generated to give wrapper functions to the branches of the this TTree. It only works on the tree in this example and if the tree itself is changes, then it will need to be regenerated. The main point is twofold:

  1. It provides a simple interface to access the branches of the tree in a looper without having to call the boiler plate functions everytime (see previous lessons eff_looper.C).
  2. It provides "lazy evaluation" on the branches. By default, none of the branches are loaded. In your analysis logic, it loads the branch the first time you access it then leaves it initialized until you access the next entry in the tree (event).

We will discuss this more when we talk about CMS2 in latter lessons.

Building and Running the code
I've provided two ways to build the code. The first is using ROOT wrapper to gcc called ACLiC (the .L macros.C++ thing). The second example is with GCC directly to show you what is under the hood. There is no reason you have to stick with these two methods and they have their pluses and minuses; however, this is all that is really needed to get started on analysis.

ACLiC
I've provided a simple macro to compile the code called compile.C. To compile this analysis code, you simple run the macro in ROOT:
root [0] .x compile.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./compile_C.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libHistTools.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libTRKEFF.so
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/Development/newbies/lesson2/lib/libTrackingEfficiencyAnalysis.so
(bool)1

When you ready to run the code, there is a simple wrapper to compile the code, create an TrackingEfficiencyAnalysis object, and run it called run_all.C:

root [0] .x macros/run_all.C 
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./compile_C.so
[TrackingEfficiencyAnalysis::ScanChain] finished processing 1000 events
------------------------------
CPU  Time: 1.2
Real Time: 1.2

[TrackingEfficiencyAnalysis::EndJob] printing histograms:
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta_sigma.png has been created

GCC directly
To see what is really going on, I provided a simple script that contains a single line to compile this code as a stand alone program (rather than running in ROOT/CINT). The main reason is to demonstrate that it is possible to compile ROOT objects and classes without the interpreter at all (CINT). You will find the interpreter has limitations and sometimes you may want to go around it. One could easily extend this using GNU Make or even a fully flushed out build system or IDE (Ecplipse, Boost Build, CMake, CMSSW's scram, etc.).

To compile from the command prompt:

[rwkelley@uaf-7 lesson2]$  ./compile.sh 
g++ -O2 source/TrackingEfficiencyAnalysis.cc source/TRKEFF.cc source/HistTools.cc -o tracking_eff_analysis -pthread -m64 -I/code/osgcode/imacneill/root/05.34.07/include -L/code/osgcode/imacneill/root/05.34.07/lib -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lPostscript -lMatrix -lPhysics -lMathCore -lThread -pthread -lm -ldl -rdynamic -lGenVector -Iinclude

This produces an executable program called "tracking_eff_analysis".

[rwk7t@tensor lesson2]$  ./tracking_eff_analysis 
[TrackingEfficiencyAnalysis::ScanChain] finished processing 1000 events
------------------------------
CPU  Time: 1.2
Real Time: 1.2

[TrackingEfficiencyAnalysis::EndJob] printing histograms:
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_cotthetares_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_d0res_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_den_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_dzres_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_eff_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_num_vs_pt.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_phires_vs_eta_sigma.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta.png has been created
Info in <TCanvas::Print>: png file plots/h_ptres_vs_eta_sigma.png has been created

Output plots

The output plots will be put in the plots directory. I've provided a macro called Formated to produce a nice version of the residual plots. First you must load the HistTools? so I use them in the macro:

root [0] .L lib/libHistTools.so 
root [1] .x macros/FormattedPlots.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/lesson2/./macros/FormattedPlots_C.so
Info in <TCanvas::Print>: pdf file plots/sigma_res.pdf has been created

This produces formated plots: </verbatim> sigma_res.png

Homework
  • check out, understand, and run the code from lesson2
  • modify the code to also produce the resolution vs pt. The only change to the selection is since this is an efficiency vs pT, we should reduce the pT threshold to pT > 0.1 GeV.
  • modify the macros to produce the nicely formatted plots using log scale.

CMSSW

Overview of CMSSW

Selected workbook topics Workbook

Chapter 2
Chapter 3
  • Ignore PAT
  • work flow of an analysis: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookAnalysisOverviewIntroduction
  • which release: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookWhichRelease
    • CMSSW setup: put this in your .bash_profile
    • source /code/osgcode/cmssoft/cms/cmsset_default.sh
      export SCRAM_ARCH=slc5_amd64_gcc462
      
    • for this exercise, use CMSSW_5_3_2_patch4 (this is what we currently use in cms2)
    • setup a release by doing the following:
    • cmsrel CMSSW_5_3_2_patch4
      pushd CMSSW_5_3_2_patch4/src
      cmsenv
      popd
      
    • It is often convenient to create a CMSSW project with a special name, so that its contents are more easily recognized by you. For example, one could
    • scramv1 p -n CMSSW_5_3_2_patch4_Tutorial CMSSW CMSSW_5_3_2_patch4
      pushd CMSSW_5_3_2_patch4_Tutorial/src
      cmsenv
      popd
      
  • Get a ROOT file. You can find them on the dbs (or DAS) and use xrootd to either open then or copy them:
    • Find a file that you want do down load
    • dbsql find file where dataset=/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/Summer12_DR53X-PU_S10_START53_V7A-v1/AODSIM
      
    • Open it in xrootd
    • root root://xrootd.unl.edu//store/mc/Summer12_DR53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/AODSIM/PU_S10_START53_V7A-v1/0000/00037C53-AAD1-E111-B1BE-003048D45F38.root
      
    • Or copy it locally
    • xrdcp root://xrootd.unl.edu//store/mc/Summer12_DR53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball/AODSIM/PU_S10_START53_V7A-v1/0000/00037C53-AAD1-E111-B1BE-003048D45F38.root dy.root
      
    • If you don't have a grid certificate yet, then use the file I copied locally
    • root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
      
  • Open a root tree and play with TTree::Draw(): https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookFWLite
    • Example EDM ROOT file: /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
    • [rwkelley@uaf-7 lesson3]$  root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
      root [0] 
      Attaching file /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root as _file0...
      Warning in <TClass::TClass>: no dictionary for class edm::EventAuxiliary is available
      Warning in <TClass::TClass>: no dictionary for class edm::Hash<2> is available
      Warning in <TClass::TClass>: no dictionary for class edm::EventID is available
      Warning in <TClass::TClass>: no dictionary for class edm::Timestamp is available
      Warning in <TClass::TClass>: no dictionary for class edm::LuminosityBlockAuxiliary is available
      Warning in <TClass::TClass>: no dictionary for class edm::LuminosityBlockID is available
      ...
      Warning in <TClass::TClass>: no dictionary for class pair<unsigned int,string> is available
      

What were all these errors? The cause is CINT/ROOT is independent of CMSSW and thus does not natively understand any of the CMSSW specific classes (reco::Track, edm::EventAuxiliary, TrackingParticle, etc.). If you are running fully compiled this is not an issue; however, if you want to look at this file via CINT, you will need to load the dictonaries such that ROOT will understand these classes. Fortunately, this can be accomplished by setting of CMSSW's FWLite:

{
    // Set up FW Lite for automatic loading of CMS libraries
    // and data formats.   As you may have other user-defined setup
    // in your rootlogon.C, the CMS setup is executed only if the CMS
    // environment is set up.
    //
    const TString cmsswbase = getenv("CMSSW_BASE");
    if (cmsswbase.Length() > 0) {
        //
        // The CMSSW environment is defined (this is true even for FW Lite)
        // so set up the rest.
        //
        cout << "Loading FW Lite setup." << endl;
        gSystem->Load("libFWCoreFWLite.so");
        AutoLibraryLoader::enable();
        gSystem->Load("libDataFormatsFWLite.so");
        gSystem->Load("libDataFormatsPatCandidates.so");
    }
}

Before you load the file, load FWLite. Now the warnings should disapear.

[rwkelley@uaf-7 lesson3]$  root /nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root
[rwkelley@uaf-7 lesson3]$  root -b
root [0] .x macros/load_fwlite.C 
Loading CMSSW FWLite
root [1] TFile* file = TFile::Open("/nfs-7/userdata/edm/53X/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_AODSIM_PU_S10_START53_V7A-v1.root")
root [2] 

Often, people put this into their .rootlogon.C file such that it is loaded with ROOT whenever CMSSW is setup.

Chapter 4
CMSSW Z Analysis exercise
  • Write a EDProducer to make a simple ntuple with some basic variables for a quick generator study of the Z-boson
    • keep the p4 of the Z and it's daughter leptons
      • gen label --> "genParticles"
    • match daughter leptons to its corresponding reco objects (if they exist).
      • muon label --> "muons"
      • electron label --> "gsfElectrons"
  • Write an EDFilter to filter out Z --> tau tau events
  • Write an EDAnalyzer to make some histograms of the variables quantities above
    • Try to look at the difference between the generator level quantity vs the reco quantity to get a feel for the resolution to measure these quantities (e.g. m_Z vs m_l1l2).
  • I'd like to see the following plots at least:
    • mass of the Z
    • Compare the dilepton invariant mass at gen level and reco level
      • for ee, μμ and both together
    • Compare pT and η of the leptons (gen and reco level)
    • Resolution of the z mass vs z mass
    • Resolution of the lepton pT vs pT (e's and μ's separately).
    • Efficiency of reconstructing the lepton vs pT (e's and μ's separately).
  • Please make a few slides summarizing your results. I will look at them and check your code on Friday.

Building Lesson

C was developed in the 1970s and C++ in the 1980s. These both predate nice GUI Integrated Development Environments (IDEs) that are quite popular today. These include MS Visual Studio, Eclipse, XCode, etc. NOTE: the IDE is not the compiler but is merely a nice GUI around the compiler. Most of the time, the nitty gritty details of managing compiling and linking code is handled by the IDE but not always. For large systems, even in industry which uses IDEs, the build system is handled separately from the IDE. Here build refers to the process of compiling and linking your code into the final product (exe, libs, etc.).

In CMS, we use SCRAM (Source Configuration, Release, And Management tool). This short lesson, motivates the need for a "build" system and then goes into how SCRAM solves the problem for CMSSW.

Why do we care?

C++ is object oriented. This encourages modular code that can be spread among multiple files. Unfortunately, nobody updated the old file management system from C to handle this so C++ is stuck with quite the antiquated system -- very crude compared to more recently developed languages. The next subsections illustrates some of the issues.
Simple one file program
Consider the following simple program in a file called hello.cpp in the src directory.
#include <iostream>

int main()
{
    std::cout << "hello world!" << std::endl;
    return 0;
}

Use g++ to compile

[rwk7t@tensor lesson4]$ g++ src/hello.cpp 

This produces an exe called a.out. To run:

[rwk7t@tensor lesson4]$  ./a.out 
hello world!

You can control the name of the output exe with the option -o to g++. Here we put the exe in the bin directory:

[rwk7t@tensor lesson4]$  g++ src/hello.cpp -o bin/hello
[rwk7t@tensor lesson4]$  ./bin/hello 
hello world!

Multiple File programs

Now what happens if we want to write some reusable functions and put them in separate files. How does this change our building procedure? Consider the function Module1::PrintHello in a separate file called module1.cpp:

#include <iostream>

namespace Module1
{
    void PrintHello()
    {
        std::cout << "hello from module1!" << std::endl;
    }
}

And the main program that goes with it is called test1.cpp:

#include "module1.cpp"

int main()
{
    Module1::PrintHello();
    return 0;
}

Now we compile into and exe in the bin directory:

[rwk7t@tensor lesson4]$  g++ src/test1.cpp -o bin/test1
[rwk7t@tensor lesson4]$  ./bin/test1 
hello from module1!

Everything here looks fine. Now, consider another function added into our program which uses the Module1::PrintHello function from module1.cpp -- call this module2.cpp:

#include <iostream>
#include "module1.cpp" 

namespace Module2
{
    void PrintHello()
    {
        std::cout << "hello from module2!" << std::endl;
        Module1::PrintHello();
    }
}

Now our main program changes to the following which is saved in test2.cpp:

#include "module1.cpp"
#include "module2.cpp"

int main()
{
    Module1::PrintHello();
    Module2::PrintHello();
    return 0;
}

Now compiling will give you an error:

[rwk7t@tensor lesson4]$  g++ src/test2.cpp -o bin/test2
In file included from src/module2.cpp:2:0,
                 from src/test2.cpp:2:
src/module1.cpp: In function 'void Module1::PrintHello()':
src/module1.cpp:5:10: error: redefinition of 'void Module1::PrintHello()'
In file included from src/test2.cpp:1:0:
src/module1.cpp:5:10: error: 'void Module1::PrintHello()' previously defined here

The reason for the error is module1.cpp has been included twice. Once in the test2.cpp and once in the module2.cpp. Recall from C++ that you can declare a function or class many times but you can only define it once. Here because we are including the definition twice, we get a compile error.

To fix the problem, we forward declare the two functions in test3.cpp:

namespace Module1 {void PrintHello();}
namespace Module2 {void PrintHello();}

int main()
{
    Module1::PrintHello();
    Module2::PrintHello();
    return 0;
}

We also need to foward declare the function in "module2.cpp"

#include <iostream>

namespace Module1 {void PrintHello();}
namespace Module2
{
    void PrintHello()
    {
        std::cout << "hello from module2!" << std::endl;
        Module1::PrintHello();
    }
}

And when we compile, we need to supply the definitions to the compiler. This is done by adding all three files to the g++ call and now g++ knows to compile all 3 files and then link them together:

[rwk7t@tensor lesson4]$  g++ src/test3.cpp src/module1.cpp src/module2.cpp -o bin/test3
[rwk7t@tensor lesson4]$  ./bin/test3 
hello from module1!
hello from module2!
hello from module1!

Now, editing the files and forward declaring is quite cumbersome so the convention is to factor out forward declarations into interface or "header files" usually denoted by *.h or *.hpp. So refactoring the modules and main program:

module1.h:

#ifndef MODULE1_H
#define MODULE1_H

namespace Module1
{
    void PrintHello();
}

#endif //MODULE1_H

Notice that the header file had the code surrounded some C preprocessor commands (called "header guard"). This really only needed for class definitions since these are typically defined in header files. This is protect from multiple class definitions. In this particular example, you wouldn't need them but its good practice.

module1.cpp:

#include "module1.h"
#include <iostream>

namespace Module1
{
    void PrintHello()
    {
        std::cout << "hello from module1!" << std::endl;
    }
}

module2.h:

#ifndef MODULE2_H
#define MODULE2_H

namespace Module2
{
    void PrintHello();
}

#endif //MODULE2_H

module2.cpp:

#include "module2.h" 
#include "module1.h" 
#include <iostream>

namespace Module2
{
    void PrintHello()
    {
        std::cout << "hello from module2!" << std::endl;
        Module1::PrintHello();
    }
}

test4.cpp:

#include "module1.h"
#include "module2.h"

int main()
{
    Module1::PrintHello();
    Module2::PrintHello();
    return 0;
}

It is customary to put the header files in a different folder than the source code. This is to facilitate a "user interface" that is easy to parse and not polluted with implementation details. This is merely a convention but it is standard practice. Because the headers in this example live in the include director, we need to tell g++ where to find them. This is done with the -I switch to g++:

[rwk7t@tensor lesson4]$  g++ src/test4.cpp -I include src/module1.cpp src/module2.cpp -o bin/test4
[rwk7t@tensor lesson4]$  ./bin/test4 
hello from module1!
hello from module2!
hello from module1!

Libraries

In the previous section, we saw how to build an exe that has code spread among multiple file. What happens though when you have 10, 100, 10k file? You cannot possibly compile every file every time you wish to build your code! For intermediate and large systems it might take hours to compile. For example, compiling CMSSW takes several hours and has to be scheduled. Even ROOT takes ~30 minutes.

To save time, you could compile each piece at a time and then you will only have to link them which can save some building time:

[rwk7t@tensor lesson4]$  g++ -c src/module1.cpp -Iinclude -o obj/module1.o
[rwk7t@tensor lesson4]$  g++ -c src/module2.cpp -Iinclude -o obj/module2.o
[rwk7t@tensor lesson4]$  g++ -c src/test4.cpp -Iinclude -o obj/test4.o
[rwk7t@tensor lesson4]$  g++ obj/test4.o obj/module1.o obj/module2.o -o bin/test4 

This breaks each file into a binary or "object" file by using the -c option. The binary file does nothing on its own but can be linked against in the last line to create the exe. However, again compiling every file into its own object file can be cumbersome for intermediate to large systems so the convention is to create a reusable library.

Static Libraries

A static library is essentially where you collect all of the binary files for set of related code and combine it into one file that can be linked against to create you main program. The main program will essentially pull the binaries that are relevant into the main program and absorb the binary content into the exe file. In the following example, we pull module1 and module2 into a single reusable static library:

[rwk7t@tensor lesson4]$  g++ -c src/module1.cpp -Iinclude -o obj/module1.o
[rwk7t@tensor lesson4]$  g++ -c src/module2.cpp -Iinclude -o obj/module2.o
[rwk7t@tensor lesson4]$  ar rvs lib/libmodule.a obj/module1.o obj/module2.o 
ar: creating archive lib/libmodule.a
a - obj/module1.o
a - obj/module2.o

NOTE: the file extension convention for static libraries is .a for "archive".

Now you can use this static library to use this reusable package:

[rwk7t@tensor lesson4]$  g++ -Iinclude src/test4.cpp lib/libmodule.a 
[rwk7t@tensor lesson4]$  ./bin/test4 
hello from module1!
hello from module2!
hello from module1!

There is an interface that g++ uses to link against libraries. The following is equivalent to the above:

[rwk7t@tensor lesson4]$  g++ -Iinclude src/test4.cpp -L lib -l module 
[rwk7t@tensor lesson4]$  ./bin/test4 
hello from module1!
hello from module2!
hello from module1!

Here -L allows you to specify the path to the library and -l allows you to specify the actual library you wish to link against. Notice the -l argument does not have the extensions or the "lib" prefix to the archive file -- this is implied. This is because you may have a dynamic and static version of the same library and by default, dynamic is preferred. The advantage of this method is you don't have to keep repeating the path the libraries which can make you command really long and unreadable.

Dynamic Libraries

Static libraries are great for small or intermediate size systems; however, since the binary code is copied to the exe, this can cause "code bloat" and cause the exe to be very large (~GB). Also, if you use a hierarchical library scheme where some libs are dependent on other libs, then you are duplicating binary code and again causing more bloated binaries. The solution is to use dynamic libraries (also called shared libraries). This allows the exe to find the library at runtime. There is a slight initialization penalty but not really that noticeable with modern computers. The advantage is the exe will be small and you don't have to copy around binaries into other libs or exes. This for systems with "standard" shared libraries installed so only the exe needs to be shipped. However, the down side to shared libraries is now you have to keep track of where the dynamic libraries are and if you wish to run the exe somewhere else (like the grid), you will have to send the appropriate libraries as well if they are not already installed somewhere.

To create a dynamic lib:

[rwk7t@tensor lesson4]$ g++ -shared obj/module*.o -o lib/libmodule.so
[rwk7t@tensor lesson4]$ g++ -I include src/test4.cpp -L lib -l module
[rwk7t@tensor lesson4]$ ./bin/test4 
hello from module1!
hello from module2!
hello from module1!

NOTE: the file extension convention varies by platform:

  • Windows: *. dll for dynamic link library
  • Linux: *. so for shared object
  • Mac OSX: *. dylib for dynamic library

Finally, exe will look in its same directory or the original place where the dynamic lib was during linking. If you move it to some central place, you will need to ensure that the exe can find it at runtime. You can specify an environment variable called LD_LIBRARY_PATH to include the path of the libraries you use frequently and the OS will search there for the shared lib. This is usually for system wide code or systems that the user sets up (like ROOT or CMSSW). Mac OSX also has a environment variable also called DYLD_LIBRARY_PATH which unfortunately causes mental friction when using a mac.

This only scratches the service of using libraries.

Building an exe with ROOT objects

There is no reason you have to use ROOT interactively though CINT -- it is really just a set of C++ libraries. Consider the following simple program called with_root.cpp:

#include "TH1F.h"
#include "TFile.h"

int main()
{
    TFile file("file.root", "RECREATE");
    TH1F h1("h1", "h1", 100, -5, 5);
    h1.FillRandom("gaus", 10000);
    h1.Write();
    file.Close();
    return 0;
}

If I try to compile this, I'll get an error since g++ doesn't know where to find the ROOT headers and libraries:

[rwk7t@tensor lesson4]$  g++ src/with_root.cpp -o bin/with_root                               
src/with_root.cpp:1:18: fatal error: TH1F.h: No such file or directory
compilation terminated.

ROOT ships with a nice command line utility that will give you the appropriate g++ options to pass to your command:

[rwk7t@tensor lesson4]$  root-config --cflags --libs
-pthread -std=c++0x -m64 -I/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/include -L/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/lib -lCore -lCint -lRIO -lNet -lHist -lGraf -lGraf3d -lGpad -lTree -lRint -lPostscript -lMatrix -lPhysics -lMathCore -lThread -lpthread -Wl,-rpath,/usr/local/cmssw/osx107_amd64_gcc472/cms/cmssw/CMSSW_6_1_2/external/osx107_amd64_gcc472/bin/../../../../../../lcg/root/5.34.03-cms2/lib -lm -ldl

So to build a standalone exe with ROOT objects, you can do the following:

[rwk7t@tensor lesson4]$  g++ src/with_root.cpp `root-config --cflags --libs` -o bin/with_root 
[rwk7t@tensor lesson4]$  ./bin/with_root 

The point is you are not stuck with ACLiC to run ROOT code.

Build Systems

To facilitate managing the low level details from the last section, programs and tools called "build systems" have been developed. This can range from a simple bash script to something from a GUI (like Visual Studio). Learning a build system can increase your productivity and reduce the mental friction of managing the development, build, test cycle.

GNU Make

GNU Make is considered a first step towards a build system. GNU Make is not a build system per se; but a tools that can be used to fashion a build system. This section will not go into any real detail other than to give the most basic example. Make is old technology from the 1970s and many more modern systems are available. It is still widely used as a piece of the build system so it is still worth knowing a bit.

Make has the simple structure which is a series of rules:

target:  dependency1 dependency2 ...
      <tab> command1
      <tab> command2
      ... 

The goal of the each rule is to create the target. If the target is older than any of its dependencies then the commands are executed. You can then build up a series of interconnect rules to build your code. Consider the following example:

# simple Makefile for test4

# target
all: bin/test4

# rule to make the exe
bin/test4: src/test4.cpp lib/libmodule.so
   g++ -Iinclude src/test4.cpp -Llib -lmodule -o bin/test4

lib/libmodule.so: obj/module1.o obj/module2.o
   g++ -shared obj/module1.o obj/module2.o -o lib/libmodule.so

obj/module1.o: include/module1.h src/module1.cpp
   g++ -c src/module1.cpp -Iinclude -o obj/module1.o

obj/module2.o: include/module2.h src/module2.cpp include/module1.h src/module1.cpp
   g++ -c src/module2.cpp -Iinclude -o obj/module2.o

# clean up
clean:
   if [ -f lib/libmodule.so ]; then rm lib/libmodule.so; fi
   if [ -f bin/test4 ]; then rm bin/test4; fi
   if [ -f obj/module1.o ]; then rm obj/module1.o; fi
   if [ -f obj/module2.o ]; then rm obj/module2.o; fi
   if [ -f obj/test4.o ]; then rm obj/test4.o; fi

# PHONY means these rules are always out of date
.PHONY: all clean

The ultimate goal is create the all rule:

[rwk7t@tensor lesson4]$  make all
g++ -c src/module1.cpp -Iinclude -o obj/module1.o
g++ -c src/module2.cpp -Iinclude -o obj/module2.o
g++ -shared obj/module1.o obj/module2.o -o lib/libmodule.so
g++ -Iinclude src/test4.cpp -Llib -lmodule -o bin/test4

Notice the rules get called in reverse since nothing has been made. Also, not that if you just type make, the rule all is implied. So this is convenient since whenever a file is out of date, it will call the appropriate rule and so on until the all rule is satisfied.

Also, there is a user defined rule clean which deletes all of the binary and library files:

[rwk7t@tensor lesson4]$  make clean
if [ -f lib/libmodule.a ]; then rm lib/libmodule.a; fi
if [ -f bin/test4 ]; then rm bin/test4; fi
if [ -f obj/module1.o ]; then rm obj/module1.o; fi
if [ -f obj/module2.o ]; then rm obj/module2.o; fi
if [ -f obj/test4.o ]; then rm obj/test4.o; fi

This might seem magical but for even intermediate size systems (like an analysis), this can get out of hand and cumbersome. Some more digging into Make's features can alleviate some of this but in the end, there is still quite a bit of work to get make to work well.

See: http://oreilly.com/catalog/make3/book/index.csp

ACLiC

ROOT for interactive work uses an interactive system that ships with CINT called The Automatic Compiler of Libraries for CINT (ACLiC).

root [0] .L macro.C++

ACLiC is nice for simple and mostly standalone code. Building larger systems can be troublesome. More on ACLiC can be found here: http://root.cern.ch/download/doc/ROOTUsersGuideChapters/CINT.pdf.

The main disadvantage of ACLiC is it has a hard time parsing complex code, especially with templates. Thus the need to use something more robust for complex systems.

SCRAM

CMSSW uses SCRAM systems (Source Configuration, Release, And Management tool). The following twikis explain it more detail:

SCRAM BuildFiles -- Short Tutorial

SCRAM is a build system that is a wrapper around make. Recall that that CMSSW is broken into Subsystem/package (e.g. Reconstruction/TrackReco or Demo/DemoZAnalysis). This is historical from the ORCA days (the predecessor to CMSSW). To first order, each package will become a shared library (*.so) and it will live in the $CMSSW_BASE/lib/$SCRAM_ARCH directory (e.g. on my mac -- CMSSW_6_1_2/lib/osx107_amd64_gcc472/).

   Subsytem1 --> package 1 --> BuildFile.xml --> libSubsystem1Package1.so
             --> package 2 --> BuildFile.xml --> libSubsystem1Package2.so
             --> package 3 --> BuildFile.xml --> libSubsystem1Package3.so
           ....
   Subsytem2 --> package 1 --> BuildFile.xml --> libSubsystem2Package1.so
           ....

You can do more fancy things like build a standalone executable or plugin. This is beyond the scope of this tutorial -- see the twiki's if you want to go there. I've done some of this and its not terrible.

EDM Plugin

To use SCRAM, you present to the system the BuildFile.xml (or CMS.BuildFile) which is a declarative statement about which libraries you need to include to build this package. Consider the following example of an analysis:

   Analysis          C++ class           Subystem/Package they live in
      |-------> reco::GenParticles --> DataFormats/HepMCCandidate 
              --> reco::Muons        --> DataFormats/MuonReco
              --> reco::GsfElectrons --> DataFormats/EGammaReco
              --> edm::EDAnalyzer    --> FWCore/Framework
              --> edm::Event         --> FWCore/Framework
              --> edm::ParameterSet  --> FWCore/ParameterSet
              --> TH1D               --> ROOT
              --> TFileService       --> PhysicsTools/UtilAlgos
              ....

To properly compile and LINK this analysis package, consider the following BuildFile.xml in the $CMSSW_BASE/src/Ryan/Analysis":

<use name="FWCore/Framework"/>
<use name="FWCore/PluginManager"/>
<use name="FWCore/ParameterSet"/>
<use name="DataFormats/HepMCCandidate"/>
<use name="DataFormats/MuonReco"/>
<use name="DataFormats/EgammaReco"/>
<use name="PhysicsTools/UtilAlgos"/>
<use name="root"/>
<flags EDM_PLUGIN="1"/>

This BuildFile declares library called libRyanAnalysis.so, which is also an EDM plugin that cmsRun will understand (EDProducer, EDAnalyzer, EDFilter, ...). This is registered by the plugin manager at compile time.

  • The lines with <use name="Subsystem/package"> means to include and link against this library
  • The lines with <flags EDM_PLUGIN="1"/> means to promote this library to a edm plugin
  • The "1" means to use the directory structure to resolve the Subsystem/Package (i.e. "Subsystem/Package")
    • you could have named it something different but this is conventional.

So when you run cmsRun, the plugin can be used in the python configuration, loaded at runtime and do whatever the plugin was designed to do.

These dependencies change as you write more or change your code. It takes some experience to build up knowledge of where all the different packages are in CMSSW -- the LXR and DOxygen are the best places to figure this out.

Reusable Library

Not all the code in CMSSW is an EDM plugin. Sometimes, you want to just make a reusable library that other packages use (e.g. data formats or analysis tools). This is simple in SCRAM. You just don't declare a EDM_PLUGIN and "export" the lib:

<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>
<use name="AnalysisTools/LanguageTools"/>
<export>
  <lib name="1"/>
  <use name="root"/>
</export>

Here I include the packages that are needed and export the library name. Again, the "1" for lib name means to use the directory structure to resolve the Subsystem/Package (i.e. "Subsystem/Package"). By exporting, I'm saying there will be a library available to everyone of name "Subsystem/Package" (e.g. AnalysisTools/!RootTools).

Purpose of Exporting

Consider that we wrote a reusable library called MyTools that we want to expose to "clients" in our system. Consider the following awesome drawing:

export_example.png

Here, the implementation (below the squiggly line) of MyTools uses three different packages: ROOT, the BOOST C++ libraries and GNU Scientific Library (GSL). The MyTools package used these three package in the implementation of the code but not necessarily in the interface. In the interface (above the squiggly line), it only uses ROOT. For example, consider the following pseudo-code from MyTools:

MyAwesomePlotMaker.h

#ifndef MYAWESOMEPLOTTOOLS_H
#define MYAWESOMEPLOTTOOLS_H
#include "TH1F.h"

namespace MyTools
{
   TH1F MyAwesomePlotMaker();
}
#endif // MYAWESOMEPLOTTOOLS_H

MyAwesomePlotMaker.cc

#include "MyAwesomePlotMaker.h"
#include "TH1F.h"
#include "boost/shared_ptr.hpp"
#include "gsl/gsl_matrix.h"

namespace MyTools
{
   TH1F MyAwesomePlotMaker()
   {
      // awesome code that does everything
      ...
   }
}

To properly link against these, the following lines are needed in the BuildFile.xml:

<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>

However, the user or "client" of the MyTools package, does not need to know about Boost and GSL since these are now binaries are linked dynamically to MyTools. However, whenever you use something from another package in the interface (header files) of your package, you will need to tell the client about it. For example, if in my tools I included

#include "TH1F.h"

Now, the client of MyTools has to also know about ROOT. If he just includes the MyTools package, he will get a linker error. To fix this he needs to explicitly include ROOT as a dependency in the client's BuildFile.xml. This is not sustainable since the client would have to figure out that he needed to include ROOT and whenever you use MyTools. This is error prone and inelegant since what if MyTools adds another dependency --> if there are 1000 "clients" of MyTools, are you going to change the 1000 build files that included MyTools?

The way SCRAM deals with this is using "export". When I export my lib, I also export the dependencies that the client needs to link this lib:

<export>
  <lib name="1"/>
  <use name="root"/>
</export>

Here, I exposing my library called Subsystem/package and also whenever you use this library, you will also get ROOT with it. So if the interface to MyTools changes and you need another package in the interface, you just have to change the BuildFile.xml for that package. So now, the picture of the build decencies will look like:

export_example2.png

Client now gets the ROOT dependency implicitly. The final BuildFile.xml to realize this picture will look like:

<use name="rootcore"/>
<use name="boost"/>
<use name="gsl"/>
<export>
  <use name="root"/>
  <lib name="1"/>
</export>

Unfortunately, in practice, people are not aware and/or disciplined enough to use the export properly. But you should be aware and try to use this feature. This is also standard practice in commercial C++ systems.

FYI, in my opinion, the BOOST build system handle this problem particularly well and is not tied to CMSSW...

Other systems

There are other systems that can be used as a build system. One goal of a modern build system is to be "platform independent" -- i.e. not depend on the system that the code is compiled on (Windows, Linux, PS3, iOS, etc.). Examples of build systems nclude but are not limited to:

Check out the wiki page on this: http://en.wikipedia.org/wiki/List_of_build_automation_software

CMS2

CMS2 is the name of the ntupling and supporting framework used by the UCSD/UCSB/FNAL group (a.k.a TAS or SNT). The work flow is to produce, for the whole group, a set of standard CMS2 style ntuples using CMSSW. The ntuples are analysis ready with the relevant variables needed to get started on a physics analysis but are general enough to be flexible. An individual analysis (e.g. Same-Sign dileptons) would take these ntuples and make further selections to produce the final analysis.

Where is the CMS2 code?

The code is now kept in github: https://github.com/cmstas

how do I produce CMS2 ntuples?

The CMS2 ntuples are produced via cmsRun from CMSSW. The instructions are kept on the SNT twiki: http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/NtupleMakingInstructions

Where are they kept?

The ntuples are produced by various members of the group (usually students/postdocs wink ) and are kept in a central place:

If you are charged to create ntuples, please ensure that this twiki is up to date with ALL the field filled in. If you don't know the value to put in a field -- ask someone!

Example CMS2 Looper

This section outlines how to checkout and compile the example CMS2 looper. Note that there is no "standard" way to build this code in this group -- the user is on his own to figure it out. This example uses SCRAM since it is supported by the CMS collaboration and you can get help via hypernews. If you decide to use something else (ACLiC, make, cmake, boost build, IDE, etc.), then you are on your own.

checkout the code

First decide where to put the code in your file system. This example will assume that you created an directory called cms2_project_example folder.

Create a base level project directory:

mkdir cms2_project_example
cd cms2_project_example

Check out the AnalysisTutorial code:

git clone https://github.com/kelleyrw/AnalysisTutorial.git 

Create a CMSSW release area and setup the environment:

scramv1 p -n CMSSW_5_3_2_patch4_cms2example CMSSW CMSSW_5_3_2_patch4
cd CMSSW_5_3_2_patch4_cms2example/src
cmsenv

Checkout and compile the CMS2 CORE and Dictionaries:

git clone https://github.com/cmstas/Dictionaries.git CMS2/Dictionaries
git clone https://github.com/cmstas/CORE.git CMS2/NtupleMacrosCore
cd CMS2/NtupleMacrosCore
git pull
git checkout scram_compliant
./setup/setupCoreForSCRAM.sh setup/cms2_ntuple_postprocess_05.03.23.root 
cd $CMSSW_BASE/src

CORE is the setup of common code the group shares to ensure reproducibility. It is mainly for implementing selections and corrections to physics objects (e.g. JEC). Also, to facilitate easy access to the ntuples's branches in the looper code, CMS2.h/cc are automatically generated. CORE uses CMS2.h heavily.

Note: CMS2.cc is actually quite slow to compile since it is ~33k lines long, thus the reason for it being made its own library.

Copy the Analysis code over to the CMSSW area (For your code, depending on how you set it up, you could just check it out directly).

cp -r ../../AnalysisTutorial/cms2_examples/CMS2Project/* .

Compile the code one last time to make sure all the modules are compiled:

scram b -j20

code layout

There should be three directories in your analysis area:

  1. CMS2: this the CMS2 code
    • Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
    • NtupleMacrosCore: the "sandbox" for the CORE common code
    • NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
  2. Packages: this is a "starter" reusable toolkit. Feel free to run with this...
    • HistTools: A set of basic ROOT histogram helper functions and classes.
    • LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
    • ... add whatever other packages you feel you need...
  3. Analysis: for the analysis specific code
    • ExampleCMS2Looper: the actual example looper
    • ExampleCMS2BabyMaker: an example baby maker (see next section)
    • ...add as many analyses as you need...

.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper

How do I run it?

There two examples of running the looper. Interactively use the ROOT non-compiled script:

cd Analysis/ExampleCMS2Looper
root -b -q -l macros/run_sample.C 

Also, there is an example of a fully compiled executable which is defined in bin/run_sample.cc:

run_sample

This should produce a ROOT file in the output directory that contains your histogram(s).

Example CMS2 Baby Maker

This builds off the concepts in the previous subsection on CMS2 Loopers. In many cases, it may make more sense to create a small ntuple (a.k.a a "baby"), that is a more cooked and less flexible version of the original CMS2 ntuple. These are usually very analysis specific and is usually both a skim and translation of the general CMS2 variables into more specific analysis variables. The babies have the following advantages:

  • the final variables you need for your analysis already calculated
  • small and light wait (e.g. all of my SS babies fit on my laptop)
  • remaking plots with different binning is trivial
  • great for specific studies

The main disadvantage of a baby is in when you run into a situtaion where you need a varaible from the CMS2 that you didn't store -- at that point you will have to re-run and re-create your baby ntuples.

checkout the code

This is the same workflow as before for the: CMS2 Looper.

code layout

There should be three directories in your analysis area:

  1. CMS2: this the CMS2 code
    • Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
    • NtupleMacrosCore: the "sandbox" for the CORE common code
    • NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
  2. Packages: this is a "starter" reusable toolkit. Feel free to run with this...
    • HistTools: A set of basic ROOT histogram helper functions and classes.
    • LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
    • ... add whatever other packages you feel you need...
  3. Analysis: for the analysis specific code
    • ExampleCMS2Looper: an example looper (see previous section)
    • ExampleCMS2BabyMaker: an example baby maker
    • ...add as many analyses as you need...

.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper
| |____ExampleCMSBabyMaker

how does it work?

The code for the example baby maker is located:

The main logic to fill the tree is in the class CMS2BabyMaker:

running the code

To run interactively use the ROOT non-compiled script create_baby.C: https://github.com/kelleyrw/AnalysisTutorial/blob/master/cms2_examples/CMS2Project/Analysis/ExampleCMS2BabyMaker/macros/create_baby.C

The arguments to create_baby.C are:

  • const std::string& sample_name --> sample name to allow for sample dependent logic
  • const std::string& input_filename --> input TTree files
  • const std::string& output_filename --> output baby TTree path
  • const std::string& runlist_filename --> good run list
  • const long num_events = -1 --> number of events to run on
  • const double lumi = 1.0 --> luminosity in fb^-1
  • const bool verbose = false --> toggle on/off extra debugging printout

The following is an example of how to run the baby maker:

root -b -q "macros/create_baby.C(\"dyll\", \"/nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root\", \"babies/dyll.root\", \"json/Merged_190456-208686_8TeV_PromptReReco_Collisions12_goodruns.txt\", -1, 0.082, 0)"

running the code in parallel

The code can be run in parallel by using the script create_babies_all.sh: https://github.com/kelleyrw/AnalysisTutorial/blob/master/cms2_examples/CMS2Project/Analysis/ExampleCMS2BabyMaker/scripts/create_babies_all.sh

Just call the script to execute the baby maker on all the datasets:

./scripts/create_babies_all.sh

examples on the ROOT interpreter

// raw 
root [1] tree->Draw("gen_p4.mass()>>h1(150, 0, 150)", "is_gen_ee");Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
root [2] cout << h1->Integral(0,-1) << endl;
238287

// scaled
root [2] TH1D h1("h1", "dielectron mass;m_{ee} (GeV)", 150, 0, 150);
root [3] tree->Draw("gen_p4.mass()>>h1", "0.082*scale1fb*is_gen_ee");
root [4] cout << h1->Integral(0,-1) << endl;
root [5] cout << h1->Integral(0,-1) << endl;
99815.7

Example Lepton Study

Say one needs to do a study on lepton ID and isolation. The simplest way conceptually (maybe not from a programming standpoint) is to study the individual leptons and not necessarily the whole event. In this sub-section, a simple skeleton single lepton study is performed. This will be done using the following steps:

  • Define the interesting variables.
  • Make a simple "lepton based" baby (i.e. TTree) that hold the variables and is filled per lepton (as apposed to per event).
  • Define a set of selections the variables and apply them one-by-one cumulatively as you observed the affect of the variables' distributions. This is the so-called N-1 plots.
  • Summarize the selections in a "cut-flow".
  • Present the results in a slides (or whatever)

Again, this builds off the concepts in the previous subsection on CMS2 Loopers. The main exception is that the baby ntuple is filled per reconstructed lepton rather than per event. This is a choice to gives us a simpler format to work with when studying per lepton quantities. The disadvantage is you give up the ability to explore event level structure but if you need that, then you can redefine your format.

checkout the code

This is the same work flow as before with one exception. This was written in a newer CMSSW release area to take advantage of the newer compiler. The instructions are the same as here except use the following CMSSW release:

Create a CMSSW release area and setup the environment:

export SCRAM_ARCH=slc5_amd64_gcc472
scramv1 p -n CMSSW_6_1_2_cms2example CMSSW CMSSW_6_1_2
cd CMSSW_6_1_2_cms2example/src
cmsenv

code layout

There should be three directories in your analysis area:

  1. CMS2: this the CMS2 code
    • Dictionaries: class definitions so that ROOT can understand the non-ROOT native CMS2 types.
    • NtupleMacrosCore: the "sandbox" for the CORE common code
    • NtupleMacrosHeader: the "sandbox" for the CMS2.h/cc that was produced by the setup script from CORE.
  2. Packages: this is a "starter" reusable toolkit. Feel free to run with this...
    • HistTools: A set of basic ROOT histogram helper functions and classes.
    • LooperTools: A set of basic tools for writing a looper (not necessarily CMS2 specific)
    • ... add whatever other packages you feel you need...
  3. Analysis: for the analysis specific code
    • ExampleCMS2Looper: an example looper (see previous sub-sections)
    • ExampleCMS2BabyMaker: an example baby maker (see previous sub-sections)
    • LeptonSelectionStudy: A single lepton analysis to study the selections.

.
|____CMS2
| |____Dictionaries
| |____NtupleMacrosCore
| |____NtupleMacrosHeader
|____Packages
| |____HistTools
| |____LooperTools
|____Analysis
| |____ExampleCMS2Looper
| |____ExampleCMSBabyMaker
| |____LeptonSelectionStudy

how does it work?

The code for the single lepton baby maker is located:

Rather than spread the baby making code over multiple files, it has been consolidated into one fully exectuable program:

This will produce a program called create_singlelep_baby. The code was written in one file since it is probably not going to be re-used somewhere else. If one does decide to make this a re-usable class, then you can always factor out the pieces into a separate source (i.e. *.cc), a header (i.e. *.h) file, and setup the library using the BuildFile? .xml.

running the code

To compile the code, you use scram:

scram b -j20

This code defines an executables called create_singlelep_baby. It uses boost program option to parse command line arguments. To see the available arguments, use --help.

[rwk7t@tensor LeptonSelectionStudy]$  create_singlelep_baby --help
Allowed options:
  --help                print this menu
  --sample arg          REQUIRED: Sample name to run on
  --output arg          output ROOT file
  --nevts arg           maximum number of events to run
  --verbose arg         verbosity toggle

The following is an example of how to run the baby maker:

[rwk7t@tensor LeptonSelectionStudy]$  create_singlelep_baby --sample dyll
[create_singlep_baby] inputs:
sample  = dyll
input   = /nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root
output  = babies/dyll_baby.root
nevts   = -1
verbose = 0

Loading CMSSW FWLite.
running single lepton baby maker...
[LeptonD0] WARNING - first good vertex index < 0.  Returning bogus value 999999
[LeptonDz] WARNING - first good vertex index < 0.  Returning bogus value 999999
 ---> 100.0% <---
[SingleLeptonNtupleMaker] Saving TTree to output file: babies/dyll_baby.root

62454 Events Processed
0 Duplicate Events
0 Bad Events
------------------------------
CPU  Time:   14.1
Real Time:   14.2

So far, there are only two samples defined:

  • "dyll" → Drell-Yan dataset
  • "QCD" → QCD dataset
If you run on another, it should fail...

[rwk7t@tensor LeptonSelectionStudy]$  create_singlelep_baby --sample bogus
[create_singlep_baby] Error: failed...
[create_singlelp_baby] sample bogus is not valid.

One can of course make this code more general, but usually this is sufficient for a specific study like this one.

running the code in parallel

The code can be run in parallel by using the python script create_babies_all.py:

Just call the script to execute the baby maker on all the datasets:

./scripts/create_babies_all.py

Create the Plots

Load Interactive HistTools

In order to maximize code-reuse and make plotting easier, use the HistTools interactively in CINT. To load the tools, add the following lines to your .rootlogon.C (or call it in each session):

    // Load HistTools
    gROOT->ProcessLine(".x $CMSSW_BASE/src/Packages/HistTools/macros/load_histtools.C");

This is a macro that loads the library and sets the include statements in a ROOT session:

run ROOT compiled plotting script

A ROOT compiled macro is used to use TTree::Draw statements to create all the plots for a set of selections.

This macro has several parts

  • GetSelections → a function returning a list of selections to be applied successively (i.e. a std::vector)
  • BookHists → book the histograms
  • CreateElectronHists → fill the electron histograms using TTree::Draw
  • PrintOverlays and PrintCutflowOverlay → overlay the Drell-Yan and QCD plots
  • CreatePlots → main program of sorts

Example:

[rwk7t@tensor LeptonSelectionStudy]$  root
Loading CMSSW FWLite.
Loading HistTools
root [1] .L macros/CreatePlots.C+
Info in <TUnixSystem::ACLiC>: creating shared library /Users/rwk7t/temp/.rootbuild//Users/rwk7t/Development/newbies/cms2_examples/CMS2Project/Analysis/LeptonSelectionStudy/./macros/CreatePlots_C.so
root [2] CreatePlots("test", "eps")
[CreatePlots] creating dyll plots
[CreatePlots] creating QCD plots
[CreatePlots] creating overlays plots
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_base.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_fid.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_pt_pt20.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_base.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_fid.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_eta_pt20.eps has been created
Info in <TCanvas::Print>: eps file plots/test/overlays/eps/p_el_cutflow.eps has been created
root [3] 

study and record results

Once you have the plots, record (in ppt, evernotes, whatever...) and interpret.

fill in the rest

This is a starter-kit framework to do the study. Take this and do whatever you want with it and finish the study.

To do:

  • Fill in the missing variables in the baby (do electron isolation as an example)
  • Add the observables to the CreatePlots.C
  • Add the observable to the study

Normalizing MC Samples

In order to get a proper prediction for the Monte Carlo samples (MC), we need to properly normalize the number of events to the number we would have expected from data given an integrated luminosity (L). Normally, we can calculate the expected number of events by using the theoretical cross-section:

N = L x σ

However, in MC, we usually don't have a sample that has N events. Since this is MC and we're really after the PDF(s) of the N and all the other kinematical variables associated with this process (e.g. # jets, parton pT, etc.), we need as many events as practical. Some of these MC samples are quite large and can be O(TB) for AOD level samples.

Since the number of event generated Ngen is usually quite larger than N, we need a way to scale our observations from this MC sample to what we would have expected if it were real data. Consider the following:

N = N x (Ngen / Ngen)

= Ngen x (N / Ngen)

= Ngen x (L x σ/Ngen)

= Ngen x L x (σ/Ngen)

= Ngen x L x scale1fb

Basically, we rewrite the equation to factor the MC dependent variables into one number we call "scale1fb". This is the number we store in our CMS2 ntuples so that we can weight each event so that our distributions (histograms) have the correct normalization.

Scale1fb TTree::Draw example

Here I show a couple of examples of using scale1fb using TTree::Draw statements. A consider the following example code fragment:

TChain chain("Events");
chain.Add("/nfs-7/userdata/rwkelley/cms2/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1.root");

// integrated luminosity
const double lumi = 0.082; //fb^-1

// get the raw number of entries for true e^+e^- events
const TCut selection   = "Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1";
const double raw_count = chain.GetEntries(selection);

// information from twiki: http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/Summer12MonteCarlo53X_Slim_Winter13#Drell_Yan
const double xsec       = 3532.81; // pb
const double kfactor    = 1.0;     // kinematically dependent factor to scale to a higher order calc -- not really used as much
const double filt_eff   = 1.0;     // Filter efficiency for the generator level (done during MC generation at the CMSSW level)
const double nevts_file = chain.GetEntries();

// scale1fb = the factor the weight each event to get scale # events / fb^-1
const double scale1fb = (xsec * kfactor * 1000.0f * filt_eff)/nevts_file;

// output
std::cout << "raw_count    = " << raw_count << "\n";
std::cout << "scale1fb     = " << scale1fb << "\n";
std::cout << "scaled count = lumi * scale1fb * raw_count = " << lumi*scale1fb*raw_count << "\n\n";

Here, I calculate the scale1fb by hand for demonstration purposes. Notice that there are other factors that went into the scalefb:

variable description
xsec The production cross-section x BR for this MC sample.
kfactor Kinematically dependent factor to scale the xsec for higher order affect -- not used as much now.
filt_eff The efficiency at the generator level. A non-unity value indicates that not all events that are generated are kept and then run throw sim/reco.
nevts_file The number of events in this cms2 level file

The result of this for lumi = 0.082 pb-1 is:

raw_count    = 23332
scale1fb     = 56.5367
scaled count = lumi * scale1fb * raw_count = 108167

This seems high. From our exercise earlier, we estimated that number of Z → e+e- events to be ~96k (if you includes τ → eν decays). What did we do wrong?

We did not account for the fact that the MC sample was filtered when the ntuple was created from the AOD to the CMS2. Specifically, in general, the NAOD ≠ NCMS2. To get an accurate scale1fb, you need the number of events before the CMS2 ntuple making procedure filtered out some events. To correct for this, we need to get the CMS2 filter efficiency from the twiki. We correct the scale1fb by:

scalefb' = scale1fb x (NCMS2 / NAOD)

In code:

const double nevts_aod  = 30459503; 
const double nevts_cms2 = 27137253;
const double cms2_filter_eff = nevts_cms2/nevts_aod;
std::cout << "properly scale count = lumi * scale1fb * raw_count * cms2_filter_eff = " << lumi*scale1fb*raw_count*cms2_filter_eff << "\n\n";

This gives:

properly scale count = lumi * scale1fb * raw_count * cms2_filter_eff = 96369.5

This is much more consistent.

In practice, when the CMS2 ntuples are generated with the scale1fb using the NAOD and is stored in the branch called evt_scale1fb. This and the following related branches are added to the ntuples in the "post-processing" step of generating the ntuples:

variable description
evt_scale1fb event weight to scale to N expected for 1 fb-1 of data
evt_nEvts Number of events generated NAOD
evt_kfactor Factor to scale LO to NLO+ -- a bit antiquated but historical
evt_filt_eff Filter efficiency at the generator level
evt_xsec_excl Exclusive cross-section
evt_xsec_incl Inclusive cross-section

Repeating the above exercise using the branches stored in the ntuple:

// scale1fb correction since we used a subset of the events
const double scale = nevts_cms2/nevts_file; 

// get the scaled entries
const TCut scaled_selection  = Form("%f*%f*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)", lumi, scale);
TH1D* const h_scale1fb_count = new TH1D("h_scale1fb_count", "h_scale1fb_count", 3, -0.5, 2.5);
chain.Draw("1>>h_scale1fb_count", scaled_selection, "goff");

std::cout << "hist entries    = " << h_scale1fb_count->GetEntries()   << "\n";
std::cout << "scale1fb        = " << chain.GetMaximum("evt_scale1fb") << "\n";
std::cout << "applying weight = " << scaled_selection.GetTitle()      << "\n";
std::cout << "scale1fb_count  = " << h_scale1fb_count->Integral()     << "\n";

Again -- when working on the full cms2 ntuples, you do not have to do the following and scaling with evt_scale1fb is sufficient. However, since we are using a subset of the cms2 ntuples, we still have to account for the cms2 filter efficiency and have to scale again since evt_scale1fb was calculated using the whole sample -- not one file.

scale1fb 1 file = (σ/Nfile_if_aod)

where Nfile_if_aod is the # events on this file before it was skimmed by the CMS2 ntuple making filter (which we didn't store anywhere).

Nfile_if_aod = Nfile x ( Naod/Ncms2)

= Naod x ( Nfile/Ncms2)

so

scale1fb 1 file = (σ/NAOD) x (Ncms2/Nfile)

scale1fb 1 file = evt_scale1fb x (Ncms2/Nfile)

The result is:

hist entries    = 23332
scale1fb        = 0.115984
applying weight = 0.082000*434.286380*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)
scale1fb_count  = 96369.6

The full example code can be found here.

Scale1fb TTree looper example

The above example is useful for quick studies on the command line. Also, this can be done in looper code when filling histograms.

void DrellYanLooper::Analyze(const long event)
{
    // select e+e- events (NOTE: no test for Z)
    bool contains_eplus  = false;
    bool contains_eminus = false;
    bool is_ee           = false;
    for (size_t idx = 0; idx < tas::genps_id().size() && !is_ee; ++idx)
    {
        if (tas::genps_status().at(idx) != 3) {continue;}
        if (tas::genps_id().at(idx) ==  11  ) {contains_eminus = true;}
        if (tas::genps_id().at(idx) == -11  ) {contains_eplus  = true;}
        is_ee = (contains_eplus && contains_eminus);
    }
    if (!is_ee) {return;}

    // --------------------------------------------------------- //
    // scale factor to scale assuming running on full MC sample 
    // --------------------------------------------------------- //

    const double lumi = 0.082; //fb^-1
    const double scale_full = lumi * tas::evt_scale1fb();
    hc["h_ee_yield_full"]->Fill(1.0, scale_full);

    // --------------------------------------------------------- //
    // scale1fb correction since we used a subset of the events
    // --------------------------------------------------------- //

    // information from twiki: http://www.t2.ucsd.edu/tastwiki/bin/view/CMS/Summer12MonteCarlo53X_Slim_Winter13#Drell_Yan
    const double nevts_aod   = tas::evt_nEvts();       // number of events run in CMSSW job to make ntuple
    const double nevts_cms2  = 27137253;               // number of events after the CMS2 level filter
    const double nevts_file  = m_num_events;           // number of events in the current job
    const double nevts_scale = nevts_cms2/nevts_file;  // scale up the weight to account for using subset of full dataset 

    const float scale_subset = scale_full * nevts_scale;
    hc["h_ee_yield_sub"]->Fill(1.0, scale_subset);
}

The result is:

[rwkelley@uaf-7 DrellYan]$  dy_scale1fb_example 
Loading CMSSW FWLite.
running drell-yan looper...
raw  count   = 23332
full count   = 221.903
subset count = 96369.6

Here the "full count" is the yield assuming that the input was the full dataset. Since we ran on a subset of the full dataset, we had to scale it.

A full analysis example -- measuring the Z cross section

Goal measure the σ(pp → Z) x BR(Z → ee/μμ) for a subset of the 8 TeV data. We will use MC for all backgrounds and ignore details such as scale factors.

References

Datasets for Exercise

Single electron/muon data

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Data2012/CMSSW_5_3_2_patch4_V05-03-24/SingleMu_Run2012A-recover-06Aug2012-v1_AOD/merged/*.root");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Data2012/CMSSW_5_3_2_patch4_V05-03-24/SingleElectron_Run2012A-recover-06Aug2012-v1_AOD/merged/*.root");

Drell-Yan

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

W + Jets → l + ν

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WJetsToLNu_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-28/merged_ntuple_1[0-9].root");

tt(bar) → 2l + 2ν

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_FullLeptMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A-v2/V05-03-24/merged_ntuple_1[0-9].root");

tt(bar) → l + ν + jj

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_SemiLeptMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A_ext-v1/V05-03-24/merged_ntuple_1[0-9].root");

tt(bar) → hadronic

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/TTJets_HadronicMGDecays_8TeV-madgraph_Summer12_DR53X-PU_S10_START53_V7A_ext-v1/V05-03-24/merged_ntuple_1[0-9].root");

QCD muon enriched (use for μμ analysis). For now, let's ignore QCD for ee analysis since it requires more statistics than is practical without using the grid.

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/QCD_Pt_20_MuEnrichedPt_15_TuneZ2star_8TeV_pythia6_Summer12_DR53X-PU_S10_START53_V7A-v3/V05-03-18_slim/merged_ntuple_1[0-9].root");

WW → 2l + 2ν

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WWJetsTo2L2Nu_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

WZ → 2l + 2q

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WZJetsTo2L2Q_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

WZ → 3l + ν

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/WZJetsTo3LNu_TuneZ2_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

ZZ → 2l + 2ν

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo2L2Nu_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v3/V05-03-23/merged_ntuple_1[0-9].root");

ZZ → 2l + 2q

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo2L2Q_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

ZZ → 4l

    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/ZZJetsTo4L_TuneZ2star_8TeV-madgraph-tauola_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");

Deliverables

  1. Due Friday 3/14: Set of slides to summarize their understanding of what we went over. Make sure you cover the following:
    • How is a cross section measured --> what are the pieces?
    • What is the signal?
    • What is a "prompt" lepton?
    • What are the major backgrounds for this analysis? Why are they backgrounds?
  2. Due Friday 3/14: A spreadsheet of the expected number of events for the signal and the background
  3. Due: Friday 4/25: m_{ll} plots and table of yields:
    • A total of 4 plots
      • gen level ee yields and m_{ee} plot
        • count events with 2 true opposite-signed and same-flavor leptons (OSSF) normalized properly (luminosity*cross section)
        • only select one hypothesis per event in processes that could have more than one OSSF pair
        • numbers should be consistent with your spreadsheet numbers
      • m_{mm} same as above
      • reco m_{ee} plot with opposite-signed and same-flavor leptons (OSSF)
        • no gen matching requirement
        • Select all OSSF lepton pairs (no hypothesis choice yet)
        • data overlaid with a stacked plot of all the backgrounds
        • Background should be normalized to the lumi*cross-section
      • reco m_{mm} plot same as above
  4. Due Friday 4/25: A Draft of your talk for the following deliverable.
  5. Due Tuesday 4/29 or Friday 5/2: Write up a 20 minute presentation on the e/μ ID or Isolation.
    • ~3/4 of talk should be on: What is it and why do we need it? What are the variables involved?
    • ~1/4 of talk should be on: How do I implement it using CMS2 ntuples.
    • schedule:
      • Dan: electron identification (Tuesday)
      • Mark: muon identification (Tuesday)
      • Elizabeth: electron Isolation (Friday)
      • Liam: muon isolation (Friday)
  6. Due Friday 5/16: Implement a cut flow, kinematic and N-1 plots.
  7. Due Friday 5/27: Implement a cut flow, kinematic and N-1 plots (2nd iteration).
  8. Due Friday 6/13: Implement all selections
  9. Due Friday 6/20: Slides of the Results at the level of a "full status report"

Current Results

Place here your latest version of the running slides (PDF please). Update them as often as you wish but have them updated with the above diliverables by the deadline.

Results slides

Place you talk slides here (PDF please). Update them as often as you wish but have them updated before the meeting you with to present.

ID/Iso talks:

FAQ

The following questions have been repeatedly asked so I wanted to write the answer down in a central place.

Do I chain the data into one chain or keep them separate?

You can of course chain them all up into one huge chain and use some logic to parse which dataset is which; however, I would advise against it. The disadvantage to this technique is two-fold:

  1. The job will take much longer to run.
  2. The analysis code has to be more complicated to handle the different cases. More potential for bugs...

I would design the code to run on each dataset separately and produce either a set of plots and/or a baby ntuple. This way each dataset can be a separate job and if you need to regenerate a single dataset (e.g. add more files or change the MC you use), you do not have to run all the jobs again.

What is the most efficient work flow?

This is of course user preference and there is no canonical answer. I will describe what I like to do:

  • Write a looper that will produce either a set of plots or a baby
  • Write a main program that will call this looper and take as arguments dataset dependent metadata:
    • dataset to run on
    • input files
    • output files
    • number of events
    • other dataset dependent issues...
  • Write a script (bash, python, or whatever), to call the jobs in parallel on the uaf. I personally like to use python for this but whatever you are most comfortable with. The following is a psuedo-script to illustrate my point:

#!/bin/bash

nevts=-1

root -q -b -l "macros/run_sample.C ($nevts, \"data\" , \"output/data_baby.root\")"  >& logs/data_example.log &
root -q -b -l "macros/run_sample.C ($nevts, \"dyll\" , \"output/dyll_baby.root\")"  >& logs/dyll_example.log &
root -q -b -l "macros/run_sample.C ($nevts, \"wjets\", \"output/wjets_baby.root\")" >& logs/wjets_example.log &
# ...

  • write a ROOT compiled macro (i.e. .L macro.C++) that will make the yield tables from the output ROOT files above. I have this script possess an option to produce latex code that i can cut and paste into LaTex/!LaTexIt to make nice tables.
  • write a ROOT compiled macro (i.e. .L macro.C++) that will make the overlay plots from the output ROOT files above. I have this script possess an option to print different plot types: png, eps, pdf.
    • I find that eps make the nicest looking plots in ppt
    • Most people use pdf for papers
    • I like png for quick plots because I want to look at them without the delay it takes OSX's finder to convert and display them.
  • Take the final plots and table and drop them into your ppt/paper.

How do I present the answer?

I would like you to append the results to your DY slides. The slides are meant to be a running documentation of your progress. Please add some meaning observations and not just dumps numbers and plots on a table. Also, you should compare them to the numbers you expect from the both the AN and the your excel exercise.

Where is the definition of the CMS2 branches?

This is a tough question. In an ideal world, we'd have a department to document our code and dataformats; however, this is physics code so that is never going to happen. You will basically need to learn to read the CMS2 ntuple maker code to interpret what each branch means:

https://github.com/cmstas/NtupleMaker/tree/master/src

I suggest (if you haven't already), check out this code so that you can grep it for the branches you need to look at.

What branches should I use?
You can see what branches are available by:

Events->GetListOfAliases()->ls()

However, a the following blocks witll be important for this analysis

  • genps_* is the generater level code
  • evt_* is the event level information such as scale1fb, run, event, etc.
  • mus_* reco muon information
  • els_* reco electron information
  • hyp_* the dilepton hypothesis block. These are convenience branches produced for every reco ee, eµ, or µµ pair in the event.

Should I make a baby ntuple?

I would suggest making a baby to facilitate this analysis. A baby is usually a simple ntuples with the "final" answer and high level variables to make plots of relevant variables. It is usually not meant for details studies -- for these, you usually drop back down to CMS2 level or even AOD.

Should I use TTree::Draw or should I use the looper code?

I would like you to at least do this with looper code. Looper code is more complicated initially but is more efficient and more flexible. This is why we went through it in so much detail. With that said, it is not a bad ideal to also do this with TTree::Draw as cross-check and to gain that facility. For quick studies, it can be much quicker to a few plots with TTree::Draw.

Should I include uncertainties

Yes, any measured number should at least have an associated statistical uncertainty. Systematic uncertainties are beyond the scope of this exercise and we shall ingore them.

How I check my Gen yield?

It is often a good idea (if you can), to check your results in more than one way. You can do this on the ROOT command line doing the following:

{
    // data
    TChain chain("Events");
    chain.Add("/hadoop/cms/store/group/snt/papers2012/Summer12_53X_MC/DYJetsToLL_M-50_TuneZ2Star_8TeV-madgraph-tarball_Summer12_DR53X-PU_S10_START53_V7A-v1/V05-03-23/merged_ntuple_1[0-9].root");
    const double nevts_cms2 = 27137253;
    const double nevts_file = chain.GetEntries(); 
    const double lumi       = 0.082;

    // selections
    TCut raw_ee    = "Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1";
    TCut scaled_ee = Form("%f*evt_scale1fb*(Sum$(genps_status==3 && genps_id==11)>=1 && Sum$(genps_status==3 && genps_id==-11)>=1)", lumi*nevts_cms2/nevts_file);
    cout << "raw_ee = " << raw_ee.GetTitle() << endl;
    cout << "scaled_ee = " << scaled_ee.GetTitle() << endl;

    // raw yields
    const double y_raw_ee = chain.GetEntries(raw_ee);
    cout << "raw ee yield = " << y_raw_ee << endl;

    // scaled yields
    TH1D* h_gen_yield = new TH1D("h_gen_yield", "h_gen_yield;m_{ee}(GeV);# events", 3, 0, 3);
    chain.Draw("1>>h_gen_yield", scaled_ee, "goff");
    h_gen_yield.Draw();
    cout << "scaled ee yield = " << h_gen_yield->Integral(0, -1) << endl;

    // another way to get scaled yields
    const double evt_scale1fb = chain.GetMaximum("evt_scale1fb");
    const double event_scale = lumi * evt_scale1fb * nevts_cms2/nevts_file;
    cout << "scaled ee yield (v2) = " << y_raw_ee * event_scale << endl;
}

-- RyanKelley - 2014/05/20

Topic attachments
ISorted ascending Attachment Action Size Date Who Comment
elsejson Cert_190782-190949_8TeV_06Aug2012ReReco_Collisions12.json manage 0.2 K 2014/03/31 - 17:42 RyanKelley JSON file
elselumi Cert_190782-190949_8TeV_06Aug2012ReReco_Collisions12.lumi manage 1.4 K 2014/03/31 - 17:42 RyanKelley Luminosity measurement
pdfpdf AN2008_098_v1.pdf manage 3881.9 K 2014/03/31 - 16:43 RyanKelley Muon ID
pdfpdf AN2012_067_v6.pdf manage 1811.9 K 2014/03/07 - 15:31 RyanKelley Z --> ll analysis note for 8 TeV? Data
pdfpdf CMS_drellyanslides.pdf manage 290.6 K 2014/05/09 - 18:25 ElizabethWicks Elizabeth's Drell Yan Slides
pdfpdf MuonReco.pdf manage 5456.4 K 2014/03/31 - 16:44 RyanKelley Muon Reconstruction
pdfpdf NOTE2006_040.pdf manage 3780.4 K 2014/03/31 - 16:43 RyanKelley Electron Reco and ID
pdfpdf StandardModelCrossSectionsat8TeVTwiki.pdf manage 497.4 K 2014/03/10 - 16:29 RyanKelley  
pdfpdf dan_dyslides.pdf manage 254.1 K 2014/04/26 - 07:13 DanielKlein  
pdfpdf dan_elid.pdf manage 317.5 K 2014/04/29 - 22:37 DanielKlein  
pdfpdf elizabeth_iso1.pdf manage 502.4 K 2014/05/02 - 18:21 RyanKelley  
pdfpdf liam_iso2.pdf manage 482.8 K 2014/05/02 - 18:41 RyanKelley  
pdfpdf mark_dyslides.pdf manage 519.7 K 2014/04/26 - 00:45 RyanKelley  
pdfpdf mark_muid.pdf manage 1120.5 K 2014/05/23 - 18:27 DanielKlein  
pdfpdf ryan_dyslides.pdf manage 1298.6 K 2014/04/26 - 21:44 RyanKelley  
pngpng eff_vs_eta.png manage 16.7 K 2013/12/09 - 17:52 RyanKelley  
pngpng export_example.png manage 307.2 K 2014/02/14 - 17:13 RyanKelley  
pngpng export_example2.png manage 311.7 K 2014/02/14 - 18:01 RyanKelley  
pngpng h_den_vs_eta_compiled.png manage 18.5 K 2013/12/09 - 18:15 RyanKelley  
pngpng h_den_vs_eta_looper.png manage 18.5 K 2013/12/09 - 18:15 RyanKelley  
pngpng h_eff_vs_eta_compiled.png manage 17.7 K 2013/12/09 - 18:06 RyanKelley  
pngpng h_eff_vs_eta_looper.png manage 17.7 K 2013/12/09 - 18:15 RyanKelley  
pngpng h_num_vs_eta_compiled.png manage 18.4 K 2013/12/09 - 18:06 RyanKelley  
pngpng h_num_vs_eta_looper.png manage 18.4 K 2013/12/09 - 18:14 RyanKelley  
pngpng sigma_res.png manage 45.6 K 2013/12/18 - 19:36 RyanKelley  
pngpng tps_p4_pt_ex1.png manage 13.7 K 2013/12/09 - 17:21 RyanKelley  
pngpng tps_p4_pt_ex2.png manage 13.9 K 2013/12/09 - 17:21 RyanKelley  
pngpng tps_p4_pt_ex3.png manage 15.2 K 2013/12/09 - 17:29 RyanKelley  
pngpng tps_p4_pt_overlay.png manage 16.8 K 2013/12/09 - 17:29 RyanKelley  
txttxt Cert_190782-190949_8TeV_06Aug2012ReReco_Collisions12_cms2.txt manage 0.2 K 2014/03/31 - 17:42 RyanKelley Flatten CMS2 friendly JSON
elsexlsx ZAnalysis_Expected_Events.xlsx manage 10.2 K 2014/04/14 - 19:22 RyanKelley  
Topic revision: r62 - 2014/06/16 - 18:02:50 - RyanKelley
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback