Summer Project for Nathan Manohar 2011
Context:
When we search for new physics at the LHC, we invariably design different signal regions that are sensitive to different types of new physics.
Typically these different signal regions differ substantially in the amount of background they receive from Standard Model sources.
Generally a given type of new physics is expected to contribute different amounts to these different signal regions.
We thus have a need to combine all signal regions statistically taking properly into account how much signal the new physics deposits into each.
E.g. it is immediately obvious that a region with a lot of background but little signal should be weighted differently into the result than a region with
lot's of signal and little background.
Goals of the project:
The project slices up into different steps as follows:
- Write software based on the statistical ideas implemented in LandS that combines N signal regions using statistical errors only.
- Write toy Monte Carlo to prove that your software works as it should. This entails generating many fictitious experiments drawn from random numbers, and showing that our software has the correct statistical behaviour.
- Modify the software to be able to take into account systematic errors that are correlated across the different signal regions. While you are at it, also implement the easier case of uncorrelated systematic errors. It's ok to cover only two cases, 100% correlated and 100% uncorrelated.
- Modify toy Monte Carlo to prove that your modified software works as it should.
In the end, we will want to use your software in the same sign dilepton search we are doing, and expect to publish with the full dataset for 2011 from CMS.
In that analysis, we distinguish most likely 5 signal regions as follows:
- "high" lepton pT, i.e. a leading lepton with pT .gt. 20 GeV? and a trailing lepton .gt. 10 GeV? .
- High MET low HT
- High MET high HT
- low MET high HT
- "low" lepton pT and .NOT. high lepton pT, i.e. electrons with pT as low as 10 GeV? and muons all the way down to 5 GeV? .
- High MET high HT
- low MET high HT
The low pT leptons always require high HT because of event triggering.
I'll add some example numbers for you later, and either Aneesh, Yanjun, or fkw can explain what all of this means at some point.
For now, all you need to know is that we want to use your software to combine roughly 5 signal regions or so for a publication
in late 2011 or early 2012.
Implementation Suggestion:
Your statistics program will be driven by external inputs.
It's probably easiest to drive it from just one file, with different flags identifying different portions of the input directives.
E.g. imagine something like this:
[data observed]
#Region1 ....
0 3 16
[bkg data predicted]
# Region1 errorRegion1 Region2 errorRegion2 ...
2.3 0.9 4.0 2.0 16.0 3.3
[signal data]
# x-scan y-scan Region1 errorRegion1 Region2 errorRegion2 ...
250 250 23.0 0.3 40.0 2.0 1.2 0.3
300 250 .... you get the idea
[error correlation on signal]
#relative error in % correlation yes/no for each region
20.0 1 1 1 0 0
25.0 0 0 0 1 1
10.0 1 1 1 1 1
... etc. ... you get the idea ...
[error correlation on bkg]
... same idea as for signal ...
The basic ideas here are:
- you have an observation in each region that is an integer number of events
- you have a predicted number of bkg events that is a float and has an error for each region
- you have a family of models that are defined via two parameters x-scan and y-scan.
- for a given value of x-y you have a prediction of signal yield for each region, and a corresponding error.
This is all you need for the first two parts of the exercise. I.e. before we deal with correlated systematic errors on both signal and background.
The last two sets of entries are then dealing with the introduction of the correlations and systematic errors.
Your software will read the input file, do it's thing, and write out an output file that is identical to the input file, except that at each line of "signal data"
you append to the line the probability that a signal like this is consistent with the "data observed" given the "bkg expected" across the N signal regions.
Now you should know everything you need to know except the statistical algorithm that you are supposed to implement.
I need to look that up still. I think the statistics we need is what underlies the LandS! program that is used for the Higgs to WW analysis
to combine the 0,1, and 2-jet bin. There they have three signal regions with different bkg levels. And they have a one dimensional scan across higgs mass.
So in principle, the statistics used there ought to be applicable to our case as well, probably with modifications in how the inputs are presented, and
possibly how the output probabilities per scan point are presented.
LandS installation instructions and documentation
From cvs at CERN:
export CVSROOT=:ext:fkw@cmscvs.cern.ch:/cvs_server/repositories/CMSSW
export CVS_RSH=ssh
#the above should just work for you too. The cms cvs is world readable.
#the below needs to be modified to point to your ROOT installation
export ROOTSYS=/code/osgcode/UCSD_root/root_v5.28.00
export MANPATH=$MANPATH:$ROOTSYS/man
export PATH=$ROOTSYS/bin:$PATH
export LD_LIBRARY_PATH=$ROOTSYS/lib
#the below is the LandS installation
cvs co -d LandS UserCode/mschen/LandS
cd LandS
cvs up -r V2011-04-BinnedShapeAnalysis
make all
test/lands.exe --help
wget uaf-2.t2.ucsd.edu:~fkw/inputLimits.tgz
tar -xvzf inputLimits.tgz
#and now you can test the code
test/lands.exe -M Bayesian -d inputLimits/inputs_0j_200pb_cut/hww-SM-mH160.txt --doExpectation 1 -t 10000
The documentation for LandS is supposedly
here.
However, I've never read those pages. No idea if they are at all useful.
--
FkW - 2011/05/31