Difference: GlideinWMSCrabSSC6 (3 vs. 4)

Revision 42012/08/28 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 27 to 27
 condor_hold uscmsxxx
Changed:
<
<
Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.
>
>
Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.
 

Collecting Information

Information an operator should collect:

Changed:
<
<
  • find out which sites jobs ran on
  • incoming IP address from which jobs were submitted (Do we have this information in the CRAB SERVER, and if so, where?)
>
>
  • which sites jobs from the banned user ran on
  • names of pilots which ran jobs from the banned user
  • incoming IP address from which jobs were submitted
 

Detailed Procedures

Line: 47 to 48
 Cluster = 31607 JOB_Site = "JINR"
Added:
>
>
Pilot names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any.
 
Changed:
<
<
IP address from which jobs were submitted? - STEFANO
>
>
IP address from which jobs were submitted are more difficult to determine. In principle, this info is in two logs in $PRODAGENT_WORKDIR/CommandManager
  • ComponentLog says that there was e.g. a request to submit a new task.
  • FrontendLog says that IP n connected at time t.
However, there is no guaranteed relationship.

FrontendLog is written by $CRABSERVER_ROOT/src/python/CommandManager/server_side/server2.c. S.B. looked a bit if it was obvious how to change to add the task name to the IP connection message (user's DN does not seem there, but task name would do), but it looks too complicted for understanding in a short time and we should not make very extensive changes to CRAB2 at this time.

 

Other actions based on information collected:

  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
Line: 94 to 102
 ./frontend_startup reconfig ../instance_v5_4.cfg/frontend.xml
Changed:
<
<
how to kill all pilots with DN=X
>
>
To remove all running and queued pilots with a particular DN, it is necessary to contact the Factory Operators (osg-gfactory-support@physics.ucsd.edu). Also ask them for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site).
 

Collecting Information

Changed:
<
<
  • find out which sites pilot jobs ran on using this proxy and notify them
>
>
  • find out which sites pilot jobs ran on using this proxy (above) and notify them
 
  • find out which users had jobs which ran on pilots with a compromised proxy

Detailed Procedures

Line: 140 to 149
  This is 33 out of 39 sites running glideins at this time.
Changed:
<
<
WRITE HOW TO GET THIS LIST OF SITES, USERS.
>
>
To get detailed information about pilot history, ask Factory Ops for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site). Then you can cross-reference the list of pilots against the condor_history (potentially a lot of work).
 

Other Actions

  • Notify the sites and users whose jobs ran with pilots with a compromised credential
  • Report the results to CMS Security Contacts (Ian and Mine)
Changed:
<
<

Action Items

>
>

General Observation

Note that if a compromise is thought to spread from pilot to user DN and vice-versa, the entire system could be considered compromised on short order, given that user tasks have of order O(1000) jobs and there are only 10 pilot proxies. The probability that any task of 1000 jobs that have already run or started avoided a particular pilot proxy is very very small (1.7 x 10^-46).

To consider (Igor, Stefano, Lola, James):

 
Changed:
<
<
  1. Document to be REVIEWED BY IGOR - DONE
  2. Does CRAB log IP addresses where submissions come from? (Stefano)
  3. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?) - IGOR AGREED TO DO IT. DONE on submit-4.
  4. Document how to get a list of sites and users that ran on a pilot with a particular pilot certificate since a particular time (JAMES) - probably a complicated looking condor command.
  5. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users? - ASKED OLI: Ans. Ian and Mine.
>
>
  • Step to renew all the pilot proxies?
  • Are the pilot credentials themselves compromised or just the proxy hijacked (simpler situation)?
  -- JamesLetts - 2012/08/27 \ No newline at end of file
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback