Difference: GlideinWMSCrabSSC6 (1 vs. 7)

Revision 72012/08/31 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 27 to 27
 condor_hold uscmsxxx
Changed:
<
<
As root, block the local userid in the /etc/passwd file on all submitter nodes by appending something to the userid like uscmsxxxBLOCKED. This will help in cleanup later. Effectively this will block any further submissions by denying the ability of the compromised DN to use gridftp or glexec on the server.
>
>
As root, block the local userid in the /etc/passwd file on all submitter nodes by appending something to the userid like uscmsxxxBLOCKED. This will help in cleanup later. Effectively this will block any further submissions by denying the ability of the compromised DN to use gridftp or glexec on the server. Also, send an e-mail to Terrence at UCSD to block the user DN in the UCSD GUMS server. While this is the best way to block a user DN completely, depending on the time of day it may not be the quickest.
 

Collecting Information

Revision 62012/08/30 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 27 to 27
 condor_hold uscmsxxx
Changed:
<
<
As root, block the local userid in the /etc/passwd file on all submitter nodes by appending something to the userid like uscmsxxxBLOCKED. This will help in cleanup later. Effectively this will block any further submissions by denying the ability of the compromised DN to use gridftp on the server.
>
>
As root, block the local userid in the /etc/passwd file on all submitter nodes by appending something to the userid like uscmsxxxBLOCKED. This will help in cleanup later. Effectively this will block any further submissions by denying the ability of the compromised DN to use gridftp or glexec on the server.
 

Collecting Information

Line: 66 to 66
 ...
Changed:
<
<
Pilot startd names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any.
>
>
Pilot startd names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any. The command condor_userlog -attr can query any attribute avaialble in the EventLog. The ClassAd? names are the same names as in condor_q and condor_history.
  IP address from which jobs were submitted are more difficult to determine. In principle, this info is in two logs in $PRODAGENT_WORKDIR/CommandManager
  • ComponentLog says that there was e.g. a request to submit a new task.

Revision 52012/08/29 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Changed:
<
<

Banning Users

>
>

Banning Users

  During the CMS Security Challenge, glideinWMS CRAB SERVER operators may be asked to ban a particular DN and provide certain information about the "attack". In particular, given a particular user DN, admins may be asked to take action to:
Line: 17 to 17
 

Detailed Procedures

The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB Servers. This can be done by looking at the list of mappings. The command condor_q can also give you the same information, but only if the user still has jobs pending or running.

Changed:
<
<
The local UNIX userid typically has the form uscmsxxx.
>
>
The local UNIX userid typically has the form uscmsxxx. However, if a priority user's DN is compromised then it will have the form cmspaxxx.
 
condor_q -format '%s ' Owner -format '%s\n' x509userproxysubject | sort | uniq -c
Line: 27 to 27
 condor_hold uscmsxxx
Changed:
<
<
Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.
>
>
As root, block the local userid in the /etc/passwd file on all submitter nodes by appending something to the userid like uscmsxxxBLOCKED. This will help in cleanup later. Effectively this will block any further submissions by denying the ability of the compromised DN to use gridftp on the server.
 

Collecting Information

Line: 38 to 38
 

Detailed Procedures

Changed:
<
<
Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes in the file /opt/glidecondor/condor_local/log/EventLog. Eventually we will have a tool to parse this log, but the information is available in terms of condor Cluster ID (the same information you get from condor_q or condor_history) and JOB_Site:
>
>
Information on which sites jobs ran at is in the condor logs on the submission nodes in the file /opt/glidecondor/condor_local/log/EventLog. The log contains event information in terms of condor cluster IDs, GLIDEIN_Site, time stamps etc.

We now have a tool to parse this log, courtesy of I. Sfiligoi. To use the tool (currently only installed on submit-4):

 
Changed:
<
<
letts@submit-4 /opt/glidecondor/condor_local/log$ tail -100 /opt/glidecondor/condor_local/log/EventLog |egrep '^Cluster|^JOB_Site' Cluster = 31675 JOB_Site = "CERN" Cluster = 31675 JOB_Site = "CERN" Cluster = 31607 JOB_Site = "JINR"
>
>
source /opt/condor_igor_3214/condor.sh condor_userlog -rotated -fullname -attr Owner,JOB_GLIDEIN_Site /opt/glidecondor/condor_local/log/EventLog Job Host Start Time Evict Time Wall Time Good Time CPU Usage 31805.0 uscms3649,IFCA 8/28 20:43 8/28 20:44 0+00:01 0+00:01 0+00:00 31142.51 uscms4150,Louvain 8/28 20:29 8/28 23:01 0+02:32 0+00:00 0+00:00 31142.12 uscms4150,Louvain 8/28 20:59 8/28 23:01 0+02:02 0+00:00 0+00:00 31142.46 uscms4150,Louvain 8/28 20:58 8/28 23:02 0+02:03 0+00:00 0+00:00 31142.27 uscms4150,Louvain 8/28 21:01 8/28 23:02 0+02:01 0+00:00 0+00:00 31142.13 uscms4150,Louvain 8/28 21:02 8/28 23:02 0+02:00 0+00:00 0+00:00 ...

To query a particular user:

condor_userlog -rotated -const 'Owner=="uscms2330"' -fullname -attr Owner,JOB_GLIDEIN_Site /opt/glidecondor/condor_local/log/EventLog
Job      Host            Start Time  Evict Time  Wall Time Good Time CPU Usage
31869.4  uscms2330,IFCA   8/29 00:20  8/29 00:21   0+00:01   0+00:01   0+00:00
31869.8  uscms2330,IFCA   8/29 00:20  8/29 00:21   0+00:01   0+00:01   0+00:00
31869.7  uscms2330,IFCA   8/29 00:20  8/29 00:21   0+00:01   0+00:01   0+00:00
31869.5  uscms2330,IFCA   8/29 00:20  8/29 00:21   0+00:01   0+00:01   0+00:00
31869.6  uscms2330,IFCA   8/29 00:20  8/29 00:21   0+00:01   0+00:01   0+00:00
...
 
Changed:
<
<
Pilot names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any.
>
>
Pilot startd names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any.
  IP address from which jobs were submitted are more difficult to determine. In principle, this info is in two logs in $PRODAGENT_WORKDIR/CommandManager
  • ComponentLog says that there was e.g. a request to submit a new task.
  • FrontendLog says that IP n connected at time t.
Changed:
<
<
However, there is no guaranteed relationship.

FrontendLog is written by $CRABSERVER_ROOT/src/python/CommandManager/server_side/server2.c. S.B. looked a bit if it was obvious how to change to add the task name to the IP connection message (user's DN does not seem there, but task name would do), but it looks too complicted for understanding in a short time and we should not make very extensive changes to CRAB2 at this time.

>
>
However, there is no guaranteed relationship. FrontendLog is written by $CRABSERVER_ROOT/src/python/CommandManager/server_side/server2.c. S. Belforte looked a bit if it was obvious how to change to add the task name to the IP connection message (user's DN does not seem there, but task name would do), but it looks too complicted for understanding in a short time and we should not make very extensive changes to CRAB2 at this time.
 

Other actions based on information collected:

Changed:
<
<
  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
  • Report the results to CMS Security Contacts (Ian and Mine)
>
>
  • Notify sites where jobs ran. (Note that individual jobs could have run on more than one site! The new EventLog gives you this information, since it tracks condor events, not clusters.)
  • Report the results to CMS Security Contacts (Ian and Mine, and the cms-comp-security mailing list).
 
Changed:
<
<

Compromised Pilot Certificate

>
>

Compromised Pilot Certificate

 
Changed:
<
<
The compromise of a pilot certificate is much more complicated than the case of a compromised user certificate, since there are only O(10) pilot certificates which are cycled round-robin to run glideinWMS pilots. User jobs will then connect to startd's run by these pilots for executing the user jobs. If a pilot certificate is compromised, then potentially every site and every user of glideinWMS for CMS analysis during the time since the compromise can be affected.
>
>
The compromise of a pilot certificate is much more complicated than the case of a compromised user certificate, since there are only O(10) pilot certificates which are cycled round-robin to run glideinWMS pilots. User jobs will then connect to startd's run by these pilots for executing the user jobs. If a pilot certificate is compromised, then potentially every site and every user of glideinWMS for CMS analysis during the time since the compromise can be affected. The time and effort to determine which, if any, proxies were not compromised might be prohibitive. In this case, it may be more efficient to shut down the entire system, clean up, and restart with un-compromised proxies. However, for the purposes of SSC6, we will not halt glidein CRAB operations or kill pilots. Simply make sure that such information that would be needed to carry out such an operation is obtainable and communication lines are working.
 
Changed:
<
<
How do you know a pilot proxy was compromised? GOOD QUESTION!
>
>
How do you know a pilot proxy was compromised? While this is a good question, for the purposes of SCC6 we will simply be told.
 

Initial Actions

If a glideinWMS pilot DN is compromised, admins will have to:

Changed:
<
<
  • Remove the particular pilot proxy from the rotation in the glideinWMS frontend and replace it with another of the 50 we have available.
  • Kill any running pilots with the banned proxy
  • BAN COMPROMISED PILOT DN AT THE COLLECTOR
  • Contact Factory Ops to kill any queued pilots
>
>
  • Remove the particular pilot proxy from the rotation in the glideinWMS frontend and replace it with another of the 50 we have available. (N.B. As of Wednesday August 29, 2012 the additional proxies are not yet registered with the CMS VO.)
  • Ask Factory Ops to kill any running pilots with the banned proxy and remove any queued pilots.
  • Ban the compromised pilot DN at on the condor collector
 

Detailed Procedures

Line: 95 to 109
 

Remove the compromised proxy from the list and replace it with another that is not being used already in this frontend or in any other running frontend on the machine.

Changed:
<
<
Other certificates can be found in ~/.globus.
>
>
Other certificates can be found in ~/.globus (but they are not yet registed with the CMS VO).
  Reconfigure the frontend:
./frontend_startup reconfig ../instance_v5_4.cfg/frontend.xml
Changed:
<
<
To remove all running and queued pilots with a particular DN, it is necessary to contact the Factory Operators (osg-gfactory-support@physics.ucsd.edu). Also ask them for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site).
>
>
To remove all running and queued pilots with a particular DN, it is necessary to contact the Factory Operators (osg-gfactory-support@physics.ucsd.edu).
 

Collecting Information

  • find out which sites pilot jobs ran on using this proxy (above) and notify them
Line: 110 to 124
 

Detailed Procedures

Changed:
<
<
Given the large number of pilots running at any given time O(10000) and the small number of proxies O(10), every site and every user who ran a job in the glideinWMS analysis system since the time of the compromise of a pilot certificate will have been affected. To make this point, look at every site where pilots are currently running using one certificate:
>
>
Given the large number of pilots running at any given time O(10000) and the small number of proxies O(10), every site and every user who ran a job in the glideinWMS analysis system since the time of the compromise of a pilot certificate may have been affected. To make this point, look at every site where pilots are currently running using one certificate:
 
letts@submit-4 ~$ condor_status -const '(GLIDEIN_X509_GRIDMAP_DNS=?="/DC=org/DC=doegrids/OU=Services/CN=glidein-collector.t2.ucsd.edu,/DC=org/DC=doegrids/OU=Services/CN=glidein-frontend.t2.ucsd.edu,/DC=org/DC=doegrids/OU=Services/CN=uscmspilot05/glidein-1.t2.ucsd.edu")' -l | grep ^GLIDEIN_CMSSite | sort | uniq -c
     20 GLIDEIN_CMSSite = "T1_CH_CERN"
Line: 147 to 161
 
    1. GLIDEIN_CMSSite = "T3_US_TTU"
    2. GLIDEIN_CMSSite = "T3_US_UMD"
Changed:
<
<
This is 33 out of 39 sites running glideins at this time.

To get detailed information about pilot history, ask Factory Ops for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site). Then you can cross-reference the list of pilots against the condor_history (potentially a lot of work).

>
>
This is 33 out of 39 sites running glideins at this time. Over the course of a few hours, this would quickly encompass all sites. Therefore, it is likely that every site running a glidein since the time of the compromise has been affected.
 

Other Actions

Changed:
<
<
  • Notify the sites and users whose jobs ran with pilots with a compromised credential
  • Report the results to CMS Security Contacts (Ian and Mine)
>
>
  • Notify the sites where jobs ran with pilots with a compromised credential (effectively all sites).
  • Report the results to CMS Security Contacts (Ian and Mine, and the cms-comp-security mailing list)
 

General Observation

Changed:
<
<
Note that if a compromise is thought to spread from pilot to user DN and vice-versa, the entire system could be considered compromised on short order, given that user tasks have of order O(1000) jobs and there are only 10 pilot proxies. The probability that any task of 1000 jobs that have already run or started avoided a particular pilot proxy is very very small (1.7 x 10^-46).

To consider (Igor, Stefano, Lola, James):

  • Step to renew all the pilot proxies?
  • Are the pilot credentials themselves compromised or just the proxy hijacked (simpler situation)?
>
>
Note that if a compromise is thought to spread from pilot to user DN and vice-versa, the entire system could be considered compromised on short order, given that user tasks have of order O(1000) jobs and there are only 10 pilot proxies. The probability that any task of 1000 jobs that have already run or started have avoided using pilot with a particular pilot proxy is very very small (1.7 x 10^-46). Therefore, in case of this kind of attack, there may be nothing to do other than holding all user jobs, removing all running and queued pilots, banning the compromised pilot certificates as above, and starting over. (What about compromised user proxies? When would it be safe to let user jobs run again?)
 
Deleted:
<
<
-- JamesLetts - 2012/08/27
 \ No newline at end of file
Added:
>
>
-- JamesLetts - 2012/08/29

Revision 42012/08/28 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 27 to 27
 condor_hold uscmsxxx
Changed:
<
<
Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.
>
>
Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.
 

Collecting Information

Information an operator should collect:

Changed:
<
<
  • find out which sites jobs ran on
  • incoming IP address from which jobs were submitted (Do we have this information in the CRAB SERVER, and if so, where?)
>
>
  • which sites jobs from the banned user ran on
  • names of pilots which ran jobs from the banned user
  • incoming IP address from which jobs were submitted
 

Detailed Procedures

Line: 47 to 48
 Cluster = 31607 JOB_Site = "JINR"
Added:
>
>
Pilot names are also available in the EventLog. From this information it should be possible to determine which other jobs ran on the same pilots that may have been compromised, if any.
 
Changed:
<
<
IP address from which jobs were submitted? - STEFANO
>
>
IP address from which jobs were submitted are more difficult to determine. In principle, this info is in two logs in $PRODAGENT_WORKDIR/CommandManager
  • ComponentLog says that there was e.g. a request to submit a new task.
  • FrontendLog says that IP n connected at time t.
However, there is no guaranteed relationship.

FrontendLog is written by $CRABSERVER_ROOT/src/python/CommandManager/server_side/server2.c. S.B. looked a bit if it was obvious how to change to add the task name to the IP connection message (user's DN does not seem there, but task name would do), but it looks too complicted for understanding in a short time and we should not make very extensive changes to CRAB2 at this time.

 

Other actions based on information collected:

  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
Line: 94 to 102
 ./frontend_startup reconfig ../instance_v5_4.cfg/frontend.xml
Changed:
<
<
how to kill all pilots with DN=X
>
>
To remove all running and queued pilots with a particular DN, it is necessary to contact the Factory Operators (osg-gfactory-support@physics.ucsd.edu). Also ask them for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site).
 

Collecting Information

Changed:
<
<
  • find out which sites pilot jobs ran on using this proxy and notify them
>
>
  • find out which sites pilot jobs ran on using this proxy (above) and notify them
 
  • find out which users had jobs which ran on pilots with a compromised proxy

Detailed Procedures

Line: 140 to 149
  This is 33 out of 39 sites running glideins at this time.
Changed:
<
<
WRITE HOW TO GET THIS LIST OF SITES, USERS.
>
>
To get detailed information about pilot history, ask Factory Ops for a history of pilots that ran at sites with that DN since the time of the incident (pilot name of the form "glidein_15640@node20-9.wn.iihe.ac.be", site). Then you can cross-reference the list of pilots against the condor_history (potentially a lot of work).
 

Other Actions

  • Notify the sites and users whose jobs ran with pilots with a compromised credential
  • Report the results to CMS Security Contacts (Ian and Mine)
Changed:
<
<

Action Items

>
>

General Observation

Note that if a compromise is thought to spread from pilot to user DN and vice-versa, the entire system could be considered compromised on short order, given that user tasks have of order O(1000) jobs and there are only 10 pilot proxies. The probability that any task of 1000 jobs that have already run or started avoided a particular pilot proxy is very very small (1.7 x 10^-46).

To consider (Igor, Stefano, Lola, James):

 
Changed:
<
<
  1. Document to be REVIEWED BY IGOR - DONE
  2. Does CRAB log IP addresses where submissions come from? (Stefano)
  3. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?) - IGOR AGREED TO DO IT. DONE on submit-4.
  4. Document how to get a list of sites and users that ran on a pilot with a particular pilot certificate since a particular time (JAMES) - probably a complicated looking condor command.
  5. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users? - ASKED OLI: Ans. Ian and Mine.
>
>
  • Step to renew all the pilot proxies?
  • Are the pilot credentials themselves compromised or just the proxy hijacked (simpler situation)?
  -- JamesLetts - 2012/08/27 \ No newline at end of file

Revision 32012/08/28 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 16 to 16
 

Detailed Procedures

Changed:
<
<
The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB SERVERS. This can be done by looking at the list of mappings. The command condor_q can also give you the same information, but only if the user still has jobs pending or running.
>
>
The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB Servers. This can be done by looking at the list of mappings. The command condor_q can also give you the same information, but only if the user still has jobs pending or running. The local UNIX userid typically has the form uscmsxxx.
 
condor_q -format '%s ' Owner -format '%s\n' x509userproxysubject | sort | uniq -c
Deleted:
<
<
The local UNIX userid typically has the form uscmsxxx.
 On each of the submitter nodes (glidein-2.t2.ucsd.edu, submit-[1-4].t2.ucsd.edu), HOLD any pending or running jobs from this user by running:
condor_hold uscmsxxx
Line: 34 to 33
  Information an operator should collect:
  • find out which sites jobs ran on
Changed:
<
<
  • incoming IP address from which jobs were submitted (do we even have this information in the CRAB SERVER, and if so, where?)
>
>
  • incoming IP address from which jobs were submitted (Do we have this information in the CRAB SERVER, and if so, where?)
 

Detailed Procedures

Changed:
<
<
Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes in the file /opt/glidecondor/condor_local/log/EventLog.
>
>
Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes in the file /opt/glidecondor/condor_local/log/EventLog. Eventually we will have a tool to parse this log, but the information is available in terms of condor Cluster ID (the same information you get from condor_q or condor_history) and JOB_Site:
letts@submit-4 /opt/glidecondor/condor_local/log$ tail -100 /opt/glidecondor/condor_local/log/EventLog |egrep '^Cluster|^JOB_Site'
Cluster = 31675
JOB_Site = "CERN"
Cluster = 31675
JOB_Site = "CERN"
Cluster = 31607
JOB_Site = "JINR"

IP address from which jobs were submitted? - STEFANO

 

Other actions based on information collected:

  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
Changed:
<
<
  • Report the results to CMS Computing Management?
>
>
  • Report the results to CMS Security Contacts (Ian and Mine)
 

Compromised Pilot Certificate

Line: 135 to 145
 

Other Actions

  • Notify the sites and users whose jobs ran with pilots with a compromised credential
Changed:
<
<
  • Report the results to CMS Computing Management?
>
>
  • Report the results to CMS Security Contacts (Ian and Mine)
 

Action Items

  1. Document to be REVIEWED BY IGOR - DONE
  2. Does CRAB log IP addresses where submissions come from? (Stefano)
Changed:
<
<
  1. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?) - IGOR AGREED TO DO IT.
>
>
  1. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?) - IGOR AGREED TO DO IT. DONE on submit-4.
 
  1. Document how to get a list of sites and users that ran on a pilot with a particular pilot certificate since a particular time (JAMES) - probably a complicated looking condor command.
Changed:
<
<
  1. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users? - ASKED OLI
>
>
  1. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users? - ASKED OLI: Ans. Ian and Mine.
  -- JamesLetts - 2012/08/27

Revision 22012/08/28 - Main.JamesLetts

Line: 1 to 1
 
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Line: 16 to 16
 

Detailed Procedures

Changed:
<
<
The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB SERVERS. This can be done by looking at the list of mappings. The local UNIX userid typically has the form uscmsxxx.
>
>
The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB SERVERS. This can be done by looking at the list of mappings. The command condor_q can also give you the same information, but only if the user still has jobs pending or running.
condor_q -format '%s ' Owner -format '%s\n' x509userproxysubject | sort | uniq -c

The local UNIX userid typically has the form uscmsxxx.

  On each of the submitter nodes (glidein-2.t2.ucsd.edu, submit-[1-4].t2.ucsd.edu), HOLD any pending or running jobs from this user by running:
Line: 33 to 38
 

Detailed Procedures

Changed:
<
<
Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes. NEED DETAILS, AND TO IMPLEMENT IGOR'S CONFIG MODIFICATIONS
>
>
Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes in the file /opt/glidecondor/condor_local/log/EventLog.
 

Other actions based on information collected:

  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
Line: 43 to 48
  The compromise of a pilot certificate is much more complicated than the case of a compromised user certificate, since there are only O(10) pilot certificates which are cycled round-robin to run glideinWMS pilots. User jobs will then connect to startd's run by these pilots for executing the user jobs. If a pilot certificate is compromised, then potentially every site and every user of glideinWMS for CMS analysis during the time since the compromise can be affected.
Added:
>
>
How do you know a pilot proxy was compromised? GOOD QUESTION!
 

Initial Actions

If a glideinWMS pilot DN is compromised, admins will have to:

Changed:
<
<
  • remove the particular pilot proxy from the rotation in the glideinWMS frontend and replace it with another of the 50 we have available.
  • kill any running pilots with the banned proxy
>
>
  • Remove the particular pilot proxy from the rotation in the glideinWMS frontend and replace it with another of the 50 we have available.
  • Kill any running pilots with the banned proxy
  • BAN COMPROMISED PILOT DN AT THE COLLECTOR
  • Contact Factory Ops to kill any queued pilots
 

Detailed Procedures

Line: 130 to 139
 

Action Items

Changed:
<
<
  1. Document to be REVIEWED BY IGOR
>
>
  1. Document to be REVIEWED BY IGOR - DONE
 
  1. Does CRAB log IP addresses where submissions come from? (Stefano)
Changed:
<
<
  1. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?)
>
>
  1. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?) - IGOR AGREED TO DO IT.
 
  1. Document how to get a list of sites and users that ran on a pilot with a particular pilot certificate since a particular time (JAMES) - probably a complicated looking condor command.
Changed:
<
<
  1. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users?
>
>
  1. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users? - ASKED OLI
  -- JamesLetts - 2012/08/27

Revision 12012/08/27 - Main.JamesLetts

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="GlideinWMSCrab"

PROCEDURES FOR GLIDEINWMS CRAB SERVER DURING THE CMS SECURITY CHALLENGE SSC6

Banning Users

During the CMS Security Challenge, glideinWMS CRAB SERVER operators may be asked to ban a particular DN and provide certain information about the "attack". In particular, given a particular user DN, admins may be asked to take action to:

Initial Actions

  • hold any running jobs
  • hold any queued jobs
  • block the user from further submissions

Detailed Procedures

The procedure for banning a user starts with mapping the certificate DN to a local userid on the UCSD CRAB SERVERS. This can be done by looking at the list of mappings. The local UNIX userid typically has the form uscmsxxx.

On each of the submitter nodes (glidein-2.t2.ucsd.edu, submit-[1-4].t2.ucsd.edu), HOLD any pending or running jobs from this user by running:

condor_hold uscmsxxx

Remove the local userid from the /etc/passwd file on all submitter nodes. This will block any further submissions.

Collecting Information

Information an operator should collect:

  • find out which sites jobs ran on
  • incoming IP address from which jobs were submitted (do we even have this information in the CRAB SERVER, and if so, where?)

Detailed Procedures

Information on which sites jobs ran at is (or soon will be) in the condor logs on the submission nodes. NEED DETAILS, AND TO IMPLEMENT IGOR'S CONFIG MODIFICATIONS

Other actions based on information collected:

  • Notify sites where jobs ran. Note that individual jobs could have run on more than one site!
  • Report the results to CMS Computing Management?

Compromised Pilot Certificate

The compromise of a pilot certificate is much more complicated than the case of a compromised user certificate, since there are only O(10) pilot certificates which are cycled round-robin to run glideinWMS pilots. User jobs will then connect to startd's run by these pilots for executing the user jobs. If a pilot certificate is compromised, then potentially every site and every user of glideinWMS for CMS analysis during the time since the compromise can be affected.

Initial Actions

If a glideinWMS pilot DN is compromised, admins will have to:

  • remove the particular pilot proxy from the rotation in the glideinWMS frontend and replace it with another of the 50 we have available.
  • kill any running pilots with the banned proxy

Detailed Procedures

There are two frontend instances running on glidein-frontend.t2.ucsd.edu under user frontend, instance_v5_4 for general usage and instance_o5_4 for xrootd overflow. These procedures could apply to either frontend.

Pilot certificates are removed from the configuration file frontend.xml in the CMS frontend, in ~/frontstage/instance_[ov]5_4.cfg in the section under security. For example, there is a list of pilot certificates used:

         <security>
            <proxies>
               <proxy absfname="/home/frontend/.globus/x509_pilot05_cms_prio.proxy" security_class="cmsprio"/>
               <proxy absfname="/home/frontend/.globus/x509_pilot06_cms_prio.proxy" security_class="cmsprio"/>
               <proxy absfname="/home/frontend/.globus/x509_pilot07_cms_prio.proxy" security_class="cmsprio"/>
               <proxy absfname="/home/frontend/.globus/x509_pilot08_cms_prio.proxy" security_class="cmsprio"/>
               <proxy absfname="/home/frontend/.globus/x509_pilot09_cms_prio.proxy" security_class="cmsprio"/>
               <proxy absfname="/home/frontend/.globus/x509_pilot10_cms_prio.proxy" security_class="cmsprio"/>
            </proxies>
         </security>

Remove the compromised proxy from the list and replace it with another that is not being used already in this frontend or in any other running frontend on the machine. Other certificates can be found in ~/.globus.

Reconfigure the frontend:

./frontend_startup reconfig ../instance_v5_4.cfg/frontend.xml

how to kill all pilots with DN=X

Collecting Information

  • find out which sites pilot jobs ran on using this proxy and notify them
  • find out which users had jobs which ran on pilots with a compromised proxy

Detailed Procedures

Given the large number of pilots running at any given time O(10000) and the small number of proxies O(10), every site and every user who ran a job in the glideinWMS analysis system since the time of the compromise of a pilot certificate will have been affected. To make this point, look at every site where pilots are currently running using one certificate:

letts@submit-4 ~$ condor_status -const '(GLIDEIN_X509_GRIDMAP_DNS=?="/DC=org/DC=doegrids/OU=Services/CN=glidein-collector.t2.ucsd.edu,/DC=org/DC=doegrids/OU=Services/CN=glidein-frontend.t2.ucsd.edu,/DC=org/DC=doegrids/OU=Services/CN=uscmspilot05/glidein-1.t2.ucsd.edu")' -l | grep ^GLIDEIN_CMSSite | sort | uniq -c
     20 GLIDEIN_CMSSite = "T1_CH_CERN"
      3 GLIDEIN_CMSSite = "T1_US_FNAL"
      5 GLIDEIN_CMSSite = "T2_BE_IIHE"
     36 GLIDEIN_CMSSite = "T2_BE_UCL"
      5 GLIDEIN_CMSSite = "T2_BR_SPRACE"
      4 GLIDEIN_CMSSite = "T2_BR_UERJ"
     11 GLIDEIN_CMSSite = "T2_CH_CERN"
      2 GLIDEIN_CMSSite = "T2_CH_CSCS"
     39 GLIDEIN_CMSSite = "T2_DE_DESY"
      6 GLIDEIN_CMSSite = "T2_DE_RWTH"
      3 GLIDEIN_CMSSite = "T2_ES_IFCA"
     30 GLIDEIN_CMSSite = "T2_FR_GRIF_LLR"
      4 GLIDEIN_CMSSite = "T2_HU_Budapest"
     37 GLIDEIN_CMSSite = "T2_IT_Bari"
     67 GLIDEIN_CMSSite = "T2_IT_Legnaro"
      9 GLIDEIN_CMSSite = "T2_IT_Pisa"
     18 GLIDEIN_CMSSite = "T2_RU_JINR"
      2 GLIDEIN_CMSSite = "T2_UA_KIPT"
      8 GLIDEIN_CMSSite = "T2_UK_London_Brunel"
      7 GLIDEIN_CMSSite = "T2_UK_London_IC"
     18 GLIDEIN_CMSSite = "T2_UK_SGrid_RALPP"
      1 GLIDEIN_CMSSite = "T2_US_Caltech"
    130 GLIDEIN_CMSSite = "T2_US_Florida"
     13 GLIDEIN_CMSSite = "T2_US_MIT"
     26 GLIDEIN_CMSSite = "T2_US_Nebraska"
      3 GLIDEIN_CMSSite = "T2_US_Purdue"
     41 GLIDEIN_CMSSite = "T2_US_UCSD"
     26 GLIDEIN_CMSSite = "T2_US_Wisconsin"
     64 GLIDEIN_CMSSite = "T3_US_Colorado"
     50 GLIDEIN_CMSSite = "T3_US_Omaha"
     10 GLIDEIN_CMSSite = "T3_US_OSU"
      1 GLIDEIN_CMSSite = "T3_US_TTU"
      7 GLIDEIN_CMSSite = "T3_US_UMD"
This is 33 out of 39 sites running glideins at this time.

WRITE HOW TO GET THIS LIST OF SITES, USERS.

Other Actions

  • Notify the sites and users whose jobs ran with pilots with a compromised credential
  • Report the results to CMS Computing Management?

Action Items

  1. Document to be REVIEWED BY IGOR
  2. Does CRAB log IP addresses where submissions come from? (Stefano)
  3. Implement condor logging level changes and document how the information should be used or extracted. (James/Igor?)
  4. Document how to get a list of sites and users that ran on a pilot with a particular pilot certificate since a particular time (JAMES) - probably a complicated looking condor command.
  5. Who do we report incidents to? Oli? Ian? Any designated CMS Security contact person? Sites/users?

-- JamesLetts - 2012/08/27

 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback