Replay attack feature when same proxy used on multiple components
--
JohnWeigand - 2009/11/25
This is a hard one to describe and is related to
replay attack feature in v2.2. This problem will occur only if you are using the same proxy/cert (key is issuer/subject value) for VO Frontend and the Factory instances.
If you are using the same proxy or certificate for both the VO Frontend and the Factory, the CONDOR_LOCATION/certs/condor_mapfile is populated as below:
If the factory (
weigand@cms-xen21.fnal.gov) is the first one in the file, when the VOFrontend (
cms_frontend@cms-xen22.fnal.gov) requests a glidein from the factory, this error occurs in the factory_err.yyyymmdd.log:
- [2009-11-25T14:58:31-05:00 1389] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid
ReqEncIdentity(cms_frontend@cms-xen22.fnal.gov!=weigand@cms-xen21.fnal.gov).
Skipping for security reasons
If I then change the condor_mapfile (as below) and put the VOFrontend ahead of the factory, the glidein request is accepted:
Another anomaly of note when attempting to resolve this is that I had to restart the WMS Collector in order for it to recognize the changed condor_mapfile. It was my understanding that this was not required. I even allowed the collector to run for 1 hour and 50 minutes before I gave up and recycled the collector.
Update -- JohnWeigand - 2009/12/01: At the glidein status meeting on Monday (11/30), I was advised that the classid_identity in the security element should
not contain the hostname of the WMS collector and that the collector's condor_mapfile
should not have the hostname in the last token. Note that this is still the use case where I am using the same user cert on both the Ffactory and VOFrontend. Under these conditions, they only way I can get anything working (as in the requested glideins started) is using this configuration:
- frontend.xml
<security classad_identity="cms_frontend@cms-xen22.fnal.gov".... >
If I do not fully qualifier either of the 2 files, I get errors like this in the Factory error log:
- [2009-12-01T07:58:11-05:00 27198] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid ReqEncIdentity? (cms_frontend!=cms_frontend@cms-xen21.fnal.gov). Skipping for security reasons.
If I qualify the security element classad_identity with "cms_frontend@cms-xen22.fnal.gov", I get this:
- [2009-12-01T09:50:14-05:00 27198] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid ReqEncIdentity? ( cms_frontend@cms-xen22.fnal.gov!=cms_frontend@cms-xen21.fnal.gov). Skipping for security reasons.
If I quality both (the frontend config and the condor_mapfile, it works.
Update 2: Here is what I use at UCSD, and this works:
gfactory - glidein-1.t2.ucsd.edu
[0858] gfactory@glidein-1 ~/glidecondor/certs$ cat condor_mapfile
GSI "/DC=org/DC=doegrids/OU=Services/CN=glidein-1.t2.ucsd.edu" condor
GSI "/DC=org/DC=doegrids/OU=Services/CN=glidein-frontend.t2.ucsd.edu" frontend
GSI "/DC=org/DC=doegrids/OU=People/CN=Dan Bradley 268953" glowfe
GSI (.*) anonymous
FS (.*) \1
frontend - glidein-frontend.t2.ucsd.edu
[0900] frontend@glidein-frontend ~/frontstage/instance_v2_1.cfg$ grep ident frontend.xml
<collector classad_identity="gfactory@glidein-1.t2.ucsd.edu" node="glidein-1.t2.ucsd.edu"/>
<security classad_identity="frontend@glidein-1.t2.ucsd.edu" classad_proxy="/home/frontend/.globus/x509_service_proxy" proxy_selection_plugin="ProxyUserMapWRecycling" sym_key="aes_256_cbc">
Update 3 -- JohnWeigand - 2009/12/01:
If the frontend is actually on glidein-frontend.t2.ucsd.edu, why then is the classad_identity populated with frontend@glidein-1.t2.ucsd.edu.
VOFrontend config file: classad_identity population
--
JohnWeigand - 2009/12/01
In the VOfrontend xml configuration file, the collector and security elements both have a classad_identity attribute.
- <security proxy_selection_plugin="ProxyAll" classad_proxy="/home/cms/grid-security/x509_cms_pilot_proxy"
classad_identity="cms_frontend">
- <collectors>
<collector node="cms-xen21.fnal.gov" classad_identity="glidein@cms-xen21.fnal.gov"
comment="Define factory collector globally for simplicity"/>
</collectors>
The security element classad_identity is
not supposed to have the hostname appended. However, the collector element classad_identity
has to have the hostname appended. When the collector element does
not have the hostname appended, the factory never appears to get the glidein requests from the VOFrontend. There is nothing in any log files on the VOFrontend, factory or WMS collector indicating a problem.
The question is: why are the classad_identity attributes populated differently?
ReSS / GIP / resource / resource group
--
JohnWeigand - 2010/03/11
This one is not directly related to glidein per se. It is indirectly affected by how OIM/MyOSG defines the OSG topology for my environment.
An example: I have the following defined in OIM/MyOSG:
resource group: ITB_GRATIA_TEST containing 2 resources:
- resource: ITB_GRATIA_TEST_1 with a CE service running on gr6x3.fnal.gov
- resource: ITB_GRATIA_TEST_2 with a CE service running on gr6x4.fnal.gov
Using the latest CE config.ini (OSG 1.2.8), I have defined resource_group and resource on each CE with the respective names shown above.
When GIP publishes data to ReSS and BDII, it uses the resource_group of the config.ini. So, for both CEs, data is published as ITB_GRATIA_TEST.
During the Factory installation process, my queries of the ReSS service (in this case, the ITB osg-ress-4.fnal.gov) bring back the following potential entry points:
- [ress_ITB_GRATIA_TEST_1] gr6x4.fnal.gov/jobmanager-condor((queue=default)(jobtype=single))
- [ress_ITB_GRATIA_TEST_2] gr6x3.fnal.gov/jobmanager-condor((queue=default)(jobtype=single))
Notice above, that ITB_GRATIA_TEST_1 appears now to be associated with gr6x4 and not the real one, gr6x3.
The reason for this is that the query brings back 2 sets of data with a name of ITB_GRATIA_TEST (the resource group). The installation appends a '_<counter>' to the resource_group. In this case, the gr6x4 came back before the gr6x3 on the query and it then looks like just a transposition.
This could just be an unfortunate, for me, problem caused by my naming the resources with an appended "_<number>.
The true reason for the appending of a counter to the resource group, at least as I am deducing it, is to handle the case where a resource (CE) has multiple job managers that can be used. This also looks like there may be 3 resources in the same resource group.
- [ress_FNAL_FERMIGRID_ITB_1] fgitbgkc1.fnal.gov:2119/jobmanager-condor((queue=group_cms)(jobtype=single))
- [ress_FNAL_FERMIGRID_ITB_2] fgitbgkc1.fnal.gov:2119/jobmanager-condor((queue=group_us_cms)(jobtype=single))
- [ress_FNAL_FERMIGRID_ITB_3] fgitbgkc2.fnal.gov/jobmanager-condor((queue=group_cms)(jobtype=single))
- [ress_FNAL_FERMIGRID_ITB_4] fgitbgkc2.fnal.gov/jobmanager-condor((queue=group_us_cms)(jobtype=single))
- [ress_FNAL_FERMIGRID_ITB_5] fgitbgkp2.fnal.gov/jobmanager-pbs((queue=batch)(jobtype=single))
At this point the "I'm not sure I know what I am talking about and may be ramblinig" stuff starts.. to be done in bullet points..
- If we are attempting to align OIM/MyOSG resource with Gratia site name, it seems will have a disconnect with how we reference glidein entry points. Contact information is at the resource level in OIM, not the resource group level.
- Several production resources are currently defined with the appended counter as I have in ITB_GRATIA_TEST_1/2. So the only clue as to the real resource is through the node name of gatekeeper.
- If, in the future, there is an intent to integrate something like planned maintenance/downtime queries of OIM/MyOSG, there is nothing available in glidein to do this.
As mentioned in the beginning, this is not necessarily a glidein problem but is something all should be aware of.
Expired user job proxy causes glexec to hang (v2.2 /glexec v0.6.8.3)
--
JohnWeigand - 2010/04/21
Description: If the user proxy used to run a job expires before it completes, the glexec authorization process causes the job to "hang" thus tieing up the WN resource.
Software versions:
- GlideinWMS v2.2
- glexec 0.6.8.3-osg1 (lcas 1.3.11.3 and lcmaps 1.4.8.4) 0.6.8.3-osg1
- VDT 2.0.99p16
This is the scenario used to test this problem:
- Created a proxy using voms-prox-init with a '-valid 00:05' argument so the proxy would expire in 5 minutes
- Submitted a job that would run for 10 minutes
These were the results:
- User job was submitted
- User job was pulled down by the glidein pilot
- glexec authorized the job to run
- It ran for the 10 minutes and then glexec needed to do another authorization on it (for what reason I am clueless) and it recognixed that the proxy had expired. It then went into an endless retry loop until I terminated the user job on the submit node 50 minutes later.
The user job's log file shows this:
000 (006.000.000) 04/20 13:59:51 Job submitted from host: <131.225.206.81:52721>
...
001 (006.000.000) 04/20 14:00:12 Job executing on host: <131.225.204.144:53728>
...
006 (006.000.000) 04/20 14:05:23 Image size of job updated: 9476
...
007 (006.000.000) 04/20 14:10:15 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to condor
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job
...
007 (006.000.000) 04/20 14:10:18 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to the user
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job
...
007 (006.000.000) 04/20 14:10:19 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to the user
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job
I killed it here..
...009 (006.000.000) 04/20 14:49:22 Job was aborted by the user.
via condor_rm (by user weigand)
The
/var/log/glexec/lcas_lcmaps.log shows this (I have omitted what I consider relevant lines):
Start of job...
LCMAPS 0: 2010-04-20.14:00:13-29105 : lcmaps_plugin_gums-plugin_run(): gums plugin succeeded
LCMAPS 0: 2010-04-20.14:00:13-29105 : lcmaps.mod-lcmaps_run_with_pem_and_return_account(): succeeded
LCMAPS 7: 2010-04-20.14:00:13-29105 : Termination LCMAPS
LCMAPS 1: 2010-04-20.14:00:13-29105 : lcmaps.mod-lcmaps_term(): terminating
LCMAPS 7: 2010-04-20.19:00:13 : Termination LCMAPS
LCMAPS 1: 2010-04-20.19:00:13 : lcmaps.mod-lcmaps_term(): terminating
Job would have completed execution here..
LCAS 1: 2010-04-20.14:10:15-01823 :
LCAS 1: 2010-04-20.14:10:15-01823 : Initialization LCAS version 1.3.11.3
LCMAPS 1: 2010-04-20.14:10:15-01823 :
LCMAPS 7: 2010-04-20.14:10:15-01823 : Initialization LCMAPS version 1.4.8-4
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-startPluginManager(): Reading LCMAPS database /etc/glexec/lcmaps/lcmaps-suexec.db
LCAS 0: 2010-04-20.14:10:15-01823 : LCAS already initialized
LCAS 2: 2010-04-20.14:10:15-01823 : LCAS authorization request
LCAS 1: 2010-04-20.14:10:15-01823 : lcas_userban.mod-plugin_confirm_authorization(): checking banned users in /etc/glexec/lcas/ban_users.db
LCAS 1: 2010-04-20.14:10:15-01823 : lcas.mod-lcas_run_va(): succeeded
LCAS 1: 2010-04-20.14:10:15-01823 : Termination LCAS
LCAS 1: Termination LCAS
LCMAPS 0: 2010-04-20.14:10:15-01823 : LCMAPS already initialized
LCMAPS 5: 2010-04-20.14:10:15-01823 : LCMAPS credential mapping request
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-runPlugin(): found plugin /usr/local/osg-wn-client/glexec-osg/lib/modules/lcmaps_verify_proxy.mod
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-runPlugin(): running plugin /usr/local/osg-wn-client/glexec-osg/lib/modules/lcmaps_verify_proxy.mod
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying proxy: Proxy certificate expired.
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying proxy: Proxy certificate expired.
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying certificate chain: certificate has expired
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps_plugin_verify_proxy-plugin_run(): verify proxy plugin failed
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-runPluginManager(): Error running evaluation manager
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_run_with_pem_and_return_account() error: could not run plugin manager
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_run_with_pem_and_return_account(): failed
LCMAPS 1: 2010-04-20.14:10:15-01830 : LCMAPS failed to do mapping and return account information
LCMAPS 7: 2010-04-20.14:10:15-01830 : Termination LCMAPS
LCMAPS 1: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_term(): terminating
LCMAPS 7: 2010-04-20.19:10:15 : Termination LCMAPS
LCMAPS 1: 2010-04-20.19:10:15 : lcmaps.mod-lcmaps_term(): terminating
LCAS 1: 2010-04-20.14:10:15-01834 : lcas_userban.mod-plugin_confirm_authorization(): checking banned users in /etc/glexec/lcas/ban_users.db
LCAS 1: 2010-04-20.14:10:15-01834 : lcas.mod-lcas_run_va(): succeeded
LCAS 1: 2010-04-20.14:10:15-01834 : Termination LCAS
LCAS 1: Termination LCAS
This then goes into an endless loop until the job was killed on the submit node.
It appears to retry twice every 2 minutes and the logs fill up very fast.
Update 1 -- JohnWeigand -2010/04/26 :
Feedback from Igor on 2010/04/21 is that this is a known problem that Condor is aware of and has plans to fix but there are no hard dates for its resolution.
However, I don't quite understand how this is a Condor issue. This may be due to my ignorance since the problem appears to be in glexec/lcas/lcmaps. Unless it is how Condor handles the error detected and is simply restarting the failed job rather than letting it die.
Regardless, I decided to change the test a little to see how it would handle other jobs in the submit node queue that had valid non-expired proxies.
- I started 2 jobs with a CMS proxy with a life of 5 minutes guaranteed to run 10 minutes. This consumed the 2 slots (and pilots) available on the test cluster.
- I then submitted 4 more jobs using a dzero proxy with a 180 hour lifetime guaranteed to run 2 minutes. These sat idle in the submit queue while the CMS jobs ran.
- To my surprise, when the 2 CMS jobs completed after 10 minutes and glexec/lcas/lcmaps went into the failure mode related to the expired proxies, the pilots brought down the remaining jobs (with dzero proxies) in the submit queue and processed them successfully.
- When these completed, glexec/lcas/lcmaps continued to error on the 2 CMS jobs.
Now out of curiosity, I renewed the proxies the 2 CMS jobs were using and then, again to my surprise, they authorized and completed succesfully.
This may be the expected behavior but I do not understand the "why".
Update 2 -- JohnWeigand -2010/05/03 (from Igor's reply) :
1. Related to understanding why this is a Condor issue.
"Condor is the one that is calling glexec.
And it is expecting that glexec will always succeed.
So it starts the job, when the proxy is valid.
When the job finishes it tries to use glexec to fetch the results and do the cleanup...
if the proxy is not valid anymore, it will fail"
2. Related to why, when the CMS jobs completed and failed the authorization on cleanup, the dzero jobs in the submit queue started.
"Not too suprising... Condor simply gave up on the CMS jobs (after a timeout?),
since it cannot do anything about them.
I suppose at this point the dzero jobs had a better priority, so the were matched to that job slot."
3. Related to why, when all jobs in the submit queue were processed, the pilot continued to attempt the clean authorization on fhe CMS jobs.
"So condor just restarted them?
If this is the case, it is not too surprising either...
Condor guarantees "job runs at least once" policy...
given that it was not able to fetch the result from the first run, it is not counting that as a run, so it tries again"
4. Related to why the CMS jobs then completed successfully after their proxy was renewed on the submit node.
"This is reasonable...
Condor will delegate the proxy from the schedd to that startd (or shadow to starter) every time it tries to start the job...
so the moment the schedd had a velid proxy, it was delegated to the glidein side where it was used to call glexec... and things started to work"
Update 3 -- JohnWeigand -2010/05/18:
This issue was brought to the attention of Condor support in an effort to escalate the priority. Related to this was how to handle a similar issue of
"banned/blacklisted" users.
Handling of "blacklisted" users
--
JohnWeigand - 2010/05/18
This was identified by Burt Holzman and is analogous to the issue of
expired proxies.
"If a user is banned from a site (but not the pilot), essentially the same thing happens -- the user jobs match, start, fail the initial glexec, get rescheduled, etc."