TWiki> UCSDTier2 Web>GlideinWMS>TaskList (revision 111)EditAttach

Task List

Convention

  • Task is complete, tested and merged into the release branch
  • Task is merged into the release branch, but may not be complete or may need further testing or review
  • Tasks that are not coloured are in individual developer branch or workspace and not merged into the release branch. Completion status doesn't matter.
  • Name in the team member working on the feature is listed next to the task

Issues that need Stakeholders input

  • Changing license to new one from BSD
  • SL4 support and Python version
  • Support for SL6

Releases

Release v2.3

v2.3 Release Page

Release v2.4

v2.4 Release Page

Release v2.5

  • Installer creates a malformed glideinWMS.xml

    • This happens when factory is configured to use a default factory. There is a trailing character 'n' at the end of <security ...>

  • Improvements to the Monitoring

    • expanding the monitoring, adding additional attributes to monitor. this changes the RRD structure enough, to not be backwards compatible (i.e. the RRDs need to be recreated)
    • I am changing the matching logic nothing changes from the user proint of view though. it should be just way more efficient
  • BUG: Frontend keeps on submitting Idle glideins

    • Check email thread on glideinwms mailing list
    • This only affects for CREAM CEs that are not configured correctly. glideins go into held state. Auto releasing the glideins for CREAM does not work since we cannot distinguish between recoverable and non-recoverable glideins. Condor does not pass back HeldStatusSubCode? to make this distinction.
  • Frontend should try to recover the crashed group before it gives up and shutsdown (Parag)

  • Add support for use of TCP in condor_advertise (IS)

  • Add support for condor_advertise -multiple (Condor v7.5.4+) (IS)

  • Improved factory stopping procedure (IS)

  • Installer can now install gridFTP and VOMS certs needed by CREAM (IS/JW?)

  • Improved Documentation (Doug)

    • Merge Doug's changes to documentation into branch_v2plus
  • BugFix: Factory entry sometimes stops reporting when it gets an exception for any reason

    • Exception handling bypasses the advertising of the entry classad
  • Top-level schedd_status.xml malformed Total data (Jeffery Dost)

    Subject: Bug report: Top-level schedd_status.xml malformed Total data Date: Tue, 07 Sep 2010 12:06:37 -0700  Dear glideinWMS team.  We just noticed that the top-level schedd_status.xml does not properly aggregate the numbers.  The total section is essentially empty!  See: http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v3_1/schedd_status.xml as an example.  Igor  PS: We are not using that for anything right now, but we may in the future. 
  • Graceful shutdown of the glidein by trapping signals in glidein_startup (Doug)

    • Do scale testing on sleeper pool.
  • CorralWMS? requests

    • Support for GLIDEIN_Monitoring_Enabled to disable monitoring slots requested by Corral project. Defaults to True
    • Support for GLIDEIN_Max_Walltime attr (which is used to calculate the GLIDEIN_Retire_Time)
    • Documentation for all the new features
    • Should be backward compatible with reasonable fallback defaults
    • Try to see if we can do a simple reconfig on v2.4.3 config and be in business
  • Split the glideinWMS release into factory and frontend release (Parag)

    • Fallout from CorralWMS? meeting
    • This way factory can be installed independently and the Frontend can be distributed separately.
    • Make an independent package gwms_release_manager to do this and much more
  • Factory should be smarter about handling held glideins (Parag)

    • So we should be a little smarter how we handle them in the code.
    • We should at least distinguish temporary problems from the permanent ones.

    Globus Error Code Held Reason Job is Recoverable
    10 globus_xio_gsi: Token size exceeds limit. Usually happens when someone tries to establish a insecure connection with a secure endpoint, e.g. when someone sends plain HTTP to a HTTPS endpoint without No
    121 the job state file doesn't exist No
    126 it is unknown if the job was submitted Yes
    12 the connection to the server failed (check host and port) Yes
    131 the user proxy expired (job is still running) Maybe
    17 the job failed when the job manager attempted to run it No
    22 the job manager failed to create an internal script argument file No
    31 the job manager failed to cancel the job as requested No
    3 an I/O operation failed Yes
    47 the gatekeeper failed to run the job manager No
    48 the provided RSL could not be properly parsed No
    4 jobmanager unable to set default to the directory requested No
    76 cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space Maybe (Short term: No)
    79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ... No
    7 an authorization operation failed Yes
    7 authentication with the remote server failed Yes
    8 the user cancelled the job No
    94 the jobmanager does not accept any new requests (shutting down) Yes
    9 the system cancelled the job No
    ? Job failed, no reason given by GRAM server No
    122 could not read the job state file Maybe (short term: no)
    132 the job was not submitted by original jobmanager No (likely to be fatal)
  • Provide the ability black/white list VOs in the factory on a per-entry basis (Doug)

    6/10/10 from Igor Sfiligoi - I have a feature request for the gfactory. Provide the capability to blacklist (or maybe whitelist or both) which VO can use which entry. Right now the only control is at the collector level, so I can only black/whitelist the whole VO, but I cannot do it on a entry by entry level. While I don't need to do this most of the time (the VO is free to choose which entry to use, and if it does not work for them.... oh well). But there are some edge cases when I do want to prevent a VO to go to a entry (it may not work, for example).
  • Publish glideinWMS code version in classads (Parag)

    • Version should be dynamically computed and matched against the stock released version
    • Append patched/updated to the version string if there are any changes determined in the code base.
    • Preferably, used hashing algorithms to control the version
    • Create release tools that automatically generate a release with version file.
  • Possible Bug in Factory Monitoring (Igor)

    • Found and fixed in v2.5.0.
    • The history log parser would not always augment a field.

    -------- Forwarded Message --------
    From: Igor Sfiligoi 
    To: 
    Subject: Monitoring problems
    Date: Tue, 07 Sep 2010 11:21:33 -0700
    
    Hi Chris.
    
    Thanks for finding this out.
    Dear glideinWMS team: Seems we have a bug in the glideinWMS code (v2_4_2)!
    
    Igor
    
    
    On 09/07/2010 11:12 AM, cmurphy@physics.ucsd.edu wrote:
    > Today's error log for this directory
    > ~/glideinsubmit/glidein_Production_v3_1/log/entry_HCC_BR_UNESP give the
    > following error
    >
    > Exception at Tue Sep  7 07:25:22 2010: Traceback (most recent call last):
    >    File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryEntry.py",
    > line 447, in iterate
    >      write_stats()
    >    File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryEntry.py",
    > line 351, in write_stats
    >      glideFactoryLib.factoryConfig.log_stats.write_file()
    >    File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryMonitoring.py",
    > line 854, in write_file
    >      xml_str=('<?xml version="1.0" encoding="ISO-8859-1"?>\n\n'+
    >    File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryMonitoring.py",
    > line 741, in get_xml_data
    >      data=self.get_data_summary()
    >   File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryMonitoring.py",
    > line 734, in get_data_summary
    >      completed_stats=self.get_completed_stats(entered_list)
    >    File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryMonitoring.py",
    > line 589, in get_completed_stats
    >      enle_jobs_duration=enle_condor_stats['Total']['secs']
    > KeyError: 'Total'
    >
    >
    > Both BR_UNESP and Michigan have this error:
    >
    >   File
    > "/glidein01/systems/gfactory/glideinWMS/factory/glideFactoryMonitoring.p > y", line 789, in get_diff_summary
    >      sdel[4]['username']=username
    > IndexError: tuple index out of range
  • Aggregation problem for rates in Factory Monitoring (John)

    Igor - Date: Fri, 08 Oct 2010 17:13:07 -0700
    I think we have a problem in the monitoring when aggregating rates; 
    compare http://sfiligoi-desktop.ucsd.edu/tmp/gwms_1/glogs_total.png 
    and http://sfiligoi-desktop.ucsd.edu/tmp/gwms_1/glogs_one.png  
    Both show the same period on the same factory, but one is aggregated 
    and the other is the source (the only active entry... two others were
     not active in that period). 
    As you can see, the right scale is completely different.  I am pretty sure 
    the non-aggregated value is the correct one.  
    

    -- JohnWeigand - 2010/10/15
    History: Back in May 2009, in glideFactoryMonitoring.py revision 1.294, a new method was added called write_rrd_multi_hetero that allowed for specifying different rrd data types for rates ("ABSOLUTE") versus normal ("GAUGE") data. Prior to that all types were specified as GAUGE. The non-aggregate (glideFactoryMonitoring.py) module was changed to use this method. However, the aggregate (glideFactoryMonitorAggregator.py) module was never changed to use this new method. This was causing the right side y-axis for rates to show different scales.

    The only change then was to the glideFactoryMonitorAggregator.py module for changing:

    glideFactoryMonitoring.monitoringConfig. *write_rrd_multi* ("%s/Log_Counts"%fe_dir, *"GAUGE"*,updated,val_dict_counts)         to   glideFactoryMonitoring.monitoringConfig. *write_rrd_multi_hetero* ("%s/Log_Counts"%fe_dir, *val_dict_counts_desc*,updated,val_dict_counts) 

    Committed in branches: branch_v2plus, branch_v2_4plus on 10/19/10.

    Important: At this time, we have not figured out how to modify the existing RRD file for this data (./totals/Log_Counts.rrd) displayed on the factoryLogStatus.html page when the entry of 'total' is specified. So, if consistent data between the 'total' and all entry points is desired, history will have to be sacrificed and the ./total/LogCounts.rrd files should be removed at the top level and in all entry point sub-directories. The ./total/LogCounts.rrd file affected by this change is the top level one so another alternative is to just remove that file and leave the individual entry point files as is.

Release v2.5.1

  • Handle SIGHUP to glideins

    • If a site sends the glideins a SIGHUP, we should indeed ignore it. Dying on SIGHUP is not an appropriate answer, and there is nothing for us to reload after Condor started.

  • Add support for Java universe in glideinWMS

    I was looking at this and fooling around on the sleeper pool. All that is needed to support a "Hello world" job (via both class files  and jar files) was the JAVA parameter.  The others are mainly for  setting the CLASSPATH (which will depend on if the application needs  some native java libs or not).  I suppose we could change the glidein_startup to look in an obvious  place for java, try to run a simple test classfile and add the JAVA line  to the startd config and publish the version if it succeeds?  I think  there may be many sites that don't have things properly configured so we  want to insulate ourselves ..  but I haven't tried to see what's out there.  - B  Just setting JAVA is not enough; it seems it needs the Condor-specific jar as well:  

  • Factory & Frontend: frontend can request removal of glideins (Igor)

    Implemented in 2.5.1.
    • Frontend asks factory to remove glideins based on some internal criteria
    • Factory performs removal (need to confirm that the frontend asking for removal is the same one that asked the factory to create them)

  • Factory should remove glideins related to old jobs

    • Implemented in 2.5.1, as part of the protocol for the frontends to remove idle glideins. Frontends now also can remove old glideins.
    • Concerns to Address:
      • How to relate a glidein to a specific job? Can we even do that right now?
      • Maybe the job is really useful to the user even if it's old. Just that the user has low priority and the job couldn't start
      • What if the user submits more jobs that are really useful? Or is he tricking the system to get CPU cycles when he really needs?

  • Configuration File based installer (JW)

    • Document the installer
    • Provide sample configurations

Release v2.5.2

  • Add following to the user schedd's config by default (Parag)

    #
    # Add default attributes
    #
    JOB_Site = "$$(GLIDEIN_Site:Unknown)"
    JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknwon)"
    JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
    JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
    JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
    JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:Unknown)"
    JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
    JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
    
    SUBMIT_EXPRS = $(SUBMIT_EXPRS) JOB_Site JOB_GLIDEIN_Entry_Name JOB_GLIDEIN_Name JOB_GLIDEIN_Factory JOB_GLIDEIN_Schedd JOB_GLIDEIN_Schedd JOB_GLIDEIN_ClusterId JOB_GLIDEIN_ProcId JOB_GLIDEIN_Site
    
  • Package jar file to support java (Parag)

    • When the starter begins, it tries to execute a benchmark from scimark2lib.jar -- if successful, it then populates JAVA_MFLOPS in the machine ClassAd? . We should add this jar to the condor tarball.
  • Default behaviour: Config should not overwrite the config file. (Parag)

  • Monitor glidein pre-emption through Condor logs (Parag)

    Hi.  I recently noticed that Condor-G now logs preemptions at remote sites; I talked to Jaime and he confirms Condor-G has been doing it for a while now. (although it is not 100% reliable, yet)  The gfactory right now does not monitor and/or report these. Can this be added, please? It would make debugging MUCH easier.  Thanks,    Igor    for the OSG gfactory team  PS: Here is an example of a job on a "misbehaving" site: 000 (82725.000.000) 07/26 10:08:22 Job submitted from host: <169.228.130.10:45733> 017 (82725.000.000) 07/26 10:08:34 Job submitted to Globus 027 (82725.000.000) 07/26 10:08:34 Job submitted to grid resource 001 (82725.000.000) 07/26 10:10:46 Job executing on host: gt2 osg-gw.clemson.edu:2119/jobmanager-condor 004 (82725.000.000) 07/26 10:16:46 Job was evicted. 001 (82725.000.000) 07/26 10:19:46 Job executing on host: gt2 osg-gw.clemson.edu:2119/jobmanager-condor 004 (82725.000.000) 07/26 10:50:19 Job was evicted. 001 (82725.000.000) 07/26 10:51:18 Job executing on host: gt2 osg-gw.clemson.edu:2119/jobmanager-condor 005 (82725.000.000) 07/26 11:20:00 Job terminated.  

  • Accounting error in Factory/Frontend (Parag)

    • This happens when two or more entries share same site name
    • Update the User Running here in rrd
  • Factory reusing the old keypair after restart (Parag)

    Date: Fri, 08 Oct 2010 17:51:28 -0700   Hi all.   The current gfactory creates a new public/private key pair at each restart (which includes reconfigs).  While this is good, as it keeps it fresh, it has a nasty side effect; any existing requests from frontends  are ignored (since they use the old key).  I propose we keep the old key around for at least 10 or 20 cycles,  and accept frontend adds with either old or new key. (After 20 or so cycles, we should throw away the key  and ignore any old frontend adds... they are obviously stale)   What do you think?   Igor 
  • Publish site info known by Frontend to the Userpool (Parag)

    • Frontend should publish the entry info it got from the factory to the user pool
    • This has advantage of letting the user know what is available and they can make some smart decisions
    • It should not be used for any else on the frontend side
    5/11/10 from Igor Sfiligoi - I would like to see the frontend publish a classad with a summary of what  entries are available. The users need to know what is available before asking for it! And asking them  to query the factory(s) does not make sense. I found I needed something like this when looking after  the CMS CRAB pool at UCSD. Does it seem reasonable? (at least at this level... details would need  to be ironed out)  
  • BUG: Improper termination of glidein causes condor_started=false in monitoring (Doug)

    • SIGHUP/preemption at Purdue

  • Limit the max number of glideins per frontend (Doug)

    • This is just specific to the frontend.

  • Use DAEMON_SHUTDOWN to shutdown glidein daemons (Doug)

    • Only supported in condor 7.4+
       Relevant info from Dan's email shutdown fast - disregard MaxJobRetirementTime and hard-kill jobs immediately shutdown graceful - respect MaxJobRetirementTime and when that expires, soft-kill jobs; if SHUTDOWN_GRACEFUL_TIMEOUT time passes, then stop respecting MaxJobRetirementTime and elevate to a fast shutdown shutdown peaceful - same as graceful, but MaxJobRetirementTime=infinity And recall that MaxJobRetirementTime is counted from when the job began running, not from when the eviction happened. So your policy of MaxJobRetirementTime=30 means any job that has already run for more than 30 seconds will be evicted immediately when entering graceful shutdown mode. I agree that what glideinWMS probably wants peaceful shutdown, not graceful shutdown. As Igor suggested, this can be achieved by using DAEMON_SHUTDOWN. I think one would need to adjust the START expression to stop accepting new jobs after some amount of time and then adjust the STARTD.DAEMON_SHUTDOWN expression to shut down the startd once the jobs go away. The MASTER.DAEMON_SHUTDOWN expression can be set to shut down the master when the startd goes away. 

  • Allow factory to specify if an entry point (CE) requires voms proxies only for pilot and user jobs (Doug)

    7/1/10 from Igor Sfiligoi - Some sites (entry point) allow only jobs with voms proxies authorized access to their resources. The current glexec-enable glidein currently requires that user jobs have just grid proxies. This needs to be expand to allow the factories to specify, additionally, if voms proxies are required on user jobs for an entry point and apply that criteria in the glidein job selection process.
  • BUG: Daylight Saving possibly messing up the factory accounting (Doug)

    Check 2 Igor's Emails sent to glideinwms@fnal.gov Sun, 07 Nov 2010 09:38:12 -0800  

  • BUG: Factory reports glideins as completed several times, even after a long time (Doug)

    • Happens when the glidein proxies are refreshed (?)
    
    Date: 	Tue, 12 Apr 2011 15:56:17 -0700 (04/12/2011 05:56:17 PM)
    
    Hello GlideinWMS Team,
    
    Today we discovered an error in which it appears that glideins that have already 
    terminated are appearing multiple times later in the completed_jobs logs (see below 
    output).  The problem does not only happen here but must be affecting whatever the 
    completed_jobs logs get their info from.  We suspect the equivalent Log_Completed rrd's 
    also have the duplicated information because our analyze_entries tool is also reporting 
    duplicates.
    
    The case that made this apparent today was due to a glidein that failed because of it's 
    proxy expiring.  The glidein really failed at 9:05 am PST yesterday the 11th.  Since 
    then the proxy has been renewed.  However if you see below you will see that the same 
    glidein is reported to have finished 6 more times today.
    
    This caused confusion because both the completed_jobs log and the analyze_entries email 
    continued to report startup errors today when really no glideins have failed since the 
    proxy was renewed yesterday at 11:30 am PST.  Also the actual job error log is 
    timestamped with the original time:
    
    $ ls -l 
    /home/gfactory/glideinsubmit/glidein_Production_v4_0/client_log/user_fehcc/entry_CMS_T2_US_UCSD_gw2/job.36506.2.err
    -rw-r--r-- 1 fehcc fehcc 8558 Apr 11 09:05 
    /home/gfactory/glideinsubmit/glidein_Production_v4_0/client_log/user_fehcc/entry_CMS_T2_US_UCSD_gw2/job.36506.2.err
    
    Can you please have a look and see why this might be happening?  In the meantime I will 
    try and come up with a way to consistently reproduce the problem.
    
    It is urgent that this is addressed soon because we rely on dependable monitoring 
    information for the daily operations.
    
    Thank you,
    Jeff Dost
    OSG Glidein Factory Operations
    
    $ grep 36506.002  log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_2011041*.log
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110411.log:
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110411.log:
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:
    log/entry_CMS_T2_US_UCSD_gw2/completed_jobs_20110412.log:
    

  • Factory & Frontend: Check proxy validity (Dennis)

    • Currently ony the glideins check the validity of the proxy
    • This means the factory can submit many glideins that can either fail at the gatekeeper or even at the worker node.
    • This is all work that could be avoided

  • Frontend Monitoring: Monitor RequestedIdle? (Igor + student)

    I just noticed we do not monitor how many glideins have been requested, on the frontend web page. This would be very useful to know, when debugging problems.
    

  • New feature: Frontend sets global limits (Igor)

    • Derek requested the ability to set a system wide limit for the number of glideins that a fronend requests.
    • CMS has a similar need.
    • Implemented as per-group limit, on May 26th 2011

  • Bug: Frontend double counts glideins when multiple groups are used (Igor)

    • Derek reports that he sees double counting of glideins in the monitoring
    • Seems to be due to the use of multiple groups
    • Fixed on May 26th 2011, revision 1.52.2.22

  • Bug: GLIDEIN_Max_Walltime not used (Igor)

    • Code is looking for it in the environment, but there is no guarantee it will be there.
    • Fix: Extract the values from the config file.

  • Better checks to handle misconfigured glexec worker nodes (Igor)

    Hi guys.  Yesterday we had a problematic node at UCSD that had glexec misconfigured; all calls to it would fail.  A glidein started on that node and started receiving user jobs; of course, non managed to start! Plus, due to a user periodic remove in the user jobs, jobs were aborted due to excessive restarts... creating real damage.  We should protect the system from this; one obvious way to do it is to test for basic glexec functionality in glexec_setup.sh My proposal is to try to execute a basic command (like /usr/bin/id) using the pilot proxy itself; if that fails, we should be pretty much guaranteed that nothing will work.  Any comments/objections? Any volunteers? ;)  Cheers,    Igor  
    • On 5/18/2011, comitted patch to glexec_setup.sh
    • Execute glexec with pilot proxy
      Fail if glexec call fails
  • Improve graphics labels in monitoring (Igor + student)

    5/11/10 from Igor Sfiligoi: Terrence pointed out that the the labels of the gfactory Web graphs are cryptic. While they have to be short during display to save space, the two of us agreed that it would be nice to have a longer description somewhere else.

  • Frontend improvement - better malformed matching handling

    Today I was experimenting eith the v2_4 frontend, and I used a
    match expression that was valid for only a subset of the jobs.
    
    The current frontend behavior is to abort the matchmaking altogether;
    while the daemon does not die (the exception is caught down the line and reported),
    it does not do any good either.
    
    We should change the behavior so that any exception is caught as soon as possible
    and the semantincs becomes that "exceptional" jobs and factories simply don't match...
    the well behaving ones instead still do.
    This will just slow down the matchmaking process, but will still work for most of the jobs.
    We should still report something is amiss, so that the frontend admin can do something about it,
    but should be in a aggregated way.
    Say: "233 out of 5000 matches threw an exception... here is one..."
    
    The line to look for is
    glideinWMS/frontend/glideinFrontendLib.py:165
    
    What do you think?
    
    Cheers,
       Igor
    

  • Frontend Monitoring: mapping of entry sites in FE monitoring (Igor + student)

    The problem I have is that I get absolutely no information about which site the jobs were running (or even are requesting glideins) from the frontend web monitoring.
    All the information is in the classads flying back and forth, but it gets lost when exposed through the Web interface. 
    
  • Miscounting Error Time (maybe 2.5.2?)

    From: Igor Sfiligoi <sfiligoi@fnal.gov>
    Over the week-end FNAL was having problems with glexec, yet there was no easy way from telling from the monitoring we had any problems at all.
    The gfactory reported only "a large fraction of idle time" for the glideins.
    
    We should do better than that.
    We should be able to easily distinguish between time spent due to VO problems (i.e. idle) and time wasted due to the glideins not doing their jobs (e.g. failing to start the job).
    

Release v2.5.3

  • New Configuration format / Use default XML libraries to handle existing configuration

  • CorralWMS requests

    • Chunksize --- Not going in v2.5 until we have it supported by other grid types. Might move to other release versions.
  • Consolidate the condor_tarball to read in a list of os, arch, version

    It would be much better if we could provide a list; like "default,7.4,7.4.3"  i.e. the above would be represented by a single line:          And the reconfig would automatically create the whole matrix (actually, a cube). Possibly even use symlinks or hardlinks between them. 

  • Provide the ability to specify the RSL on a VO-by-VO basis

    6/14/10 from Igor Sfiligoi - Several sites (i.e. most non-condor sites) require a different RSL for each VO submitting to them. Having a complete new entry for each VO for each site is annoying. The site is functionally identical from the gfactory point of view. It would be nice to have an option to massage the RSL on a VO-by-VO basis (i.e. frontend-by-frontend basis). Possibly not the whole RSL but just the relevant part.

  • Factory: GUI for maintenance of glideinWMS.xml configuration

    • glideinWMS.xml maintenance via text editor for many entry points is difficult

Release v2.6

  • Factory: dynamic entry point update from information services (Igor + student)

    • In a pluggable fashion, allow factories to dynamically reconfigure entry points
    • Write plugins for ReSS, BDII and document API
  • Factory: allow factory admin to test sites with frontend-provided proxy (Igor + student)

  • Factory: put in place a mechanism in the factory to automatically detect bad entry points (Igor + student)

    • Due to glideins not starting (Condor-G/GRAM problems)
    • Glideins failing during validation (single node? fraction of site? all glideins?)
    • Glideins running fine, but jobs failing. (may need help from users)
    • Glideins running for a while and being killed/preempted. (right now we don't even know if and how often this happens)
  • Factory: periodically test sites with per-factory pluggable scripts (Igor + student)

    • Run tests on all sites, even if no frontends are using them
    • Disable sites that fail tests
    • Enable previously disabled sites that succeed

Release v3.0

  • Code cleanup using pylint (Parag + Burt)

    • Informally enforce some structure and look for problems in the code using pylint
  • Simplify how we specify condor_tarball config param

    Date: Thu, 28 Oct 2010 15:58:08 -0700
    
    Hi guys.
    
    At UCSD we often tag the same tarball multiple times, because of the need to handle defaults in an easy way.
    This then results in multiple tarball lines.
    
    For example:
           <condor_tarball arch="default" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4.3"/>
           <condor_tarball arch="default" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4"/>
           <condor_tarball arch="default" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="default"/>
           <condor_tarball arch="x86" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4.3"/>
           <condor_tarball arch="x86" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4"/>
           <condor_tarball arch="x86" os="default" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="default"/>
           <condor_tarball arch="default" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4.3"/>
           <condor_tarball arch="default" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4"/>
           <condor_tarball arch="default" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="default"/>
           <condor_tarball arch="x86" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4.3"/>
           <condor_tarball arch="x86" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="7.4"/>
           <condor_tarball arch="x86" os="rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="default"/>
    
    
    It would be much better if we could provide a list;
    like "default,7.4,7.4.3"
    
    i.e. the above would be represented by a single line:
           <condor_tarball arch="default,x86" os="default,rhel5" tar_file="/home/gfactory/Prestage/gfactory-2.4.2-condor-7.4.3-linux-x86-rhel5-dynamic.tar.gz" version="default,7.4,7.4.3"/>
    
    And the reconfig would automatically create the whole matrix (actually, a cube).
    Possibly even use symlinks or hardlinks between them.
    
    What do you think?
    Would it be reasonable?
    
    It would certainly make life much easier for us.
    
    Thanks,
       Igor
    

Release v3.1

  • Factory: never forget about a submitted glidein

    • Currently, if the frontend goes away, the factory forgets about the glideins it has submitted for that frontend
    • This is not critical, since the glideins will eventually finish by themselves, but we loose monitoring and can forget held jobs in the queue
  • Factory: Refactor: multiple entry points per factory process

    • Single process per entry point does not scale well for hundreds of entry points (schedds get pounded with many more condor_qs than we need, for instance)
  • Frontend: Tool for declaring downtime

    • Switch into mode where no new glideins are requested (but old glideins are not requested to be removed)
  • Frontend: Discover and act on jobs that do not match any factory entry (Igor + student)

    • There is no easy way to tell if there are any jobs in the queues that do not match to any factory entry, and why.
    • This results in jobs sitting in the queues forever!
  • Frontend: request different number of glideins for matching entry points

    • Frontend currently requests same pressure for all matching entry points
    • For multiple matching entry points, slower sites will have many idle glideins long after the jobs have all started running (or have already finished)
    • The frontend could monitor the glidein start rate and regulate the pressure -- or this could be customizable
    • Should help to reduce wasted walltime at misconfigured grid sites (i.e.: glideins start, but never register with the collector.)
  • Factory should push more monitoring stats to frontend

    • Right now the factory has many more monitoring info than the frontend; in particular info about glidein success/failure rates
    • Once we have multi-VO factories, this does not work too well; it is difficult to understand who does what
    • The factory should push most of the monitoring info the the frontend
  • Factory monitoring should have links to frontend web pages

    • The frontends have per-VO info that a factory admin may want to see
    • It does not make sense for the factory to host the data categorized by VO; the factory is loaded enough
    • So we need an easy way for a factory admin to discover the VO frontend web monitoring

New requests (not slotted for a release)

  • Log history archiving

    Date: Mon, 11 Oct 2010 09:20:36 -0700  I would like to request a feature to be added to the history file rotation  component of the glidein-wms.   Currently it is my understanding that glidein WMS periodically removes  (deletes) history files older than a specific number of days. Removal is  important to prevent excessive storage requirements regarding this logs and  probably other reasons.   At the same time there is a desire to maintain much longer history of these  transactions so that if necessary for auditing they can be retrieved. Storage  requirements are a concern with such a feature. A secondary concern is  excessive numbers of small files after a compression step.   One suggestion might be for the glidein-wms rotation function to move files  instead of delete them. This move would relocate any history logs older than a  certain number of days to a directory other than the original, rather than  performing a deletion step.  The advantage to this feature would be that it would allow an archiving  process, separate from glidein-wms to gather all history from the temporary  directory in one step, archive and then compress it. Once archiving was  completed any processed files would be deleted from this temporary area by the  archiving process.   By combining several history files into one archive it should be possible to  achieve more efficient compression. It could also significantly reduce the  total file count.  Each archive can be labeled by date for quick retrieval and  expansion as necessary.   An alternative would be for glidein-wms to perform a compression, instead of  deletion step. An archiving process would then simply remove any compressed  histories from the log directory on a daily or weekly basis and store them in  an off system area similar to the first method. Compressing many smaller files  is less efficient than compressing a single large file typically, due to  additional header information etc...  By adding one of the above, or similar, feature to the glidein-wms history  rotation it should make transaction archival easier, more efficient and  reliable.   If you have any questions please let me know.   Terrence 
  • Try to eliminate Factory as a single point of failure.

    • If a factory goes down, all frontends are affected.
    • Understand if we can run factory in HA mode. What are the security, load related implications?
    • Do we need to apply similar feature for frontend?
  • All: replace GSI_DAEMON_NAME with ALLOW_CLIENT

    • Cannot be done without changes to Condor
    • But when it will work -- ALLOW_CLIENT accepts wildcards, makes life easier for running factories with many pilot certificates

Bugs

Unconfirmed/Not reproducible bugs

  • Frontend likely bug - not respecting the factory match_attr

    • Parag: Tried with v2.4.1 and v2.4.1. Cannot be reproduced.

    I am running the v2_4 frontend and have set
             <match match_expr='(job["NeedBLAST"]!=1) or (glidein["attrs"]["HasBLAST"]!=0)'>
                 <factory query_expr="HasBLAST=!=UNDEFINED">
                    <match_attrs>
                       <match_attr name="HasBLAST" type="int"/>
                    </match_attrs>
                    <collectors>
                    </collectors>
                 </factory>
                 <job query_expr="True">
                    <match_attrs>
                       <match_attr name="NeedBLAST" type="int"/>
                    </match_attrs>
                    <schedds>
                    </schedds>
                 </job>
              </match>
    
    and have been struggling with why the matching was not working.
    
    So I added a printout in the frontend code
    glideinWMS/frontend/glideinFrontendLib.py:164
    log_files.logDebug("Matching: job %s gliein %s"%(job,glidein["attrs"]))
    
    and this is the result:
    [2010-07-14T22:12:22-05:00 16100] Matching: job {u'EnteredCurrentStatus': 1279163537, u'NeedBLAST': 1, u'ServerTime': 1279163542, u'JobStatus': 2} gliein {u'GLIDEIN_In_Downtime': u'False', u'GlideinRequirex509_Proxy': True, u'EntryName': u'Wisc_OSG_Edu', u'GLIDEIN_Site': u'Wisc', u'USE_CCB': u'True', u'PubKeyValue': u'-----BEGIN PUBLIC 
    KEY-----\\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA26ir/K8q2gvM6c/30j/B\\nFUDdYvUuvN8Zn17sw5p1f6sCeVy/1CpEJURYcNmdxbdWce6pJ9dwGKXnXyI7a4mk\\nyO2TjBzYSBVdnTqj4vEdqmi8sOTGFC44WsZsRgf/zsXoVgSJSDJVHk5RMflRLCdi\\nepw/NWzRy8UvTMAmTV0T1rsdsp3D7YZD80P5XjkuVb6TJY+OJYEGYBhXNcgODc8w\\nZkIXixcEyVmPXUvtIsReSlmFcaQdaiYP1w9tC7CBWnLRkm4T8IzojnczCf8MaV7i\\n31yaMOal4m0KICwgQYZ3/Ho1WY54MQNomB3X9rHoCyM97uWxDq+97319X//xa3WD\\nGwIDAQAB\\n-----END PUBLIC KEY-----\\n', 'PubKeyObj': <pubCrypto.PubRSAKey instance at 0x2af8a7f598c0>, u'PubKeyType': u'RSA', u'AuthenticatedIdentity': u'gfactory@vdt-itb.cs.wisc.edu', u'GLIDEIN_Gatekeeper': u'osg-edu.cs.wisc.edu/jobmanager-condor', u'GLIDEIN_GridType': u'gt2', u'GLEXEC_BIN': u'NONE', 
    u'GCB_ORDER': u'NONE', u'GlideinName': u'v1_0', u'LastHeardFrom': 1279163526, u'FactoryName': u'OSGSchool', u'PubKeyID': u'20ef93c63fbf5d552ca1786f08720053', u'SupportedSignTypes': u'sha1', u'GlideinAllowx509_Proxy': True, u'HasBLAST': u'True'}
    
    The job match_attr seems to have been honored, but the factory one was ignored.
    
    I think we have a bug in the frontend code, somewhere.
    Any volunteers to find and fix it?
    
    Igor
    
  • Factory does not submit glideins(Cannot be reliably reproduced)

    • Observed in the v1.6.x and v2.x
    • Was not able to reproduce successfully
    • Igor thinks this could be related to exception occuring but not handled correctly. Instead the exception is blindly ignored. His student may be looking into this issue.
    • Factory does not correctly determine the number of idle glideins in the system. Sometimes it reports zero glideins are in the system and submits bunch of new glideins, thus overloading the system.
    • In case there are errors/exceptions while running condor_q, we should just bypass the entire cycle and try again during the next cycle. This way if the condor_schedd has gone down, the entry will not advertise to the collector and the classad will eventually expire. This seems to be a safer operation.
    • Email from Burt:
      I noticed this in the CMS production installation:[2009-09-18T15:30:35-05:00 32407] Client 'cmssrv86', schedd status {1: 104, 2: 1659, 1100: 1, 1002: 104}, collector running ?[2009-09-18T15:31:58-05:00 32407] Client 'cmssrv86', schedd status {1: 0}, collector running ?
      [2009-09-18T15:33:07-05:00 32407] Client 'cmssrv86', schedd status {1: 100, 2: 1658, 1002: 100}, collector Note the 15:31:58 status of 1:0 -- that's not real. Is there some error condition with the condor_q output that defaults to marking
      as "zero jobs idle" {1: 0}
  • Factory sometimes crashes when it reaches the maximum number of glideins that can be submitted.

    • Email from Joe:
      -------- Forwarded Message --------
      From: Joe Boyd <xxxx@fnal.gov>
      To: Parag Mhashilkar <xxxx@fnal.gov>
      Cc: Federica Moscato <xxxx@fnal.gov>, Dennis D Box <xxxx@fnal.gov>
      Subject: a different factory died
      Date: Wed, 30 Sep 2009 20:38:00 -0500

      Hi Parag,

      This was a completely different installation than the last one where the factory
      died on me a couple of times. Again though, the factory died when a configured
      condor limit was reached. This was glideinwms 1.5.1 so maybe something is fixed
      in a later release. I can't even remember what I was testing before. This was
      a different limit than before. I had one entry point open and 8000 jobs
      submitted. I hadn't realized that condor was setup with this:

      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1] condor_config_val -dump
      | grep 5000
      GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 5000
      GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5000
      SEC_DEFAULT_SESSION_DURATION = 50000
      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1]

      This is the glideinwms factory condor pool and once it submitted 5000 glideins
      it wouldn't send any more I guess. At this point, the factory died. There is
      the error file. The factory_info file doesn't have any error in it. The last
      entry is just a regular loop entry with the same timestamp as this file.

      [gfactory@fcdfhead42dev ~/glideinsubmit/glidein_v1_5_1/log] cat
      factory_err.20090930.log
      [2009-09-30T15:53:51-05:00 29724] Exception at Wed Sep 30 15:53:51 2009:
      ['Traceback (most recent call last):\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
      in main\n glideinDescript,entries)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 121,
      in spawn\n time.sleep(sleep_time)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 192,
      in termsignal\n raise KeyboardInterrupt, "Received signal %s"%signr\n',
      'KeyboardInterrupt: Received signal 15\n']
      [2009-09-30T16:03:08-05:00 32504] Exception at Wed Sep 30 16:03:08 2009:
      ['Traceback (most recent call last):\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 176,
      in main\n glideinDescript,entries)\n', ' File
      "/cdf/local/home/gfactory/glideinWMS.v1_5_1/factory/glideFactory.py", line 115,
      in spawn\n raise RuntimeError,"Entry \'%s\' exited, quit the whole
      factory:\\n%s\\n%s"%(entry_name,tempOut,tempErr)\n', "RuntimeError: Entry
      'osgt2' exited, quit the whole factory:\n[]\n[]\n"]

      joe
    • At UCSD we reach the max_jobs all the time, and never had a problem.
      The trace shows that the program was sent a SIGTERM.

WMS Collector 2.2

Replay attack feature when same proxy used on multiple components

-- JohnWeigand - 2009/11/25

This is a hard one to describe and is related to replay attack feature in v2.2. This problem will occur only if you are using the same proxy/cert (key is issuer/subject value) for VO Frontend and the Factory instances.

If you are using the same proxy or certificate for both the VO Frontend and the Factory, the CONDOR_LOCATION/certs/condor_mapfile is populated as below:

If the factory (weigand@cms-xen21.fnal.gov) is the first one in the file, when the VOFrontend (cms_frontend@cms-xen22.fnal.gov) requests a glidein from the factory, this error occurs in the factory_err.yyyymmdd.log:

  • [2009-11-25T14:58:31-05:00 1389] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid
    ReqEncIdentity(cms_frontend@cms-xen22.fnal.gov!=weigand@cms-xen21.fnal.gov).
    Skipping for security reasons

If I then change the condor_mapfile (as below) and put the VOFrontend ahead of the factory, the glidein request is accepted:

Another anomaly of note when attempting to resolve this is that I had to restart the WMS Collector in order for it to recognize the changed condor_mapfile. It was my understanding that this was not required. I even allowed the collector to run for 1 hour and 50 minutes before I gave up and recycled the collector.

Update -- JohnWeigand - 2009/12/01: At the glidein status meeting on Monday (11/30), I was advised that the classid_identity in the security element should not contain the hostname of the WMS collector and that the collector's condor_mapfile should not have the hostname in the last token. Note that this is still the use case where I am using the same user cert on both the Ffactory and VOFrontend. Under these conditions, they only way I can get anything working (as in the requested glideins started) is using this configuration:

  • frontend.xml
    <security classad_identity="cms_frontend@cms-xen22.fnal.gov".... >

If I do not fully qualifier either of the 2 files, I get errors like this in the Factory error log:

  • [2009-12-01T07:58:11-05:00 27198] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid ReqEncIdentity? (cms_frontend!=cms_frontend@cms-xen21.fnal.gov). Skipping for security reasons.
If I qualify the security element classad_identity with "cms_frontend@cms-xen22.fnal.gov", I get this:
  • [2009-12-01T09:50:14-05:00 27198] Client ress_GRATIA_TEST_32@v2_2@factory@cms_frontend-v2_2.main provided invalid ReqEncIdentity? ( cms_frontend@cms-xen22.fnal.gov!=cms_frontend@cms-xen21.fnal.gov). Skipping for security reasons.
If I quality both (the frontend config and the condor_mapfile, it works.

Update 2: Here is what I use at UCSD, and this works:

gfactory - glidein-1.t2.ucsd.edu

[0858] gfactory@glidein-1 ~/glidecondor/certs$ cat condor_mapfile
GSI "/DC=org/DC=doegrids/OU=Services/CN=glidein-1.t2.ucsd.edu" condor
GSI "/DC=org/DC=doegrids/OU=Services/CN=glidein-frontend.t2.ucsd.edu" frontend
GSI "/DC=org/DC=doegrids/OU=People/CN=Dan Bradley 268953" glowfe
GSI (.*) anonymous
FS (.*) \1

frontend - glidein-frontend.t2.ucsd.edu

[0900] frontend@glidein-frontend ~/frontstage/instance_v2_1.cfg$ grep ident frontend.xml
<collector classad_identity="gfactory@glidein-1.t2.ucsd.edu" node="glidein-1.t2.ucsd.edu"/>
<security classad_identity="frontend@glidein-1.t2.ucsd.edu" classad_proxy="/home/frontend/.globus/x509_service_proxy" proxy_selection_plugin="ProxyUserMapWRecycling" sym_key="aes_256_cbc">

Update 3 -- JohnWeigand - 2009/12/01:

If the frontend is actually on glidein-frontend.t2.ucsd.edu, why then is the classad_identity populated with frontend@glidein-1.t2.ucsd.edu.

VOFrontend config file: classad_identity population

-- JohnWeigand - 2009/12/01

In the VOfrontend xml configuration file, the collector and security elements both have a classad_identity attribute.

  • <security proxy_selection_plugin="ProxyAll" classad_proxy="/home/cms/grid-security/x509_cms_pilot_proxy"
    classad_identity="cms_frontend">
  • <collectors>
    <collector node="cms-xen21.fnal.gov" classad_identity="glidein@cms-xen21.fnal.gov"
    comment="Define factory collector globally for simplicity"/>
    </collectors>

The security element classad_identity is not supposed to have the hostname appended. However, the collector element classad_identity has to have the hostname appended. When the collector element does not have the hostname appended, the factory never appears to get the glidein requests from the VOFrontend. There is nothing in any log files on the VOFrontend, factory or WMS collector indicating a problem.

The question is: why are the classad_identity attributes populated differently?

ReSS / GIP / resource / resource group

-- JohnWeigand - 2010/03/11

This one is not directly related to glidein per se. It is indirectly affected by how OIM/MyOSG defines the OSG topology for my environment.

An example: I have the following defined in OIM/MyOSG:

resource group: ITB_GRATIA_TEST containing 2 resources:

  • resource: ITB_GRATIA_TEST_1 with a CE service running on gr6x3.fnal.gov
  • resource: ITB_GRATIA_TEST_2 with a CE service running on gr6x4.fnal.gov

Using the latest CE config.ini (OSG 1.2.8), I have defined resource_group and resource on each CE with the respective names shown above.

When GIP publishes data to ReSS and BDII, it uses the resource_group of the config.ini. So, for both CEs, data is published as ITB_GRATIA_TEST.

During the Factory installation process, my queries of the ReSS service (in this case, the ITB osg-ress-4.fnal.gov) bring back the following potential entry points:

  • [ress_ITB_GRATIA_TEST_1] gr6x4.fnal.gov/jobmanager-condor((queue=default)(jobtype=single))
  • [ress_ITB_GRATIA_TEST_2] gr6x3.fnal.gov/jobmanager-condor((queue=default)(jobtype=single))

Notice above, that ITB_GRATIA_TEST_1 appears now to be associated with gr6x4 and not the real one, gr6x3.

The reason for this is that the query brings back 2 sets of data with a name of ITB_GRATIA_TEST (the resource group). The installation appends a '_<counter>' to the resource_group. In this case, the gr6x4 came back before the gr6x3 on the query and it then looks like just a transposition.

This could just be an unfortunate, for me, problem caused by my naming the resources with an appended "_<number>.

The true reason for the appending of a counter to the resource group, at least as I am deducing it, is to handle the case where a resource (CE) has multiple job managers that can be used. This also looks like there may be 3 resources in the same resource group.

  • [ress_FNAL_FERMIGRID_ITB_1] fgitbgkc1.fnal.gov:2119/jobmanager-condor((queue=group_cms)(jobtype=single))
  • [ress_FNAL_FERMIGRID_ITB_2] fgitbgkc1.fnal.gov:2119/jobmanager-condor((queue=group_us_cms)(jobtype=single))
  • [ress_FNAL_FERMIGRID_ITB_3] fgitbgkc2.fnal.gov/jobmanager-condor((queue=group_cms)(jobtype=single))
  • [ress_FNAL_FERMIGRID_ITB_4] fgitbgkc2.fnal.gov/jobmanager-condor((queue=group_us_cms)(jobtype=single))
  • [ress_FNAL_FERMIGRID_ITB_5] fgitbgkp2.fnal.gov/jobmanager-pbs((queue=batch)(jobtype=single))

At this point the "I'm not sure I know what I am talking about and may be ramblinig" stuff starts.. to be done in bullet points..

  1. If we are attempting to align OIM/MyOSG resource with Gratia site name, it seems will have a disconnect with how we reference glidein entry points. Contact information is at the resource level in OIM, not the resource group level.
  2. Several production resources are currently defined with the appended counter as I have in ITB_GRATIA_TEST_1/2. So the only clue as to the real resource is through the node name of gatekeeper.
  3. If, in the future, there is an intent to integrate something like planned maintenance/downtime queries of OIM/MyOSG, there is nothing available in glidein to do this.

As mentioned in the beginning, this is not necessarily a glidein problem but is something all should be aware of.

Expired user job proxy causes glexec to hang (v2.2 /glexec v0.6.8.3)

-- JohnWeigand - 2010/04/21

Description: If the user proxy used to run a job expires before it completes, the glexec authorization process causes the job to "hang" thus tieing up the WN resource.

Software versions:

  • GlideinWMS v2.2
  • glexec 0.6.8.3-osg1 (lcas 1.3.11.3 and lcmaps 1.4.8.4) 0.6.8.3-osg1
  • VDT 2.0.99p16

This is the scenario used to test this problem:

  1. Created a proxy using voms-prox-init with a '-valid 00:05' argument so the proxy would expire in 5 minutes
  2. Submitted a job that would run for 10 minutes

These were the results:

  1. User job was submitted
  2. User job was pulled down by the glidein pilot
  3. glexec authorized the job to run
  4. It ran for the 10 minutes and then glexec needed to do another authorization on it (for what reason I am clueless) and it recognixed that the proxy had expired. It then went into an endless retry loop until I terminated the user job on the submit node 50 minutes later.

The user job's log file shows this:

000 (006.000.000) 04/20 13:59:51 Job submitted from host: <131.225.206.81:52721>
...
001 (006.000.000) 04/20 14:00:12 Job executing on host: <131.225.204.144:53728>
...
006 (006.000.000) 04/20 14:05:23 Image size of job updated: 9476
...
007 (006.000.000) 04/20 14:10:15 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to condor
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job
...
007 (006.000.000) 04/20 14:10:18 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to the user
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job
...
007 (006.000.000) 04/20 14:10:19 Shadow exception!
Error from glidein_9185@cms-xen11.fnal.gov: error changing sandbox ownership to the user
0 - Run Bytes Sent By Job
4096 - Run Bytes Received By Job

I killed it here..

...009 (006.000.000) 04/20 14:49:22 Job was aborted by the user.
via condor_rm (by user weigand)

The /var/log/glexec/lcas_lcmaps.log shows this (I have omitted what I consider relevant lines):

Start of job...

LCMAPS 0: 2010-04-20.14:00:13-29105 : lcmaps_plugin_gums-plugin_run(): gums plugin succeeded
LCMAPS 0: 2010-04-20.14:00:13-29105 : lcmaps.mod-lcmaps_run_with_pem_and_return_account(): succeeded
LCMAPS 7: 2010-04-20.14:00:13-29105 : Termination LCMAPS
LCMAPS 1: 2010-04-20.14:00:13-29105 : lcmaps.mod-lcmaps_term(): terminating
LCMAPS 7: 2010-04-20.19:00:13 : Termination LCMAPS
LCMAPS 1: 2010-04-20.19:00:13 : lcmaps.mod-lcmaps_term(): terminating
Job would have completed execution here..

LCAS 1: 2010-04-20.14:10:15-01823 :
LCAS 1: 2010-04-20.14:10:15-01823 : Initialization LCAS version 1.3.11.3
LCMAPS 1: 2010-04-20.14:10:15-01823 :
LCMAPS 7: 2010-04-20.14:10:15-01823 : Initialization LCMAPS version 1.4.8-4
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-startPluginManager(): Reading LCMAPS database /etc/glexec/lcmaps/lcmaps-suexec.db
LCAS 0: 2010-04-20.14:10:15-01823 : LCAS already initialized
LCAS 2: 2010-04-20.14:10:15-01823 : LCAS authorization request
LCAS 1: 2010-04-20.14:10:15-01823 : lcas_userban.mod-plugin_confirm_authorization(): checking banned users in /etc/glexec/lcas/ban_users.db
LCAS 1: 2010-04-20.14:10:15-01823 : lcas.mod-lcas_run_va(): succeeded
LCAS 1: 2010-04-20.14:10:15-01823 : Termination LCAS
LCAS 1: Termination LCAS
LCMAPS 0: 2010-04-20.14:10:15-01823 : LCMAPS already initialized
LCMAPS 5: 2010-04-20.14:10:15-01823 : LCMAPS credential mapping request
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-runPlugin(): found plugin /usr/local/osg-wn-client/glexec-osg/lib/modules/lcmaps_verify_proxy.mod
LCMAPS 1: 2010-04-20.14:10:15-01823 : lcmaps.mod-runPlugin(): running plugin /usr/local/osg-wn-client/glexec-osg/lib/modules/lcmaps_verify_proxy.mod
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying proxy: Proxy certificate expired.
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying proxy: Proxy certificate expired.
LCMAPS 1: 2010-04-20.14:10:15-01823 : Error: Verifying certificate chain: certificate has expired
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps_plugin_verify_proxy-plugin_run(): verify proxy plugin failed
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-runPluginManager(): Error running evaluation manager
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_run_with_pem_and_return_account() error: could not run plugin manager
LCMAPS 0: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_run_with_pem_and_return_account(): failed
LCMAPS 1: 2010-04-20.14:10:15-01830 : LCMAPS failed to do mapping and return account information
LCMAPS 7: 2010-04-20.14:10:15-01830 : Termination LCMAPS
LCMAPS 1: 2010-04-20.14:10:15-01830 : lcmaps.mod-lcmaps_term(): terminating
LCMAPS 7: 2010-04-20.19:10:15 : Termination LCMAPS
LCMAPS 1: 2010-04-20.19:10:15 : lcmaps.mod-lcmaps_term(): terminating
LCAS 1: 2010-04-20.14:10:15-01834 : lcas_userban.mod-plugin_confirm_authorization(): checking banned users in /etc/glexec/lcas/ban_users.db
LCAS 1: 2010-04-20.14:10:15-01834 : lcas.mod-lcas_run_va(): succeeded
LCAS 1: 2010-04-20.14:10:15-01834 : Termination LCAS
LCAS 1: Termination LCAS

This then goes into an endless loop until the job was killed on the submit node.

It appears to retry twice every 2 minutes and the logs fill up very fast.

Update 1 -- JohnWeigand -2010/04/26 :

Feedback from Igor on 2010/04/21 is that this is a known problem that Condor is aware of and has plans to fix but there are no hard dates for its resolution.

However, I don't quite understand how this is a Condor issue. This may be due to my ignorance since the problem appears to be in glexec/lcas/lcmaps. Unless it is how Condor handles the error detected and is simply restarting the failed job rather than letting it die.

Regardless, I decided to change the test a little to see how it would handle other jobs in the submit node queue that had valid non-expired proxies.

  • I started 2 jobs with a CMS proxy with a life of 5 minutes guaranteed to run 10 minutes. This consumed the 2 slots (and pilots) available on the test cluster.
  • I then submitted 4 more jobs using a dzero proxy with a 180 hour lifetime guaranteed to run 2 minutes. These sat idle in the submit queue while the CMS jobs ran.
  • To my surprise, when the 2 CMS jobs completed after 10 minutes and glexec/lcas/lcmaps went into the failure mode related to the expired proxies, the pilots brought down the remaining jobs (with dzero proxies) in the submit queue and processed them successfully.
  • When these completed, glexec/lcas/lcmaps continued to error on the 2 CMS jobs.

Now out of curiosity, I renewed the proxies the 2 CMS jobs were using and then, again to my surprise, they authorized and completed succesfully.

This may be the expected behavior but I do not understand the "why".

Update 2 -- JohnWeigand -2010/05/03 (from Igor's reply) :

1. Related to understanding why this is a Condor issue.

"Condor is the one that is calling glexec.
And it is expecting that glexec will always succeed.
So it starts the job, when the proxy is valid.
When the job finishes it tries to use glexec to fetch the results and do the cleanup...
if the proxy is not valid anymore, it will fail"

2. Related to why, when the CMS jobs completed and failed the authorization on cleanup, the dzero jobs in the submit queue started.

"Not too suprising... Condor simply gave up on the CMS jobs (after a timeout?),
since it cannot do anything about them.
I suppose at this point the dzero jobs had a better priority, so the were matched to that job slot."

3. Related to why, when all jobs in the submit queue were processed, the pilot continued to attempt the clean authorization on fhe CMS jobs.

"So condor just restarted them?
If this is the case, it is not too surprising either...
Condor guarantees "job runs at least once" policy...
given that it was not able to fetch the result from the first run, it is not counting that as a run, so it tries again"

4. Related to why the CMS jobs then completed successfully after their proxy was renewed on the submit node.

"This is reasonable...
Condor will delegate the proxy from the schedd to that startd (or shadow to starter) every time it tries to start the job...
so the moment the schedd had a velid proxy, it was delegated to the glidein side where it was used to call glexec... and things started to work"

Update 3 -- JohnWeigand -2010/05/18:

This issue was brought to the attention of Condor support in an effort to escalate the priority. Related to this was how to handle a similar issue of "banned/blacklisted" users.

Handling of "blacklisted" users

-- JohnWeigand - 2010/05/18

This was identified by Burt Holzman and is analogous to the issue of expired proxies.

"If a user is banned from a site (but not the pilot), essentially the same thing happens -- the user jobs match, start, fail the initial glexec, get rescheduled, etc."

General Questions v2.4+

Privilege Separation

1. JohnWeigand - 2010/05/26: When privilege separation is in effect, it appears the WMS Collector and Factory always have to be co-located. This appears to be true since the Factory create step apparently invokes the condor switchboard to create the individual VO frontend user directories. True/False?

Or is privilege separation not a factor and, therefore the WMS Collector and Factory always have to be co-located?

Edit | Attach | Print version | History: r115 | r113 < r112 < r111 < r110 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r111 - 2011/06/02 - 21:43:27 - ParagMhashilkar
UCSDTier2.TaskList moved from UCSDTier2.ReleaseTaskList on 2011/05/24 - 18:16 by ParagMhashilkar - put it back
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback