Difference: UafSubmissionInfrastructure (4 vs. 5)

Revision 52015/10/26 - Main.FkW

Line: 1 to 1
 
META TOPICPARENT name="WebHome"
Line: 66 to 66
  To avoid this particular problem, you will want to extend your proxy lifetime to 72h with "voms-proxy-init -H 72".
Added:
>
>

How to figure out why your job is running so much longer than it should?

Here we need your help to hel you, because chances are this requires root privileges on the cluster to figure out. So here's what you should do:

  • pick a job that has been running for way too long and do as follows:
         condor_q -l 45638.0 |grep GridJobId
         #this will get you something like:
        GridJobId = "condor cmssubmit-r1.t2.ucsd.edu glidein-collector.t2.ucsd.edu 845562.0”
        # the last number is the job Id for this job on cmssubmit-r1, so next you do:
       condor_q -n cmssubmit-r1.t2.ucsd.edu -pool glidein-collector.t2.ucsd.edu -l 845562.0 |grep MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot
       #this will give you something like:
       MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = "slot2@cabinet-3-3-2.t2.ucsd.edu”
       #at this point you send an email to t2support that tells people that you think cabinet-3-3-2 has a broken hadoop fuse mount, and provide those guys with
       #all the info from the above, i.e. the UAF your jobs were submitted from, an example job ID there, the job ID on cmssubmit-r1 that it corresponds to, and the slot2@cabinet-3-3-2.t2.ucsd.edu
       # that you figured out above.
      
  • if I'm awake, or somebody else is awake, we will then try to fix the broken fuse mount. In the meantime, you just leave the long running job hanging, and resubmit another one just like it.
  • Here's what we will do:
    • log into the node in question
    • use pstree -p guser2 or equivalent to see what process Id the job has that hangs.
    • cd /proc/processId
    • cat cmdline
    • and this will tell us what the job is doing, and why it's hung.
 

How to query the schedd on cmssubmit-r1

  • condor_q -n cmssubmit-r1.t2.ucsd.edu -pool glidein-collector.t2.ucsd.edu
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback