Hi guys. I tried to install the collector+schedd on a test machine, installing as root and using a user account for the condor daemons, and the installer did not like it; it fails in the last step, when trying to start the daemons: ... How many secondary schedds do you want?: [9] 4 Error determining who should own the Condor-related directories. Either create a "condor" account, or set the CONDOR_IDS environment variable to the valid uid.gid pair that should be used by Condor. Traceback (most recent call last): File "./glideinWMS_install", line 4238, in ? main() File "./glideinWMS_install", line 117, in main return installer(install_options) File "./glideinWMS_install", line 94, in installer install_options[k]["proc"]() File "./glideinWMS_install", line 301, in schedd_node_install configure_secondary_schedd(schedd_name) File "./glideinWMS_install", line 3431, in configure_secondary_schedd raise RuntimeError, "Failed to initialize schedd '%s'!"%schedd_name RuntimeError: Failed to initialize schedd 'jobs1'! ...
Actual implementation: If an entry is configure with GLEXEC_BIN with value other than NONE, it is assumed that site has GLEXEC available to be used. Frontend can mandate/overide the use of GLEXEC by setting attr GLIDEIN_Glexec_Use. If set to NONE, glidein with not use GLEXEC, if set to OPTIONAL, glidein will use GLEXEC if site has one, if set to REQUIRED, glidein will enforce the use of GLEXEC. Setting it to REQUIRED also enforces the factory to send glideins to the sites that have GLEXEC configured.
frontend factory result -------- ------- ------ required glexec ran/used glexec never glexec ran/did not use glexec optional glexec ran/used glexec optional no glexec ran/did not use glexec never no glexec ran/did not use glexec required no glexec should have no glideins start job367- authorize but not user job - clueless nothing in any log saying it will never be satisfied
The HCC group requested another feature; since they use preemption (for which I added an explicit knob in v2_4 a while ago) whey also need to specify MaxJobRetirementTime, or the job will never go away when preempted. I defined PREEMPT_GRACE_TIME that is then used to define MaxJobRetirementTime in my UCSD gfactory, and they are testing it now. The reason for a different name is so it is not confused with GLIDEIN_Job_Max_Time
[2010-08-10T13:45:21-05:00 32385] condor_submit failed: Error running 'export X509_USER_PROXY=/home/gfactory/.grid/pilot.dbox.proxy;./job_submit.sh "gpgeneral" "gpgeneral@v1_0@if01@fnal_if01" 20 -- GLIDEIN_Collector if01.dot,fnal.dot,gov' code 256:['\n', 'ERROR: proxy has expired\n'] [2010-08-10T13:45:21-05:00 32385] Exception at Tue Aug 10 13:45:21 2010: Traceback (most recent call last): File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 311, in iterate glideinDescript,jobDescript,jobAttributes,jobParams) File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 281, in iterate_one done_something = find_and_perform_work(in_downtime,glideinDescript,jobDescript,jobParams) File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 181, in find_and_perform_work jobDescript,x509_proxy_fname,params) File "/home/gfactory/glideinWMS/factory/glideFactoryEntry.py", line 91, in perform_work nr_submitted=glideFactoryLib.keepIdleGlideins(condorQ,idle_glideins,max_running,max_held,submit_attrs,x509_proxy_fname,params) File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 286, in keepIdleGlideins submitGlideins(condorq.entry_name,condorq.schedd_name,condorq.client_name,min_nr_idle-idle_glideins,submit_attrs,x509_proxy_fname,params) File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 664, in submitGlideins cluster,count=extractJobId(submit_out) File "/home/gfactory/glideinWMS/factory/glideFactoryLib.py", line 587, in extractJobId raise condorExe.ExeError, "Could not find cluster info!" ExeError: Could not find cluster info!
In submit_20100921_cms_jgw-v2_4_3.main.log.. 026 (004.000.000) 09/21 08:59:38 Detected Down Grid Resource GridResource: gt2 gr9x0.fnal.gov/jobmanager-condorHowever, at 09:38, this 'tuple index out of range' stacktrace appeared log files and never appeared again over an 01:30:00 time period. I do not believe it is related to this issue. The log files this appeared in where the factory (not client logs):
[2010-09-21T09:38:37-05:00 14896] WARNING: Exception occurred: ['Traceback (most recent call last):\n', ' File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryEntry.py", line 453, in iterate\n write_stats()\n', ' File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryEntry.py", line 357, in write_stats\n glideFactoryLib.factoryConfig.log_stats.write_file()\n', ' File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryMonitoring.py", line 870, in write_file\n diff_summary=self.get_diff_summary()\n', ' File "/home/weigand/glidein/glideinWMS.v2_4_3_alpha_1/factory/glideFactoryMonitoring.py", line 795, in get_diff_summary\n sdel[4][\'username\']=username\n', 'IndexError: tuple index out of range\n']2. Started the entry CE globus-gatekeeper at 10:40. %BR& Results: It processed the user jobs successfully. I then shutdown the entry CE gatekeeper. It recognized the down resource correctly. The glidein pilots on the WMS collector continued running. I did not get the warning message again.