MIS Get CPUS script caused overload of the CE. The cron job was disabled and nominal loads were restored on the CE. During the problems we were able to keep from purging the CE Schedd queue which allowed us to delay, rather than cancel pending user jobs. The exception was for CDF jobs which are glide-ins and are best removed if you need to stop the actual job on the pool.
An update of the CA Certificates package on the GUMS server resulted in corrupting the GUMS VDT install. This required a full re-install of the GUMS infrastructure and about an hour of downtime. Existing jobs were not affected by authentication for new jobs was down during that time. The cause of the problem within pacman is still being investigated as of Nov 1 2006
Oct 2 2006
Some NFS mounts were not properly refreshed from last weeks changes and at least one node had not gotten the proper auto.* files. These problems have been fixed
Applying OpenSSH? security patches to nodes and restarting OpenSSH?
Sept 21 2006
Restarted node 5-7 remotely. No clear errors in the log files on the central log manager still investigating
Restarted node 5-11 remotely. No clear errors in the log files on the central log manager still investigating
Sept 20 2006
Power Cycles esf1 and patched it
Restarted most internal condor pools. There is a problem with the pools and schedd getting disconnected. Dan Bradley suggest upgrading to address a known bug related to network errors and schedd startd communication problems
Possible intermittent problem with the jobmanger on osg-gw-2.t2.ucsd.edu with VO CompBioGrid? . Local submission is successfuly, waiting for VO retest.
Sept 15 2006
Power cycled esf1.t2
Sept 14 2006
Sept 13 2006
Restarted MonaLisa on osg-gw-2.t2 after it crashed
Restarted node 4-30. Was down for unknown reasons
Remotely power cycled esf1.t2.ucsd.edu
Restarted node 5-19, did not restart because BIOS reset to see the disks as IDE and not MMIO as is required by the SATA driver
Attempted to restart node 4-11 but discovered it had a failed disks. Will be re-installed on subsequent visit