UCSD Tier 2 Event Log

Oct 27 2006

  • MIS Get CPUS script caused overload of the CE. The cron job was disabled and nominal loads were restored on the CE. During the problems we were able to keep from purging the CE Schedd queue which allowed us to delay, rather than cancel pending user jobs. The exception was for CDF jobs which are glide-ins and are best removed if you need to stop the actual job on the pool.
  • An update of the CA Certificates package on the GUMS server resulted in corrupting the GUMS VDT install. This required a full re-install of the GUMS infrastructure and about an hour of downtime. Existing jobs were not affected by authentication for new jobs was down during that time. The cause of the problem within pacman is still being investigated as of Nov 1 2006

Oct 2 2006

  • Some NFS mounts were not properly refreshed from last weeks changes and at least one node had not gotten the proper auto.* files. These problems have been fixed
  • Applying OpenSSH? security patches to nodes and restarting OpenSSH?

Sept 21 2006

  • Restarted node 5-7 remotely. No clear errors in the log files on the central log manager still investigating
  • Restarted node 5-11 remotely. No clear errors in the log files on the central log manager still investigating

Sept 20 2006

  • Power Cycles esf1 and patched it
  • Restarted most internal condor pools. There is a problem with the pools and schedd getting disconnected. Dan Bradley suggest upgrading to address a known bug related to network errors and schedd startd communication problems
  • Possible intermittent problem with the jobmanger on osg-gw-2.t2.ucsd.edu with VO CompBioGrid? . Local submission is successfuly, waiting for VO retest.

Sept 15 2006

  • Power cycled esf1.t2

Sept 14 2006

Sept 13 2006

  • Restarted MonaLisa on osg-gw-2.t2 after it crashed
  • Restarted node 4-30. Was down for unknown reasons
  • Remotely power cycled esf1.t2.ucsd.edu
  • Restarted node 5-19, did not restart because BIOS reset to see the disks as IDE and not MMIO as is required by the SATA driver
  • Attempted to restart node 4-11 but discovered it had a failed disks. Will be re-installed on subsequent visit

-- TerrenceMartin - 14 Sep 2006

Topic revision: r5 - 2006/11/01 - 20:55:50 - TerrenceMartin
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback