Procedures for Identifying, Avoiding and Recovering from Data Loss at the UCSD CMS T2 Center
Contents
Introduction
This living document outlines policies and procedures for ensuring data safety at the UCSD CMS T2 Center. This information is based on operational experience at the UCSD T2 Center and is subject to change.
Dcache Node
Identification of Current or Imminent Node Failure
System log of a disk failure
It is often the case when a disk is going bad that it will start to log single block or sector errors to the kernel or system log file. These errors can be detected manually or automatically. If they are monitored automatically then that information can be pushed into a central error detection and monitoring system (eg. Nagios).
Examples of logs of disk failures
SATA Disk error
Oct 1 04:08:13 cabinet-5-5-12 kernel: ata3: command 0x25 timeout, stat 0xd0 host_stat 0x61
Oct 1 04:08:13 cabinet-5-5-12 kernel: ata3: status=0xd0 { Busy }
Oct 1 04:08:13 cabinet-5-5-12 kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 8000002
Oct 1 04:08:13 cabinet-5-5-12 kernel: Current sd08:25: sense key Aborted Command
Oct 1 04:08:13 cabinet-5-5-12 kernel: Additional sense indicates Scsi parity error
Oct 1 04:08:13 cabinet-5-5-12 kernel: I/O error: dev 08:25, sector 160576768
Oct 1 04:08:13 cabinet-5-5-12 kernel: ATA: abnormal status 0xD0 on port 0xFFFFFF000001921C
PATA Disk error
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: dma_timer_expiry: dma status == 0x61
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: error waiting for DMA
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: dma timeout retry: status=0xd0 { Busy }
Oct 2 10:30:36 cabinet-4-4-21.local kernel:
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hda: DMA disabled
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: DMA disabled
Oct 2 10:30:36 cabinet-4-4-21.local kernel: ide0: reset: success
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 2 10:30:36 cabinet-4-4-21.local kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=46128471, high=2, low=12574039, sector=46128408
Oct 2 10:30:36 cabinet-4-4-21.local kernel: end_request: I/O error, dev 03:41 (hdb), sector 46128408
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=45933575, high=2, low=12379143, sector=45933512
Oct 2 10:31:39 cabinet-4-4-21.local kernel: end_request: I/O error, dev 03:41 (hdb), sector 45933512
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=45939407, high=2, low=12384975, sector=45939344
Oct 2 10:31:39 cabinet-4-4-21.local kernel: end_request: I/O error, dev 03:41 (hdb), sector 45939344
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 2 10:31:39 cabinet-4-4-21.local kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=45933575, high=2, low=12379143, sector=45933512
Oct 2 10:31:39 cabinet-4-4-21.local kernel: end_request: I/O error, dev 03:41 (hdb), sector 45933512
Oct 2 10:31:53 cabinet-4-4-21.local kernel: hdb: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 2 10:31:53 cabinet-4-4-21.local kernel: hdb: dma_intr: error=0x40 { UncorrectableError }, LBAsect=45945935, high=2, low=12391503, sector=45945872
Oct 2 10:31:53 cabinet-4-4-21.local kernel: end_request: I/O error, dev 03:41 (hdb), sector 45945872
Disk Errors Do not Guarantee Complete Disk Failure
It is worth mentioning that errors of these types can occur long before the disk will fail completely if it fails at all. Sometimes even months after a few errors are detected the disk will continue to run. It is also interesting that disks that show these errors will either checkout when run through the vendor disk testing utilities, or can be repaired with those same utilities.
It is UCSD CMS T2 policy however to replace such disks as soon as is practical when the errors start to occur. The removed disks are tested with vendor supplied disk testing software and sent out for warranty replacement if possible.
If the disk errors coincide with a failure of a dcache pool the disk is removed as soon as possible after pool draining and confirmation of replication of all files on the pool.
Dcache Pool Failure
It has been observed that dcache is sensitive to certain types of disk failures resulting in the pools themselves crashing or becoming unresponsive. When dcache pools do become unresponsive one of the first things to check is for any recent disk failures as noted above.
Dcache Pool Disk Problems Response
The response once a pools disks are determined to be no longer servicable is to;
- Configure dcache to start draining the pool. Draining causes replica manager to make sure that there are at least 2 more copies over every file on the pool so that the loss of the pool does not result in loss of data. Draining the pool also instructs dcache not to put new files there.
- Run the file checker tool developed at UCSD to determine if the files have copies on nodes other than the node with the failed disk
- Once the files are confirmed to be replicated to at least 2 other pools the pool is taken offline, the disk is replaced and the operating system re-installed. After a disk replacement the pool starts life again as an empty pool
Shared User Disk for the UAF
All shared disk for the UAF is place on a shared NFS server. The NFS server itself uses a 16 disk RAID5 array that is resilient in the event of a single disk failure. In the event a failure occurs on the device support at raidinc.com should be contacted. The email should include the device, firmware and the error encountered. Depending on the problem Raid Inc. will suggest a remedy which may include a media scan or possibly disk replacement.
Replacement disks from RAID are usually sent out overnight delivery.
Replacing a Disk in the RAID Array
To replace a disk in the falcon array simply remove the disk from its slot, remove the old disk from the cradle, insert the new disk in the cradle and re-insert the disk. Once inserted the array should automatically detect the new disk and start a rebuild process. You can determine the status of the rebuild by logging into the raid device and checking the rebuild status.
Periodic Media Scans
It is recommended by RAID Inc to perform periodic media scans on the RAID array.
To perform a media scan;
- Telnet to the RAID device and log in, enter the password when prompted
- Select Terminal(VT100 Mode) and hit enter, enter the password when prompted
- Select view and edit Logical drives and hit enter
- Select the drive to run the scan on and hit enter
- Select Media scan, hit escape
- Select Yes and hit enter
The media scan takes several hours. Generally you can check back the next day. You can determine the progress of the media scan by selecting the disk from under the logical drives and selecting media scan again.
Logging out from the Falcon
crtl-shift-]
quit
Authors
--
TerrenceMartin - 02 Oct 2006