We get servers with disks which pass smart tests but have many ata errors.
I wrote a quick script based on check_mcelog to grep in the logs for ata errors which can point to a failing disk or loose cable.
# cat /usr/local/bin/check_ata
#!/bin/bash
#modified from check_mcelog
LOGFILE=/var/log/kern.log
if [ ! -f "$LOGFILE" ]
then
echo "No logfile exists"
exit 3
else
ERRORS=$( grep -c -e "ATA bus error" -e "failed command" $LOGFILE )
if [ $ERRORS -lt 1 ]
then
echo "OK: $ERRORS ATA errors found"
exit 0
elif [ $ERRORS -lt 100 ]
then
echo "WARNING: $ERRORS ATA errors found"
exit 1
else
echo "CRITICAL: $ERRORS ATA errors found"
exit 1
fi
fi
For example we see these kernel errors:
Jun 25 16:19:50 e206 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun 25 16:19:50 e206 kernel: ata4.00: BMDMA stat 0x24
Jun 25 16:19:50 e206 kernel: ata4.00: failed command: READ DMA EXT
Jun 25 16:19:50 e206 kernel: ata4.00: cmd 25/00:a0:00:d0:13/00:03:5f:00:00/e0 tag 20 dma 475136 in
Jun 25 16:19:50 e206 kernel: res 51/40:00:00:d2:13/40:00:5f:00:00/00 Emask 0x9 (media error)
Jun 25 16:19:50 e206 kernel: ata4.00: status: { DRDY ERR }
Jun 25 16:19:50 e206 kernel: ata4.00: error: { UNC }
Jun 25 16:19:51 e206 kernel: ata4.00: configured for UDMA/133
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Jun 25 16:19:51 e206 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 25 16:19:51 e206 kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 25 16:19:51 e206 kernel: 5f 13 d2 00
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 5f 13 d0 00 00 03 a0 00
Jun 25 16:19:51 e206 kernel: end_request: I/O error, dev sdb, sector 1595134464
Jun 25 16:19:51 e206 kernel: ata4: EH complete
And these smart errors:
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors
Yet the device checks out via smart:
# smartctl -H /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-2.6.32-504.16.2.el6.x86_64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
This was tested on a SUN FIRE X2250 with Scientific Linux 6.6.
Based on:
https://github.com/solarkennedy/nagios-plugins/blob/master/check_mcelog
I wrote a quick script based on check_mcelog to grep in the logs for ata errors which can point to a failing disk or loose cable.
# cat /usr/local/bin/check_ata
#!/bin/bash
#modified from check_mcelog
LOGFILE=/var/log/kern.log
if [ ! -f "$LOGFILE" ]
then
echo "No logfile exists"
exit 3
else
ERRORS=$( grep -c -e "ATA bus error" -e "failed command" $LOGFILE )
if [ $ERRORS -lt 1 ]
then
echo "OK: $ERRORS ATA errors found"
exit 0
elif [ $ERRORS -lt 100 ]
then
echo "WARNING: $ERRORS ATA errors found"
exit 1
else
echo "CRITICAL: $ERRORS ATA errors found"
exit 1
fi
fi
For example we see these kernel errors:
Jun 25 16:19:50 e206 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun 25 16:19:50 e206 kernel: ata4.00: BMDMA stat 0x24
Jun 25 16:19:50 e206 kernel: ata4.00: failed command: READ DMA EXT
Jun 25 16:19:50 e206 kernel: ata4.00: cmd 25/00:a0:00:d0:13/00:03:5f:00:00/e0 tag 20 dma 475136 in
Jun 25 16:19:50 e206 kernel: res 51/40:00:00:d2:13/40:00:5f:00:00/00 Emask 0x9 (media error)
Jun 25 16:19:50 e206 kernel: ata4.00: status: { DRDY ERR }
Jun 25 16:19:50 e206 kernel: ata4.00: error: { UNC }
Jun 25 16:19:51 e206 kernel: ata4.00: configured for UDMA/133
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Jun 25 16:19:51 e206 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 25 16:19:51 e206 kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 25 16:19:51 e206 kernel: 5f 13 d2 00
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 5f 13 d0 00 00 03 a0 00
Jun 25 16:19:51 e206 kernel: end_request: I/O error, dev sdb, sector 1595134464
Jun 25 16:19:51 e206 kernel: ata4: EH complete
And these smart errors:
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors
Yet the device checks out via smart:
# smartctl -H /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-2.6.32-504.16.2.el6.x86_64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
This was tested on a SUN FIRE X2250 with Scientific Linux 6.6.
Based on:
https://github.com/solarkennedy/nagios-plugins/blob/master/check_mcelog
No comments:
Post a Comment