Wednesday, August 12, 2015

Nagios check_ata

We get servers with disks which pass smart tests but have many ata errors.

I wrote a quick script based on check_mcelog to grep in the logs for ata errors which can point to a failing disk or loose cable.

# cat /usr/local/bin/check_ata
#!/bin/bash
#modified from check_mcelog
LOGFILE=/var/log/kern.log

if [ ! -f "$LOGFILE" ]
then
    echo "No logfile exists"
    exit 3
else
    ERRORS=$( grep -c -e "ATA bus error" -e "failed command" $LOGFILE )
    if [ $ERRORS -lt 1 ]
    then
        echo "OK: $ERRORS ATA errors found"
        exit 0
    elif [ $ERRORS -lt 100 ]
    then
        echo "WARNING: $ERRORS ATA errors found"
        exit 1
    else
        echo "CRITICAL: $ERRORS ATA errors found"
        exit 1
    fi
fi



For example we see these kernel errors:

Jun 25 16:19:50 e206 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun 25 16:19:50 e206 kernel: ata4.00: BMDMA stat 0x24
Jun 25 16:19:50 e206 kernel: ata4.00: failed command: READ DMA EXT
Jun 25 16:19:50 e206 kernel: ata4.00: cmd 25/00:a0:00:d0:13/00:03:5f:00:00/e0 tag 20 dma 475136 in
Jun 25 16:19:50 e206 kernel:         res 51/40:00:00:d2:13/40:00:5f:00:00/00 Emask 0x9 (media error)
Jun 25 16:19:50 e206 kernel: ata4.00: status: { DRDY ERR }
Jun 25 16:19:50 e206 kernel: ata4.00: error: { UNC }
Jun 25 16:19:51 e206 kernel: ata4.00: configured for UDMA/133
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
Jun 25 16:19:51 e206 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 25 16:19:51 e206 kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 25 16:19:51 e206 kernel:        5f 13 d2 00
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 5f 13 d0 00 00 03 a0 00
Jun 25 16:19:51 e206 kernel: end_request: I/O error, dev sdb, sector 1595134464
Jun 25 16:19:51 e206 kernel: ata4: EH complete

And these smart errors:
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors

Yet the device checks out via smart:

# smartctl -H /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-2.6.32-504.16.2.el6.x86_64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

This was tested on a SUN FIRE X2250 with Scientific Linux 6.6.

Based on:
https://github.com/solarkennedy/nagios-plugins/blob/master/check_mcelog

No comments:

Post a Comment