Showing posts with label Nagios. Show all posts
Showing posts with label Nagios. Show all posts

Wednesday, August 12, 2015

Dell Openmanage Kill Switch



This kill switch stops machines from running when  cooling systems go down and runs from /etc/cron.hourly.  Tested on on SL6.6 Dell R410 and dell R710s.   It requires you install Dell's Openmanage software.

# cat omreportkillswitch.sh
#!/bin/bash
#Kill Switch using chassis temptemp=$(/opt/dell/srvadmin/bin/omreport chassis temps|grep Reading|awk '{ print $3}'|cut -d'.' -f1)
#disable condor killswitch
if [ $temp -gt 28 ]
  then #WARNING temps over 28C stop condor
    /etc/init.d/condor stop
fi
#shutdown node killswitch
if [ $temp -gt 32 ]
  then #CRITICAL temps over 32C shutdown node
  /sbin/shutdown -h now
fi

Nagios check_ata

We get servers with disks which pass smart tests but have many ata errors.

I wrote a quick script based on check_mcelog to grep in the logs for ata errors which can point to a failing disk or loose cable.

# cat /usr/local/bin/check_ata
#!/bin/bash
#modified from check_mcelog
LOGFILE=/var/log/kern.log

if [ ! -f "$LOGFILE" ]
then
    echo "No logfile exists"
    exit 3
else
    ERRORS=$( grep -c -e "ATA bus error" -e "failed command" $LOGFILE )
    if [ $ERRORS -lt 1 ]
    then
        echo "OK: $ERRORS ATA errors found"
        exit 0
    elif [ $ERRORS -lt 100 ]
    then
        echo "WARNING: $ERRORS ATA errors found"
        exit 1
    else
        echo "CRITICAL: $ERRORS ATA errors found"
        exit 1
    fi
fi



For example we see these kernel errors:

Jun 25 16:19:50 e206 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jun 25 16:19:50 e206 kernel: ata4.00: BMDMA stat 0x24
Jun 25 16:19:50 e206 kernel: ata4.00: failed command: READ DMA EXT
Jun 25 16:19:50 e206 kernel: ata4.00: cmd 25/00:a0:00:d0:13/00:03:5f:00:00/e0 tag 20 dma 475136 in
Jun 25 16:19:50 e206 kernel:         res 51/40:00:00:d2:13/40:00:5f:00:00/00 Emask 0x9 (media error)
Jun 25 16:19:50 e206 kernel: ata4.00: status: { DRDY ERR }
Jun 25 16:19:50 e206 kernel: ata4.00: error: { UNC }
Jun 25 16:19:51 e206 kernel: ata4.00: configured for UDMA/133
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
Jun 25 16:19:51 e206 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 25 16:19:51 e206 kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 25 16:19:51 e206 kernel:        5f 13 d2 00
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
Jun 25 16:19:51 e206 kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 5f 13 d0 00 00 03 a0 00
Jun 25 16:19:51 e206 kernel: end_request: I/O error, dev sdb, sector 1595134464
Jun 25 16:19:51 e206 kernel: ata4: EH complete

And these smart errors:
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors
Aug 12 10:04:31 host smartd[3720]: Device: /dev/sdb [SAT], 2 Offline uncorrectable sectors

Yet the device checks out via smart:

# smartctl -H /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-2.6.32-504.16.2.el6.x86_64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

This was tested on a SUN FIRE X2250 with Scientific Linux 6.6.

Based on:
https://github.com/solarkennedy/nagios-plugins/blob/master/check_mcelog