Аппаратный рейд и Smart мониторинг дисков

Ставим пакет smartmontools и настраиваем мониторинг дисков.

При использовании аппаратного рейд контроллера Adaptec 2405.

Физические диски в системе видны как /dev/sgX. Но это в том случае, если загружен модуль sg.
Если устройст /dev/sgX нет, пробуем подгрузить модуль sg:

modprobe sg

Проверяем:

ls -la /dev/sg*
crw------- 1 root root 21, 0 Jul  5 14:41 /dev/sg0
crw------- 1 root root 21, 1 Jul  5 14:41 /dev/sg1
crw------- 1 root root 21, 2 Jul  5 14:41 /dev/sg2
crw------- 1 root root 21, 3 Jul  5 14:41 /dev/sg3
crw------- 1 root root 21, 4 Jul  5 14:41 /dev/sg4

Все нормально, диски видны.
/dev/sg0 – это непосредственно сам контроллер, sg1-sg4 – наши диски.

Теперь настраиваем smartmontools.
Правим конфиг /etc/smartd.conf. коментируем строку:

# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

и ниже добавляем:

/dev/sg1 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg2 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg3 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg4 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

где:
-m root – кому отправлять уведомления от мониторинга
-d sat – тип устройства

Разрешаем запуск демона в /etc/default/smartmontools и запускаем smartmontools.

Теперь тоже самое, но при использовании аппаратного рейд контроллера LSI 9260.

Правим конфиг /etc/smartd.conf. коментируем строку:

# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

и ниже добавляем:

/dev/sg0 -d megaraid,0 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg0 -d megaraid,1 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg0 -d megaraid,2 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sg0 -d megaraid,3 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

где:
/dev/sg0 – это наш рейд контроллер,
megaraid,X – номер диска в массиве,
-m root – кому отправлять уведомления от мониторинга

После этого запускаем демон smartmontools и смотрим лог. Если все ли нормально, то в логе будет нечто подобное:

Jul 12 11:28:45 node1 smartd[273479]: smartd 5.40 2010-07-12 r3124 [x86_64-unknown--gnu] (local build)#012Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net#012
Jul 12 11:28:45 node1 smartd[273479]: Opened configuration file /etc/smartd.conf
Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 22 of file /etc/smartd.conf
Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 23 of file /etc/smartd.conf
Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 24 of file /etc/smartd.conf
Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 25 of file /etc/smartd.conf
Jul 12 11:28:45 node1 smartd[273479]: Configuration file /etc/smartd.conf parsed.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat'
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], opened
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], not found in smartd database.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], is SMART capable. Adding to "monitor" list.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1006xxx.ata.state
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat'
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], opened
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], not found in smartd database.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], is SMART capable. Adding to "monitor" list.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1002xxx.ata.state
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat'
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], opened
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], not found in smartd database.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], is SMART capable. Adding to "monitor" list.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0993xxx.ata.state
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat'
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], opened
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], not found in smartd database.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], is SMART capable. Adding to "monitor" list.
Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0981xxx.ata.state
Jul 12 11:28:45 node1 smartd[273479]:  4 ATA and 0 SCSI devices
Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1006xxx.ata.state
Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1002xxx.ata.state
Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0993xxx.ata.state
Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0981xxx.ata.state
Jul 12 11:29:05 node1 smartd[273489]: smartd has fork()ed into background mode. New PID=273489.
Jul 12 11:29:05 node1 smartd[273489]: file /var/run/smartd.pid written containing PID 273489

Теперь smart статус наших дисков будет мониториться и в случае проблем, будет отправляться уведомление на email (в данном случае – на root)

Посмотреть текущее состояние smart можно так:
для Adaptec:

# smartctl  -A -d sat /dev/sg1
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   091   091   016    Pre-fail  Always       -       12893
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       93
  3 Spin_Up_Time            0x0007   115   115   024    Pre-fail  Always       -       200 (Average 200)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   138   138   020    Pre-fail  Offline      -       31
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       5099
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Lifetime Min/Max 25/33)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

Для LSI:

# smartctl -A -d megaraid,0 /dev/sg0
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/sg0 [megaraid_disk_00] [SAT]: Device open changed type from 'megaraid' to 'sat'
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   143   143   021    Pre-fail  Always       -       3808
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9314
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       23
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0022   118   109   000    Old_age   Always       -       6169 (0 0 0 26)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Для HP P222
# smartctl -a -d sat+cciss,X /dev/sda
где Х – номер физического диска в массиве
/dev/sda – логический диск

# smartctl -a -d sat+cciss,1 /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-39-] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     MB2000GCVBR
Serial Number:    WCC1P0573706
LU WWN Device Id: 5 0014ee 25e116efc
Firmware Version: HPG1
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Wed Feb 24 08:18:49 2016 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (24960) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   206   206   021    Pre-fail  Always       -       4691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       20354
 10 Spin_Retry_Count        0x0033   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0033   100   253   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
180 Unused_Rsvd_Blk_Cnt_Tot 0x002f   200   200   100    Pre-fail  Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   097   000    Old_age   Always       -       3
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       17180131335
190 Airflow_Temperature_Cel 0x0022   069   060   045    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       296
194 Temperature_Celsius     0x0022   119   110   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Как видно, ничего сложного 🙂

Залишити відповідь

Ваша e-mail адреса не оприлюднюватиметься. Обов’язкові поля позначені *

*