Ставим пакет smartmontools и настраиваем мониторинг дисков.
При использовании аппаратного рейд контроллера Adaptec 2405.
Физические диски в системе видны как /dev/sgX. Но это в том случае, если загружен модуль sg.
Если устройст /dev/sgX нет, пробуем подгрузить модуль sg:
modprobe sg
Проверяем:
ls -la /dev/sg* crw------- 1 root root 21, 0 Jul 5 14:41 /dev/sg0 crw------- 1 root root 21, 1 Jul 5 14:41 /dev/sg1 crw------- 1 root root 21, 2 Jul 5 14:41 /dev/sg2 crw------- 1 root root 21, 3 Jul 5 14:41 /dev/sg3 crw------- 1 root root 21, 4 Jul 5 14:41 /dev/sg4
Все нормально, диски видны.
/dev/sg0 – это непосредственно сам контроллер, sg1-sg4 – наши диски.
Теперь настраиваем smartmontools.
Правим конфиг /etc/smartd.conf. коментируем строку:
# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
и ниже добавляем:
/dev/sg1 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg2 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg3 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg4 -d sat -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
где:
-m root – кому отправлять уведомления от мониторинга
-d sat – тип устройства
Разрешаем запуск демона в /etc/default/smartmontools и запускаем smartmontools.
Теперь тоже самое, но при использовании аппаратного рейд контроллера LSI 9260.
Правим конфиг /etc/smartd.conf. коментируем строку:
# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
и ниже добавляем:
/dev/sg0 -d megaraid,0 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg0 -d megaraid,1 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg0 -d megaraid,2 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/sg0 -d megaraid,3 -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
где:
/dev/sg0 – это наш рейд контроллер,
megaraid,X – номер диска в массиве,
-m root – кому отправлять уведомления от мониторинга
После этого запускаем демон smartmontools и смотрим лог. Если все ли нормально, то в логе будет нечто подобное:
Jul 12 11:28:45 node1 smartd[273479]: smartd 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)#012Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net#012 Jul 12 11:28:45 node1 smartd[273479]: Opened configuration file /etc/smartd.conf Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 22 of file /etc/smartd.conf Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 23 of file /etc/smartd.conf Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 24 of file /etc/smartd.conf Jul 12 11:28:45 node1 smartd[273479]: Drive: /dev/sg0, implied '-a' Directive on line 25 of file /etc/smartd.conf Jul 12 11:28:45 node1 smartd[273479]: Configuration file /etc/smartd.conf parsed. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat' Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], opened Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], not found in smartd database. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], is SMART capable. Adding to "monitor" list. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1006xxx.ata.state Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat' Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], opened Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], not found in smartd database. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], is SMART capable. Adding to "monitor" list. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1002xxx.ata.state Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat' Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], opened Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], not found in smartd database. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], is SMART capable. Adding to "monitor" list. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0993xxx.ata.state Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0, type changed from 'megaraid' to 'sat' Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], opened Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], not found in smartd database. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], is SMART capable. Adding to "monitor" list. Jul 12 11:28:45 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0981xxx.ata.state Jul 12 11:28:45 node1 smartd[273479]: monitoring 4 ATA and 0 SCSI devices Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_00] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1006xxx.ata.state Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_01] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP1002xxx.ata.state Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_02] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0993xxx.ata.state Jul 12 11:29:05 node1 smartd[273479]: Device: /dev/sg0 [megaraid_disk_03] [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD5003ABYX_18WERA0-WD_WMAYP0981xxx.ata.state Jul 12 11:29:05 node1 smartd[273489]: smartd has fork()ed into background mode. New PID=273489. Jul 12 11:29:05 node1 smartd[273489]: file /var/run/smartd.pid written containing PID 273489
Теперь smart статус наших дисков будет мониториться и в случае проблем, будет отправляться уведомление на email (в данном случае – на root)
Посмотреть текущее состояние smart можно так:
для Adaptec:
# smartctl -A -d sat /dev/sg1 smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 091 091 016 Pre-fail Always - 12893 2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 93 3 Spin_Up_Time 0x0007 115 115 024 Pre-fail Always - 200 (Average 200) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 138 138 020 Pre-fail Offline - 31 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 5099 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 5 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 5 194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Lifetime Min/Max 25/33) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
Для LSI:
# smartctl -A -d megaraid,0 /dev/sg0 smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net /dev/sg0 [megaraid_disk_00] [SAT]: Device open changed type from 'megaraid' to 'sat' === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1 3 Spin_Up_Time 0x0027 143 143 021 Pre-fail Always - 3808 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 25 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 9314 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 24 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 23 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1 194 Temperature_Celsius 0x0022 118 109 000 Old_age Always - 6169 (0 0 0 26) 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
Для HP P222
# smartctl -a -d sat+cciss,X /dev/sda
где Х – номер физического диска в массиве
/dev/sda – логический диск
# smartctl -a -d sat+cciss,1 /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-39-pve] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: MB2000GCVBR Serial Number: WCC1P0573706 LU WWN Device Id: 5 0014ee 25e116efc Firmware Version: HPG1 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 6 Local Time is: Wed Feb 24 08:18:49 2016 EET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART STATUS RETURN: incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (24960) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 206 206 021 Pre-fail Always - 4691 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 073 073 000 Old_age Always - 20354 10 Spin_Retry_Count 0x0033 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0033 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 26 180 Unused_Rsvd_Blk_Cnt_Tot 0x002f 200 200 100 Pre-fail Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 097 000 Old_age Always - 3 188 Command_Timeout 0x0032 100 097 000 Old_age Always - 17180131335 190 Airflow_Temperature_Cel 0x0022 069 060 045 Old_age Always - 31 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 296 194 Temperature_Celsius 0x0022 119 110 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
Как видно, ничего сложного 🙂