smartctl-health

遇到smartctl FAILED,了解錯誤原因


更新記錄

item note
20170120 第一版

目錄


Smartctl Reallocated_Sector_Ct

there are physical issues already reported by S.M.A.R.T
表示已檢測到實体磁區損毀

When the hard drive finds a read/write/verification error, it marks that sector as “reallocated” and transfers data to a special reserved area (spare area)
硬碟driver會去做讀入驗證(即寫入資料,再讀回來驗證),若產生問題,將會使用硬碟內的spare area

RAW_VALUE
記錄實際損毀磁區
The raw value normally represents a count of the bad sectors that have been found and remapped.

  • 下例記錄嚴重錯誤 (若超出安全的範圍)
id note note x
1 Read Error Rate 底层数据读取错误率
5 Reallocated Sector Count 重定位磁区计数 当高过一定数值后,后磁區消耗殆尽而无法再重映射修复时,这些坏磁區就会显现出来且无法自行修复
10 Spin Retry Count 电机起转重试 主轴电机频繁的尝试启动,意味着硬盘驱动器的寿命可能将近实际限值
196 Reallocation Event Count 重定位事件计数 记录已重映射扇区和可能重映射扇区的事件计数
197 Current Pending Sector Count 等候重定的扇区计数 记录了不稳定的扇区的数量
198 Uncorrectable Sector Count 无法校正的扇区计数 记录肯定出错的扇区数量

smartctl health

smartctl程序

  • 開啟SMART
    smartctl -s on -d ata /dev/sda

    1
    2
    3
    4
    5
    -s VALUE, --smart=VALUE
    Enable/disable SMART on device (on/off)

    -d TYPE, --device=TYPE
    Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, cciss,N, auto, test
  • 每隔一段時間檢查health
    smartctl -H /dev/sda -d auto

    1
    2
    -H, --health
    Show device SMART health status

smartctl health訊息

  • PASSED

    1
    2
    3
    4
    5
    6
    7
    8
    # smartctl -H /dev/sda -d auto
    smartctl 6.2 2013-07-26 r3841 [armv7l-linux-3.4.35_hi3535] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    #
  • FAILED

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # smartctl -H /dev/sdb -d auto
    smartctl 6.2 2013-07-26 r3841 [armv7l-linux-3.4.35_hi3535] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: FAILED!
    Drive failure expected in less than 24 hours. SAVE ALL DATA.
    Failed Attributes:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct 0x0033 135 135 140 Pre-fail Always FAILING_NOW 27

smartctl test

Reallocated_Event_Count : 重定位事件计数,记录已重映射扇区和可能重映射扇区的事件计数
Current_Pending_Sector: 等候重定的扇区计数,记录了不稳定的扇区的数量

1
2
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       647
197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 65532
  • smartctl -t short /dev/sdb
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    # smartctl -t short /dev/sdb 
    smartctl 6.2 2013-07-26 r3841 [armv7l-linux-3.4.35_hi3535] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
    Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 2 minutes for test to complete.
    Test will complete after Fri Jan 20 13:42:35 2017

    Use smartctl -X to abort test.
    # date
    Fri Jan 20 13:41:07 UTC 2017
    # date
    Fri Jan 20 13:41:44 UTC 2017
    # date
    Fri Jan 20 13:43:13 UTC 2017
    #
    =========================================================================================
    # smartctl -a /dev/sdb
    smartctl 6.2 2013-07-26 r3841 [armv7l-linux-3.4.35_hi3535] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Device Model: WDC WD10JUCT-63CYNY0
    Serial Number: WD-WX31AC4PVPHA
    LU WWN Device Id: 5 0014ee 65aae4265
    Firmware Version: 01.01A01
    User Capacity: 1,000,204,886,016 bytes [1.00 TB]
    Sector Sizes: 512 bytes logical, 4096 bytes physical
    Rotation Rate: 5400 rpm
    Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: ACS-2 (minor revision not indicated)
    SATA Version is: SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
    Local Time is: Fri Jan 20 13:43:41 2017 UTC
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: FAILED!
    Drive failure expected in less than 24 hours. SAVE ALL DATA.
    See vendor-specific Attribute list for failed Attributes.

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: Disabled.
    Self-test execution status: ( 241) Self-test routine in progress...
    10% of test remaining.
    Total time to complete Offline
    data collection: (18420) seconds.
    Offline data collection
    capabilities: (0x7b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 206) minutes.
    Conveyance self-test routine
    recommended polling time: ( 5) minutes.
    SCT capabilities: (0x7035) SCT Status supported.
    SCT Feature Control supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
    3 Spin_Up_Time 0x0027 183 179 021 Pre-fail Always - 1833
    4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 441
    5 Reallocated_Sector_Ct 0x0033 135 135 140 Pre-fail Always FAILING_NOW 2770
    7 Seek_Error_Rate 0x002e 200 194 000 Old_age Always - 4
    9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 9398
    10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
    11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 441
    192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 439
    193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 140
    194 Temperature_Celsius 0x0022 114 099 000 Old_age Always - 33
    196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 647
    197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 65532
    198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

    SMART Error Log Version: 1
    ATA Error Count: 9762 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 9762 occurred at disk power-on lifetime: 9396 hours (391 days + 12 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 61 46 00 00 00 a0 Device Fault; Error: ABRT

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    ef 03 46 00 00 00 a0 08 24d+01:45:47.162 SET FEATURES [Set transfer mode]
    ec 00 00 00 00 00 a0 08 24d+01:45:47.162 IDENTIFY DEVICE
    c8 00 08 00 00 00 e0 08 24d+01:45:47.128 READ DMA
    ec 00 00 00 00 00 a0 08 24d+01:45:47.116 IDENTIFY DEVICE
    ef 03 46 00 00 00 a0 08 24d+01:45:47.116 SET FEATURES [Set transfer mode]

    Error 9761 occurred at disk power-on lifetime: 9396 hours (391 days + 12 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 61 08 00 00 00 e0 Device Fault; Error: ABRT 8 sectors at LBA = 0x00000000 = 0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 08 00 00 00 e0 08 24d+01:45:47.128 READ DMA
    ec 00 00 00 00 00 a0 08 24d+01:45:47.116 IDENTIFY DEVICE
    ef 03 46 00 00 00 a0 08 24d+01:45:47.116 SET FEATURES [Set transfer mode]
    ec 00 00 00 00 00 a0 08 24d+01:45:47.115 IDENTIFY DEVICE
    c8 00 08 00 00 00 e0 08 24d+01:45:47.079 READ DMA

    Error 9760 occurred at disk power-on lifetime: 9396 hours (391 days + 12 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 61 46 00 00 00 a0 Device Fault; Error: ABRT

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    ef 03 46 00 00 00 a0 08 24d+01:45:47.116 SET FEATURES [Set transfer mode]
    ec 00 00 00 00 00 a0 08 24d+01:45:47.115 IDENTIFY DEVICE
    c8 00 08 00 00 00 e0 08 24d+01:45:47.079 READ DMA
    ec 00 00 00 00 00 a0 08 24d+01:45:47.071 IDENTIFY DEVICE
    ef 03 46 00 00 00 a0 08 24d+01:45:47.070 SET FEATURES [Set transfer mode]

    Error 9759 occurred at disk power-on lifetime: 9396 hours (391 days + 12 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 61 08 00 00 00 e0 Device Fault; Error: ABRT 8 sectors at LBA = 0x00000000 = 0

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    c8 00 08 00 00 00 e0 08 24d+01:45:47.079 READ DMA
    ec 00 00 00 00 00 a0 08 24d+01:45:47.071 IDENTIFY DEVICE
    ef 03 46 00 00 00 a0 08 24d+01:45:47.070 SET FEATURES [Set transfer mode]
    ec 00 00 00 00 00 a0 08 24d+01:45:47.070 IDENTIFY DEVICE
    c8 00 08 00 00 00 e0 08 24d+01:45:47.026 READ DMA

    Error 9758 occurred at disk power-on lifetime: 9396 hours (391 days + 12 hours)
    When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    04 61 46 00 00 00 a0 Device Fault; Error: ABRT

    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
    -- -- -- -- -- -- -- -- ---------------- --------------------
    ef 03 46 00 00 00 a0 08 24d+01:45:47.070 SET FEATURES [Set transfer mode]
    ec 00 00 00 00 00 a0 08 24d+01:45:47.070 IDENTIFY DEVICE
    c8 00 08 00 00 00 e0 08 24d+01:45:47.026 READ DMA
    ec 00 00 00 00 00 a0 08 24d+01:45:47.015 IDENTIFY DEVICE
    ef 03 46 00 00 00 a0 08 24d+01:45:47.014 SET FEATURES [Set transfer mode]

    SMART Self-test log structure revision number 1
    No self-tests have been logged. [To run self-tests, use: smartctl -t]


    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

    #

其它參考