SMART data & Self tests, not sure if my SSD is on it's last gasp

Joshua Judson Rosen rozzin at hackerposse.com
Wed Dec 30 15:53:35 EST 2020


Storage still scares me, just as a general principle...,
so I'm basically never going to say "you really have nothing to worry about"...,
but I think I _might_ be able to settle your nerves a little:

On 12/30/20 2:04 PM, Bruce Labitt wrote:
> I think I have a SSD on the way out.  Last reboot took a REALLY long
> time.  Like 30 minutes.
Are you sure your computer wasn't just running an extensive fsck during that boot?

Assuming you're running one of the "ext" filesystem variants (ext4, ext3...),
you can try running dumpe2fs on each of your filesystems and looking at the "Last checked" field.
If that's the same as the last time you booted..., there you go.

IIRC ext3 used to force periodic full fsck by default.... I'm not sure what the intervals were,
what the current defaults are, or when they might have changed. A lot of people liked to
disabled them, though, because otherwise the lengthy fsck always seemed to come at the
most unexpected and inopportune times (especially on laptops that might be running battery-only).
The relevant fields in the dumpe2fs output here are "Maximum mount count" and "Check interval".

Your smartctl output actually doesn't sound any alarms for me:

> I ran the smart data and self test and the SSD
> passes.  Overall assessment is disk is ok.  I really don't know how to
> interpret what the results are.
> 
> I think the disk is in pre-fail based on the smartctl output below

I think you're misreading the `attribute TYPE' column as an `attribute value summary interpretation'.

"Pre-Fail" doesn't mean "this drive *is* about to fail according to current value of this attribute",
it just means "this drive *would be* about to fail if the current value were
past the value in the THRESHOLD column".

The relevant paragraph from the smartctl manual:

         The Attribute table printed  out  by  smartctl  also  shows  the
         "TYPE"  of  the  Attribute.   Attributes are one of two possible
         types: Pre-failure or Old age.  Pre-failure Attributes are  ones
         which, if less than or equal to their threshold values, indicate
         pending disk failure.  Old age, or usage  Attributes,  are  ones
         which  indicate end-of-product life from old-age or normal aging
         and wearout, if the Attribute value is less than or equal to the
         threshold.   Please  note: the fact that an Attribute is of type
         'Pre-fail' does not mean that your disk is about  to  fail!   It
         only  has  this  meaning  if  the Attribute's current Normalized
         value is less than or equal to the threshold value.


Just going by your smartctl report, this drive looks `practically new' to me...:
the current and `worst ever seen' values are all at 100 and the closest pre-fail
indicator is `not until it gets down to 50' (and the others are
either `not until it gets down to 10' or `not until it gets down to 1').

The Power_On_Hours and Power_Cycle_Count figures show that the drive has probably been
in use in a laptop (with typical sleep/wake/powercycle frequency) for a couple of years,
but that's all I see.

If you haven't taken a backup recently..., you should do _that_... just because... backups.

It's been a while since I researched `SSD failure modes', but my recollection was
that `suddenly, completely, and without a lot of warning' was pretty typical--
as opposed to the old spinning-platter disc drives for which `first they get hot and noisy'
and `you lose a few sectors first and then an recover the rest' were more normal....
(someone who's more up-to-date on this than me, please jump in!). So.., yeah--backups.

And if it's a couple years old, it might be out of its warranty period--
so consider whether that bothers you, I guess?


> 
> /snip
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Crucial/Micron RealSSD m4/C400/P400
> Device Model:     M4-CT256M4SSD2
> Serial Number:    000000001247091DC2FF
> LU WWN Device Id: 5 00a075 1091dc2ff
> Firmware Version: 040H
> User Capacity:    256,060,514,304 bytes [256 GB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    Solid State Device
> Form Factor:      2.5 inches
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Wed Dec 30 13:49:17 2020 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> /snip
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>     1 Raw_Read_Error_Rate     0x002f   100   100   050 Pre-fail
> Always       -       0
>     5 Reallocated_Sector_Ct   0x0033   100   100   010 Pre-fail
> Always       -       0
>     9 Power_On_Hours          0x0032   100   100   001 Old_age
> Always       -       7294
>    12 Power_Cycle_Count       0x0032   100   100   001 Old_age
> Always       -       2511
> 170 Grown_Failing_Block_Ct  0x0033   100   100   010 Pre-fail
> Always       -       0
> 171 Program_Fail_Count      0x0032   100   100   001 Old_age
> Always       -       0
> 172 Erase_Fail_Count        0x0032   100   100   001 Old_age
> Always       -       0
> 173 Wear_Leveling_Count     0x0033   098   098   010 Pre-fail
> Always       -       66
> 174 Unexpect_Power_Loss_Ct  0x0032   100   100   001 Old_age
> Always       -       87
> 181 Non4k_Aligned_Access    0x0022   100   100   001 Old_age
> Always       -       10250 5047 5203
> 183 SATA_Iface_Downshift    0x0032   100   100   001 Old_age
> Always       -       0
> 184 End-to-End_Error        0x0033   100   100   050 Pre-fail
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   001 Old_age
> Always       -       0
> 188 Command_Timeout         0x0032   100   100   001 Old_age
> Always       -       0
> 189 Factory_Bad_Block_Ct    0x000e   100   100   001 Old_age
> Always       -       81
> 194 Temperature_Celsius     0x0022   100   100   000 Old_age
> Always       -       0
> 195 Hardware_ECC_Recovered  0x003a   100   100   001 Old_age
> Always       -       0
> 196 Reallocated_Event_Count 0x0032   100   100   001 Old_age
> Always       -       0
> 197 Current_Pending_Sector  0x0032   100   100   001 Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   001 Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   100   100   001 Old_age
> Always       -       0
> 202 Perc_Rated_Life_Used    0x0018   098   098   001 Old_age
> Offline      -       2
> 206 Write_Error_Rate        0x000e   100   100   001 Old_age
> Always       -       0


-- 
Connect with me on the GNU social network: <https://status.hackerposse.com/rozzin>
Not on the network? Ask me for an invitation to a social hub!


More information about the gnhlug-discuss mailing list