As MLC-based SSD cards are raising popularity, there is also a raising concern how long it can survive. As we know, a MLC NAND module can handle 5,000-10,000 erasing cycles, after which it gets unusable. And obviously the SSD card based on MLC NAND has a limited lifetime. There is a lot of misconceptions and misunderstanding on how long such card can last, so I want to show some calculation to shed a light on this question.

For base I will take Virident FlashMAX M1400 (1.4TB) card. Virident guarantees 15PB (PB as in petabytes) of writes on this card.
15PB sounds impressive, but how many years it corresponds to ? Of course it depends on your workload, and mainly how write intensive it is. But there are some facts that can help you to estimate.

On Linux you can look into the /proc/diskstats file, which shows something like:

where 8492649856 is the number of sectors written since the reboot (sector is 512 bytes).

Now you can say that we may take /proc/diskstats stats with the 1h interval, and it will show write how many bytes per hour we write, and in such way to calculate the potential lifetime.
This will be only partially correct. There is such factor as Write Amplification, which is very well described on WikiPedia, but basically SSD cards, due an internal organization, write more data than it comes from an application.
Usually the write amplification is equal or very close to 1 (meaning there is no overhead) for sequential writes and it gets a maximum value for fully random writes. This value can be 2 – 5 or more and depends on many factors like the used capacity and the space used for an over-provisioning.

Basically it means you should look into the card statistic to get an exact written bytes.
For Virident FlashMAX it is

Having this info let’s take look what a lifetime we can expect under a tpcc-mysql workload.
I put 32 users threads against 5000W dataset (about 500GB of data on the disk) during 1 hour.

After 1 hour, /proc/diskstat shows 984,442,441,728 bytes written, which is 984.44GB and the Virident stat shows 1,125,653,692,416 bytes written, which is 1,125.65GB
It allows us to calculate the write amplification factor, which in our case is
1,125,653,692,416 / 984,442,441,728 = 1.143. This looks very decent, but remember we use only 500GB out of 1400GB, and the factor will grow as we fill out more space.

Please note we put a quite intensive write load during this hour.
MySQL handled 25,000 updates/sec, 20,000 inserts/sec and 1,500 deletes/sec, which corresponds to
write throughput 273.45MB/sec from MySQL to disk.

And it helps to calculate the lifetime of the card if we put such workload 24/7 non-stop.
15PB (of total writes) / 1125.65GB (per hour) = 13,325.634 hours = 555.23 days = 1.52 years

That is under non-stop tpcc-mysql workload we may expect the card will last 1.52 years. However, in real production you do not have an uniform load every hour, so you may base your estimation on daily or weekly stats.

Unfortunately there is no easy way to predict this number until you start workload on the SSD.
You can take look into /proc/diskstat, but
1. There is write amplification factor which you do not know
and 2. A throughput on regular RAID is much less than on SSD and you do not know what your throughput will be when you put workload on SSD.


8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jake

Id like to know what happens when the SSD starts failing due to hitting these limits. Does the whole thing stop working catastrophically, or something else? How does MySQL perform on a dying SSD?

Edmar

Absolutely great post, something I’ve been wondering about for some time. Thanks!

Do you have any idea about how would the end-of-life of such a card manifest itself? Hard instantaneous failure, or maybe progressively severe performance degradation as less and less hidden reserve capacity (reserved MLC modules) is available?

Is there a way to measure/monitor failed MLC module count from vgc-monitor -d output?

Kep

Intel has Solid-State Drive Toolbox which shows Drive health & Estimated drive life remaining.
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=18455

Shirish Jamthe

Hello Jake, Edmar

Let me respond to your questions in two parts.
First let me introduce you to the output of vgc-monitor that Vadim is looking at for lifetime.
In my second comment I will describe how Virident defines ‘End of Life’ for Virident’s FlashMAX cards.

As you can see from the output below, we not only tell you the bytes written but also tell you the remaining life left as a % so you can monitor it closely.

Shirish Jamthe

Hello Jake, Edmar

In my above post the formatting didn’t come through. But the key things you may want to focus on are
1. The drive status, which shows ‘GOOD’ here.
2. RAID status, it shows ‘enabled’ here.
3. Remaining Life : 99.60%
4. Writes as mentioned by Vadim.

As you may already know all SSDs come with over provisioning. As blocks start to go bad they are replaced from reserve.
Virident has defined ‘End of Life’ as a threshold or watermark for the bad block number (or reserve capacity) such that the card performance doe not degrade from what it was spec’d at.

So it is definitely possible to use the card after EOL of warrantied writes, specially if the application is read mostly.
TWe have another watermark or threshold that is defined where the card will not be able to sustained writes. At this time the card will go into ‘READ ONLY’ mode as defined by card status and you will still be able to recover data from it.

So in summary there is no catastrophic failure after you have written over warrantied writes. You will have warning from life left reading in the monitor tool as well as ability to continue for a while to manage timely replacement or migration.

I am happy to chat more on this if you have further questions. Again, I have defined behavior specific to Virident SSDs and not for generic SSD.

-Shirish

Vojtech Kurka

Vadim: Yes, but you can watch the drive’s health using smartctl.

/usr/sbin/smartctl -a /dev/sdd

232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       –       0
233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       –       0

225 (E1) = total host writes, must multiply by 32MB

George

Suppose you could use pt-diskstats https://www.percona.com/doc/percona-toolkit/pt-diskstats.html in place of /proc/diskstats ?