Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info. |
HDD Operational Vs Theoretical MTBF
Corporate Reliability Engineering
August 1996
Overview
The specified Mean Time Between Failure
(MTBF) ratings of hard disk drives have risen dramatically over
the past few years and many OEMs and others who purchase these
drives have now reached the point where they question the published
numbers. Today, it is common to see MTBF specifications for high-performance
drives at 800,000 hours and even one million hours .
The purpose of this paper is to detail
an approach that Quantum Corporate Reliability Engineering is
now taking to address this issue. In addition to the specified
Theoretical MTBF, an Operational MTBF -which more closely represents
how the drive will actually perform in the real world - can be
provided for each of Quantum's hard disk drive products.
Theoretical MTBF
In order to estimate the reliability
of new products during the early phases of design and development,
mathematical models are created to estimate their reliability
characteristics with empirical field data. The results derived
from these models are the Theoretical (or predicted) MTBF. Once
volume production begins and a significant number of run-hours
has accumulated, the resultant field data is used to validate
the model.
The failure rates for common Printed
Circuit Board Assembly (PCBA) and Head-Disk Assembly (HDA) components
are established based upon the actual Field returned HDA and PCBA
repairs from the same product family. This database represents
literally billions of HDD operating hours. At Quantum, it is
derived from a population of one million drives running 71-100
percent power-on-hours in workstation or server environments over
a period of 12 to 18 months.
The detailed repair data for the above
period is processed to eliminate defects that appear to be the
result of integration damage or mishandling. These include: dings
or scratches on the media, scratched or chipped heads, cut flex
cables and other defective parts listed as showing physical damage.
The resultant data is used to establish the Drive-, PCBA- and
HDA-level failure rates. Quantum compares this field assessment
data to its OEM-returned drive Failure Analysis Summary Report
for further weighted MTBF analysis and for consideration of other
related failure factors.
The Theoretical or predicted MTBF typically
published in product specifications can not accurately take into
account unexpected design faults, firmware bugs or manufacturing-induced
failures. Likewise, mishandling by the customer during installation,
shipping damage and the ever-elusive "No Trouble Found"
category of failures are not included in the Theoretical MTBF
calculation.
The predicted or theoretical MTBF stated
on a Quantum data sheet represents a drive's steady state
failure rate. A stated MTBF of 800,000 hour, for example,
means that in a given large population of drives the average time
in which a drive may fail will be 800,000 hours. MTBF (or MTTF)
is not a means of predicting the life of an individual or a small
group of drives. To achieve this number, a drive would run until
it reaches its end-of-life period or fail and then be replaced
by a new drive of similar reliability and so on. In this case
it is theoretically possible that 800,000 hours would elapse before
a failure would occur in a large population.
Limitations of Theoretical MTBF
Prediction Modeling
The Part count/Part Stress models and
techniques used to provide data to these theoretical models are
estimates and thus have their limitations. The models cannot
predict:
- Cannot predict device, drive, firmware or equipment design errors.
- Cannot predict unanticipated defects induced during manufacturing.
- Are limited to a mathematically modeled relationship within the basic laws of physics.
- Cannot predict the "Humanware" or "Human" element.
- Cannot predict the environment where the drive/system may be placed (e.g. humidity, temperature, outside of spec.)
Many of these limitations may be characterized
as design, component or process immaturity or infancy.
Since these represents the early portion of the reliability growth
curve, they also represent an unfortunate characteristic of the
industry that inhibits a drive from meeting its MTBF goal: the
relatively short product life span.
The MTBF growth curve depends heavily
upon an installed base and accumulated run hours to meet its goals.
With today's high MTBF drives and short product life cycle, the
growth curve does not begin to level out (achieve steady state)
until it reaches a point where it nears the end of its production
life.
Operational MTBF
Technically, the Operational MTBF
is defined as a system-level failure caused by a customer-perceived
drive fault that requires a replacement. At Quantum, the Operational
MTBF is calculated by counting all Field returned
drives as failures excluding handling damage, upgrades and returned
drives for credit. With different system platforms and applications,
the Operational MTBF is usually considerably lower than the Theoretical
MTBF, especially during the early production phase.
Because many customers purchase these
early units, the Operational MTBF is a means of calculating what
the actual reliability performance of the drive may be. It differs
primarily from the Theoretical MTBF in that it includes many of
the failure mechanisms that are not included in the original theoretical
calculation as shown in table 1.
calculations are summarized in Table 1:
Failure Mode | Theoretical MTBF | |
Handling Damage | Not Included | |
NTF Returns | Not Included | |
Infancy Failures (of any type) | Not Included | |
Mfg. Process-induced Failures | Not Included | |
Mfg. Test Process Escapes | Included | |
Random Steady State Failures | Included | |
Other Failures * | Partially Weighted |
Table 1
One final note on the operational calculation
is that this is a dynamic value and will change over time as various
defects are analyzed to root cause and corrected. This contrasts
with the theoretical number that represents a steady state failure
rate and remains constant over the useful life of the drive.
Example
In the following example, a recent high-end Quantum disk drive is used to compare the Theoretical MTBF (Table 2) with an estimated Operational MTBF (Table 3). The specified MTBF for this drive was 800,000 hours.
Component | Failure Rate |
Mechanics | 585 FITs |
Electronics | 486 |
Flex Assembly | 30 |
Mfg. Test Process Escapes | 300 |
Firmware | - |
Handling | - |
Total Failure Rate | 1401 FITs |
Theoretical MTBF | 713,000 hours |
Table 1 -- Theoretical MTBF
* Other failures = unknown failures and failures that are not included in the above categories.
Component | Failure Rate |
Mechanics | 1440 FITs |
Electronics | 47 |
Flex Assembly | 24 |
Mfg. Process | 2030 |
Firmware | 432 |
Handling | - |
NTF Returns | 1346 |
Total Failure Rate | 5319 FITs |
Operational MTBF | 188,005 hours |
Table 2 -- Operational MTBF
The Operational MTBF was calculated
after this product had been in mass production for approximately
10 months and represented an installed base of approximately 68,000
units. The data for this calculation is based upon detailed failure
analysis of approximately 600 drives.
Annualized Field Return Rate
An Operational MTBF by itself is not
sufficient to provide the customer with all the information necessary
to adequately plan for maintenance costs, spares and other expenses.
AFRR or Annualized Field Return Rate is another metric that can
help to provide the customer with an overall picture of a drive's
reliability in the field. The intent of this metric is to capture
all field return information for trend analysis and repair supply
planning purposes. The AFRR is excluded from the DOA ?? and integration
fallout from the OEM site. This is addressed in our Defects per
Million program as a quality issue.
AFRR is calculated by using a moving
average of the previous three month's returns vs. the installed
base during that time. Note that AFRR is calculated only when
there are sufficient shipments and return data to provide a meaningful
value.
Where Rn = Field Returns
Mn = Installed Base
Some adjustment is made to the Return value to account for drives classified as No Trouble Found (NTF) and for those with handling damage. On certain products, the Service and Reliability groups have estimated that this may constitute as much as 40 percent of all returned drives.
Run Time Accumulative Models
Metrics such as MTBF and Annualized
Failure Rates (AFR) are somewhat static in that they show an estimate
for only a point in time or over a fixed period. As such, they
may not adequately show incremental improvements in product reliability.
A run time model may provide more meaningful data especially
in high-end drives where the duty cycle is at or close to 100
percent.
Rather than calculate a failure rate
based upon a population of drives, the number of failures is compared
to the total accumulated run-hours of the installed base. The
main advantage to this model is that it can be updated continuously
as hours accumulate and provide a better overview of product performance
than would otherwise be available with dissimilar models. In
this way it is possible to analyze the reliability of a drive
over longer periods of time and even beyond the point where product
is no longer shipped.
Conclusion
Because there is currently no Industry
Standard for the methods used to determine HDD reliability, the
predicted or theoretical MTBF that forms the basis for these predictions
may differ from vendor to vendor as their assumptions vary. These
predictions have been used primarily as Marketing tools and have
steadily moved farther from the reality of how an individual drive
will actually perform. Moreover, the rapidity at which new products
are introduced and the shortened production cycle of current products
make it highly unlikely that these values will ever be realized.
The Operational MTBF is a more realistic
number but still requires a substantial installed base in order
to provide an accurate value. However, given the extensive and
detailed data base of HDD products that Quantum has, it is now
possible to project an Operational MTBF for products under development
using this historical data.
Quantum is currently working with IDEMA
to develop a standard that will be more satisfactory to its customers.
The company's Corporate Reliability Engineering is actively involved
in IDEMA's committee on Reliability Metrics and methodology standardization.
Until a standard is established, however, Quantum will continue
to use both the Operational and Theoretical MTBF as a measure
of drive reliability in the field.
References:
1. Graham, J. & Yang, J., "Hard Disk Drive Reliability", a Quantum Corporation white paper, March 1996
2. MIL-STD-721C & IEC STD 271, "Reliability & Maintainability Terms, Definitions and Related Mathematics", MIL-STD handbook, Jun, 1981
3. Kececioglu, D., "Reliability Engineering Handbook", Volume I & II, PTR Prentice Hall publication, Englewood, NJ, 1991, chapter 1, pp1-20.
4. Bellcore, "Reliability Prediction Procedure for Electronic Equipment", Technical Reference TR-NWT-000332, issue 5, Dec. 1995, Bell Communication Research, Livington, NJ
5. MIL-HDBK-217F, "Reliability Prediction of Electronic Equipment", MIL-STD Handbook, Version F, Notice 1, Jul. 1992.
6. Bellcore, "Field Reliability Performance Study Handbook", Special Report SR-NWT-000821, issue 3, Dec. 1990, Bell Communication Research, Livington, NJ
7. International Standard IEC 1014,
IEC 56 (central Office) 150 "Programmes for Reliability
Growth Management, Model", International Electromechanical
Commission, Geneva, 1989.