Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

HDD Operational Vs Theoretical MTBF

From Higher Intellect Vintage Wiki
Jump to navigation Jump to search

IMG 1.GIF


HDD Operational Vs Theoretical MTBF

Corporate Reliability Engineering

August 1996

Overview

The specified Mean Time Between Failure (MTBF) ratings of hard disk drives have risen dramatically over the past few years and many OEMs and others who purchase these drives have now reached the point where they question the published numbers. Today, it is common to see MTBF specifications for high-performance drives at 800,000 hours and even one million hours .

The purpose of this paper is to detail an approach that Quantum Corporate Reliability Engineering is now taking to address this issue. In addition to the specified Theoretical MTBF, an Operational MTBF -which more closely represents how the drive will actually perform in the real world - can be provided for each of Quantum's hard disk drive products.

Theoretical MTBF

In order to estimate the reliability of new products during the early phases of design and development, mathematical models are created to estimate their reliability characteristics with empirical field data. The results derived from these models are the Theoretical (or predicted) MTBF. Once volume production begins and a significant number of run-hours has accumulated, the resultant field data is used to validate the model.

The failure rates for common Printed Circuit Board Assembly (PCBA) and Head-Disk Assembly (HDA) components are established based upon the actual Field returned HDA and PCBA repairs from the same product family. This database represents literally billions of HDD operating hours. At Quantum, it is derived from a population of one million drives running 71-100 percent power-on-hours in workstation or server environments over a period of 12 to 18 months.

The detailed repair data for the above period is processed to eliminate defects that appear to be the result of integration damage or mishandling. These include: dings or scratches on the media, scratched or chipped heads, cut flex cables and other defective parts listed as showing physical damage. The resultant data is used to establish the Drive-, PCBA- and HDA-level failure rates. Quantum compares this field assessment data to its OEM-returned drive Failure Analysis Summary Report for further weighted MTBF analysis and for consideration of other related failure factors.

The Theoretical or predicted MTBF typically published in product specifications can not accurately take into account unexpected design faults, firmware bugs or manufacturing-induced failures. Likewise, mishandling by the customer during installation, shipping damage and the ever-elusive "No Trouble Found" category of failures are not included in the Theoretical MTBF calculation.

The predicted or theoretical MTBF stated on a Quantum data sheet represents a drive's steady state failure rate. A stated MTBF of 800,000 hour, for example, means that in a given large population of drives the average time in which a drive may fail will be 800,000 hours. MTBF (or MTTF) is not a means of predicting the life of an individual or a small group of drives. To achieve this number, a drive would run until it reaches its end-of-life period or fail and then be replaced by a new drive of similar reliability and so on. In this case it is theoretically possible that 800,000 hours would elapse before a failure would occur in a large population.

Limitations of Theoretical MTBF Prediction Modeling

The Part count/Part Stress models and techniques used to provide data to these theoretical models are estimates and thus have their limitations. The models cannot predict:

  • Cannot predict device, drive, firmware or equipment design errors.
  • Cannot predict unanticipated defects induced during manufacturing.
  • Are limited to a mathematically modeled relationship within the basic laws of physics.
  • Cannot predict the "Humanware" or "Human" element.
  • Cannot predict the environment where the drive/system may be placed (e.g. humidity, temperature, outside of spec.)

Many of these limitations may be characterized as design, component or process immaturity or infancy. Since these represents the early portion of the reliability growth curve, they also represent an unfortunate characteristic of the industry that inhibits a drive from meeting its MTBF goal: the relatively short product life span.

The MTBF growth curve depends heavily upon an installed base and accumulated run hours to meet its goals. With today's high MTBF drives and short product life cycle, the growth curve does not begin to level out (achieve steady state) until it reaches a point where it nears the end of its production life.

Operational MTBF

Technically, the Operational MTBF is defined as a system-level failure caused by a customer-perceived drive fault that requires a replacement. At Quantum, the Operational MTBF is calculated by counting all Field returned drives as failures excluding handling damage, upgrades and returned drives for credit. With different system platforms and applications, the Operational MTBF is usually considerably lower than the Theoretical MTBF, especially during the early production phase.

Because many customers purchase these early units, the Operational MTBF is a means of calculating what the actual reliability performance of the drive may be. It differs primarily from the Theoretical MTBF in that it includes many of the failure mechanisms that are not included in the original theoretical calculation as shown in table 1.


The actual parameters used in the two

calculations are summarized in Table 1:

Failure Mode
Operational MTBF
Theoretical MTBF
Handling Damage
Not Included
Not Included
NTF Returns
Included
Not Included
Infancy Failures (of any type)
Included
Not Included
Mfg. Process-induced Failures
Included
Not Included
Mfg. Test Process Escapes
Included
Included
Random Steady State Failures
Included
Included
Other Failures *
Included
Partially Weighted

Table 1

One final note on the operational calculation is that this is a dynamic value and will change over time as various defects are analyzed to root cause and corrected. This contrasts with the theoretical number that represents a steady state failure rate and remains constant over the useful life of the drive.

Example

In the following example, a recent high-end Quantum disk drive is used to compare the Theoretical MTBF (Table 2) with an estimated Operational MTBF (Table 3). The specified MTBF for this drive was 800,000 hours.

Component Failure Rate
Mechanics585 FITs
Electronics486
Flex Assembly 30
Mfg. Test Process Escapes 300
Firmware-
Handling-
Total Failure Rate 1401 FITs
Theoretical MTBF 713,000 hours

Table 1 -- Theoretical MTBF

* Other failures = unknown failures and failures that are not included in the above categories.

Component Failure Rate
Mechanics1440 FITs
Electronics47
Flex Assembly 24
Mfg. Process 2030
Firmware432
Handling-
NTF Returns1346
Total Failure Rate 5319 FITs
Operational MTBF 188,005 hours

Table 2 -- Operational MTBF

The Operational MTBF was calculated after this product had been in mass production for approximately 10 months and represented an installed base of approximately 68,000 units. The data for this calculation is based upon detailed failure analysis of approximately 600 drives.


Annualized Field Return Rate

An Operational MTBF by itself is not sufficient to provide the customer with all the information necessary to adequately plan for maintenance costs, spares and other expenses. AFRR or Annualized Field Return Rate is another metric that can help to provide the customer with an overall picture of a drive's reliability in the field. The intent of this metric is to capture all field return information for trend analysis and repair supply planning purposes. The AFRR is excluded from the DOA ?? and integration fallout from the OEM site. This is addressed in our Defects per Million program as a quality issue.

AFRR is calculated by using a moving average of the previous three month's returns vs. the installed base during that time. Note that AFRR is calculated only when there are sufficient shipments and return data to provide a meaningful value.

Where Rn = Field Returns

Mn = Installed Base

Some adjustment is made to the Return value to account for drives classified as No Trouble Found (NTF) and for those with handling damage. On certain products, the Service and Reliability groups have estimated that this may constitute as much as 40 percent of all returned drives.

Run Time Accumulative Models

Metrics such as MTBF and Annualized Failure Rates (AFR) are somewhat static in that they show an estimate for only a point in time or over a fixed period. As such, they may not adequately show incremental improvements in product reliability. A run time model may provide more meaningful data especially in high-end drives where the duty cycle is at or close to 100 percent.

Rather than calculate a failure rate based upon a population of drives, the number of failures is compared to the total accumulated run-hours of the installed base. The main advantage to this model is that it can be updated continuously as hours accumulate and provide a better overview of product performance than would otherwise be available with dissimilar models. In this way it is possible to analyze the reliability of a drive over longer periods of time and even beyond the point where product is no longer shipped.

Conclusion

Because there is currently no Industry Standard for the methods used to determine HDD reliability, the predicted or theoretical MTBF that forms the basis for these predictions may differ from vendor to vendor as their assumptions vary. These predictions have been used primarily as Marketing tools and have steadily moved farther from the reality of how an individual drive will actually perform. Moreover, the rapidity at which new products are introduced and the shortened production cycle of current products make it highly unlikely that these values will ever be realized.

The Operational MTBF is a more realistic number but still requires a substantial installed base in order to provide an accurate value. However, given the extensive and detailed data base of HDD products that Quantum has, it is now possible to project an Operational MTBF for products under development using this historical data.

Quantum is currently working with IDEMA to develop a standard that will be more satisfactory to its customers. The company's Corporate Reliability Engineering is actively involved in IDEMA's committee on Reliability Metrics and methodology standardization. Until a standard is established, however, Quantum will continue to use both the Operational and Theoretical MTBF as a measure of drive reliability in the field.


References:

1. Graham, J. & Yang, J., "Hard Disk Drive Reliability", a Quantum Corporation white paper, March 1996

2. MIL-STD-721C & IEC STD 271, "Reliability & Maintainability Terms, Definitions and Related Mathematics", MIL-STD handbook, Jun, 1981

3. Kececioglu, D., "Reliability Engineering Handbook", Volume I & II, PTR Prentice Hall publication, Englewood, NJ, 1991, chapter 1, pp1-20.

4. Bellcore, "Reliability Prediction Procedure for Electronic Equipment", Technical Reference TR-NWT-000332, issue 5, Dec. 1995, Bell Communication Research, Livington, NJ

5. MIL-HDBK-217F, "Reliability Prediction of Electronic Equipment", MIL-STD Handbook, Version F, Notice 1, Jul. 1992.

6. Bellcore, "Field Reliability Performance Study Handbook", Special Report SR-NWT-000821, issue 3, Dec. 1990, Bell Communication Research, Livington, NJ

7. International Standard IEC 1014, IEC 56 (central Office) 150 "Programmes for Reliability Growth Management, Model", International Electromechanical Commission, Geneva, 1989.

See Also