Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

S.M.A.R.T. Advanced Drive Diagnostics

From Higher Intellect Vintage Wiki
Jump to navigation Jump to search
QLogo.gif
Smartlogo.gif

Overview

Is there such thing as enough reliability? In industries providing capabilities that are crucial to the day-to-day productivity and survival of their clients, the answer is clearly no. And, while many companies are investing significant resources in elevating the reliability of individual components and devices, overall system reliability is not always addressed.

By broadening the scope of reliability enhancements from the device level to the system level, system designers and integrators can leverage the full capabilities of the system, not just its components, to devise more complete, intelligent solutions that benefit overall system reliability.

The S.M.A.R.T. Self-Monitoring, Analysis and Reporting Technology System was designed precisely with this approach in mind. By using S.M.A.R.T. technology, virtually any intelligent component or device within a computer can communicate its predicted reliability status to its user and system administrator to provide comprehensive protection that can prevent system downtime, productivity loss, and even the loss of valuable data.

Since user data is not always as easy to replace as hardware devices, most computer users will argue that their data is the most valuable element of their computer system. In recognizing the need to elevate the protection of user data, Quantum has pioneered the implementation of the S.M.A.R.T. System for hard disk drives.

Through the S.M.A.R.T. System, Quantum disk drives incorporate a suite of advanced diagnostics that monitor the internal operations of a drive and provide an early warning for many types of potential problems. If detected, the drive can be scheduled for replacement before the loss of data occurs. The result: higher productivity and increased security of data.

What is S.M.A.R.T.?

The S.M.A.R.T. system for disk drives, illustrated in Figure 1, is designed to revolutionize overall system and data reliability. The system is comprised of software that resides both on the disk drive and on the host computer. The disk drive software monitors the internal performance of the motors, media, heads, and electronics of the drive, while the host software monitors the overall reliability status of the drive. The reliability status is determined through the analysis of the drive's internal performance level and the comparison of internal performance levels to predetermined threshold limits.

In some configurations the host may play a more active role and direct the drive to perform diagnostic tests and report the results. The host may then make comparisons with previous results, or against values set for that particular drive model, and take appropriate action.

Upon determination of a critical reliability condition where a device failure has been predicted, the host software warns the user of the impending condition and advises the user to take appropriate action to protect the data being stored. In more advanced implementations, the host could notify a network administrator and automatically reduce the workload on the device, relocate key files, and even begin a backup of the data to tape or other disk drives.


Figure 1. The S.M.A.R.T. System

Three functions shared across the host and the drive

4Figure1.gif

The S.M.A.R.T. System and Data Reliability

In recent years, disk drives have gone from being reliable to being extremely reliable. And, with product quality levels2 reaching 99.96 percent and actual field failure levels as low as 0.27 percent, Quantum hard disk drives are among the most reliable in the world.

One measure of this increasing reliability has been that the MTBF (Mean Time Between Failures) rating for a typical drive has climbed to 300,000 hours, and MTBFs of 800,000 hours or more are often found in high-end drives. In comparison with other devices, the MTBF ratings for hard disk drives are ten to thirty times higher than that of typical floppy disk drives and CD-ROM drives.

Yet, despite significant gains in MTBF ratings, system managers and end users have continued to press for improvements in reliability. The increase in MTBF rating alone has not been enough to meet this need for three main reasons:

  • The number of drives per system continues to increase
  • The capacity of the typical drive continues to grow
  • MTBF is a statistical rating that applies to a large population of devices, not to a specific device in a particular system

Today, a computer installation may include multiple servers attached to several RAID cabinets. The number of drives installed may have increased from five or ten a decade ago to more than a hundred today.

With this increase in the number of drives has come a dilution of overall system reliability. An installation with a hundred 300,000 MTBF drives can be expected to have a failure every 3,000 hours perhaps twice a year.

And, today's drives are considerably larger than those of just a few years ago. Each potential drive failure now puts more data at risk. So, ensuring the reliability of each drive has become more important.

Similarly, the MTBF values calculated by manufacturers apply to a large population of drives all of the drives produced of that model. But this average for all drives can only broadly reflect the reliability to be expected for a particular drive.

With the S.M.A.R.T. system, data reliability will evolve from the general and statistical to the specific and individual.

How Does S.M.A.R.T. Work?

Part of what makes the S.M.A.R.T. System possible is that disk drive reliability has been intensely studied for many years. Though difficulties remain and new technologies are continually being introduced, the key vital areas, shown in Figure 2, have been well explored. By analyzing the vital functions of disk drive components and understanding their typical failure mechanisms, disk drive designers can not only develop more reliable products, but also apply their knowledge to the prediction of device failures.

Through research and monitoring of vital functions, performance thresholds which correlate to imminent failure can be determined. By applying these thresholds to the monitoring of each individual device, the S.M.A.R.T. System achieves the goal of effective failure prediction.


Figure 2. Vital Functional Areas

Figure2a.gif

In addition to handling damage, assembly defects or material defects, environmental conditions can also contribute to device failure or the loss of data. The mobile computing environment, for example, exposes systems to a broad range of extreme adverse conditions such as shock, vibration, temperature, and humidity. Exposure to these types of conditions can promote a variety of failure mechanisms, including those listed in figure 2.

The specific measures and techniques used in Quantum's drives are selected individually for each design. They can vary by model and change over time as the drive architecture changes and diagnostic techniques improve.

Figure 3 presents the results of some of Quantum's reliability research, and briefly outlines some of the measures that have been evaluated. It is important to note that no one measure is effective for all problem areas. S.M.A.R.T. is truly a suite of diagnostics.


Figure 3. Reliability Predictors

Type of Failure

Symptom/Cause

Predictor

Excessive bad sectors

Grown defect list, media defects, handling damage

Number of defects, growth rate

Excessive run-out

Noisy bearings, motor, handling damage

Run-out, bias force diagnostics

Excessive soft errors

Crack/broken head, contamination

High retries, ECC involves

Motor failure, bearings

Drive not ready, no platter spin, handling damage

Spin-up retries, spin-up time

Drive not responding, no connect

Bad electronics module

None, typically catastrophic

Bad servo positioning

High servo errors, handling damage

Seek errors, calibration retries

Head failure, resonance

High soft errors, servo retries, handling damage

Read error rate, servo error rate

Failure Predictability

Regardless of the failure area and the components involved, failures can be identified as one of two broad types: predictable or non-predictable, as illustrated in Figure 4. The predictable failures show a gradual and detectable decline in performance. And, there is a known threshold for acceptable performance. The challenge in designing an early warning algorithm is in identifying the threshold and detecting the decline.


Figure 4. Predictable vs. Non-Predictable

2Figure4.gif

Some measures, like power-on hours or number of contact start/stops are easy to measure, but have no clear limit. They are somewhat like the number of miles on a car's odometer 100,000 miles is a high amount, but does not mean that the particular car will fail anytime soon.

Non-predictable failures either show no gradual decline in performance, or the measures needed to detect them cannot be accomplished by the drive. Some failures of integrated circuits could perhaps be predicted by monitoring microscopic cracks in the substrate and circuits, but an electron microscope is needed.

As drive technologies advance, some non-predictable failures may become predictable. But for now it is important to acknowledge the limits of the S.M.A.R.T. System the advanced diagnostics can provide an early warning for many failures, but not all.

The Benefits of the S.M.A.R.T. System

In many ways the S.M.A.R.T. system can be thought of as a set of diagnostic software that is built into the drive. Mainframe computers and minicomputers have used disk drive diagnostic routines for many years. In the PC world, many users are familiar with CHKDISK, SCANDISK, and other utilities that can provide early detection of some disk drive problems.

The S.M.A.R.T. system on Quantum drives extends this technology by designing the diagnostics into the drive. There, the diagnostic routines can be more precise because they are designed for a specific drive design. They can also be more effective because they have access to the internal performance and calibration measurements collected by the drive's controller. Through its use of internal performance indicators and real-time monitoring and analysis, the S.M.A.R.T. system is designed to extend its data protection capability beyond that of traditional diagnostic software.

To ensure compatibility of software and hardware implementations of the S.M.A.R.T. system, Quantum is actively pursuing standardization for both ATA and SCSI interfaces. For both interfaces, the implementation of S.M.A.R.T. requires the introduction of new firmware that specifies the drive activities to be monitored, the diagnostics to be performed, and the values to be used as thresholds or for an analysis.

The key feature of the open S.M.A.R.T. specification being promoted by Quantum is that any drive that is S.M.A.R.T. compliant can communicate with any host that is also compliant. Though the specific parameters measured and the limits and analysis used may vary, the communications framework is usable across vendors.

The S.M.A.R.T. System from Quantum

With the S.M.A.R.T. System, Quantum is pioneering an open standard that can bring a new level of data security to the industry. The key features of S.M.A.R.T. include:

  • It is an open industry standard, developed and endorsed by Compaq Computer and other industry leaders. The specification has been published by the disk drive industry's Small Form Factor Committee under document number SFF-8035.
  • The technology can be extended to include a variety of devices tapes, CD-ROM, communications devices, etc.
  • S.M.A.R.T. drives can be automatically monitored for impending failure conditions.
  • Drive manufacturers can use the specific internal diagnostics best suited to their drive.
  • S.M.A.R.T.-compliant hosts can take early action to protect the data on a drive automated backup, load reduction, etc.
  • S.M.A.R.T. is designed to detect up to 70 percent of all predictable device failures to extend data and overall system reliability.
  • S.M.A.R.T. can detect and report failure conditions that originate in the field due to shock, vibration, temperature, and voltage extremes.

In Summary

As the access to electronic information becomes more and more vital in business and at home, system designers can now reach beyond the traditional boundaries of product reliability and extend their protection of valuable user data to new and more sophisticated levels. By combining internal device monitoring with a system level software interface, the strengths of the entire system can be leveraged to surpass the capabilities of the individual pieces.

See Also