Playing it S.M.A.R.T.
By: Erik Ottem, Judy Plummer
Date: June, 1995
Playing it S.M.A.R.T.
The Emergence of Reliability Prediction Technology
Reliability is a concept we seek in our daily lives. We want cars that are reliable. Reliable people make good friends and good employees. We naturally expect things to perform well. Disc drive manufacturers are dedicated to improving the reliability of their products, with the purpose of achieving customer satisfaction. Although people cannot always predict reliability in other people or the cars they drive, disc drive manufacturers have taken a giant step towards predicting reliability in disc drives, a step marked by the emergence of Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.).
Computer users today have great expectations of data storage reliability. Many users do not even consider the possibility of losing data due to a hard disc drive failure. Even though continual improvements in technology make data loss uncommon, it is not impossible. Reliability prediction technology is a way to anticipate the failure of a disc drive with sufficient notice to allow a system, or user, to back up data prior to a drive's failure. S.M.A.R.T. is reliability prediction technology for both ATA/IDE and SCSI environments. Pioneered by Compaq, S.M.A.R.T. is under continued development by the top five disc drive manufacturers in the world: Seagate Technology Inc., IBM, Conner Peripherals Inc., Western Digital Corporation, and Quantum Corporation.
The evolution of S.M.A.R.T.
Reliability prediction technology emerged from a widely-recognized need to protect mission-critical information stored on disc drives. As system storage capacity requirements increased and multiple disc array systems started to appear, industry leaders identified the importance of creating an early warning system that would allow enough lead time to back up data, should a failure become imminent. In order to understand how S.M.A.R.T. evolved, it is necessary to look at S.M.A.R.T.'s roots, which are based in technology developed by IBM and Compaq.
IBM's reliability prediction technology is called Predictive Failure Analysis (PFA®). PFA measures several attributes, including head flying height, to predict failures. The disc drive, upon sensing degradation of an attribute, such as flying height, sends a notice to the host that a failure may occur. Upon receiving notice, users can take steps to protect their data.
Some time later, Compaq announced a breakthrough in diagnostic design called IntelliSafeTM. This technology, which was developed in conjunction with Seagate, Quantum, and Conner, monitors a range of attributes and sends attribute and threshold information to host software. The disc drive then decides if an alert is warranted, and sends that message to the system, along with the attribute and threshold information. The attribute and threshold level implementation of IntelliSafe varies with each disc drive vendor, but the interface, and the way in which status is sent to the host, are consistent across all vendors.
Compaq placed IntelliSafe in the public domain by presenting their specification for the ATA/IDE environment, SFF-8035, to the Small Form Factor Committee on May 12, 1995. Seagate quickly recognized that reliability prediction technology offered tremendous benefits to customers, and researched the possibility of making a version available to other system OEMs, integrators, and independent software vendors. Seagate was joined by Conner, IBM, Quantum and Western Digital in the development of this new version, appropriately named S.M.A.R.T., which combines conceptual elements of Compaq's IntelliSafe and IBM's PFA.
Features of S.M.A.R.T. technology include a series of attributes, or diagnostics, chosen specifically for each individual drive model. Attribute individualism is important because drive architectures vary from model to model. Attributes and thresholds that detect failure for one model may not be functional for another model. Comparing different models of cars helps illustrate this point. Some cars are equipped with four-wheel drive, but others, like a Cadillac, are not. In other words, the architecture of the drive will determine which attributes to measure, and which thresholds to employ. Although not all failures will be predicted, we can expect an evolution of S.M.A.R.T., as technology and experience sharpen our ability to predict reliability. Subsequent changes to attributes and thresholds will also occur as field experience allows improvements to the prediction technology.
Some failures are predictable, and some are not
A disc drive must be able to monitor many elements in order to have a comprehensive reliability management capability. One of the most crucial elements is understanding failures. Failures can be seen from two standpoints: predictable, and unpredictable.
Unpredictable failures occur quickly, like electronic and mechanical problems, such as a power surge that can cause chip or circuit failure. Improvements in quality, design, process, and manufacturing can reduce the incidence of non-predictable failures. For example, the development of steel-belted radial tires reduced the occurrences of blowouts common among older flatwall "rag" tire designs.
Predictable failures are characterized by degradation of an attribute over time, before the disc drive fails. This creates a situation where attributes can be monitored, making it possible for predictive failure analysis. Many mechanical failures are typically considered predictable, such as the degradation of head flying height, which would indicate a potential head crash. Certain electronic failures may show degradation before failing, but more commonly, mechanical problems are gradual and predictable. For instance, oil level is a function, or "attribute" of most cars that can be monitored. When a car's diagnostic system senses that the oil is low, an oil light comes on. The driver can stop the car and save the engine. In the same manner, S.M.A.R.T. allows notice to start the backup procedure and save the user's data.
The chart below shows different types of failures, and corresponding levels of occurrence.
Mechanical failures, which are mainly predictable failures, account for 60 percent of drive failure. This number is significant because it demonstrates a great opportunity for reliability prediction technology. With the emerging technology of S.M.A.R.T., an increasing number of predictable failures will be predicted, and data loss will be avoided.
How attributes are determined
S.M.A.R.T. technology is like a jigsaw puzzle; it takes many pieces, put together in the right way, to make a pattern. As previously discussed, understanding failures is one piece of the puzzle. Another piece of the puzzle is the way in which
attributes are determined. Attributes are reliability prediction parameters, customized by the manufacturer for different types of drives. To determine attributes, Seagate design engineers review returned drives, consider the design points, and create attributes to signal the types of failures that they are seeing. Information gained from field experience can be used to predict reliability exposures and, over time, attributes can be incorporated into the new reliability architecture.
Though attributes are drive-specific, a variety of typical characteristics can be identified:
- head flying height
- data throughput performance
- spin-up time
- re-allocated sector count
- seek error rate
- seek time performance
- spin try recount
- drive calibration retry count
The attributes listed above illustrate typical kinds of reliability indicators. Ultimately, the disc drive design determines which attributes the manufacturer will choose. Attributes are therefore considered proprietary, since they depend on drive design.
The two S.M.A.R.T. specifications
S.M.A.R.T. emerged for the ATA/IDE environment when SFF-8035 was placed in the public domain. SCSI drives incorporate a different industry standard specification, as defined in the ANSI-SCSI Informational Exception Control (IEC) document X3T10/94-190. Seagate's S.M.A.R.T. System program includes both industry standards, thereby making S.M.A.R.T. technology available for both products with either ATA/IDE, or SCSI interfaces.
The S.M.A.R.T. system technology of attributes and thresholds is similar in ATA/IDE and SCSI environments, but the reporting of information differs, as illustrated in the diagram below.
In an ATA/IDE environment, software on the host interprets the alarm signal from the drive generated by the "report status" command of S.M.A.R.T. The host polls the drive on a regular basis to check the status of this command, and if it signals imminent failure, sends an alarm to the end user or system administrator. This allows downtime to be scheduled by the system administrator to allow for backup of data and replacement of the drive. This structure also allows for future enhancements, which might allow reporting of information other than drive conditions, such as thermal alarms, CD-ROM, tape, or other I/O reporting. The host system can evaluate the attributes and alarms reported, in addition to the "report status" command from the disc.
Generally speaking, SCSI drives with reliability prediction capability only communicate a reliability condition as either good or failing. In a SCSI environment, the failure decision occurs at the disc drive, and the host notifies the user for action. The SCSI specification provides for a sense bit to be flagged if the disc drive determines that a reliability issue exists. The system then alerts the end user/system manager.
An ounce of prevention is worth a pound of cure
Seagate is dedicated to customer satisfaction. Commitment to the customer is shown in the way Seagate has addressed concerns on the prevention of data loss, and responded with a plan. Seagate's strategy for implementing S.M.A.R.T. will include the use of industry standard interfaces for both ATA/IDE and SCSI environments. The emergence of reliability prediction technology is another way that Seagate leverages new technology to solve the needs of its customers.