What should we think of manufacturers who claim 99.999% availability for their hardware?

Hardware availability is just one, albeit important, factor in server availability. Over the past few years, hardware availability has increased because of technology improvements. Key factors in this are the increasing level of integration (which reduces the component count necessary to implement a system, as well as the number of connectors needed), improvements in data integrity, and the use of formal methods during the design of hardware components.

Systems based around standard technologies increasingly integrate functionality— such as system partitioning or the ability to swap subsystems “online,” i.e., without requiring that the system be brought down—that was until very recently the prerogative of mainframe systems and of “Fault Tolerant” systems. As a result of such advancements, it is possible to reach very high levels of hardware availability without entering the domain of specialized, expensive machinery.

On the other hand, it must be realized that software failures are more frequent than hardware failures. This trend is increasing. Hardware reliability keeps improving, but the amount of software in a system keeps increasing while its quality (as measured, for example, by the number of defects per thousand lines of code) shows little sign of improving. What matters, for a system, is total availability—a combination of hardware quality, the quality of software written or provided by the manufacturer, the quality of third-party software, and finally the quality of any application and/or operating procedures developed within the company. This last factor requires special attention—only too often, underestimating the importance of the quality of the in-house applications and/or operating procedures has its effect on the failure rate of the system.

It is appropriate to be very precise in any calculation concerning availability. To illustrate this point, consider a system that works 24 hours a day, 7 days a week, with an availability of 99.999%. This figure implies that the system is down no more than five minutes a year. Whether planned downtime is included in this budget makes a big difference in the difficulty of achieving this objective. Finally, we must not forget that any use of redundancy to improve availability tends to have an effect on performance.

Are RISC processors dead, killed by Intel?

Source of Information : Elsevier Server Architectures 2005

No comments:

Cloud storage is for blocks too, not just files

One of the misconceptions about cloud storage is that it is only useful for storing files. This assumption comes from the popularity of file...