Software Reliability is defined as the probability of failure-free software operation for a specified period of time in a specified environment.
Software and hardware have basic differences that make them different in failure mechanisms. Hardware faults are mostly physical faults, while software faults are design faults, which are harder to visualize, classify, detect, and correct. Design faults are closely related to fuzzy human factors and the design process, which we don't have a solid understanding. In hardware, design faults may also exist, but physical faults usually dominate. In software, we can hardly find a strict corresponding counterpart for "manufacturing" as hardware manufacturing process, if the simple action of uploading software modules into place does not count. Therefore, the quality of software will not change once it is uploaded into the storage and start running. Trying to achieve higher reliability by simply duplicating the same software modules will not work, because design faults can not be masked off by voting.
Reliability Testing has to be planned. The critical questions to be answered while planning for reliability testing are:
- What is my time to market?
- Is this a safety critical product where people's lives may be at stake upon a failure?
- What is the life expectancy of my product?
- Does my manufacturing process use accelerated test techniques to find process failures?
- How costly to me are field failures?
- Am I having reliability problems with an existing product?
Fast time to market products that have short deployment lives due to fast technology advancements will require accelerated life testing techniques. Why? Given the fast pace of these bleeding edge high technology products, it is too late in the game to make changes at the end of the development program, or worse yet, having to redesign the product just before it should have been shipped. Products with long development cycles along with long deployment lives will require life tests run under extreme user conditions. Products that are safety critical will need a combination of accelerated and life testing techniques. Products that have reliability and stability problems will need some form of reliability growth testing. Needless to say, high quality products of any type will use various combinations of all of the reliability testing techniques available.
In general, accelerated life (ALT) and accelerated stress tests (AST) find problems very quickly and, provided that root cause failure analysis is done, the fixes can also be tested quickly. It is never too early in a program to perform an AST type test on a new technology or early prototype. The earlier a problem is found, the less costly it will be to fix. AST tests are also good for establishing the robustness of a product ensuring that latent defects are not hiding in the design, only to fail at some future date.
HALT (Highly Accelerated Life Test) testing is a specialized version of AST testing. HALT testing is a step-stress test regiment using temperature and random vibration to eliminate design and latent defects from a product. HALT testing is also a prerequisite to setting manufacturing ESS (Environmental Stress Screening) and HASS (Highly Accelerated Stress Test) levels to ensure that the stress screens are only removing the manufacturing defects and not damaging good products.
While AST type tests are excellent means for flushing-out design problems, there are still many failure modes that, due to physics of failure, will just not happen in a short period of time. For these types of failure modes, especially on long-lived products, reliability demonstration testing (RDT) and reliability qualification testing (RQT) are required. In RDTs/RQTs, the equipment is operational and subjected to environments cycled between the expected maximum use conditions seen by a customer over a long period of time.
If one is having problems with existing equipment reliability, or if one wants a more stable piece of equipment entering into a RDT/RQT and have one's chances of passing the test requirement greatly improved, a reliability growth test (RGT) is the methodology to use. A RGT is a TAAF (Test, Analyze And Fix) program that, like the AST, finds, fixes and eliminates failure modes. In fact, an AST is a form of RGT.
In summary, reliability testing needs to be tailored to one's business, technology and manufacturing needs. No one-shoe size fits all.