if You Fail This Mode

The Bathtub Curve and Production Failure Behavior
Office One - The Bathtub Bend, Infant Mortality and Fire-in

by Dennis J. Wilkins
Retired Hewlett-Packard Senior Reliability Specialist, currently a ReliaSoft Reliability Field Consultant
This paper is adapted with permission from work done while at Hewlett-Packard.

Reliability specialists often describe the lifetime of a population of products using a graphical representation chosen the bathtub curve. The bathtub curve consists of three periods: an infant bloodshed catamenia with a decreasing failure rate followed by a normal life period (also known as "useful life") with a low, relatively constant failure rate and concluding with a habiliment-out menstruation that exhibits an increasing failure charge per unit. This commodity provides an overview of how baby mortality, normal life failures and habiliment-out modes combine to create the overall product failure distributions. Information technology describes methods to reduce failures at each stage of product life and shows how fire-in, when appropriate, can significantly reduce operational failure charge per unit past screening out infant mortality failures. The fabric will be presented in two parts. Part One (presented in this effect) introduces the bathtub curve and covers infant mortality and fire-in. Part Two (presented in next month's HotWire) will address the remaining ii periods of the bathtub curve: normal life failures and end of life wear-out.

Figure 1: The Reliability Bathtub Curve

Figure i: The Bathtub Curve

The bathtub curve, displayed in Figure 1 above, does not draw the failure rate of a single particular, just describes the relative failure rate of an unabridged population of products over time. Some individual units will neglect relatively early (baby mortality failures), others (nosotros promise most) will last until wear-out, and some volition fail during the relatively long period typically called normal life. Failures during babe mortality are highly undesirable and are always acquired by defects and blunders: material defects, design blunders, errors in associates, etc. Normal life failures are normally considered to be random cases of "stress exceeding strength." Still, as we'll see, many failures often considered normal life failures are actually infant mortality failures. Wear-out is a fact of life due to fatigue or depletion of materials (such as lubrication depletion in bearings). A product'due south useful life is limited by its shortest-lived component. A product manufacturer must clinch that all specified materials are acceptable to function through the intended product life.

Note that the bathtub curve is typically used every bit a visual model to illustrate the 3 key periods of product failure and non calibrated to depict a graph of the expected beliefs for a item product family. Information technology is rare to have enough brusque-term and long-term failure data to actually model a population of products with a calibrated bathtub curve.

As well note that the actual fourth dimension periods for these iii characteristic failure distributions can vary profoundly. Baby mortality does not mean "products that neglect inside 90 days" or any other defined time period. Baby mortality is the time over which the failure rate of a product is decreasing, and may last for years. Conversely, habiliment-out will not always happen long after the expected product life. It is a flow when the failure charge per unit is increasing, and has been observed in products after just a few months of use. This, of course, is a disaster from a warranty standpoint!

We are interested in the characteristics illustrated past the entire bathtub bend. The infant bloodshed menstruum is a time when the failure rate is dropping, but is undesirable because a pregnant number of failures occur in a brusk time, causing early customer dissatisfaction and warranty expense. Theoretically, the failures during normal life occur at random merely with a relatively abiding rate when measured over a long period of time. Because these failures may incur warranty expense or create service support costs, we want the bottom of the bathtub to exist as low every bit possible. And we don't desire whatever article of clothing-out failures to occur during the expected useful lifetime of the production.

Infant Bloodshed What Causes It and What to Do Virtually It?
From a customer satisfaction viewpoint, infant mortalities are unacceptable. They cause "dead-on-arrival" products and undermine customer confidence. They are caused by defects designed into or built into a production. Therefore, to avert infant mortalities, the product manufacturer must determine methods to eliminate the defects. Appropriate specifications, acceptable design tolerance and sufficient component derating tin can help, and should ever be used, but even the best blueprint intent tin can neglect to cover all possible interactions of components in operation. In add-on to the all-time design approaches, stress testing should be started at the earliest development phases and used to evaluate pattern weaknesses and uncover specific associates and materials bug. Tests similar these are chosen HALT (Highly Accelerated Life Test) or HAST (Highly Accelerated Stress Test) and should be applied, with increasing stress levels as needed, until failures are precipitated. The failures should be investigated and blueprint improvements should exist made to improve production robustness. Such an arroyo can help to eliminate blueprint and material defects that would otherwise show upwards with production failures in the field.

After manufacturing of a production begins, a stress test can still exist valuable. There are two distinct uses for stress testing in product. One purpose (often called HASA, Highly Accelerated Stress Audit) is to identify defects caused by assembly or fabric variations that can lead to failure and to take action to remove the root causes of these defects. The other purpose (oft called burn-in) is to utilise stress tests as an ongoing 100% screen to weed out defects in a product where the root causes cannot exist eliminated.

The first approach, eliminating root causes, is generally the all-time arroyo and can significantly reduce infant mortalities. It is ordinarily about cost-effective to run 100% stress screens only for early on production, then reduce the screen to an audit (or entirely eliminate it) every bit root causes are identified, the procedure/design is corrected and significant problems are removed. Unfortunately, some companies put 100% burn down-in processes in place and go on using them, addressing the symptoms rather than identifying the root causes. They but go on scrapping and/or reworking the aforementioned defects over and over. For most products, this is not effective from a cost standpoint or from a reliability improvement standpoint.

There is a class of products where ongoing 100% burn-in has proven to exist effective. This is with technology that is "country-of-the-art," such every bit leading edge semiconductor chips. In that location are bulk defects in silicon and minute fabrication variances that cannot be designed out with the current state of technology. These defects can cause some parts to neglect very early relative to the majority of the population. Fire-in can be an effective way to screen out these weak parts. This will be addressed later on in this article.

A Quantitative Await at Infant Mortality Failures Using the Weibull Distribution
The Weibull distribution is a very flexible life distribution model that can be used to characterize failure distributions in all 3 phases of the bathtub curve. The basic Weibull distribution has two parameters, a shape parameter, often termed beta (β), and a scale parameter, ofttimes termed eta (η ). The scale parameter, eta, determines when, in time, a given portion of the population volition fail (i.e., 63.two%). The shape parameter, beta, is the cardinal feature of the Weibull distribution that enables information technology to be applied to any phase of the bathtub curve. A beta less than ane models a failure rate that decreases with fourth dimension, as in the infant bloodshed period. A beta equal to 1 models a constant failure rate, every bit in the normal life period. And a beta greater than 1 models an increasing failure rate, as during wearable-out. At that place are several ways to view this distribution, including probability plots, survival plots and failure rate versus time plots. The bathtub curve is a failure charge per unit vs. time plot.

Typical infant bloodshed distributions for state-of-the-art semiconductor chips follow a Weibull model with a beta in the range of 0.2 to 0.6. If such a distribution is viewed in terms of failure rate versus fourth dimension, it looks like the plot in Figure 2.

Figure 2: Infant Mortality Curve - Failure Rate vs. Time

Figure ii: Baby Mortality Bend - Failure Rate vs. Time

This plot shows x years (87,600 hours) of time on the ten-axis with failure charge per unit on the y-axis. It looks a lot like the babe mortality and normal life portions of the bathtub curve in Figure 1, simply this curve models only infant bloodshed (decreasing failure charge per unit). Dots on this plot represent failure times typical of an infant mortality with Weibull beta = 0.2. As y'all can see, there are 27 failures before 1 year, and only six failures from one to ten years. People observing this curve, and the failure points plotted, could not be blamed for thinking information technology represents both infant mortality failures (in the start yr or so), and normal life failures after that. Simply these are only infant mortality failures - all the manner out to x years!

This plot shows the distribution for a beta value typical of complex, high-density integrated circuits (VLSI or Very Large Scale Integrated circuits). Parts such as CPUs, interface controller and video processing chips oft exhibit this kind of failure distribution over time. A look at this plot shows that if you could run these parts for the equivalent of three years and discard the failed parts, the reliability of the surviving parts would be much college out to 10 years. In fact, until a clothing-out mode occurs, the reliability would keep to meliorate over time. If there are mechanisms that can produce normal life failures (theoretically a abiding failure charge per unit) mixed in with the defects that crusade the infant mortalities shown in a higher place, fire-in can still provide significant improvement as long as the constant failure rate is relatively low.

Burn down-In for Leading Border Technologies
To run into how burn down-in can ameliorate the reliability of loftier tech parts, nosotros'll use a chart that looks somewhat like the failure charge per unit vs. time bend in Figure 2, but is more than useful. This is a survival plot that directly shows how many units from a population have survived to a given fourth dimension. Figure 3 is a plot for a typical VLSI process with a pocket-sized "weak" sub-population (defective parts that volition neglect as infant mortalities) and a larger sub-population of parts that will neglect randomly at a very low charge per unit over the normal operating life. The x-centrality scale is in years of employ (zero to 100 years!) and the y-axis is pct of parts even so operating to spec (starting at 100% and dropping to 50%).

Figure three shows that, of the failures that occur in the get-go 20 years (most 4%), most failures occur in the commencement yr or so, but like we observed in the infant bloodshed instance to a higher place. Because in that location is a low level, constant failure rate, this plot shows failures continuing for a hundred years. Of course, there could be a wear-out way that comes into play before a hundred years has elapsed, but no article of clothing-out distribution is considered here. Electronic components, unlike mechanical assemblies, rarely take wear-out mechanisms that are significant before many decades of operation.

Figure 3: Mixed Infant Mortality and Normal Life Survival Plot

Figure iii: Mixed Infant Mortality and Normal Life Survival Plot

We're non really interested in the failures much beyond ten years, so let's look at this same model for just the first ten years. In Figure 4, nosotros have included sample failure points from the simulation model used to create the plot. These enable u.s. to view which population (infant bloodshed or normal life) the failure came from.

Figure 4: Mixed Baby Mortality and Normal Life Failures

We come across that the plot in Figure 4 looks similar the early life and normal life portions of the bathtub bend, and in fact includes both distributions. We see that over two% of the units fail in the beginning year, only it takes ten years for 3% to fail. In actuality, there are still "infant" mortalities occurring well beyond ten years in this model, but at an ever-decreasing rate. In fact, in the x year bridge of this model there would exist very few normal life failures. Only ii failures (~5% of all failures) in this example (large blueish dots) come up from the normal life failure population. About 95% of the failures plotted in a higher place (small cerise dots) are babe mortality failures! This is what the integrated circuits (IC) industry has observed with complex solid-state devices. Even after ten years of functioning the primary failure cause for ICs is yet babe mortality. In other words, failures are still driven primarily by defects.

In such cases, burn down-in can help. In the plot higher up you can see that if you could get three years of operation on this part before you shipped it, y'all would have screened out over eighty% (2% divided by 3%) of the parts that would fail in x years. Then if we were to come up with a method to effectively "historic period" the parts the equivalent of three years and eliminate most of the infant mortalities, the remaining parts would be more reliable than the original population. Of course, the parts that go through the 3-year "burn-in" would have to last an additional 10 years in the field, for a total of xiii years. Let's see what this looks like in Effigy 5.

Figure 5: Comparison of Failures from Raw and Burned-in Parts

Figure 5: Comparison of Failures from Raw and Burned-in Parts

Above, we see 14 years of failure distribution for the original parts (non burned-in) and eleven years of expected failure distribution for parts that received 3 years of fire-in. In this example, the total cumulative failures between 3 years and xiii years for the original parts (or from zero to 10 years for burned-in parts) is most 0.6%. Without burn-in, the first x years would take had virtually iii% cumulative failures. This is about a five times reduction in cumulative failures by using burn-in, or in terms of a change, we would have about ii% fewer cumulative failures in x years with burn-in if a ascendant infant bloodshed failure mode exists. Notation that in the first year or two, the relative comeback in reliability is even greater. At two years, simply about 0.1% failures are expected after burn-in simply almost 2% without burn-in; a ratio of almost 25:1!

In reality, manufacturers don't have two to three years to spend on fire-in. They need an accelerated stress test. In the IC industry there are usually 2 stresses that are used to advance the effective fourth dimension of fire-in: temperature and voltage. Increased temperature (relative to normal operating temperatures) can provide an acceleration of tens of times (10x to 30x is typical). Increased voltages (relative to normal operating levels) can provide fifty-fifty higher acceleration factors on many types of ICs. Combined dispatch factors in the range of 1000:one, or more than, are typical for many IC burn-in processes. Therefore, burn-in times of tens of hours can provide effective operating times of one to five years, significantly reducing the proportion of parts with baby mortality defects.

What if we try burn-in on a production with no ascendant babe mortality issues? The survival plot for an assembly with a 1% per year "constant" failure rate (normal life flow) is shown below in Figure 6.

Figure 6: Survival Plot for Constant Failure Rate

Effigy 6: Survival Plot for Abiding Failure Rate

It's pretty like shooting fish in a barrel to see that burn-in for two years would discover ~2% failures, just operation for an additional 2 years would find another ~2%. At 10 years, nosotros would accept found about 10%. Note, the line is not actually a straight line because a constant failure rate (equivalent to the normal life part of the bathtub) acts on the remaining population and the remaining population is decreasing as units neglect. Looking at the aforementioned burn down-in conditions as in the last case, if nosotros were to provide three years of functioning on these parts and then utilise them for an additional ten years, what results would we have? The cumulative failures of the units that passed this screen would be very close to 9.5%. Without burn down-in, the cumulative failures in 10 years would be the same, near 9.5%. In that location is no advantage to burn-in with a constant (normal life) failure rate.

Information technology should be obvious that burn-in of an assembly that is failing due to a clothing-out failure mode (failure charge per unit increasing with fourth dimension) will actually yield assemblies that are worse than units that did not go through burn-in. This is simply because the probability of failure is increasing for every hour the parts run. Calculation operating time just increases the possibility of a failure in any future menses of time!

Conclusion
In this issue, Part I, we have introduced the concept of the bathtub curve and discussed issues related to the first flow, babe mortality, besides as the practices, such as burn-in, that are used to address failures of this type. As this article demonstrates, although burn down-in practices are not ordinarily a applied economic method of reducing infant mortality failures, fire-in has proven to be effective for land-of-the-art semiconductors where root cause defects cannot be eliminated. For nigh products, stress testing, such as HALT/HAST should be used during design and early on production phases to precipitate failures, followed by analysis of the resulting failures and corrective activity through redesign to eliminate the root causes. In Part 2 (presented in side by side calendar month's HotWire), nosotros will examine the final two periods of the bathtub curve: normal life failures and stop of life wear-out.

foltzperinced.blogspot.com

Source: https://www.weibull.com/hotwire/issue21/hottopics21.htm

0 Response to "if You Fail This Mode"

Enviar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel