Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis:
Failure mode and effects analysis (FMEA; often written with "failure modes" in plural) is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects.
For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. A FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models  are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.
A few different types of FMEA analyses exist, such as:
- Functional
- Design
- Process
Sometimes FMEA is extended to FMECA (failure mode, effects, and criticality analysis) to indicate that criticality analysis is performed too.
FMEA is an inductive reasoning (forward logic) single point of failure analysis and is a core task in reliability engineering, safety engineering and quality engineering.
A successful FMEA activity helps identify potential failure modes based on experience with similar products and processes—or based on common physics of failure logic. It is widely used in development and manufacturing industries in various phases of the product life cycle. Effects analysis refers to studying the consequences of those failures on different system levels.
Functional analyses are needed as an input to determine correct failure modes, at all system levels, both for functional FMEA or Piece-Part (hardware) FMEA. An FMEA is used to structure Mitigation for Risk reduction based on either failure (mode) effect severity reduction or based on lowering the probability of failure or both. The FMEA is in principle a full inductive (forward logic) analysis, however the failure probability can only be estimated or
reduced by understanding the failure mechanism. Hence, FMEA may include information on causes of failure (deductive analysis) to reduce the possibility of occurrence by eliminating identified (root) causes.

Introduction:
The FME(C)A is a design tool used to systematically analyze postulated component failures and identify the resultant effects on system operations. The analysis is sometimes characterized as consisting of two sub-analyses, the first being the failure modes and effects analysis (FMEA), and the second, the criticality analysis (CA).
Successful development of an FMEA requires that the analyst include all significant failure modes for each contributing element or part in the system. FMEAs can be performed at the system, subsystem, assembly, subassembly or part level. The FMECA should be a living document during development of a hardware design. It should be scheduled and completed concurrently with the design. If completed in a timely manner, the FMECA can help guide design decisions. The usefulness of the FMECA as a design tool and in the decision making process is dependent on the effectiveness and timeliness with which design problems are identified. Timeliness is probably the most important consideration. In the extreme case, the FMECA would be of little value to the design decision process if the analysis is performed after the hardware is built. While the FMECA identifies all part failure modes, its primary benefit is the early identification of all critical and catastrophic subsystem or system failure modes so they can be eliminated or minimized through design modification at the earliest point in the development effort; therefore, the FMECA should be performed at the system level as soon as preliminary design information is available and extended to the lower levels as the detail design progresses.

Remark: For more complete scenario modelling another type of Reliability analysis may be considered, for example fault tree analysis (FTA); a deductive (backward logic) failure analysis that may handle multiple failures within the item and/or external to the item including maintenance and logistics. It starts at higher functional / system level. A FTA may use the basic failure mode FMEA records or an effect summary as one of its inputs (the basic events). Interface hazard analysis, human error analysis and others may be added for completion in scenario modelling.
Functional Failure mode and effects analysis The analysis may be performed at the functional level until the design has matured sufficiently to identify specific hardware that
will perform the functions; then the analysis should be extended to the hardware level. When performing the hardware level FMECA, interfacing hardware is considered to be operating within specification. In addition, each part failure postulated is considered to be the only failure in the system (i.e., it is a single failure analysis). In addition to the FMEAs done on systems to evaluate the impact lower level failures have on system operation, several other FMEAs are done. Special attention is paid to interfaces between systems and in fact at all functional interfaces. The purpose of these FMEAs is to assure that irreversible physical and/or functional damage is not propagated across the interface as a result of failures in one of the interfacing units. These analyses are done to the piece part level for the circuits that directly interface with the other units. The FMEA can be accomplished without a CA, but a CA requires that the FMEA
has previously identified system level critical failures. When both steps are done, the total process is called a FMECA.

Types:
Functional: before design solutions are provided (or only on high level) functions can be evaluated on potential functional failure effects. General Mitigations ("design to" requirements) can be proposed to limit consequence of functional failures or limit the probability of occurrence in this early development. It is based on a functional breakdown of a system. This type may also be used for Software evaluation.
Concept Design / Hardware: analysis of systems or subsystems in the early design concept stages to analyse the failure mechanisms and lower level functional failures, specially to different concept solutions in more detail. It may be used in trade-off studies.
Detailed Design / Hardware: analysis of products prior to production. These are the most detailed (in mil 1629 called Piece-Part or Hardware FMEA) FMEAs and used to identify any possible hardware (or other) failure mode up to the lowest part level. It should be based on hardware breakdown (e.g. the BoM = Bill of Material). Any Failure effect Severity, failure Prevention (Mitigation), Failure Detection and Diagnostics may be fully analyzed in this FMEA.
Process: analysis of manufacturing and assembly processes. Both quality and reliability may be affected from process faults. The input for this FMEA is amongst others a work process / task Breakdown.

Ground rules
The ground rules of each FMEA include a set of project selected procedures; the assumptions on which the analysis is based; the hardware that has been included and excluded from the analysis and the rationale for the exclusions. The ground rules also describe the indenture level of the analysis (i.e. the level in the hierarchy of the part to the sub-system, sub-system to the system, etc.), the basic hardware status, and the criteria for system and mission success. Every effort should be made to define all ground rules before the FMEA begins; however, the ground rules may be expanded and clarified as the analysis proceeds. A typical set of ground rules (assumptions) follows:
1. Only one failure mode exists at a time.
2. All inputs (including software commands) to the item being analyzed are present and at nominal values.
3. All consumables are present in sufficient quantities.
4. Nominal power is available Benefits
Major benefits derived from a properly implemented FMECA effort are as follows:
1. It provides a documented method for selecting a design with a high probability of successful operation and safety.
2. A documented uniform method of assessing potential failure mechanisms, failure modes and their impact on system operation, resulting in a list of failure modes ranked according to the seriousness of their system impact and likelihood of occurrence.
3. Early identification of single failure points (SFPS) and system interface problems, which may be critical to mission success and/or safety. They also provide a method of verifying that switching between redundant elements is not jeopardized by postulated single failures.
4. An effective method for evaluating the effect of proposed changes to the design and/or operational procedures on mission success and safety A basis for in-flight troubleshooting procedures and for locating performance monitoring and fault-detection devices.
6. Criteria for early planning of tests. 

From the above list, early identifications of SFPS, input to the troubleshooting procedure and locating of performance monitoring / fault detection devices are probably the most important benefits of the FMECA. In addition, the FMECA procedures are straightforward and allow orderly evaluation of the design.

Basic terms:
The following covers some basic FMEA terminology.
Failure
The loss of a function under stated conditions.
Failure mode
The specific manner or way by which a failure occurs in terms of failure of the item (being a part or (sub) system) function under investigation; it may generally describe the way the failure occurs. It shall at least clearly describe a (end) failure state of the item (or function in case of a Functional FMEA) under consideration. It is the result of the failure mechanism (cause of the failure mode). For example; a fully fractured axle, a deformed axle or a fully open or fully closed electrical contact are each a separate failure mode of a DFMEA, they would not be failure modes of a PFMEA. Here you examine your process, so process step x - insert drill bit, the failure mode would be insert wrong drill bit, the effect of this is too big a hole or too small a hole.

Failure cause and/or mechanism
Defects in requirements, design, process, quality control, handling or part application, which are the underlying cause or sequence of causes that initiate a process (mechanism) that leads to a failure mode over a certain time. A failure mode may have more causes. For example; "fatigue or corrosion of a structural beam" or "fretting corrosion in an electrical contact" is a failure mechanism and in itself (likely) not a failure mode. The related failure mode (end state) is a "full fracture of structural beam" or "an open electrical contact". The initial cause might have been "Improper application of corrosion protection layer (paint)" and /or "(abnormal) vibration input from another (possibly failed) system".

Failure effect
Immediate consequences of a failure on operation, function or functionality, or status of some item.
Indenture levels (bill of material or functional breakdown)
An identifier for system level and thereby item complexity. Complexity increases as levels are closer to one.
Local effect
The failure effect as it applies to the item under analysis.
Next higher level effect
The failure effect as it applies at the next higher indenture level.
End effect
The failure effect at the highest indenture level or total system.
Detection
The means of detection of the failure mode by maintainer, operator or built in detection system, including estimated dormancy period (if applicable)
Probability
The likelihood of the failure occurring.
Risk Priority Number (RPN) Severity (of the event) × Probability (of the event occurring) × Detection (Probability that the event would not be detected before the user was aware of it)
Severity
The consequences of a failure mode. Severity considers the worst potential consequence of a failure, determined by the degree of injury, property damage, system damage and/or time lost to repair the failure.
Remarks / mitigation / actions
Additional info, including the proposed mitigation or actions used to lower a risk or justify a risk level or scenario.

Probability (P)
It is necessary to look at the cause of a failure mode and the likelihood of occurrence. This can be done by analysis, calculations / FEM, looking at similar items or processes and the
failure modes that have been documented for them in the past. A failure cause is looked upon as a design weakness. All the potential causes for a failure mode should be identified and documented. This should be in technical terms. Examples of causes are: Human errors in handling, Manufacturing induced faults, Fatigue, Creep, Abrasive wear, erroneous algorithms, excessive voltage or improper operating conditions or use (depending on the used ground rules). A failure mode is given a
Probability Ranking.
Severity (S)
Determine the Severity for the worst-case scenario adverse end effect (state). It is convenient to write these effects down in terms of what the user might see or experience in terms of functional failures. Examples of these end effects are: full loss of function x,
degraded performance, functions in reversed mode, too late functioning, erratic functioning, etc. Each end effect is given a Severity number (S) from, say, I (no effect) to V (catastrophic),
based on cost and/or loss of life or quality of life. These numbers prioritize the failure modes (together with probability and detectability). Below a typical classification is given. Other
classifications are possible. See also hazard analysis.
Detection (D)
The means or method by which a failure is detected, isolated by operator and/or maintainer and the time it may take. This is
important for maintainability control (availability of the system) and it is especially important for multiple failure scenarios. This
may involve dormant failure modes (e.g. No direct system effect, while a redundant system / item automatically takes over or when the failure only is problematic during specific mission or system states) or latent failures (e.g. deterioration failure mechanisms, like a metal growing crack, but not a critical length). It should be made clear how the failure mode or cause can be discovered by an operator under normal system operation or if it can be
discovered by the maintenance crew by some diagnostic action or automatic built in system test. A dormancy and/or latency period may be entered.
Dormancy or Latency Period
The average time that a failure mode may be undetected may be entered if known. For example:
- Seconds, auto detected by maintenance computer
- 8 hours, detected by turn-around inspection
- 2 months, detected by scheduled maintenance block X
- 2 years, detected by overhaul task x

Indication
If the undetected failure allows the system to remain in a safe / working state, a second failure situation should be explored to determine whether or not an indication will be evident to all operators and what corrective action they may or should take.
Indications to the operator should be described as follows:
- Normal. An indication that is evident to an operator when the system or equipment is operating normally.
- Abnormal. An indication that is evident to an operator when the system has malfunctioned or failed.
- Incorrect. An erroneous indication to an operator due to the malfunction or failure of an indicator (i.e., instruments, sensing devices, visual or audible warning devices, etc.).

Risk level (P×S) and (D)
Risk is the combination of End Effect Probability And Severity where probability and severity includes the effect on non detectability (dormancy time). This may influence the end effect probability of failure or the worst case effect Severity. The exact calculation may not be easy in all cases, such as those where multiple scenarios (with multiple events) are possible and detectability / dormancy plays a crucial role (as for redundant systems). In that case Fault Tree Analysis and/or Event Trees may be needed to determine exact probability and risk levels.
Preliminary Risk levels can be selected based on a Risk Matrix like shown below, based on Mil. Std. 882. The higher the Risk level, the more justification and mitigation is needed to provide evidence and lower the risk to an acceptable level. High risk should be indicated to higher level management, who are responsible for final decision-making.

Timing:
The FMEA should be updated whenever:
- A new cycle begins (new product/process)
- Changes are made to the operating conditions
- A change is made in the design
- New regulations are instituted
- Customer feedback indicates a problem

Uses:
- Development of system requirements that minimize the likelihood of failures.
- Development of designs and test systems to ensure that the failures have been eliminated or the risk is reduced to acceptable level.
- Development and evaluation of diagnostic systems
- To help with design choices (trade-off analysis).

Advantages:
- Catalyst for teamwork and idea exchange between functions
- Collect information to reduce future failures, capture engineering knowledge
- Early identification and elimination of potential failure modes
- Emphasize problem prevention
- Improve company image and competitiveness
- Improve production yield
- Improve the quality, reliability, and safety of a product/process
- Increase user satisfaction
- Maximize profit
- Minimize late changes and associated cost
- Reduce impact on company profit margin
- Reduce system development time and cost
- Reduce the possibility of same kind of failure in future
- Reduce the potential for warranty concerns.

4 comments:

  1. Thank you for sharing such a great blog with us Reliability Engineering

    ReplyDelete
  2. Thanks for sharing great informational content.
    Read my latest technology related blogs
    pikdo
    FMovies

    ReplyDelete
  3. Thank you so much such a nice blog writing, directpointelectrical We are a team of expert Electrician offering wide range of electrical services in Australia and we offer premium support to our customers in Australia. directpointelectrical team has become the world leader in electrician filled. Electrician Dandenong

    ReplyDelete
  4. Thanks for the post. You have explained the topic in very simple and step by step.
    Plastic product for home use in Haryana

    ReplyDelete