What’s the practical difference between λDU and λDD?

λDU drives PFDavg because the failure is both dangerous and undetected. λDD is dangerous but detected, so it typically results in an alarm or forced-safe trip and may contribute to STR depending on configuration.

Do failure rates change over time in my facility?

Yes. The published λ values assume controlled conditions; however, real-world factors like environment, cycling, installation quality, and maintenance can shift the true failure rate up or down. However, we assume constant for modeling purposes.

Why do certified products help for accurate λ values?

Certified devices provide validated λ values and clearly defined DU/DD/SU/SD splits under IEC 61508. This reduces interpretation errors and ensures the assumptions behind the numbers are understood and controlled.

Is it okay to use generic failure rates?

Only if you confirm that they match your device and application. Generic values may not reflect your environment, proof-test strategy, or diagnostic coverage.

What if the device doesn’t have a SIL certificate?

You can still use it, but you must rely on credible manufacturer reliability data, validated site history, or reputable sources such as OREDA. These represent the alternate data routes when a certified data path is not available. Assumptions must match the application, and justification must be documented.

Why do final elements almost always have higher λDU than sensors?

Most λDU in a final element comes from mechanical components like the actuator and valve body, which lack strong diagnostics. Sensors generally have better diagnostic coverage and fewer mechanical wear points, so their λDU values are typically much lower.

I understand why λDU impacts PFDavg, but why would λDD impact PFDavg?

It depends on how the controls and diagnostics are set up. In many low-demand configurations, only λDU enters PFDavg. But if a dangerous detected failure leaves the SIF unable to perform its function—and the system does not act on or repair that diagnostic—then that portion of λDD effectively behaves like λDU and may need to be included in the PFDavg analysis.

How are λ values used differently in high-demand or continuous modes?

Failure rate is highly relevant but used differently. In low-demand mode (most common), we use λDU to calculate PFDavg. In continuous or high-demand modes, PFDavg is not used. Instead, SIL is based on PFH—the rate of dangerous failure per hour. In these cases, λDU is used to calculate PFH through a different series of equations than PFDavg.

SIL Safe

Hazard and Risk Assessment (H&RA): The Foundation of Functional Safety

mamerten — Sat, 04 Apr 2026 21:07:09 +0000

If there is one activity in the functional safety life-cycle that sets the tone for everything that follows, it is the Hazard & Risk Assessment (H&RA). Get it right and you have a solid technical foundation for your Safety Instrumented System (SIS). Get it wrong — or skip it — and every decision downstream is built on sand.

What Is a Hazard & Risk Assessment (H&RA)?

A hazard is a physical situation with the potential to cause harm. A risk is what you get when you combine two things: the probability that the hazard leads to a harmful event, and the severity of the consequences. Risk is never just one of those dimensions. A high-severity outcome with negligible probability may be entirely tolerable. A low-severity outcome that happens constantly may not be. Both dimensions must be assessed together — always.

The H&RA is the structured process of identifying hazards, evaluating the associated risks, and determining whether those risks are tolerable. It is the foundation of the functional safety life-cycle — the end-to-end engineering process defined in IEC 61511 that governs how Safety Instrumented Systems are designed, implemented, operated, and maintained. Without the H&RA, there is no technical basis for any of what follows.

A note on terminology — you will encounter several names and acronyms for this activity depending on the standard or industry context:

H&RA (Hazard and Risk Assessment) — the term used in IEC 61511
HRA and HARA — also widely used within the functional safety community; same activity, different shorthand
PHA (Process Hazard Analysis) — the equivalent term under the OSHA PSM regulation (29 CFR 1910.119) and the EPA RMP regulation; same concept, different regulatory language
Other safety disciplines — machinery safety, for example — may use different terms again

The variation is mostly regulatory and organizational preference. The underlying activity is the same.

When a Hazard and Risk Assessment Fits in the IEC 61511 Safety Life-Cycle

A hazard and risk assessment must be conducted early in the safety life-cycle — specifically under Clause 8 — before SIS design begins.

The accepted practice is to conduct the H&RA when the P&IDs (Piping and Instrumentation Diagrams) are at Rev 0 — the point at which the process design is sufficiently defined to support a meaningful hazard study, but not so advanced that changes identified during the study become costly or impractical to implement. Too early and the hazard picture is incomplete; too late and the window to influence design has closed.

A poorly timed H&RA compounds every problem that follows.

Key Steps in Conducting an H&RA

The H&RA is not a single task — it is a structured sequence of activities. The order matters.

Step 1 – Determine tolerable risk. Before you can assess whether any risk is acceptable, you need to define what “acceptable” means for your organization. This benchmark must be established first. It governs every risk judgment that follows. (See the next section for details.)

Step 2 – Define the scope and boundaries. What process units, equipment, and operating modes are included in this H&RA? Scope creep and scope gaps are both problems. A clearly documented boundary prevents both.

Step 3 – Identify hazards. What physical situations exist that have the potential to cause harm? This is where the structured identification methodology — HAZOP, What-If, or similar — is applied.

Step 4 – Identify hazardous events and demand scenarios. A hazard becomes a hazardous event when a specific initiating cause triggers it — the conditions under which a safety function would be called upon to act.

Step 5 – Assess consequences. For each hazardous event, what is the worst credible outcome? Consequences are typically assessed in terms of harm to people, environmental impact, and asset damage.

Step 6 – Assess likelihood/frequency. How often is the hazardous event expected to occur, without any protection layers in place? This is the unmitigated or inherent demand rate.

Step 7 – Identify the risk gap. With both consequence severity and likelihood established, the assessed risk can be compared against the tolerable risk criteria and the calibrated risk matrix. Where the assessed risk exceeds tolerable risk, a gap exists. That gap is the direct trigger for a SIS.

Determining Tolerable Risk — The Foundation of Step 1

Under IEC 61511, tolerable risk is the level of risk accepted in a given context based on the current values of society. It is not zero risk — every industrial process carries some inherent risk, and the goal of IEC 61511 is to ensure that risk is identified, evaluated, and reduced to a tolerable level, not eliminated entirely. The instinct to demand “no risk” is understandable but neither achievable nor the intent of the standard.

Tolerable risk criteria must be documented — a specific IEC 61511 requirement and a common gap at smaller PSM and RMP facilities. Without documented criteria, every risk judgment becomes too subjective and the study loses its technical defensibility.

A calibrated risk matrix is the standard tool for establishing and communicating tolerable risk criteria. It is a specific type of risk matrix in which the axis boundaries are anchored to actual numerical frequency and consequence values — not vague qualitative descriptors like “frequent” or “catastrophic.” Calibration reduces subjectivity as much as possible, improves consistency across the study, and makes the risk criteria defensible under regulatory or third-party scrutiny. Larger organizations and multi-site facilities often maintain multiple calibrated risk matrices — one for each consequence category or tailored to specific process units. The PEAR Model, discussed further in the Related Topics section, provides a useful framework for structuring those consequence categories.

The Risk Gap and the Case for a SIS

The risk gap is the difference between the unmitigated risk of a hazardous event and the tolerable risk threshold. When a gap exists — when the assessed risk exceeds what is tolerable — a protection layer is required.

Protection layers can take many forms:

Inherently safer design
Basic process control
Physical relief devices
Operator response
Safety Instrumented Functions (SIFs)

When a hazard cannot be reduced to a tolerable level through other means, it requires a Safety Instrumented Function (SIF) — a specific automated safety action that brings the process to a safe state in response to a defined demand. A process will typically have multiple hazards that each require their own SIF. All of those SIFs together constitute the Safety Instrumented System (SIS) — the SIS does not exist independently of the SIFs that define it; it is the sum of them.

This is why the hazard and risk assessment is the primary input to SIS design. Without it, there is no basis for knowing which SIFs are needed or what they must do. The H&RA outputs flow directly into other life-cycle documents, such as the Safety Requirements Specification (SRS).

The Link Between H&RA and SIL Determination

Identifying that a SIF is needed is only the first step. The H&RA provides the information that a SIL allocation is required — the size and nature of the risk gap determines how much risk reduction each SIF must deliver.

Safety Integrity Level (SIL) is a discrete measure of the required risk reduction performance of a SIF. IEC 61511 defines three SIL levels for the process sector — SIL 1, SIL 2, and SIL 3 — each representing an order-of-magnitude increase in performance. The required SIL is determined by the size of the risk gap: the larger the gap, the higher the SIL required to close it.

Layer of Protection Analysis (LOPA) is the most widely used SIL determination methodology in the process industry. LOPA is not an H&RA methodology — it is a SIL allocation tool that operates later in the life-cycle, using the hazardous event data and demand rates produced by the H&RA as its primary inputs.

H&RA Methodologies

IEC 61511 does not prescribe a specific H&RA methodology. The right choice depends on process complexity, the stage of design, available data, and the level of rigor required. What the standard does require is that the methodology is appropriate and applied systematically.

All H&RA methodologies fall into one of three umbrella categories:

Qualitative — judgment-based and descriptive. No numerical failure frequencies or consequence magnitudes are required. The output is a structured list of hazards, causes, consequences, and existing safeguards. Qualitative methods are the most widely used in the process industry for H&RA purposes.

Semi-quantitative — structured scoring or ranking. Numbers are used to characterize risk, but a full probabilistic analysis is not performed. The output provides more consistency and defensibility than a purely qualitative approach.

Quantitative — numerical frequency and consequence analysis with full probabilistic treatment. The output is a calculated risk value that can be compared directly against a numerical tolerable risk criterion.

Qualitative Methods

HAZID (Hazard Identification Study) is an upstream screening tool used to identify major hazards early in a project, before detailed design is available. It is typically applied at the conceptual or FEED stage and is less structured than a HAZOP. The primary reference standard is ISO 17776.

HAZOP (Hazard and Operability Study) is the dominant H&RA methodology in the process industry. It uses a systematic, guide-word-driven approach applied by a multi-disciplinary team to identify deviations from design intent and evaluate their causes and consequences. HAZOP produces a structured, auditable record that serves as the primary H&RA documentation. The governing standard is IEC 61882.

What-If Analysis is a less formal brainstorming technique useful for simpler processes, preliminary reviews, or as a supplement to a more structured study.

FMEA (Failure Mode and Effects Analysis) examines individual components to identify failure modes and their system-level effects. It is a bottom-up analysis most commonly used in equipment-focused assessments.

Semi-Quantitative Methods

Risk Graph is a structured method that uses defined parameters — consequence severity, occupancy, probability of avoiding harm, and demand rate — to assign a SIL target. It is widely used as an initial SIL targeting tool where a full LOPA is not warranted.

Quantitative Methods

Event Tree Analysis (ETA) models the sequences of events following an initiating cause, branching at each point where a safety barrier succeeds or fails. It is used to quantify outcome frequencies and is commonly paired with fault tree analysis in broader QRA work.

Quantitative Risk Analysis (QRA) is a formal methodology combining frequency analysis and consequence modeling to produce a quantified risk picture — typically expressed as individual risk contours or F-N curves. QRA draws on ETA, fault tree analysis, and dispersion analysis as inputs and is the most rigorous and resource-intensive approach.

FMEDA (Failure Mode, Effects and Diagnostic Analysis) is primarily a manufacturer and device certification tool conducted against IEC 61508. It produces quantitative failure rate data — including dangerous undetected failure rate and diagnostic coverage — that process facilities consume from equipment supplier safety manuals rather than generating themselves.

Common Mistakes and Pitfalls

Timing it wrong. Too early and the hazard picture is incomplete; too late and design changes to address identified hazards may already be impractical. The H&RA should be conducted when the P&IDs are at Rev 0.

No documented tolerable risk criteria. Without a defined, documented calibrated risk matrix, every risk judgment in the study is a personal opinion — and the study cannot be defended or verified by a third party.

Poorly defined hazardous events. Vague events produce incorrect consequence and frequency assessments, which produce incorrect risk gap calculations. Each hazardous event needs a defined initiating cause, affected equipment, and demand scenario.

Ignoring human factors and demand rates. Human error is a significant source of SIF demand. Underestimating it produces an overly optimistic risk picture and may result in incorrect SIL targets.

Treating the H&RA as a one-time exercise. The H&RA is a living document. When the process changes, it must be reviewed and updated.

Confusing inherent risk with residual risk. Inherent risk is assessed before protection layers. Residual risk is what remains after. Conflating the two leads to incorrect safeguard crediting and inaccurate risk gap calculations.

Keeping Your Hazard and Risk Assessment Current (Management of Change)

IEC 61511 is explicit: the H&RA must be reviewed whenever changes occur that could affect its validity. Triggers include process modifications, near misses, new or modified equipment, regulatory updates, and periodic revalidation obligations under the safety life-cycle. A formal Management of Change (MOC) process that flags these triggers is the most reliable way to keep the H&RA aligned with the actual process.

Frequently Asked Questions

Q1: I keep hearing different terms — PHA, HARA, HRA, H&RA. Are these all the same thing?

H&RA, HRA, and HARA all refer to the same IEC 61511 activity. Process Hazard Analysis (PHA) is the equivalent term under the OSHA PSM and EPA RMP regulations — different regulatory language, same concept. Other safety disciplines, such as machinery safety, may use different terms again — it can be a bit confusing.

Q2: I’ve heard that LOPA and HAZOP are often done together in the same study. Is that correct, and if so, how does that work?

Yes — many organizations run HAZOP and LOPA back-to-back in the same workshop series for efficiency, completing the HAZOP for each node before immediately running the LOPA on the identified scenarios. They remain technically distinct activities: HAZOP is the H&RA, LOPA is SIL determination — combining the workshops is a logistical choice, not a technical one.

Q3: Is HAZOP the standard way to conduct an H&RA in the process industry?

In practice, yes — HAZOP is the de facto H&RA methodology for most process facilities operating under IEC 61511, ISA 84, PSM, and RMP. That said, IEC 61511 does not mandate it; other methodologies such as What-If or QRA are appropriate depending on the project stage and complexity.

Q4: Does IEC 61511 require a specific H&RA methodology?

No — IEC 61511 requires that an appropriate methodology is applied systematically, but does not mandate a specific one. HAZOP is the most common choice in the process sector, but What-If and other approaches are all valid depending on the context.

Q5: What is the difference between inherent risk and residual risk?

Inherent risk is the risk before any protection layers are applied. Residual risk is what remains after all protection layers — including the SIS — have been credited.

Q6: Who should be involved in conducting an H&RA?

An H&RA is a multi-disciplinary team activity requiring input from process, operations, instrumentation, and safety disciplines at a minimum — these can be multi-day affairs, painful, particularly when LOPA is run back-to-back with the HAZOP. IEC 61511 requires that at least one team member holds functional safety competence; this role is often referred to in practice as the Team Leader or HAZOP Leader.

Q7: How often does an H&RA need to be revalidated?

IEC 61511 requires H&RA review whenever changes occur that could affect its validity — process modifications, near misses, new equipment, or regulatory changes. Periodic revalidation is also required as part of the broader safety life-cycle review obligations.

Conclusion

A hazard and risk assessment is not a compliance checkbox. It is the technical foundation on which every downstream SIS decision is built — from SIF definition to SIL determination to verification. A well-conducted H&RA, performed at the appropriate time, gives you a defensible, documented basis for your safety case. A poor one — or one conducted too early or too late — leaves your entire SIS design without a credible technical basis.

For facilities operating under IEC 61511, PSM, or RMP, the message is straightforward: invest in getting the H&RA right, time it correctly, and make sure it is conducted by people who understand both the process and the standard.

Functional safety is complex, and the stakes are high. If you have questions about your SIS design, SIL verification, or where to start with IEC 61511, the team at SIL Safe is here to help. Reach out to us today.

Functional Safety for the Process Industry: 10 Core Concepts Every Engineer Should Know

mamerten — Wed, 25 Mar 2026 01:12:24 +0000

Functional Safety at the Most Basic Level

Functional safety is the engineering discipline that is about making sure that when something goes wrong, the systems intended to protect people and the environment work when they are needed.

A simple illustration helps.

Over roughly 20 years, I have had about six driver’s side window regulators fail. Annoying—but not dangerous. If the window is used about three times per week, that’s roughly 3,120 operations. Six failures over that period corresponds to a probability of failure of about 1.92E-3 per demand. Is this acceptable? It has been annoying for me, but never once was a safety issue. It also was not a cost risk for the manufacturer since they all failed post-warranty. Now, imagine the regulator is a life safety device, that failure rate would be too high. Functional safety is the program that takes that failure rate and lowers it multiple orders of magnitude, to perhaps 1.92E-5. One cannot do that by building a “better” regulator. You can get there, but it would be via a layer of controls throughout the lifecycle of the regulator.

Driving failure probabilities down by orders of magnitude is the core driver of functional safety. It is not achieved by a single “better” device. It requires coordinated engineering: design choices, architecture, testing, diagnostics, maintenance, and governance across the life of the system. In the process industry, this structured approach is most commonly defined and implemented through IEC 61511, which provides the framework for how functional safety is applied across the lifecycle of a facility.

Functional safety is implemented through active protection functions that span the full loop—sensor, logic solver, and final element. When a hazardous condition is detected, this chain must act correctly to move the process to a safe state, whether that means closing a valve, stopping a motor, or isolating energy.

Just as important, that performance has to hold over time. Functional safety is not about working once—it is about continuing to meet required performance over years of operation, testing, maintenance, and change.

Why Functional Safety is Applied in the Process Industry

Consider a reactor that begins to over‑pressurize.

If nothing intervenes, the vessel could rupture, leading to injury, environmental release, or significant damage. Operators may not respond in time. Standard control systems can help, but systems designed for normal control are typically not sufficient for safety-critical action when consequences are severe.

A system that fails 1% of the time may be acceptable for control. It is often unacceptable for protection.

Functional safety via IEC 61511 is applied to reduce the probability that the protection fails when it is needed, thus making the process move to a safe state under abnormal conditions.

What Is Functional Safety?

Functional safety is an engineering discipline that ensures safety‑significant functions perform correctly when required. It is not a single device, calculation, or tool. It is a structured approach to designing, implementing, and managing systems so that their performance is sufficiently reliable for the hazards they control.

The core concept to understand is risk. A process will have hazards (such as a tank rupture) and that hazard has an associated level of risk. Risk is always the combination of probability and severity.

Functional safety contributes to risk reduction by lowering the probability that a hazardous outcome occurs and, in some designs, by limiting its severity.

How Functional Safety Reduces Risk

Every facility has hazards. But before the hazards are evaluated, the facility needs to determine what risks are acceptable. This is an odd concept for people new to process safety because one has to decide what is an acceptable amount of death or injury. This decision making is documented in a calibrated risk matrix. The next task is to understand those hazards, determine what risks they hold, are those risks tolerable, and reduce risk where they exceed the tolerable level.

Functional safety provides this method of reducing this risk.

The central metric is the probability of failure on demand (PFDavg). If a safety function has a PFDavg of 0.01, it will fail about 1% of the time when demanded and succeed about 99% of the time. This corresponds to a risk reduction factor (RRF) of 100. RRF = 1/PFDavg

In practical terms, this safety function reduces the probability of a hazardous outcome by a factor of 100. This quantified reduction is the mechanism by which functional safety works.

Standards Governing Functional Safety

Functional safety is applied across multiple industries, each with standards built on the same underlying principles. Examples include ISO 26262 for automotive systems, IEC 62061 for machinery, and EN 50126/50128/50129 for rail.

At the foundation is IEC 61508, which defines the general framework for functional safety of electrical, electronic, and programmable electronic systems.

IEC 61511 applies these principles to the process industry. It is not a separate concept; it is a sector-specific implementation of the IEC 61508 framework tailored to process facilities.

Note that in my experience as an engineer, almost all codes and standards are applicable to a certain area. Such as a country or perhaps the European Union. But functional safety is truly one of the few global standards. In almost all counties, if they implement a process safety approach, it will be functional safety. This is one of the reasons SIL Safe exists.

Hazard and Risk Assessment (H&RA)

The functional safety lifecycle begins with understanding hazards in your process. This is called the hazard and risk assessment (H&RA) which identifies scenarios that could lead to harm, estimates their risk (probability and severity), and determines whether risk is tolerable. Under IEC 61511, the H&RA is the starting point of the safety lifecycle, and it directly drives the identification of Safety Instrumented Functions (SIFs) and their required performance.

Typical activities in an H&RA include identifying credible scenarios, estimating probability and severity, and deciding where additional protection is required.

Common methods in the process industry include HAZOP and risk matrices. The outcome of this work will establish whether SIFs are needed.

When this occurs in the design process is important. The H&RA cannot be too early or too late in the process. If too early, much of the hazards would likely change, not exist, or only be a partial list. If done too late, you are asking for challenges as changes may need to be modified in existing equipment. Think cutting pipe to add a SIF. That is never fun.

See this full article for deeper dive into an H&RA.

Independent Protection Layers (IPLs)

Facilities rely on multiple layers of protection rather than a single safeguard. Meaning safeguards should be put in place BEFORE a SIF is needed. If this happens, then perhaps a SIF would not be needed or a SIF at a lower SIL level.

These layers can include the basic process control system (BPCS), relief devices, operator response, and safety instrumented systems. For layers to count independently, they must not fail for the same reasons. This collection of independent items that can protect a SIF are called Independent Protection Layers (IPL).

The key question is whether the existing layers reduce risk to tolerable levels. If they do not, additional protection is required.

This evaluation is often formalized through a Layer of Protection Analysis (LOPA), which builds directly on H&RA results. This is one of the most common methods used within IEC 61511 programs to determine the required SIL for each SIF. This asks…

Is the risk associated with each hazard tolerable? This would assess the risk for each hazard against the calibrated risk matrix.

If not, are there IPLs present or can they be added? This could be extra pressure relief valves, other instruments in the BPCS.
Does each hazard scenario need a SIF? This compares the risk with the IPLs, against the tolerable risk. If the risk is not tolerable, a SIF is added to mitigate the risk of that hazard.
How much risk must each SIF reduce? Generally, this is thought of in orders of magnitude via a calibrated risk matrix. Each box the hazard moves is considered one order of magnitude.
What is the SIL level of each SIF? This is how many orders of magnitude the risk must be reduced to be tolerable.

Safety Instrumented Systems (SIS)

When existing protection is insufficient, a Safety Instrumented System (SIS) is the system that comes into play. The SIS is the core system involved in IEC 61511, although the standard also has lifecycle requirements. Meaning functional safety is not just the SIS.

A SIS is an independent system designed to detect hazardous conditions and move the process to a safe state. It operates separately from the basic process control system (BPCS), which manages normal operation. A typical SIS consists of sensors, a logic solver, and final elements such as shutdown valves or motor isolation devices.

A Safety Instrumented Function (SIF) is a single protection loop within the SIS. A facility may have one SIF or many, depending on the number of scenarios requiring protection. Again, it is the LOPA that dictates where the SIFs are and the SIL needed per SIF.

For example, a simple SIF could be a pressure switch sensing high pressure and turning off a pump with a contactor via the controls in a PLC. Then another SIF can sense a high level and open a dump valve to prevent an overflow. It goes to the same PLC as the first SIF. This would have two SIFs, both using the same PLC and together they would be the SIS.

Safety Integrity Levels (SIL)

Each SIF must meet the required level of reliability, expressed as a Safety Integrity Level (SIL).

SIL is always determined after the H&RA and during the allocation of safety layers, typically determined during the LOPA (which itself is typically done directly after the H&RA, often in the same long meeting). In the process industry, most applications are SIL 1 or SIL 2, with occasional SIL 3. SIL 4 is rarely used.

SIL is often associated with equipment ratings, which is technically correct. But more importantly, it defines the required performance of the safety function across design, implementation, and maintenance. For example, higher SIL levels will need more redundancy and would have different levels of controls and different rules about software. This can be complicated and best discussed elsewhere.

Below is the standard relationship between SIL, Risk Reduction Factor (RRF), and PFDavg for low demand mode:

SIL	RRF (Risk Reduction Factor)	PFDavg Range	Interpretation
1	10 to 100	1E-1 to 1E-2	Reduces risk by 1–2 orders of magnitude
2	100 to 1,000	1E-2 to 1E-3	Reduces risk by 2–3 orders of magnitude
3	1,000 to 10,000	1E-3 to 1E-4	Reduces risk by 3–4 orders of magnitude
4	10,000 to 100,000	1E-4 to 1E-5	Rare in process industry; extreme risk reduction

Other Design Considerations in Functional Safety

Designing a SIF involves more than selecting components. There are many other concepts that need to be considered, understood, and decided during the design phase.

hardware fault tolerance (HFT) – Reliability requirements
voting architecture – Such as 1oo1 voting, 2oo3, and others
proof test interval (TI) – Duration between proof tests. See this in-depth article.
proof test coverage (Cpt) – How good proof testing is at detecting failures. See this deeper dive on Cpt.
mean time to restore (MTTR) – Facility’s time to repair a problem, relates to available spare parts. Read more about MTTR.
spurious trip rate (STR) – The balance between safety and facility uptime
diagnostic coverage (DC) – How good the on-board diagnostics are at detecting failures
common cause failure (beta factor) – Redundant systems may jointly fail due to a common design flaw. Lean more on CCF in this article.

These parameters all interact. There is a push and pull between the terms and even departments in a facility. Decisions about testing, architecture, spare parts, and maintenance can materially change achieved performance via PFDavg. Each topic is substantial and better explored individually.

These concepts receive significant attention in functional safety engineering and certification exams. These will also take significant effort to decide and work through during the detailed design phase of the SIS.

Verifying Safety Instrumented Function Performance (PFDavg)

Once a SIF is designed, IEC 61511 requires that it be verified against its SIL requirement, per SIF. PFDavg calculations quantify whether the design meets the target. They account for failure rates of the SIF components, architecture, testing intervals, coverage, and repair assumptions.

The process of doing the calculations is complicated. There are simple equations and complex equations. Think of it as a specific PFDavg calculation stems from a series of decisions that are made by engineering. For example, if the SIF is tested only when the unit is shutdown versus bypassed – that impacts PFDavg in different ways. Generally, practitioners use software or Excel to facilitate.

There is a related approach called Markov Analysis which we will not get into here as it is a more advanced approach.

While central, PFDavg is only one step in a broader discipline. It verifies performance; it does not define the entire program. Inexperienced users may think PFDavg is all functional safety is about. But that is an over-simplification.

The simplest PFDavg equation is shown below. This is for a 1oo1 architecture.

TI – proof test interval
λ_DU – the dangerous undetected failure rate.

The Functional Safety Life-Cycle

Functional safety is governed by a structured life-cycle, and this is one of the most important concepts to understand. The structure of this lifecycle is defined in IEC 61511, and following it is what distinguishes a complete functional safety program from isolated design efforts. Many engineers initially approach functional safety as a design activity, but that is only one portion of the overall process.

The life-cycle defines how safety is managed from the earliest concept of a facility through to its eventual decommissioning. It ensures that safety functions are not only designed correctly, but also installed, operated, maintained, and periodically assessed in a consistent and auditable manner.

Typical phases include:

Hazard and risk assessment (H&RA) such as a HAZOP
Allocation of safety functions (such as via a LOPA)
Design and engineering – detailed design of the SIFs and SIS
Verification and validation
Operation and maintenance
Functional safety assessment (FSA)
Decommissioning

Each of these phases has specific deliverables and expectations. For example, the design phase may define the architecture of a SIF, but the operation and maintenance phase ensures that proof testing is performed and that failures are addressed correctly.

The key point is continuity. Functional safety is not a one-time effort. A system that is properly designed but poorly maintained will not achieve its required performance over time.

An all-too-common scenario is that a compliant SIS is installed in a facility. The company is then bought by a firm in an adjacent industry that has never implemented, nor understands, IEC 61511. Over time, budgets can be cut or key people can navigate to other departments, companies, or retire. In this possible scenario various things could happen that could impact the SIS design. For example, proof tests being done at the wrong intervals or not in accordance with the requirements. Components could be replaced with non-SIL qualified versions, or the competency and training program could become weak. All of this is why the functional safety lifecycle is so important.

Regulatory Context

In the United States, regulatory frameworks such as OSHA’s Process Safety Management (PSM) and the EPA’s Risk Management Program (RMP) are the core two regulations. These get triggered if a certain weight of various materials is on site (called threshold quantity). These regulations do not prescribe exact methods for managing risk. Instead, they require that facilities follow sound engineering practices. IEC 61511 is not listed specifically as applicable in the Code of Federal Regulations (CFR), but both OSHA and EPA have stated in writing that using that standard is a sufficient and preferred way to meet the regulations. In other words, it is considered a RAGAGEP (recognized and generally acceptable good engineering practice).

Therefore, IEC 61511 is often used to demonstrate that a facility is meeting those regulations. It provides a structured and well-understood approach that regulators, auditors, and engineers recognize. Other countries have similar laws and regulations requiring the standard.

In practice, this means that even though IEC 61511 is not mandatory by law, it is frequently treated as if it were, because it defines what “good” looks like for functional safety in the process industry.

In addition to RMP and PSM, other things could trigger using IEC 61511. This could be contracts between parties or insurance requirements. At times, projects for various reasons will not invoke IEC 61511 in full, but may require certain instruments or automated valve maintain a SIL certification. These SIL only requirements have become more typical as SIL rated components become more common. SIL Safe welcomes this change in the industry.

Common Misconceptions About Functional Safety

Functional safety via IEC 61511 is often misunderstood, particularly by those who are new to the discipline or who have only been exposed to portions of it.

One common misconception is that functional safety is primarily about calculations, especially PFDavg. While calculations are important, they are only one part of the overall process. Without proper hazard assessment, design, testing, and maintenance, calculations alone do not ensure safety.

Another misconception is that functional safety is only relevant during design. In reality, long-term performance depends heavily on proof testing, maintenance practices, and how changes are managed over time.

A third misconception is that higher SIL automatically means a better system. In practice, SIL is determined by the risk of the hazard along with the IPLs that exist. A higher SIL requirement often indicates a more severe hazard rather than a superior design choice.

Understanding these misconceptions is important because they often lead to incomplete or ineffective implementations of functional safety programs.

Why Functional Safety Programs Matter — and When Expertise Is Needed

Functional safety programs provide a structured way to manage risk across a facility. At a high level, they help ensure that hazards are identified, risks are evaluated, and appropriate protections are implemented and maintained over time.

Effective programs reduce the probability and severity of major accidents, support regulatory compliance, and provide confidence that safety systems will perform when required.

In practice, many organizations require additional expertise at certain points, such as:

Implementing IEC 61511 for the first time
Performing SIL determination or verification
Preparing for and conducting functional safety assessments
Modifying or upgrading existing SIS implementations

These situations often involve complex decisions, tradeoffs, and documentation requirements that benefit from experienced practitioners.

Q&A Section

What is functional safety in simple terms?

Functional safety is the part of overall facility safety that depends on safety functions operating correctly when required. It focuses on ensuring that protection systems perform reliably enough to reduce risk to tolerable levels. It has the ability to take a typical failure rate of a safety system of perhaps 0.01 (1%) down multiple orders of magnitude. It does this through a layer of requirements throughout the lifecycle of the system.

Functional Safety is applied to various industries. SIL Safe focuses on its application to the process industry via IEC 61511.

What is the difference between a SIF and a SIS?

A Safety Instrumented System (SIS) is the overall system that performs safety functions. A Safety Instrumented Function (SIF) is a single protection loop within that system, typically consisting of a sensor, logic solver (think a PLC), and final element (like a valve or contactor).

If I have hazardous scenarios, why can’t I just add an extra instrument?

Adding an extra instrument and connecting that to your BPCS does not necessarily reduce risk enough. Risk reduction must be quantified, and the resulting protection must meet the required performance level. Without it, the hazard may still exceed tolerable risk.

For example, at times a SIF will have to have an architecture of 2oo3 (meaning three instruments at one point). One would not know that it was needed unless the process was followed and a PFDavg was calculated

I’ve worked on projects where SIL 2 instruments were specified, but the facility was not doing functional safety in its entirety. What is happening there?

Some projects contractually require certain instruments or “safety instruments” to be SIL rated (for example SIL 2 transmitters or valves). This does not mean that the facility is implementing the full functional safety lifecycle. In many cases, companies attempt a compromise where equipment meets SIL capability requirements even if the broader IEC 61511 functional safety program is not fully implemented.

What standards govern functional safety?

The foundational standard is IEC 61508, which defines the general framework for functional safety of electrical, electronic, and programmable electronic systems. For the process industry specifically, IEC 61511 defines how those principles are applied to Safety Instrumented Systems.

What about machinery safety?

Machinery safety is important, of course. But it is distinct. Functional safety for the process industry is focused on reducing the risk (probability and severity) of a major accident. Machinery safety focuses on the user of the machine.

However, …. as SIL ratings are more common, what is happening is machinery safety risk assessment now will often require a SIL rated instrument. SIL Safe fully supports this excellent use of SIL instruments. However, this should not be construed as functional safety.

Conclusion

Functional safety via IEC 61511 combines hazard analysis, engineered protection systems, and structured life-cycle management to reduce the probability and severity of hazardous events.

It is not a single calculation or device, but a coordinated engineering approach that spans the entire life of a facility.

For engineers working in the process industry, understanding these concepts is essential to designing and operating safe systems.

Call to Action

If your facility is implementing or improving a functional safety program, expert guidance can make the process significantly more effective.

Contact SIL Safe to discuss consulting services for IEC 61511 programs, SIS design, and functional safety assessments.

Additional Resources:

Failure Rates in Functional Safety: A Practical Guide for Working Engineers

mamerten — Sun, 07 Dec 2025 23:07:26 +0000

Introduction: Why Failure Rates Matter in Functional Safety

Failure rates sit at the center of how we evaluate, verify, and maintain safety instrumented systems (SIS) under IEC 61511. They show up in SIL verification, PFDavg and STR calculations, equipment selection, and proof test strategy. If you understand the failure‑rate categories and how to obtain them correctly, you can avoid many of the mistakes that derail SIL verification or misrepresent SIS performance.

This article explains the four failure‑rate categories, where the values come from, how to interpret them, and how a functional safety engineer uses them in practice.

The Big Picture: What We Mean by a “Failure Rate”

In functional safety, failure rate is represented by λ (lambda), typically shown in units of 1/hour or in FITs (failures per 1E9 hours). It represents the frequency of random hardware failures—the only failures that can be mathematically modeled.

Random vs. Systematic Failures

Random hardware failures are the only failures that can be described by a rate. Systematic failures absolutely matter, but because they arise from design or process weaknesses, they cannot be represented by λ. You must manage them through quality processes and functional safety management—not statistics.

Constant Failure Rate Assumption

IEC 61511 modeling assumes a constant failure rate. Real‑world devices follow a classic bathtub curve: higher failures early in life (infant mortality), a long flat useful‑life period, and then increasing failures late in life. Failure‑rate data used for SIL verification assumes you are in that useful‑life region.

Statistical Nature of λ Values

Certification bodies and data handbooks treat λ values as statistical estimates with confidence bounds. The SIL certificate condenses this into a single number, but it should be remembered that every published λ carries uncertainty. However, most of that is a bit behind the scenes for functional safety engineers.

The Four Failure Rate Categories in Functional Safety

Failure modes in functional safety fall into four buckets based on whether the failure is safe or dangerous, and whether it is detected or undetected by diagnostics:

λSD – Safe Detected
λSU – Safe Undetected
λDD – Dangerous Detected
λDU – Dangerous Undetected

These categories determine whether the failure affects safety, reliability, or uptime—and how it appears in SIL and STR calculations. Note that “detected” means detected by diagnostics, not by proof tests.

How the Four Failure Rates Feed Functional Safety Calculations

λDU is the largest driver of safety risk. It represents failures that prevent the SIF from acting and are not discovered by diagnostics. This value is always used in PFDavg. λDD may be used in PFDavg if the detected failure notifies only (does not trip the SIF).

λSU always contributes to spurious trip rate (STR). λDD and λSD contribute to STR if the control logic forces a safe‑state action when diagnostics detect a failure. See this article on STR for more background.

λSU and λSD influence reliability and uptime but do not affect PFDavg.

Finally, proof tests exist specifically to reveal undetected dangerous failures—the λDU portion. Some engineers misunderstand this and assume proof tests simply “detect failures,” but in functional safety terms, proof tests are how you manage the DU accumulation. See this article on proof testing.

Where Failure Rates Come From

Where the Functional Safety Engineer Actually Gets Failure Rates

In real SIS design work, most failure‑rate data comes from certified products, where a certification body (CB) has already performed a detailed IEC 61508 assessment. The FS engineer reads the SIL certificate and the Safety Manual, which provide the extracted λDU, λDD, λSU, and λSD values. These two documents are the authoritative sources for day‑to‑day engineering. The underlying FMEDA exists, but it is not normally reviewed or needed by the practitioner.

When a certified device is not available, several alternate data routes exist. Each route has tradeoffs and requires engineering judgment:

Manufacturer‑supplied reliability data – useful when transparent and well‑supported, but assumptions must be confirmed.
Validated site or company datasets – often the most realistic if maintenance and failure tracking are strong.
User‑generated field data – applicable for legacy equipment with a long operating history.
Industry sources such as OREDA – helpful when carefully matched to device type, service, and environment.

These alternatives are less typical in functional safety practice, but they require more scrutiny than certified data.

How Failure Rates Are Determined (Typical Scenario)

For certified devices, IEC 61508 defines the process for establishing failure rates. Behind the scenes, the CB reviews or performs:

FMEDA (failure‑mode analysis and diagnostic modeling)
Test campaigns and empirical validation
Diagnostic behavior evaluation
Environmental and installation assumption checks

The FS engineer does not redo this work. Instead, their responsibility is to:

Use the published λ values correctly
Ensure the application matches the assumptions in the Safety Manual
Integrate diagnostics the way the certification expects

This is where many real‑world errors occur—not because the values are wrong, but because the application does not match the assumptions behind them.

Broader Reliability Concepts

Failure rates are not standalone constants; they are shaped by reliability principles that sit behind the λ numbers. A functional safety engineer must understand these broader ideas to avoid misusing published data.

Systematic failures are not described by λ values. Random hardware failures can be modeled with rates; systematic failures cannot. They arise from design issues, configuration errors, software defects, or procedure gaps. They must be controlled through functional safety management (such as what IEC 61508 does), not reliability math.

Failure‑rate uncertainty is always present. λ values are statistical estimates derived from limited testing, modeling, or field data. Certification bodies select a representative value for the SIL certificate, but there is natural variability behind every λ. The published number is not a perfect truth—it is a useful engineering approximation.

Application and environment can change the true failure rate. A device used in corrosive service, high vibration, or aggressive cycling may experience a higher effective λ than the certified value. Likewise, poor installation, improper mounting, or low‑quality air supply (for valves) can shift failure behavior. The published λ applies only when the Safety Manual conditions are met.

The Safety Manual controls the validity of the data. A λ value is only valid if the equipment is installed, wired, maintained, and operated according to the Safety Manual. If diagnostics are not used, if limits are exceeded, or if maintenance intervals differ from expectations, the certified failure rates no longer describe the real system.

Automated Valve Assemblies and How Their Failure Rates Combine

Automated valves used as final elements are not single devices—they are assemblies made of several components, each with their own failure behavior. A functional safety engineer must gather λ values for each sub‑component and understand how they combine to represent the full final element.

Typical valve‑assembly components include the valve body, actuator, solenoid, positioner, and any boosters or air relays. Each component contributes its own λDU, λDD, λSU, and λSD. Because the SIF fails if any one of these components cannot perform its intended action, the failure rates are combined using Boolean OR logic:

λ_total ≈ λ₁ + λ₂ + λ₃ + …

In practice, most λDU comes from mechanical components such as the actuator and valve body. These parts typically lack meaningful diagnostics, so λDU remains the dominant contributor for the final element. Electronic components—like smart positioners—may reduce λDD by improving diagnostics, but they seldom reduce λDU in a significant way.

Some manufacturers have begun certifying complete valve assemblies under IEC 61508. When available, this simplifies the engineer’s task: the assembly‑level λ values are already validated and consolidated under a single device boundary. See this example from Emerson: https://www.emerson.com/en-us/automation/valves/controlvalves/digital-isolation-solutions. We at SIL Safe expect and hope this trend continues.

Practical Examples

Sensor Example: How the Engineer Obtains λ Values

A functional safety engineer begins by locating the λDU, λDD, λSU, and λSD values published in the device’s SIL certificate or Safety Manual. These documents reflect the certification body’s IEC 61508 assessment and define how the device behaves under expected diagnostic, installation, and environmental conditions. The engineer then confirms that the plant’s SIS logic and wiring actually use the diagnostic features assumed in the certification. Once these steps are complete, the engineer has the correct λ values that will be applied in later PFDavg or STR calculations.

Final Element Example: How the Engineer Obtains λ Values

For an automated valve assembly, the process is more involved because a final element is made of multiple components that must all function correctly. The engineer identifies each sub-component—such as the valve body, actuator, solenoid, positioner, and boosters—and retrieves λ values from each component’s SIL certificate or Safety Manual. The installation and diagnostic assumptions must match the application for the values to be valid. Because a failure in any single sub-component prevents the valve from performing its safety function, the engineer combines the λ values using OR logic to produce the total assembly failure rate. This assembled λ dataset will be used in downstream PFDavg and STR calculations.

Common Mistakes Engineers Make with Failure Rates

Even experienced engineers can misapply failure‑rate data if the context behind the numbers is not fully understood. A few issues show up repeatedly in real SIS design and verification work.

Misinterpreting λDU vs. λDD. These two values behave very differently. λDU always goes into PFDavg because it represents failures that diagnostics cannot find. λDD may or may not impact STR or PFDavg depending on how diagnostics are integrated. Treating DD like DU—or assuming DD never matters—produces incorrect verification results.

Using generic values without validating assumptions. Generic data tables, old spreadsheets, or handbook values can be misleading if the assumptions behind them do not match your application. Certified values come with defined conditions; generic values usually do not.

Ignoring diagnostics. Sometimes diagnostics exist on the device but do not make it into the SIS logic or maintenance workflow. If a diagnostic bit is unwired, unmapped, filtered out, or simply ignored in operations, detected dangerous failures behave like undetected failures. In this case, λDD effectively becomes λDU.

Treating λ values as universal constants. A λ value from a certificate is not automatically valid everywhere. Installation, environment, cycling, mounting, and maintenance determine whether the published λ truly reflects your plant’s conditions. Failure rates must be applied with engineering judgment, not copied blindly.

When Diagnostics Exist but Are Not Used

Diagnostics only add value when the SIS actually acts on them. A device may have excellent internal diagnostics, but if they are not used by the controls, the failure behaves as if it were undetected. In this situation, λDD is effectively added to λDU for purposes of SIL verification because the SIF remains impaired until someone actively responds. It could really impact PFDavg.

This scenario is more common in brownfield facilities, older installations, poorly integrated SIS/BPCS architectures, or sites where diagnostics alarm but no work process exists to ensure timely repair. The lesson for the engineer is simple: diagnostics only help if the entire chain—from device to logic to maintenance—uses them correctly.

Where Failure Rates Influence SIS Design Decisions

Failure‑rate data influences several real‑world engineering choices throughout the SIS life‑cycle. Understanding how λ values behave helps the engineer select architectures, manage proof‑test strategies, and apply diagnostics intentionally—not blindly.

Architecture selection. If λDU is high, additional redundancy may be required to achieve the target SIL. Failure‑rate data helps determine whether 1oo1, 1oo2, or 2oo3 architectures are appropriate for the SIF.

Choosing the proof‑test interval (TI). Proof tests exist to reveal λDU—the part diagnostics cannot see. A higher λDU or lower proof‑test coverage (Cpt) typically requires a shorter TI. Failure‑rate data directly shapes the proof‑test strategy. See this other article about CPT.

Partial‑stroke testing for final elements. For valves that dominate λDU, partial‑stroke testing may reduce the exposure time of dangerous failures. This decision depends on understanding which failure modes are found by diagnostics versus proof tests.

Diagnostic selection and integration. λDD and λSD only help if diagnostics are wired, mapped, and acted on. Understanding the diagnostic coverage of a device and the assumptions in the Safety Manual helps engineers design logic and maintenance workflows that truly reduce risk.

Summary: A Practical Way to Think About Failure Rates

Failure rates are the foundation of how we model and manage random hardware failures in functional safety. λDU represents the portion of failures that silently erode the ability of a SIF to act when needed. λDD and λSD describe failures that diagnostics can reveal, informing how often a SIF may trip unnecessarily and how reliably it stays available. λSU supports reliability but does not influence risk directly.

These four failure‑rate categories show up throughout the IEC 61511 safety life‑cycle: in equipment selection, architectural decisions, proof‑test strategy, diagnostic design, and SIL verification. If the failure‑rate assumptions in the SIL certificate and Safety Manual are respected—and if diagnostics are used correctly—then λ values become powerful tools for designing and maintaining a dependable SIS.

More Help

For more help applying failure‑rate data correctly—or for third‑party SIS verification—reach out through SIL Safe’s contact page.

Explore the SIL Safe glossary for clear explanations of related terms.
See ISA’s main guidelines
Fully certified automated valves
Deeper blog article on proof testing

Q&A Section

What’s the practical difference between λDU and λDD?
λDU drives PFDavg because the failure is both dangerous and undetected. λDD is dangerous but detected, so it typically results in an alarm or forced-safe trip and may contribute to STR depending on configuration.
Do failure rates change over time in my facility?
Yes. The published λ values assume controlled conditions; however, real‑world factors like environment, cycling, installation quality, and maintenance can shift the true failure rate up or down. However, we assume constant for modeling purposes.
Why do certified products help for accurate λ values?
Certified devices provide validated λ values and clearly defined DU/DD/SU/SD splits under IEC 61508. This reduces interpretation errors and ensures the assumptions behind the numbers are understood and controlled.
Is it okay to use generic failure rates?
Only if you confirm that they match your device and application. Generic values may not reflect your environment, proof-test strategy, or diagnostic coverage.
What if the device doesn’t have a SIL certificate?
You can still use it, but you must rely on credible manufacturer reliability data, validated site history, or reputable sources such as OREDA. These represent the alternate data routes when a certified data path is not available. Assumptions must match the application, and justification must be documented.
Why do final elements almost always have higher λDU than sensors?
Most λDU in a final element comes from mechanical components like the actuator and valve body, which lack strong diagnostics. Sensors generally have better diagnostic coverage and fewer mechanical wear points, so their λDU values are typically much lower.
I understand why λDU impacts PFDavg, but why would λDD impact PFDavg?
It depends on how the controls and diagnostics are set up. In many low-demand configurations, only λDU enters PFDavg. But if a dangerous detected failure leaves the SIF unable to perform its function—and the system does not act on or repair that diagnostic—then that portion of λDD effectively behaves like λDU and may need to be included in the PFDavg analysis.
How are λ values used differently in high-demand or continuous modes?
Failure rate is highly relevant but used differently. In low-demand mode (most common), we use λDU to calculate PFDavg. In continuous or high-demand modes, PFDavg is not used. Instead, SIL is based on PFH — the rate of dangerous failure per hour. In these cases, λDU is used to calculate PFH through a different series of equations than PFDavg.

How to Apply the Beta Factor: A Practical Guide to Common Cause Failures in SIL Verification

mamerten — Sun, 19 Oct 2025 16:44:46 +0000

When engineers perform SIL verification, most of the attention goes toward failure rates, proof test intervals, or diagnostics. But one input that carries enormous influence is beta factor (β) — the number that represents how much of your redundancy can be trusted to behave independently. Misunderstanding it can make even a perfect SIL calculation look better on paper than it really is in the field. This article walks through what β is, how it links to common cause failure (CCF), and how to apply it correctly with practical math and examples.

What Is It? (CCF vs. Beta Factor and Why It Matters)

Common cause failure (CCF) is a concept of multiple things failing for the same core reason. It is a concept in various industries and practices. In a Safety Instrumented System (SIS) it means multiple channels in redundant architectures fail together because of a shared cause. For example:

Sensors: Two pressure transmitters mounted side‑by‑side exposed to the same vibration or plugged impulse lines.
Logic solver: Both processor cards affected by the same unknown software bug.
Final element: Two solenoids sharing the same instrument air header that fail simultaneously.

The beta factor (β) is the fraction of all dangerous failures that occur from a CCF. It is the mathematical representation of the CCF concept. It converts the qualitative idea of “common cause” into a quantifiable term used in equations. Ignoring β will almost always make your average probability of failure on demand (PFDavg) appear lower than it really is — giving a false sense of security.

Where to Get Beta Factor Values

There are two primary sources:

SIL certificates or FMEDA reports from manufacturers. These often include β values based on test data and modeling assumptions.
IEC 61508‑6 Annex D – the formal method for determining β using dependent failure analysis. This process is complex and beyond the scope of this article, but it is the standard reference.

Other useful references include ISA TR84.00.02, OREDA, and CCPS reliability publications. Some companies maintain internal databases based on field performance.

A practical approach:

Start with the β from the SIL certificate.
Review Annex D factors to see if your installation justifies adjustment.
Document and justify the final value in your Safety Requirements Specification (SRS).

How Architecture Affects β

The β‑factor is not constant; it changes with architecture. This is a common point of confusion within Functional Safety. Think of redundancy as a team of security guards protecting a building:

In a 1oo2 architecture, either guard can respond to stop a robbery. If both guards eat the same lunch and get food poisoning, that’s a CCF — a shared vulnerability.
In a 1oo4 architecture, four guards must all get sick for the robbery to succeed. The common cause event would have to be much stronger or more universal, or “more common,” to take them all down. Think of a really widespread case of food poisoning. Therefore, the effective β is lower.

This creates a feedback loop: changing architecture alters the PFDavg equation, but it also justifies a new β — which again changes the PFDavg. It is a bit odd, but it is correct. IEC 61508‑6 Annex D Table D.5 gives architectural correction factors (roughly 0.3 – 1.74) that help account for this effect. Most β values published in SIL certificates are assumed for 1oo2 configurations.

NooN Architectures and Why β Is Not Present

For NooN architectures (e.g., 2oo2 or 3oo3), β is normally not present in the mathematics. Any single channel failure will cause the function to fail because all channels must work. The total λDU (dangerous undetected failure rate) already includes both independent and common‑cause contributions. Applying a separate β term would double‑count the effect.

Returning to the guard analogy: if three guards must all respond correctly and any one failure causes system failure, it doesn’t matter if their failures were independent or shared — the result is the same. For this reason, modeling tools and standards do not include β for NooN designs.

Note that this can be thought of as “β is set to 0 for NooN” but that is not how we at SIL Safe think of this. Yes, you can use the MooN PFDavg equations, set β = 0, and get the same results. But as we discussed above, β is accounted for in λDU.

Typical Values and Influencing Factors

Subsystem	Typical β Range
Sensors	0.02 – 0.10
Logic Solvers	0.01 – 0.05
Final Elements	0.05 – 0.15

Factors that increase β:

Common environment (same enclosure, same power, same impulse lines)
Identical design and firmware
Shared maintenance procedures or simultaneous testing

Factors that decrease β:

Physical separation and shielding
Vendor or technology diversity
Independent power and utilities
Separate calibration schedules
Staggered proof testing

PFDavg Equation Discussion

In a 1oo2 system, the total PFDavg is made up of independent and common‑cause terms. Using a simplified version of the equation:

Where:

λ_DU = dangerous undetected failure rate (per hour)
TI = proof test interval (hours)
β = fraction of failures due to common causes

The first term (squared) represents two independent latent failures. The second (linear) term represents the probability that both channels fail from a shared cause. Because the independent term is squared, it becomes much smaller — making β a powerful driver in redundant designs.

PFDavg Calculation Examples (1oo2)

Case A – Good β Factor (Target: SIL 2)

Given: β = 0.03, λ_DU = 2E-6/hr, TI = 8,760 hr (1 year)

Independent term:
(1/3) × [(1 − 0.03) × 2E-6 × 8,760]² = 9.63E-5
Common‑cause term:
(1/2) × 0.03 × 2E-6 × 8,760 = 2.63E-4
Total PFDavg = 3.59E-4
RRF = 1 / PFDavg = 2,785 → SIL 2

STR (spurious trip rate): assume λ_SP = 1E-5/hr per channel. For 1oo2 (trip on any channel):
STR ≈ 2 × λ_SP = 2E-5/hr → 0.175 trips per year ≈ a trip every 5.7 years.

Case B – Poor β Factor (Target: SIL 1)

Given: β = 0.15, λ_DU = 2E-6/hr, TI = 8,760 hr (same assumptions)

Independent term:
(1/3) × [(1 − 0.15) × 2E-6 × 8,760]² = 7.4E-5
Common‑cause term:
(1/2) × 0.15 × 2E-6 × 8,760 = 1.31E-3
Total PFDavg = 1.38E-3
RRF = 1 / PFDavg = 725 → SIL 1

STR (same λ_SP): 2E-5/hr → 0.175 trips/year or one trip every 5.7 years (same as Case A).

Comparison: Raising β from 0.03 to 0.15 increased total PFDavg by ≈ 4×, dropping the SIL from 2 to 1.

Impact of Common Cause Portions of PFDavg

Because the independent term is squared, it diminishes as reliability increases, while the common‑cause term remains linear. The result: CCF often dominates total PFDavg in redundant architectures.

Case A (β = 0.03): PFDavg_ind = 9.63E-5, PFDavg = 2.63E-4 → CCF ≈ 73% of total.
Case B (β = 0.15): PFDavg_ind = 7.4E-5, PFDavg = 1.31E-3 → CCF ≈ 95% of total.

Even at modest β values, most of the total unavailability stems from common causes — which is why reducing β through independence and diversity is far more effective than upgrading the architecture. The same pattern holds true for higher architectures like 2oo3.

Common Mistakes

Applying the same β to every subsystem without justification.
Assuming diagnostics lower β (they don’t; they reduce λ_DU).
Forgetting to justify β selection in the SRS.
Focusing only on adding redundancy instead of lowering β.

Summary

The β‑factor is what connects real‑world dependencies to your math. A small change in β can shift a design from SIL 2 to SIL 1 — even when every other parameter is identical. Real independence, not just redundancy, drives reliability.

Document your β assumptions, revisit them throughout the SIS life‑cycle, and when in doubt, verify them against IEC 61508‑6 Annex D or manufacturer data.

If you’d like a second set of eyes on your β assumptions or SIL verification model, contact SIL Safe for a practical review or training session.

Q&A

1. Does beta factor (β) apply only to 1oo2 architectures?
No. The concept applies to any redundant system where channels can fail together, though it’s most visible in 1oo2.

2. Can β ever be zero?
Not realistically. There’s always some shared factor — environment, maintenance, or design — that introduces correlation. For NooN scenarios, β does not appear in the mathematics, so some say β is zero, but it is still included in the λ_DU term.

3. What does β actually represent?
It’s the percentage of dangerous failures that are shared between channels rather than truly independent.

4. How often should β be revisited?
Any time design, environment, or procedures change — typically during periodic review or FSA Stage 4. It is common for β to change over the safety lifecycle.

5. Can one simply apply the β in the SIL certificate?
Use it as a starting point. Then compare against Annex D’s checklist and justify any adjustment.

6. What portion can common cause contribute to total PFDavg?
Often the majority. For 1oo2 with β ≈ 0.1–0.2, common cause failures can make up 70–95 % of total PFDavg — the same trend appears in higher architectures.

7. I’ve heard that for NooN architectures, β is set to 0, but I thought that is not possible?
β is involved in NooN, but it is already included in the λ_DU term. Meaning we don’t want to double count. The NooN equations will not have any β terms. Thus, it can be thought of as they were set to zero, but that is not technically correct.

Spurious Trip Rate Explained: 6 Facts Every Functional Safety Engineer Should Know

mamerten — Wed, 03 Sep 2025 02:24:51 +0000

Introduction

When engineers think about functional safety and IEC 61511, they often focus on probability of failure on demand (PFDavg). But another metric—spurious trip rate (STR)—directly affects plant uptime, productivity, and costs. Unlike PFDavg, STR doesn’t measure safety performance. Instead, it reflects how often a Safety Instrumented Function (SIF) causes unwanted shutdowns. In this article, we’ll break down STR, introduce its inverse (MTTFsp), show how it’s calculated, and explain why balancing STR with PFDavg is essential for effective SIF design.

What Is Spurious Trip Rate?

Spurious trip rate (STR) is the frequency of unwanted or false trips of a Safety Instrumented Function (SIF).

These aren’t dangerous failures—they’re safe failures that cause a SIF to shut down the process when no actual demand exists. STR is usually expressed in failures per hour (1/hr) and then converted to years between trips for practical interpretation.

To clarify:

Dangerous failures → drive safety risk and are addressed through PFDavg.
Safe failures (that force shutdowns) → drive nuisance trips and are captured in STR.

Why STR Matters in Functional Safety

Unnecessary shutdowns don’t just reduce production—they can create safety hazards of their own. Restarting equipment introduces operational risk, and frequent nuisance trips erode operators’ trust in SIS performance. IEC 61511 requires that designers consider both safety integrity (via PFDavg) and availability (via STR), even though the standard does not prescribe numeric STR limits.

A SIF with an overly high STR might meet its SIL target but still be unacceptable to operations because of lost uptime. Spurious trips matters because it links functional safety design to real-world economics.

Alternatively, some operations are so automated that recovering from a SIF trip is only a minor problem. It just depends.

STR and Its Inverse: MTTFsp

STR is expressed in trips per unit time—often failures per hour. But that can be difficult to interpret. So, engineers often use its inverse: MTTFsp (Mean Time to Spurious Trip) .

MTTFsp = 1 / STR
Example: STR = 0.01 trips/year → MTTFsp = 100 years

Both terms are useful:

STR highlights the frequency of nuisance trips.
MTTFsp highlights how long, on average, a SIF runs without a false trip.

At SIL Safe, we always calculate both terms in verification reports. This makes it easier for both engineers and managers to understand the balance between reliability and uptime.

How STR Is Calculated

The first step is defining the λsp (spurious failure rate). This aggregates all relevant failure modes in the SIF path that can drive a false trip:

If a detected failure is configured to vote-to-trip, it contributes to λsp.
If a detected failure is configured to notify-only, it does not contribute.
If there are no diagnostics, then the diagnostic terms are already 0.

Thus, the voting philosophy directly affects which categories—safe detected (SD), safe undetected (SU), dangerous detected (DD)—feed into λsp. Dangerous undetected (DU) never contributes to STR.

1oo1 SIF: STR = λsp (per unit time).
1ooN SIF: STR increases with more elements in parallel (anyone can trip the system).
- STR=N*λsp
MooN SIF (M > 1): STR decreases because multiple channels must trip simultaneously (e.g., 2oo3 cuts nuisance trip probability). This is done through the often-confusing MooN equation, which is beyond the scope here.

Contrast: PFDavg calculations focus on dangerous undetected failures (DU), while STR calculations focus on safe or detected failures that cause spurious trips.

Example Calculations

Case A – 1oo1, No Diagnostics

Assumptions: Single-channel SIF, no diagnostics.
Failure rates per channel:
- λ_DU = 1E-6 /hr (listed for completeness; not in calculation)
- λ_DD = 0 /hr
- λ_SU = 2E-6 /hr
- λ_SD = 0 /hr
λsp definition: As there are no diagnostics, DD and SD are not applicable. DU is never applicable. Thus, λsp = λ_SU = 2E-6 /hr
Results:
STR = 2E-6 /hr
MTTFsp = 1 / (λ_sp × 8,760) = ≈ 57.1 years (≈ 1 trip every 57 years per channel)

Case B – 1oo2 Sensors, Diagnostics with Vote-to-Trip

Assumptions: Dual-channel 1oo2 sensor architecture; online diagnostics present; detected faults (SD and DD) vote-to-trip. Only the sensor portion is considered.
Failure rates per channel:
- λ_DU = 1.0E-6 /hr (listed for completeness; not in calculation)
- λ_DD = 2.0E-7 /hr
- λ_SU = 4.0E-7 /hr
- λ_SD = 8.0E-7 /hr
λsp per channel: λ_sp = λ_SD + λ_SU + λ_DD = 1.4E-6 /hr
System results:
STR = 2 × 1.4E-6 = 2.8E-6 /hr
MTTFsp = 1 / (STR × 8,760) = ≈ 40.8 years (≈ 1 trip every 41 years for sensor portion)

Key takeaway: architecture, diagnostics, and voting logic can drastically change nuisance trip rates. Presenting both STR and MTTFsp highlights the trade-offs clearly.

Who Decides What STR Is Acceptable?

Many plants rely on guidance from firms like SIL Safe to set realistic STR expectations. Unlike SIL targets, there is no universal STR requirement. Acceptability is decided by plant management and operations during safety design reviews. It’s typically documented in the Safety Requirements Specification (SRS).

Typical ranges: MTTFsp of 10–100 years per channel is often considered reasonable in industry.
Higher MTTFsp expectations: High-availability processes (e.g., refineries, offshore platforms).
Lower MTTFsp expectations: Non-critical utilities or batch processes.
Unrealistic targets: “Once every million years” is neither achievable nor useful. However, this does happen as new people are introduced to functional safety.

The point: STR goals are practical business decisions, not dictated by IEC 61511. The standard IEC 61511-1 requires the team to consider it against safety and PFDavg.

Balancing Safety Integrity and Availability

The art of SIF design is balancing safety integrity (low PFDavg) with availability (low STR):

Use redundancy wisely (2oo3 voting can cut STR). This is the primary reason 2oo3 is so common.
Apply diagnostics carefully (vote-to-trip vs notify-only matters).
Base calculations on realistic failure data, not overly optimistic assumptions.

A SIS with perfect safety but terrible availability—or vice versa—fails its mission.

Conclusion

Spurious Trip Rate doesn’t affect whether a plant is safe, but it absolutely affects whether it runs. That’s why STR is as important as PFDavg in practice. Engineers must present both STR and MTTFsp to give operators a realistic picture of system performance. The best designs find the balance: safe enough and reliable enough.

Q&A Section

1. Is a low STR always better?
Yes for uptime, but not if it compromises safety. STR must be balanced against PFDavg.

2. Does IEC 61511 require specific STR values?
No. It requires that availability and spurious trips be considered, but it does not prescribe limits.

3. Can proof testing increase STR?
Yes. Poorly planned tests (e.g., cycling valves without bypass) can cause nuisance trips. This temporarily raises STR and lowers MTTFsp.

4. Which devices dominate STR?
Final elements (valves) often contribute the most to spurious trips because of high safe failure rates.

5. How does redundancy reduce STR?
Voting architectures (like 2oo3) allow one sensor to fail without tripping the system, lowering STR.

6. Where is acceptable STR documented?
Usually in the Safety Requirements Specification (SRS) or reliability design basis documents.

7. Can STR goals cause conflict between engineers and operations?
Absolutely. Operations may push for fewer nuisance trips, while safety engineers emphasize conservatism. Finding balance is key. The stakeholders need to discuss it together.

Learn More

SIL Safe has a full glossary and a much shorter entry for spurious trip rate.
IEC 61511-1 page

How to Use Proof Test Coverage (Cpt) to Improve PFDavg

mamerten — Sun, 03 Aug 2025 22:15:25 +0000

When designing or verifying a Safety Instrumented Function (SIF), it’s common to hear terms like PFDavg, SIL level, and test interval. But one factor that’s often misunderstood — or just overlooked — is proof test coverage (Cpt). This is a critical element that directly impacts how effectively your testing finds dangerous failures.

If your facility is working toward compliance with IEC 61511-1, understanding how Cpt works — and how to apply it — can make the difference between an overly optimistic SIL claim and a realistic, defensible safety case.

Let’s walk through what Cpt is, how it affects your calculations, and how to apply it the right way.

What Is Proof Test Coverage (Cpt)?

Proof test coverage (Cpt) is the fraction of dangerous undetected (DU) failures that your proof test is capable of finding.

A Cpt of 1.0 (or 100%) means your test detects all dangerous undetected failures.
A Cpt of 0.7 (or 70%) means your test only finds 70% of those failures.

This is important because any dangerous failures that your test doesn’t catch will accumulate over time, increasing the average probability of failure on demand (PFDavg).

Cpt is often used alongside another key term: proof test interval (TI) — how often you do the testing. But the test interval doesn’t matter much if your test isn’t catching what matters.

Also worth noting: Cpt is not the same thing as diagnostic coverage (DC) — though they both relate to detecting failures, they’re measured differently and come from different sources.

Note that proof tests can and do capture beyond DU failures. It can also catch safe failures (SU, SD). But the main purpose of proof test is to find DU failures and that is the only failure Cpt is associated with.

How Cpt Affects PFDavg

The most common form of the PFDavg equation used in training looks like this:

But this assumes you catch all dangerous undetected (DU) failures — which is rarely true. A more accurate form includes Cpt:

Where:

λ_DU is the dangerous undetected failure rate
TI is the proof test interval
LT is the SIS lifetime (e.g., 15 or 20 years)

The two terms in the equation represent different contributions to PFDavg, as explained below:

The first term is the contribution between tests.
The second term is the contribution of failures that remain hidden even during proof tests that apply for the lifetime.

In many training or spreadsheet tools, the second term is omitted if the lifetime is similar to the test interval. But when the lifetime is significantly longer (e.g., TI = 1 year, LT = 15 years), ignoring it underestimates risk.

Let’s make a quick comparison:

Example:

λ_DU = 2E-6 per hour
TI = 1 year (8,760 hours)
LT = 15 years (131,400 hours)
Case A: Cpt = 0.55
Case B: Cpt = 0.95

Case A: PFDavg ≈ (2E-6 × 0.55 × 8760)/2 + (2E-6 × 0.45 × 131400)/2 = 1.04E-2 → RRF ≈ 96 (SIL 1)

Case B: PFDavg ≈ (2E-6 × 0.95 × 8760)/2 + (2E-6 × 0.05 × 131400)/2 = 2.06E-3 → RRF ≈ 485 (SIL 2)

👉 This is the difference between a SIL 1 system and a SIL 2 system — driven entirely by proof test coverage.

Even though both cases used the same failure rate, test interval, and SIS lifetime, the lower test coverage in Case A pulled the risk performance down an entire SIL level. This is a powerful reminder that increasing test frequency is not enough if the test itself isn’t catching the right failure modes.

What’s a Realistic Cpt?

You’ll often see vendors or safety books quote generic Cpt ranges. Here’s a quick cheat sheet to get you started:

Component	Typical Cpt	Notes
Pressure Transmitter	85–95%	Depends on how it’s tested
Logic Solver	95–99%	High diagnostic coverage helps
Final Element (valve)	50–95%	Greatly depends on stroke testing

Cpt is influenced by:

Test method (partial stroke, full stroke, leak test, etc.)
Equipment design (some valves are inherently testable)
Human factors (procedures, training, consistency)

How to Determine Proof Test Coverage (Cpt)

If Using IEC 61508 Certified Equipment

If you’re using components that are certified per IEC 61508, your job is easier. Look at the SIL certificate or safety manual. Most will include Cpt values based on an FMEDA (Failure Modes, Effects, and Diagnostic Analysis).

Example: A final element might claim 65% for partial-stroke testing and 90% for full-stroke testing.
You need to match your test procedure to what was assumed in the FMEDA.

This is especially important with valves. Partial stroke tests (PVST) might not catch failure modes that a full test (FVST) would — and the difference in Cpt can be dramatic.

If Using Non-61508 Equipment (Route 2H or 2S)

If your hardware isn’t certified, you’ll need to get data and use the proven in use method (this takes the route 2H or 2S approach, routes are confusing and will be discussed elsewhere).:

Use industry databases like OREDA
Refer to books like Safety Instrumented System Verification by Goble
Review ISA technical reports and peer-reviewed FMEDAs
Document your engineering judgment and conservatism

Example: You might assign a Cpt of 70% to a test routine that checks for mechanical failure in a solenoid but can’t detect seat leakage. Be transparent about assumptions — auditors and assessors will ask.

Common Proof Test Coverage Misunderstandings

Cpt ≠ diagnostic coverage: Diagnostic coverage comes from built-in self-checks. Cpt is about your manual or automatic testing procedures.
You can’t just assume 100%: Even a full-stroke test may not catch all dangerous failures, especially in actuators and valve internals.
Test frequency doesn’t override poor Cpt: Doing a weak test more often doesn’t give the same benefit as a strong test less frequently.
Copying vendor Cpt while using a much weaker plant test; assuming partial‑stroke test Cpt equals full functional Cpt.
Not coordinating the type of Cpt with the plant possibly needing to be shutdown. For example, a Functional Safety Engineers establishes a high Cpt assuming a FVST every six months, which would require a plant shutdown in their situation. But the facility cannot be shutdown every six months. This is a classic example of not including all stakeholders in decisions.

Practical Tips for Beginners

If possible, use certified equipment — it saves work and improves defensibility.
For valves and final elements, be clear with your operations team: A test that’s easy to perform (like PVST) often has lower coverage.
Document exactly what your test does and doesn’t detect.
For new designs, select devices that are easier to proof test.

Q&A

1. How do I figure out what Cpt to use in my SIFs?
Start with the equipment documentation. If certified, use the FMEDA. If not, use judgment, external databases, and document everything.

2. Can I assume 100% Cpt if I fully test a valve via a FVST?
Not quite. While FVST gets close, it might miss failure modes like sticking during partial actuation or internal bypass.

3. How is Cpt different from diagnostic coverage?
It measures what your manually performed or manually-initiated test can catch. Diagnostic coverage measures what the device’s self-checks can catch.

4. Does increasing test frequency help more than increasing Cpt?
They both help — but increasing Cpt often gives more impact with fewer operational interruptions.

5. What’s the best way to improve Cpt without changing the system?
Upgrade your test method. Add leak testing, position feedback, or combine manual and automated routines.

6: Does Cpt apply to safe failures as well?
No. Cpt is defined only on dangerous undetected failure modes. Proof tests may reveal safe failures, but those affect spurious trip rate and availability, not the Cpt value used in PFDavg.

Want to go deeper? Check out our full article on Proof Testing here or visit the glossary for more functional safety terms.

How MTTR Affects Your SIL Calculations: A Beginner’s Guide to Mean Time to Restore

mamerten — Fri, 09 May 2025 00:55:19 +0000

When functional safety calculations are discussed, engineers often focus on things like hardware fault rates, test intervals, or architectural constraints. One input that is often misunderstood — or misused — is MTTR: Mean Time to Restore. While it sounds simple, this term carries specific meaning in the context of IEC 61511 and can significantly impact SIL verification and spurious trip rate (STR) evaluations.

This article is a beginner’s guide to what MTTR actually is, when it should (and should not) be used in probability of failure on demand (PFDavg) calculations, and how getting it wrong could lead to overconfident or unrealistic safety claims.

What Is MTTR (Mean Time to Restore)?

Per IEC 61511, MTTR refers to the time it takes to restore a safety instrumented function (SIF) to its proper operating state after a failure has occurred. Importantly — and often overlooked — this is different from mean repair time (MRT).

This includes the following four components:

(a) Time to detect the failure – meaning from the point of failure to when the diagnostic alert is triggered
(b) Time spent before starting the repair. Such as paperwork and ordering parts.
(c) Time to do the repair itself.
(d) Time to perform the restoration

So:

In contrast, MRT (Mean Repair Time) is:

This distinction matters. MTTR as defined in IEC 61511 considers the full window of vulnerability — from initial failure to full recovery. Both MTTR and MRT are used in PFDavg calculations.

This is important because in safety calculations, you’re modeling how long a function is unavailable and how much risk accumulates during that time.

Typical Ranges:

Low-end (e.g., redundant sensor swap with on-site spares): 2–4 hours
High-end (e.g., submerged valve with specialist access): weeks or even months

When MTTR Is (and Isn’t) Used in PFDavg

Mean time to restore is NOT always used in PFDavg.

Rule of thumb: It only affects PFDavg when a diagnostic failure is detected and that failure is reported only. Meaning, it doesn’t cause an immediate trip of the SIF, nor does it vote to trip. In essence it means the SIF is still online but functionally unavailable and has effectively already “failed.”

Not used in PFDavg:

The following are examples when PFDavg would not include MTTR components

Example – A 2oo3 instrument SIF where one instrument has a DD error. That instrument has voted TRUE in the 2oo3 logic. Note this is a common scenario and is a typical way SIFs are designed.
Example – a 2oo3 instrument where one instrument has a DD error. The system is designed to trip the ENTIRE SIF (this is a rare scenario but could happen)

Incorrect usage example: Applying MTTR to a demand-mode final element that fails silently and has no on-board diagnostics. This underestimates PFDavg.

MTTR Impact on PFDavg

This concept can influence PFDavg calculations in some hard-to-understand ways. The first way is the classic method of DD failure which has the most impact. The second is included as it is very adjacent to MTTR, but is actually another term called mean repair time (MRT).

1. DD Failures that are Reported only

In this case, a component within the SIF experiences a dangerous detected failure (DD) that is reported but not acted upon with a trip nor vote-to-trip. The SIF remains active, but risk is accumulating because the failure has not yet been remedied. It is assumed the SIF is non-operational.

Where:

λ_DD = dangerous detected failure rate
MTTR = mean time to restore

2. Very Closely Related – SIF Bypass Periods (e.g., Proof Testing or Maintenance)

When the entire SIF is placed in bypass — such as during proof testing or temporary overrides — MRT (not MTTR) contributes. The logic is that if a failure is discovered during testing, the SIF must be restored, so MRT becomes relevant.

Where:

PTD = proof test duration
TI = test interval
λ_DU = dangerous undetected failure rate
MRT = mean repair time

It can be confusing that λ_DU is included as the failure was “detected” during proof testing. Reminder, the “detected” versus “undetected” applies to diagnostics. If a proof test finds a seized solenoid valve, that would fall under λ_DU. It can also be confusing that MTTR and MRT can equal each other in times of fast diagnostics.

How Much Does MTTR Contribute to PFDavg?

Now that we’ve seen where Mean Time to Restore enters the PFDavg equations, the next logical question is: how much of the total PFDavg is typically due to this value?

The answer depends on the architecture and diagnostic design of the SIF, but in many cases the MTTR-related terms can be a major portion of the overall PFDavg.

Example Breakdown:

Suppose a SIF has the following contributors:

Dangerous undetected term (TI/2 component): 6.00E-3
Diagnostic-only (λ_DD × MTTR): 2.00E-3
Bypass and restoration term: 1.00E-3

Total PFDavg = 9.00E-3

In this example, MTTR-based contributions account for 3.00E-3, or 33% of the total PFDavg.

Sensitivity to MTTR Changes

If the time the SIF is unavailable increases from 8 hours to 72 hours or beyond — as can easily happen with inaccessible equipment — that 33% could grow to 50% or more of the total risk profile.

This reinforces a key takeaway:

It isn’t just a footnote — it’s a meaningful design and reliability driver.

When conducting SIL verification, always explore how much of your calculated PFDavg comes from MTTR-related sources. If it’s significant, you may want to revisit response logistics, spares availability, or automated recovery strategies. Designing a SIS is a balance of many things.

Impact on Spurious Trip Rate (STR)

A longer MTTR increases the exposure window for nuisance or spurious trips. Consider a logic solver that falsely detects a dangerous condition. If the system cannot be restored quickly, operations may suffer unnecessary downtime or escalation.

Designs that balance fault tolerance, diagnostic alerts, and realistic restore times can minimize the impact of STR on both safety and availability. Different facilities will have completely different tolerances of STR. Perhaps one per year is acceptable, perhaps not.

What IEC 61511 Says

IEC 61511 emphasizes the need to use realistic — not idealized — values in SIL verification.

“The MTTR shall take into account all delays including diagnosis, personnel response, spares availability, and repair.” (paraphrased for clarity)

Be cautious of defaulting to 8 hours. Unless supported by site history, vendor specs, or service agreements, this assumption could invalidate a SIL claim.

Common Mistakes

Confusing MTTR with MRT
Using overly optimistic values — e.g., assuming parts, tools, and staff are instantly available
Ignoring logistics — A poorly managed work order system may introduce delays of several days
Assuming the same value for all elements — A sensor may be quick to restore, but a final element might require complex isolation and drainage

Best Practices for MTTR in SIL Verification

Document separately for each device type (sensor, logic solver, final element)
Source your data: field data, OEM specs, or reliability databases (OREDA, etc.)
Challenge vendor-provided values if they seem unrealistic
Responsibility: SIS designers typically own the MTTR assumption, but input from operations and maintenance is essential
Record all assumptions in the Safety Requirements Specification (SRS) and in SIL verification reports

Q&A

1. When does mean time to restore impact PFDavg?
Only when a component of a SIF has detected a failure and reported it only. Thus, the SIF is unavailable but not tripped.

2. How does MTTR differ from MRT?
The former includes detection time (a); MRT starts with diagnosis (b). MTTR is longer and more conservative.

3. Do these values affect high-demand SIFs?
Generally no — high-demand systems use different metrics like PFH (probability of failure per hour).

4. Should MTTR differ by component?
Yes. Sensors, logic solvers, and final elements can have drastically different restore times.

5. Can MTTR be impacted by facility bureaucracy or procurement delays?
Absolutely. Delays in approvals, permits, or spare part procurement can dramatically increase the time a SIF is unavailable. These should be discussed and accounted for in determining these values. It could be that the architecture needs to change if the facility is unwilling to have working spares.

Conclusion

Mean time to restore might seem like a small parameter, but it has outsized influence on how we evaluate and justify risk reduction in a functional safety system. Whether you’re calculating PFDavg or trying to keep STR under control, using accurate, documented, and conservative MTTR values helps ensure your SIL claims are realistic — and defensible.

Audit your assumptions. Align with IEC 61511. And don’t let this “minor” input undermine your entire safety case.

Limble has a great article discussing MTTR and comparing it with other similar concepts.

PFDavg Explained: 6 Essentials for Getting Started with SIL Calculations

mamerten — Sat, 26 Apr 2025 19:50:21 +0000

Understanding PFDavg—short for Probability of Failure on Demand, average—is foundational if you’re new to functional safety and the IEC 61511 framework. Whether you’re supporting a Safety Instrumented System (SIS) in oil & gas, biogas, chemicals, or any other process industry, developing a solid understanding of this metric helps you design safer systems and comply with regulatory expectations.

This guide walks you through six essential concepts about PFDavg that every engineer should understand when evaluating or designing Safety Instrumented Functions (SIFs).

1. What Is PFDavg and Why It Matters

A Safety Instrumented Function (SIF) is a specific set of equipment—usually a sensor, logic solver, and final element—designed to take a process to a safe state when a defined hazardous condition is detected.

PFDavg is the average likelihood that a Safety Instrumented Function (SIF) will fail when it’s needed. It quantifies the chance that your safeguard won’t respond in a dangerous situation—basically, the “gap” in protection.

Most SIS applications operate in low-demand mode, meaning they’re called upon infrequently (e.g., fewer than once per year). For these systems, PFDavg is the go-to metric for quantifying performance.

Why does it matter? Because PFDavg is directly tied to your Safety Integrity Level (SIL) target. If your value is too high, your SIF doesn’t meet the required SIL—and that means your process risk isn’t sufficiently reduced.

External resource: ISA Functional Safety

2. How PFDavg Determines Safety Integrity Level (SIL)

The Safety Integrity Level (SIL) is a measure of how much risk reduction a SIF provides. It’s defined by using thresholds. The following table summarizes the P ranges and associated risk reduction factor (RRF) for each SIL level:

SIL Level	PFDavg Range	RRF Range
SIL 1	≥1.0E-2 to <1.0E-1	10 to <100
SIL 2	≥1.0E-3 to <1.0E-2	100 to <1,000
SIL 3	≥1.0E-4 to <1.0E-3	1,000 to <10,000
SIL 4	≥1.0E-5 to <1.0E-4	10,000 to <100,000

In the process industry, you’ll typically see SIL 1 to SIL 3 used. SIL 4 is rare and generally reserved for nuclear or aerospace applications.

External reference: IEC Functional Safety

3. The Core PFDavg Equation

For low-demand systems, a simplified PFDavg formula looks like this:

Where:

λdu: Dangerous undetected failure rate (failures per hour)
TI: Proof test interval (hours)

This equation assumes perfect testing, no redundancy, and no diagnostics. It’s a good starting point but real-world systems require more advanced modeling.

More complete equations may also include:

Proof Test Coverage (Cpt)
Mean Time to Restore (MTTR)
Common Cause Failure, beta factor (β)

These concepts—Cpt, MTTR, and β—will be covered in future posts.

4. Key Terms You Need to Know

Proof Test Interval (TI)

How often the system is tested to reveal hidden failures. A longer TI increases PFDavg because failures remain latent for longer.

Mean Time to Restore (MTTR)

The average time it takes to restore the system to a working state once a failure is discovered. This influences overall system unavailability. (Future article topic)

Proof Test Coverage (Cpt)

Represents the fraction of dangerous undetected failures that are revealed by a proof test. The higher the Cpt, the more effective the test, and the lower your average probability of failure on demand. This is especially critical in systems that lack built-in diagnostics. (Future article topic)

Common Cause Failure

Common cause failure occurs when two or more components that are supposed to provide redundancy fail simultaneously due to a shared cause. These causes might include shared environmental factors (like temperature or humidity), a common power supply, or even human error. In SIL verification calculations, the beta factor (β) represents the portion of failure that cannot be assumed to be independent. Properly accounting for β is critical in ensuring that your redundant architecture isn’t giving a false sense of risk reduction. (Future article topic)

5. A Simple Example

Let’s say you have a single-channel SIF configured in a 1oo1 architecture, using a sensor that has a dangerous undetected failure rate of 1E-6 per hour. The system is proof tested every year (8760 hours).

This value falls within the SIL 2 range. This corresponds to a Risk Reduction Factor (RRF) of 228.

Now, if you want to increase this to SIL 3, you will need to…

Reduce the failure rate
Reduce the TI
Add redundancy by adjusting the architecture (which would use another equation)
Increase diagnostic coverage

Note this formula is only valid for 1oo1 and a simple approach of doing these calculations.

6. How Design Choices Affect PFDavg

Designers can manipulate several variables to drive the value down:

Shorter TI: More frequent testing catches failures earlier.
Redundant architecture: 1oo2 or 2oo3 configurations use different equations.
Improved diagnostics: Increases Cpt, reducing undetected failure exposure.
Better components: Lower intrinsic failure rates (λD).

Each choice comes with trade-offs in cost, complexity, and operational downtime. The goal is to achieve just enough safety—not overdesign.

In Summary

PFDavg is a key performance metric in functional safety for low-demand SIFs. It links directly to SIL and guides the engineering design and validation process. Understanding the basics of how it’s calculated—and what affects it—helps ensure you’re designing systems that meet the risk reduction targets set by IEC 61511.

Future articles will dive deeper into Cpt, MTTR, and β, each of which play roles in real-world SIL verification.

Quick Q&A

What is the difference between PFDavg and PFH?
PFDavg applies to low-demand systems; PFH (Probability of Failure per Hour) applies to high/continuous-demand systems.

Can PFDavg be too low?
Yes. Overdesign increases cost and complexity without necessarily increasing safety proportionally.

Does IEC 61511 require PFDavg calculations?
Yes—for each SIF, PFDavg (or PFH) must be calculated to demonstrate compliance with the required SIL.

Is there a standard way to calculate it?
Yes and no. IEC 61508 and IEC 61511 provide methods and allow both simplified and detailed probabilistic approaches. But there are different levels of equations the engineer can choose to get into.

What determines what the PFDavg needs to be?
The required value is dictated by your hazard and risk assessment (H&RA)—typically performed through a Layer of Protection Analysis (LOPA). This is what tells you how much risk reduction is needed, which in turn defines your SIL target.

What role does proof testing play?
It reveals hidden failures. The longer the proof test interval (TI), the higher the PFDavg.

Does the math formula change with different voting logic (e.g., 1oo2, 2oo3)?
Absolutely. The basic formula you’ve seen assumes a 1oo1 architecture. Different configurations use different math, and these differences can significantly change the resulting calculation.

Proof Testing of SIFs: Understanding Its 3 Purposes and Its Importance

mamerten — Sat, 15 Mar 2025 19:02:46 +0000

If you’re new to functional safety, you’ve likely encountered terms like proof testing, Safety Instrumented Functions (SIF), and Probability of Failure on Demand (PFDavg). Understanding these concepts is crucial for ensuring the reliability of safety systems in process industries. In this article, we’ll clearly explain proof testing, why it’s essential, how it impacts the PFDavg calculation, and its relationship with Safety Integrity Level (SIL).

What Exactly is Proof Testing?

Proof testing refers to manually performed or manually initiated periodic tests conducted on Safety Instrumented Systems (SIS) to reveal hidden or dormant faults. The main goal is ensuring each SIF—which consists of sensors, logic solvers, and final elements—can reliably activate when needed. These tests simulate conditions that activate the safety system, confirming functionality and identifying issues that could compromise performance during an actual emergency.

IEC 61511, the international standard guiding functional safety in the process industry, explicitly mandates proof tests to maintain compliance and assure safety integrity.

Proof testing is not just a regulatory checkbox—it’s a central activity in any process safety lifecycle. When done properly, it enhances equipment reliability, informs maintenance strategies, and helps avoid unnecessary spurious trips or overlooked latent failures. These tests are especially important in facilities with aging infrastructure or complex safety systems, where regular inspections can be challenging.

Why is it Essential?

Regular proof testing helps:

Ensure Reliability: Detects hidden faults, increasing confidence that safety systems respond correctly when required.
Achieve and Maintain SIL Ratings: Directly influences SIL by lowering PFDavg values, ensuring a system meets its designated safety objectives.

Connecting to PFDavg

PFDavg is a metric used to gauge the likelihood that a safety system will fail to respond correctly upon demand. Frequent and thorough testing significantly reduces the PFDavg by detecting hidden failures and correcting them promptly.

Understanding PFDavg Through a Simple Calculation:

A simplified formula for calculating PFDavg is:

λ_DU is the dangerous undetected failure rate, often in 1/hour
TI s the test interval between proof tests, often in years

Shorter test intervals result in a lower PFDavg, directly supporting higher SIL ratings.

Practical Example:

Imagine a SIF component that has 0.002 failures per year. With annual testing, your calculation would be:

This PFDavg of 0.001 complies with SIL 2 requirements (ranging from 0.001 to 0.01). Extending test intervals increases PFDavg, reducing safety system reliability and potentially affecting the SIL rating.

Types of Tests and Operational Impact

Different types of tests have distinct impacts on plant operations:

Full Functional Test

A complete system test, involving sensors, logic solvers, and actuators.

Operational Impact: Usually requires full shutdown.
Advantages: Highest accuracy, confirms full loop functionality.
Drawbacks: Production downtime and increased costs.

Partial-Stroke Test

Typically used for valves, it partially moves the valve to test performance without shutting down the process.

Operational Impact: Minimal interruption, production continues.
Advantages: Frequent monitoring, limited disruption.
Drawbacks: Less comprehensive, may miss certain faults.

Diagnostic Testing – is Great – but is not Proof Testing

Automated tests performed by components to detect faults continuously or at short intervals is a common feature in Functional Safety. However, these are NOT considered proof tests. Remember, proof test are manually performed or manually initiated tests.

Automatic diagnostic testing is an amazing way to help ensure safety, but is a distinct concept and has distinct math terms from proof testing. Some notes on diagnostics are…

Operational Impact: No direct operational impact.
Advantages: Early fault detection, reduced manual testing frequency.
Drawbacks: Can miss faults undetectable by automated methods.

Balancing Test Frequency and Practicality

Typical Test Intervals (TI) for SIFs

The appropriate Test Interval (TI) for a Safety Instrumented Function (SIF) depends on many factors, including the required Safety Integrity Level (SIL), failure rate of components, Proof Test Coverage (Cpt), and the risk reduction targets.

In general practice:

SIL 1 SIFs often have TIs ranging from 1 to 5 years
SIL 2 SIFs typically fall within a 1 to 2 year TI
SIL 3 SIFs usually require TIs of 6 months to 1 year or even more frequent

These ranges are not hard-rules and must be validated by PFDavg calculations, which incorporate actual device data and proof test protocol effectiveness. For high-demand or continuous mode operations, the calculation basis and interval definitions change, often requiring more sophisticated modeling.

Determining testing frequency involves balancing safety, reliability, and operational practicality. More frequent testing reduces PFDavg but increases operational costs and downtime, while less frequent testing can increase risks due to undetected faults. The key is aligning testing frequency with safety goals and operational realities.

Proof Test Coverage (Cpt)

A related topic, Proof Test Coverage (Cpt), refers to the effectiveness of proof tests in detecting hidden faults. Or more specifically, dangerous undetected (DU) failures. While this article focuses on the broader concept of proof testing, it’s worth noting that Cpt plays a significant role in how PFDavg is calculated. In essence, if a test is only capable of detecting a fraction (Cpt ) of possible dangerous undetected failures, that limited effectiveness must be factored into the safety equations. This complicates the math and emphasizes the importance of having well-defined proof test protocols that clearly state what is and isn’t being tested.

To learn more about Cpt, see this article to get into more details.

Balancing Testing with Operational Requirements

While proof testing is essential for maintaining the integrity of SIFs, it often competes with operational demands. Some tests, especially full functional tests, require shutting down part or all of a process—a decision that carries significant production and cost implications. In other cases, testing may require bypassing safety functions, temporarily reducing the facility’s protection layers.

Therefore, organizations must carefully weigh the benefits of frequent and thorough testing against the impact on production schedules, safety availability, and maintenance workload. This balance often involves coordination across operations, safety, and engineering teams to align test intervals with planned outages, maintenance windows, and risk tolerance levels.

Common Testing Pitfalls

While not exhaustive, here are some common pitfalls organizations encounter:

Poor planning leading to unnecessary operational disruptions.
Overreliance on partial-stroke and diagnostic testing without periodic full functional tests.

Avoiding these issues enhances both safety and efficiency.

Wrap-Up

Proof testing of SIFs is fundamental in maintaining the reliability and effectiveness of safety systems. Regular and thorough tests directly influence PFDavg calculations, help maintain SIL ratings, and ensure compliance with IEC 61511. By clearly understanding and implementing these practices, your organization significantly enhances safety, reduces risks, and sustains operational efficiency.

Quick Q&A:

Q1: What is the primary purpose of a proof test protocol?
A: To detect hidden failures within the SIF, ensuring reliability and readiness. Or more specifically, dangerous undetected (DU) failures.

Q2: How does testing frequency influence functional safety?
A: More frequent testing reduces the PFDavg, directly improving the SIL rating.

Q3: Can partial-stroke testing replace full functional testing?
A: No, partial-stroke testing complements full tests but can’t entirely replace them, as it’s not comprehensive enough to detect all types of faults.

Q4: Is diagnostic testing (done automatically by newer components) a type of proof testing.
A: No, diagnostic testing and proof testing are different concepts in the functional safety ecosystem. Proof testing is always operator-performed or operator-initiated. Both concepts are in PFDavg calculations and the math works different ways.

Q5: If my SIF already uses devices with high diagnostic coverage, do I still need frequent proof testing?
A: Yes. But perhaps with very strong diagnostics you can use a longer TI and still achieve the needed PFDavg.

Ready to Learn More?

Stay informed and up-to-date on functional safety and industry best practices by subscribing to our newsletter. You’ll receive the latest insights directly in your inbox to help improve your safety management practices.

Additional Resources:

IEC 61511 Official Standard
International Society of Automation (ISA)
UK Health and Safety Executive (HSE)
CCPS Guidelines
Internal blog post on Cpt
See our glossary for loads more terms
Internal article on failure rate
Internal article on the basics of functional safety

SIL Safe

Hazard and Risk Assessment (H&RA): The Foundation of Functional Safety

What Is a Hazard & Risk Assessment (H&RA)?

When a Hazard and Risk Assessment Fits in the IEC 61511 Safety Life-Cycle

Key Steps in Conducting an H&RA

Determining Tolerable Risk — The Foundation of Step 1

The Risk Gap and the Case for a SIS

The Link Between H&RA and SIL Determination

H&RA Methodologies

Qualitative Methods

Semi-Quantitative Methods

Quantitative Methods

Related Topics and Tools

Common Mistakes and Pitfalls

Keeping Your Hazard and Risk Assessment Current (Management of Change)

Frequently Asked Questions

Further Reading

Conclusion

Functional Safety for the Process Industry: 10 Core Concepts Every Engineer Should Know

Functional Safety at the Most Basic Level

Why Functional Safety is Applied in the Process Industry

What Is Functional Safety?

How Functional Safety Reduces Risk

Standards Governing Functional Safety

Hazard and Risk Assessment (H&RA)

Independent Protection Layers (IPLs)

Safety Instrumented Systems (SIS)

Safety Integrity Levels (SIL)

Other Design Considerations in Functional Safety

Verifying Safety Instrumented Function Performance (PFDavg)

The Functional Safety Life-Cycle

Regulatory Context

Common Misconceptions About Functional Safety

Why Functional Safety Programs Matter — and When Expertise Is Needed

Q&A Section

Conclusion

Call to Action

Additional Resources:

Failure Rates in Functional Safety: A Practical Guide for Working Engineers

Introduction: Why Failure Rates Matter in Functional Safety

The Big Picture: What We Mean by a “Failure Rate”

Random vs. Systematic Failures

Constant Failure Rate Assumption

Statistical Nature of λ Values

The Four Failure Rate Categories in Functional Safety

How the Four Failure Rates Feed Functional Safety Calculations

Where Failure Rates Come From

Where the Functional Safety Engineer Actually Gets Failure Rates

How Failure Rates Are Determined (Typical Scenario)

Broader Reliability Concepts

Automated Valve Assemblies and How Their Failure Rates Combine

Practical Examples

Sensor Example: How the Engineer Obtains λ Values

Final Element Example: How the Engineer Obtains λ Values

Common Mistakes Engineers Make with Failure Rates

When Diagnostics Exist but Are Not Used

Where Failure Rates Influence SIS Design Decisions

Summary: A Practical Way to Think About Failure Rates

More Help

Q&A Section

How to Apply the Beta Factor: A Practical Guide to Common Cause Failures in SIL Verification

What Is It? (CCF vs. Beta Factor and Why It Matters)

Where to Get Beta Factor Values

How Architecture Affects β

NooN Architectures and Why β Is Not Present

Typical Values and Influencing Factors

PFDavg Equation Discussion

PFDavg Calculation Examples (1oo2)

Case A – Good β Factor (Target: SIL 2)

Case B – Poor β Factor (Target: SIL 1)

Impact of Common Cause Portions of PFDavg

Common Mistakes

Summary

Q&A

Further Reading

Spurious Trip Rate Explained: 6 Facts Every Functional Safety Engineer Should Know

Introduction

What Is Spurious Trip Rate?

Why STR Matters in Functional Safety

STR and Its Inverse: MTTFsp

Case A – Good β Factor (Target: SIL 2)

Case B – Poor β Factor (Target: SIL 1)