Fallback Engine — Continuity & Resilience

Automated Service Continuity

Fallback in inference-relay is not a backup plan. It is automatic service continuity — a guarantee that inference requests complete successfully even when individual providers experience failures.

The cascade engine maintains a priority-ordered list of providers. When a request fails on the primary provider, it is transparently retried on the next provider in the cascade. The calling application receives a successful response with no indication that a failover occurred, except for an optional metadata field recording the cascade path.

Priority cascade: providers are tried in priority order (lower number = higher priority). If a provider fails, the next provider handles the request. If all providers fail, the error from the last provider is returned.

Health State Management

Every provider maintains one of three health states. The cascade engine uses these states to skip providers that are unlikely to succeed, reducing latency and avoiding redundant failures.

Healthy — Accepting requests normally. Triggered by successful execution.
Degraded — Accepting requests but flagged for monitoring. Triggered by rate limit (429) or overload (529). Recovers on next successful request.
Unhealthy — Skipped entirely in cascade. Triggered by auth failure (401) or repeated consecutive errors. Recovers via Self-Healing Protocol.

State Transitions

A healthy provider that returns a 429 or 529 transitions to degraded.
A degraded provider that completes a request successfully transitions back to healthy.
A degraded provider that returns a 401 transitions to unhealthy.
An unhealthy provider is excluded from the cascade until the Self-Healing Protocol restores it.

Self-Healing Protocol

Unhealthy providers automatically recover after a 5-minute cooldown period. No manual intervention, no restart, no configuration change is required.

This protocol exists because most provider failures are transient:

A 401 during key rotation resolves once the new key propagates.
Repeated timeouts during a service incident resolve once the provider stabilizes.
Network partitions heal on their own.

Without self-healing, a single transient failure would permanently exclude a provider from the cascade for the lifetime of the process. The 5-minute window is long enough to avoid hammering a failing provider but short enough to restore capacity promptly.

After the cooldown expires, the provider is moved back to degraded state and included in the cascade on the next request. If that request succeeds, it transitions to healthy. If it fails again, it returns to unhealthy and the cooldown resets.

Error Classification

The cascade engine classifies errors to determine the correct action and health impact for each failure type.

Rate limit (429) — Cascade to next provider, mark degraded
Auth failure (401) — Cascade to next provider, mark unhealthy
Overloaded (529) — Cascade to next provider, mark degraded
Timeout — Cascade to next provider, no health change
AbortError — Re-throw immediately, no cascade

AbortError — The Exception

AbortError is the only error type that halts the cascade. This error means the caller explicitly cancelled the request (e.g., user navigated away, component unmounted, abort controller signaled). Cascading to another provider would be wasteful — the caller no longer wants a response.

All other errors trigger automatic failover to the next available provider.

Cascade Chain Record

Every RelayResult includes a fallbackChain field that records the full cascade path — which providers were tried, in what order, and what happened at each step.

native-gateway:failed:rate_limited → api-provider:success

native-gateway:failed:unhealthy → api-provider-1:failed:timeout → api-provider-2:success

api-provider:success

This provides complete observability into routing decisions without requiring external logging infrastructure. The chain record is available in the response metadata and can be forwarded to your telemetry system, logged, or displayed in debug UI.

Atomic Session Continuity

Standard fallback handles failures that occur before a response begins. Atomic Session Continuity handles the harder case: failures that occur mid-stream, after the provider has already begun sending tokens.

The Problem

A provider accepts a streaming request and begins returning tokens. Partway through the response, the connection drops — the provider hits a rate limit, the network blips, or the service restarts. Without mid-stream protection, the user sees a partial response followed by an error.

The Solution — Stateful Transition Between Gateways

The relay buffers logic states for every active stream. As tokens arrive from the primary provider, the cascade engine accumulates a structured execution state alongside the user-facing token feed. This buffered state is the precondition for a stateful transition between providers.

inference-relay separates stream connection success from stream content consumption:

A successful connection opens the stream, but the provider is not marked healthy yet.
Tokens are consumed from the stream and forwarded to the caller. The cascade engine simultaneously buffers the logic state of the in-flight session.
Only after the stream is fully consumed without error is the provider marked healthy.

This is Deferred Health Marking— the provider's health state reflects complete request success, not just the initial handshake.

If the stream fails mid-consumption, the cascade engine performs an injection of partial execution states into the secondary gateway. The buffered logic state is formatted as a structured precondition for the next provider in the cascade, enabling that provider to continue generation from the exact point of failure rather than restarting the entire request from scratch.

The result: users never see a raw “Connection Lost” error. The relay either completes the request transparently on another provider — with the accumulated content carried forward through the stateful transition — or returns a structured error with the partial content preserved.

Capability Filtering

Before the cascade even begins, providers are filtered by their declared capabilities. This prevents wasted attempts on providers that cannot possibly handle the request.

Contains image content → Skip providers where supportsMultimodal = false
Requests streaming → Skip providers where supportsStreaming = false
Running on web platform → Skip providers where platformRequirement = 'desktop'

Auto-Cascade for Multimodal

The Native Subscription Gateway does not support multimodal inputs. When a request contains images, the gateway is automatically skipped and the request cascades directly to the first multimodal-capable API provider. This happens silently — no error is generated, no fallback is recorded. The gateway simply is not a candidate for that request.

This means applications can send multimodal requests without checking provider capabilities first. The relay handles the routing automatically.

Smart Provider Pre-Skip (v1.4.0)

Before v1.4.0, the cascade engine would attempt every provider in order, relying on the provider's guard to reject incompatible models. This worked but wasted a try/catch round-trip per mismatched provider.

In v1.4.0, the engine consults the MODEL_REGISTRY before calling buildRequestBody(). If the registry knows a model belongs to a different provider, that provider is skipped immediately.

Pre-skipped providers appear in the fallback chain as:

openai-api:skipped:wrong_provider → anthropic-api:success

The v1.3.1 guards remain as defense-in-depth — if the registry is stale or a custom provider is used, the guard fires and the cascade continues normally. Custom and mock providers are never pre-skipped.

Monitoring

FallbackTrace Dashboard

The @inference-relay/dashboard package provides a FallbackTrace React component that visualizes cascade activity in real time. It displays:

Active provider health states (healthy / degraded / unhealthy)
Recent cascade chains with timing data
Self-Healing Protocol countdowns for unhealthy providers
Cost attribution per provider

MCP Server Integration

The inference-relay MCP server exposes two tools for programmatic access to fallback data:

list_fallback_events — Returns recent fallback events with timestamps, providers, and error types
explain_fallback_chain — Takes a request ID and returns a human-readable explanation of why each routing decision was made

Activity Log

The relay maintains an internal activity log that captures every fallback event. This log is available through the relay's public API and can be forwarded to external systems for fleet-wide pattern analysis — identifying which providers fail most often, at what times, and for which request types.

Continue reading: Streaming for the Universal Protocol Decoder and stream interface.