Provider Architecture — The Resource Gateway Guide

Overview

inference-relay ships with four built-in providers and supports unlimited custom providers through an extensible architecture. Each provider implements the BaseProvider interface, which defines six abstract methods that govern health reporting, execution, streaming, and cost estimation.

The provider system is designed so that no single provider is a dependency. Any provider can be removed, added, or swapped without affecting the relay's core behavior. The cascade engine treats all providers as interchangeable resource endpoints differentiated only by capability flags and priority.

Native Subscription Gateway

The Native Subscription Gateway routes inference through the user's existing subscription, leveraging their authenticated desktop session rather than consuming developer API credits.

Cost to Developer — $0.00. The user's subscription absorbs all compute.
Platform — Desktop only (macOS, Linux, Windows)
Streaming — Yes
Multimodal — Auto-cascades to API providers for image inputs

Hardware-Bound Security

Authentication is handled entirely through Hardware-Authorized Secure Enclaves — platform-native credential stores that bind secrets to the physical device. Credentials never transit application memory in plaintext and cannot be extracted by other processes.

How It Works

The gateway opens a Native Subscription Gateway channel to the local subscription runtime. Responses are decoded via Asynchronous Stream Decoding, which processes the output as a continuous byte stream rather than discrete HTTP responses. A set of Logic Mappings translate the subscription output format into the standardized RelayResult interface.

Advantages

Zero marginal cost — Every request through this gateway is free to the developer.
Context-Preservation Parity — The local runtime has direct knowledge of remaining context budget, unlike cloud-throttled API endpoints that report limits only after rejection.
No network latency — Inference happens on the user's machine or through their authenticated session.

Cloud-based API endpoints often apply aggressive context-compression to manage global compute load. The Native Gateway maintains high-fidelity synchronization with the model's native context window, resulting in approximately 80% more structured detail in complex extractions compared to equivalent direct API calls at the same token budget.

Limitations

Desktop-only — not available in web or server environments.
Multimodal requests (images) are automatically cascaded to an API provider.

Anthropic API Provider

Direct integration with the Anthropic Messages API.

Cost — Standard Anthropic pricing
Platform — Any (web, server, desktop)
Streaming — Yes (Server-Sent Events)
Multimodal — Yes

Features

Full support for streaming, multimodal inputs, and tool use
Native request format — no translation layer required

Error Classification

401 Unauthorized — Mark unhealthy (invalid or revoked key)
429 Rate Limited — Mark degraded (key valid but throttled)
529 Overloaded — Mark degraded (key valid, service under load)

OpenAI API Provider

Translates requests from the Anthropic message format into OpenAI Chat Completions format, and translates responses back. This allows inference-relay users to access OpenAI models without changing their application code.

Cost — Standard OpenAI pricing
Platform — Any (web, server, desktop)
Streaming — Yes (Server-Sent Events)
Multimodal — Yes

Automatic Format Translation

The provider handles bidirectional translation of:

Message structure — Anthropic role/content blocks ↔ OpenAI messages array
Tool calls — Anthropic tool_use/tool_result blocks ↔ OpenAI function_call/tool_calls
Stop reasons — Anthropic end_turn/tool_use ↔ OpenAI stop/tool_calls

No application code changes are needed when routing shifts between Anthropic and OpenAI providers.

Supported Models (v1.4.0)

gpt-4.1 — Flagship ($2.00 / $8.00 per 1M tokens)
gpt-4.1-mini — Efficient ($0.40 / $1.60)
gpt-4.1-nano — Minimal ($0.10 / $0.40)
gpt-4o — Multimodal ($2.50 / $10.00)
gpt-4o-mini — Efficient multimodal ($0.15 / $0.60)
o4-mini — Reasoning ($1.10 / $4.40)
o3 — Reasoning ($0.40 / $1.60)
o3-mini — Efficient reasoning ($1.10 / $4.40)
o1 — Reasoning, legacy ($15.00 / $60.00)

Legacy models (gpt-4-turbo, gpt-4, gpt-3.5-turbo) are retained for backward compatibility. Use modelsForProvider('openai-api') for the current list. See the Model Registry for full details.

Ollama Provider

Local inference through an Ollama instance running on the user's machine.

Cost — $0.00 (runs entirely on user hardware)
Platform — Any (requires Ollama installed and running)
Streaming — Yes
Multimodal — Model-dependent (llava supports images)
Tool Use — Model-dependent (v1.4.0 — detected at runtime)
Auth — None required

Connection

Connects to Ollama's local API at localhost:11434 by default. The port is configurable via provider options.

Tool Support Detection (v1.4.0)

The relay detects tool-capable Ollama models at runtime by examining the Modelfile template. Models that include tool markers in their template support native function calling.

llama3.1 — Tool support verified
mistral — Tool support verified
mistral-nemo — Tool support verified
qwen2:0.5b — No tool support

When tool support is confirmed, the relay translates Anthropic-format tool definitions to Ollama's native format automatically. When a model lacks tool support and params.tools is set, the cascade advances to the next provider.

Use Cases

Development and testing — Free, fast iteration without burning API credits
Air-gapped environments — No network access required
Privacy-sensitive workloads — Data never leaves the machine
Cost elimination — Pair with cloud providers in a cascade; Ollama handles what it can

System Compatibility Matrix

Native Gateway — Desktop, Secure Enclave auth, $0.00, streaming, cascades multimodal
Anthropic API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
OpenAI API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
Ollama — Any platform, no auth, $0.00, streaming, no multimodal

Protocol Extensibility — Custom Providers

Any inference endpoint can be integrated by implementing the BaseProvider abstract class.

Required Capabilities

Health Probe — Check if the provider is available and authenticated
Synchronous Execution — Send a non-streaming request and return a complete response
Stream Execution — Send a streaming request and return an async iterable
Predictive Resource Allocation — Estimate cost for a given request before execution
Health Reporting — Return current health state (healthy, degraded, unhealthy)
Rate Limit Acceptance — Accept rate-limit signals from the cascade engine

Capability Flags

Custom providers declare their capabilities through configuration:

Streaming support — Whether the provider can return async iterables
Multimodal support — Whether the provider handles image inputs
Platform requirement — any, desktop, or server
Priority — Controls cascade ordering (lower number = higher priority)

Return Type

All providers must return a standardized response interface that includes response data, token usage, cost, and provider metadata. This ensures that downstream consumers never need to know which provider fulfilled a request.

Priority

The priority field controls where a custom provider sits in the cascade ordering. Built-in providers are ordered: Native Gateway (highest), Anthropic API, OpenAI API, Ollama (lowest). Custom providers can insert at any position.

Continue reading: Routing Engine for cost-based routing and the Condition Logic Engine.