Provider Architecture — The Resource Gateway Guide
Overview
inference-relay ships with four built-in providers and supports unlimited custom providers through an extensible architecture. Each provider implements the BaseProvider interface, which defines six abstract methods that govern health reporting, execution, streaming, and cost estimation.
The provider system is designed so that no single provider is a dependency. Any provider can be removed, added, or swapped without affecting the relay's core behavior. The cascade engine treats all providers as interchangeable resource endpoints differentiated only by capability flags and priority.
Native Subscription Gateway
The Native Subscription Gateway routes inference through the user's existing subscription, leveraging their authenticated desktop session rather than consuming developer API credits.
- Cost to Developer — $0.00. The user's subscription absorbs all compute.
- Platform — Desktop only (macOS, Linux, Windows)
- Streaming — Yes
- Multimodal — Auto-cascades to API providers for image inputs
Hardware-Bound Security
Authentication is handled entirely through Hardware-Authorized Secure Enclaves — platform-native credential stores that bind secrets to the physical device. Credentials never transit application memory in plaintext and cannot be extracted by other processes.
How It Works
The gateway opens a Native Subscription Gateway channel to the local subscription runtime. Responses are decoded via Asynchronous Stream Decoding, which processes the output as a continuous byte stream rather than discrete HTTP responses. A set of Logic Mappings translate the subscription output format into the standardized RelayResult interface.
Advantages
- Zero marginal cost — Every request through this gateway is free to the developer.
- Context-Preservation Parity — The local runtime has direct knowledge of remaining context budget, unlike cloud-throttled API endpoints that report limits only after rejection.
- No network latency — Inference happens on the user's machine or through their authenticated session.
Cloud-based API endpoints often apply aggressive context-compression to manage global compute load. The Native Gateway maintains high-fidelity synchronization with the model's native context window, resulting in approximately 80% more structured detail in complex extractions compared to equivalent direct API calls at the same token budget.
Limitations
- Desktop-only — not available in web or server environments.
- Multimodal requests (images) are automatically cascaded to an API provider.
Anthropic API Provider
Direct integration with the Anthropic Messages API.
- Cost — Standard Anthropic pricing
- Platform — Any (web, server, desktop)
- Streaming — Yes (Server-Sent Events)
- Multimodal — Yes
Features
- Full support for streaming, multimodal inputs, and tool use
- Native request format — no translation layer required
Error Classification
401Unauthorized — Mark unhealthy (invalid or revoked key)429Rate Limited — Mark degraded (key valid but throttled)529Overloaded — Mark degraded (key valid, service under load)
OpenAI API Provider
Translates requests from the Anthropic message format into OpenAI Chat Completions format, and translates responses back. This allows inference-relay users to access OpenAI models without changing their application code.
- Cost — Standard OpenAI pricing
- Platform — Any (web, server, desktop)
- Streaming — Yes (Server-Sent Events)
- Multimodal — Yes
Automatic Format Translation
The provider handles bidirectional translation of:
- Message structure — Anthropic role/content blocks ↔ OpenAI messages array
- Tool calls — Anthropic
tool_use/tool_resultblocks ↔ OpenAIfunction_call/tool_calls - Stop reasons — Anthropic
end_turn/tool_use↔ OpenAIstop/tool_calls
No application code changes are needed when routing shifts between Anthropic and OpenAI providers.
Supported Models (v1.4.0)
- gpt-4.1 — Flagship ($2.00 / $8.00 per 1M tokens)
- gpt-4.1-mini — Efficient ($0.40 / $1.60)
- gpt-4.1-nano — Minimal ($0.10 / $0.40)
- gpt-4o — Multimodal ($2.50 / $10.00)
- gpt-4o-mini — Efficient multimodal ($0.15 / $0.60)
- o4-mini — Reasoning ($1.10 / $4.40)
- o3 — Reasoning ($0.40 / $1.60)
- o3-mini — Efficient reasoning ($1.10 / $4.40)
- o1 — Reasoning, legacy ($15.00 / $60.00)
Legacy models (gpt-4-turbo, gpt-4, gpt-3.5-turbo) are retained for backward compatibility. Use modelsForProvider('openai-api') for the current list. See the Model Registry for full details.
Ollama Provider
Local inference through an Ollama instance running on the user's machine.
- Cost — $0.00 (runs entirely on user hardware)
- Platform — Any (requires Ollama installed and running)
- Streaming — Yes
- Multimodal — Model-dependent (llava supports images)
- Tool Use — Model-dependent (v1.4.0 — detected at runtime)
- Auth — None required
Connection
Connects to Ollama's local API at localhost:11434 by default. The port is configurable via provider options.
Tool Support Detection (v1.4.0)
The relay detects tool-capable Ollama models at runtime by examining the Modelfile template. Models that include tool markers in their template support native function calling.
- llama3.1 — Tool support verified
- mistral — Tool support verified
- mistral-nemo — Tool support verified
- qwen2:0.5b — No tool support
When tool support is confirmed, the relay translates Anthropic-format tool definitions to Ollama's native format automatically. When a model lacks tool support and params.tools is set, the cascade advances to the next provider.
Use Cases
- Development and testing — Free, fast iteration without burning API credits
- Air-gapped environments — No network access required
- Privacy-sensitive workloads — Data never leaves the machine
- Cost elimination — Pair with cloud providers in a cascade; Ollama handles what it can
System Compatibility Matrix
- Native Gateway — Desktop, Secure Enclave auth, $0.00, streaming, cascades multimodal
- Anthropic API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
- OpenAI API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
- Ollama — Any platform, no auth, $0.00, streaming, no multimodal
Protocol Extensibility — Custom Providers
Any inference endpoint can be integrated by implementing the BaseProvider abstract class.
Required Capabilities
- Health Probe — Check if the provider is available and authenticated
- Synchronous Execution — Send a non-streaming request and return a complete response
- Stream Execution — Send a streaming request and return an async iterable
- Predictive Resource Allocation — Estimate cost for a given request before execution
- Health Reporting — Return current health state (healthy, degraded, unhealthy)
- Rate Limit Acceptance — Accept rate-limit signals from the cascade engine
Capability Flags
Custom providers declare their capabilities through configuration:
- Streaming support — Whether the provider can return async iterables
- Multimodal support — Whether the provider handles image inputs
- Platform requirement —
any,desktop, orserver - Priority — Controls cascade ordering (lower number = higher priority)
Return Type
All providers must return a standardized response interface that includes response data, token usage, cost, and provider metadata. This ensures that downstream consumers never need to know which provider fulfilled a request.
Priority
The priority field controls where a custom provider sits in the cascade ordering. Built-in providers are ordered: Native Gateway (highest), Anthropic API, OpenAI API, Ollama (lowest). Custom providers can insert at any position.
Continue reading: Routing Engine for cost-based routing and the Condition Logic Engine.