Provider Architecture — The Resource Gateway Guide

Overview

inference-relay ships with four built-in providers and supports unlimited custom providers through an extensible architecture. Each provider implements the BaseProvider interface, which defines six abstract methods that govern health reporting, execution, streaming, and cost estimation.

The provider system is designed so that no single provider is a dependency. Any provider can be removed, added, or swapped without affecting the relay's core behavior. The cascade engine treats all providers as interchangeable resource endpoints differentiated only by capability flags and priority.

Native Subscription Gateway

The Native Subscription Gateway routes inference through the user's existing subscription, leveraging their authenticated desktop session rather than consuming developer API credits.

  • Cost to Developer — $0.00. The user's subscription absorbs all compute.
  • Platform — Desktop only (macOS, Linux, Windows)
  • Streaming — Yes
  • Multimodal — Auto-cascades to API providers for image inputs

Hardware-Bound Security

Authentication is handled entirely through Hardware-Authorized Secure Enclaves — platform-native credential stores that bind secrets to the physical device. Credentials never transit application memory in plaintext and cannot be extracted by other processes.

How It Works

The gateway opens a Native Subscription Gateway channel to the local subscription runtime. Responses are decoded via Asynchronous Stream Decoding, which processes the output as a continuous byte stream rather than discrete HTTP responses. A set of Logic Mappings translate the subscription output format into the standardized RelayResult interface.

Advantages

  • Zero marginal cost — Every request through this gateway is free to the developer.
  • Context-Preservation Parity — The local runtime has direct knowledge of remaining context budget, unlike cloud-throttled API endpoints that report limits only after rejection.
  • No network latency — Inference happens on the user's machine or through their authenticated session.

Cloud-based API endpoints often apply aggressive context-compression to manage global compute load. The Native Gateway maintains high-fidelity synchronization with the model's native context window, resulting in approximately 80% more structured detail in complex extractions compared to equivalent direct API calls at the same token budget.

Limitations

  • Desktop-only — not available in web or server environments.
  • Multimodal requests (images) are automatically cascaded to an API provider.

Anthropic API Provider

Direct integration with the Anthropic Messages API.

  • Cost — Standard Anthropic pricing
  • Platform — Any (web, server, desktop)
  • Streaming — Yes (Server-Sent Events)
  • Multimodal — Yes

Features

  • Full support for streaming, multimodal inputs, and tool use
  • Native request format — no translation layer required

Error Classification

  • 401 Unauthorized — Mark unhealthy (invalid or revoked key)
  • 429 Rate Limited — Mark degraded (key valid but throttled)
  • 529 Overloaded — Mark degraded (key valid, service under load)

OpenAI API Provider

Translates requests from the Anthropic message format into OpenAI Chat Completions format, and translates responses back. This allows inference-relay users to access OpenAI models without changing their application code.

  • Cost — Standard OpenAI pricing
  • Platform — Any (web, server, desktop)
  • Streaming — Yes (Server-Sent Events)
  • Multimodal — Yes

Automatic Format Translation

The provider handles bidirectional translation of:

  • Message structure — Anthropic role/content blocks ↔ OpenAI messages array
  • Tool calls — Anthropic tool_use/tool_result blocks ↔ OpenAI function_call/tool_calls
  • Stop reasons — Anthropic end_turn/tool_use ↔ OpenAI stop/tool_calls

No application code changes are needed when routing shifts between Anthropic and OpenAI providers.

Supported Models (v1.4.0)

  • gpt-4.1 — Flagship ($2.00 / $8.00 per 1M tokens)
  • gpt-4.1-mini — Efficient ($0.40 / $1.60)
  • gpt-4.1-nano — Minimal ($0.10 / $0.40)
  • gpt-4o — Multimodal ($2.50 / $10.00)
  • gpt-4o-mini — Efficient multimodal ($0.15 / $0.60)
  • o4-mini — Reasoning ($1.10 / $4.40)
  • o3 — Reasoning ($0.40 / $1.60)
  • o3-mini — Efficient reasoning ($1.10 / $4.40)
  • o1 — Reasoning, legacy ($15.00 / $60.00)

Legacy models (gpt-4-turbo, gpt-4, gpt-3.5-turbo) are retained for backward compatibility. Use modelsForProvider('openai-api') for the current list. See the Model Registry for full details.

Ollama Provider

Local inference through an Ollama instance running on the user's machine.

  • Cost — $0.00 (runs entirely on user hardware)
  • Platform — Any (requires Ollama installed and running)
  • Streaming — Yes
  • Multimodal — Model-dependent (llava supports images)
  • Tool Use — Model-dependent (v1.4.0 — detected at runtime)
  • Auth — None required

Connection

Connects to Ollama's local API at localhost:11434 by default. The port is configurable via provider options.

Tool Support Detection (v1.4.0)

The relay detects tool-capable Ollama models at runtime by examining the Modelfile template. Models that include tool markers in their template support native function calling.

  • llama3.1 — Tool support verified
  • mistral — Tool support verified
  • mistral-nemo — Tool support verified
  • qwen2:0.5b — No tool support

When tool support is confirmed, the relay translates Anthropic-format tool definitions to Ollama's native format automatically. When a model lacks tool support and params.tools is set, the cascade advances to the next provider.

Use Cases

  • Development and testing — Free, fast iteration without burning API credits
  • Air-gapped environments — No network access required
  • Privacy-sensitive workloads — Data never leaves the machine
  • Cost elimination — Pair with cloud providers in a cascade; Ollama handles what it can

System Compatibility Matrix

  • Native Gateway — Desktop, Secure Enclave auth, $0.00, streaming, cascades multimodal
  • Anthropic API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
  • OpenAI API — Any platform, API key auth, standard pricing, SSE streaming, multimodal
  • Ollama — Any platform, no auth, $0.00, streaming, no multimodal

Protocol Extensibility — Custom Providers

Any inference endpoint can be integrated by implementing the BaseProvider abstract class.

Required Capabilities

  • Health Probe — Check if the provider is available and authenticated
  • Synchronous Execution — Send a non-streaming request and return a complete response
  • Stream Execution — Send a streaming request and return an async iterable
  • Predictive Resource Allocation — Estimate cost for a given request before execution
  • Health Reporting — Return current health state (healthy, degraded, unhealthy)
  • Rate Limit Acceptance — Accept rate-limit signals from the cascade engine

Capability Flags

Custom providers declare their capabilities through configuration:

  • Streaming support — Whether the provider can return async iterables
  • Multimodal support — Whether the provider handles image inputs
  • Platform requirementany, desktop, or server
  • Priority — Controls cascade ordering (lower number = higher priority)

Return Type

All providers must return a standardized response interface that includes response data, token usage, cost, and provider metadata. This ensures that downstream consumers never need to know which provider fulfilled a request.

Priority

The priority field controls where a custom provider sits in the cascade ordering. Built-in providers are ordered: Native Gateway (highest), Anthropic API, OpenAI API, Ollama (lowest). Custom providers can insert at any position.

Continue reading: Routing Engine for cost-based routing and the Condition Logic Engine.