Getting Started

From install to first successful relay in under five minutes.

Prerequisites

Node.js 18+
Claude Code CLI installed: npm install -g @anthropic-ai/claude-code
Active Claude subscription (Pro, Max, or Team) — this is your Authorized Execution Environment
inference-relay license key — get one at inference-relay.com/pricing

Install

npm install inference-relay

Configure

export IR_LICENSE_KEY=ir_live_xxxx

Replace ir_live_xxxx with the license key from your dashboard.

Integration Levels

inference-relay supports three integration levels. Pick the one that matches how much control you need over your provider configuration. All three are functionally equivalent — they just trade ergonomics for granularity.

Level 1

import 'inference-relay/auto'

Existing Anthropic SDK code. Zero changes elsewhere.

Level 2

new InferenceRelay({...})

Multiple SDK instances, custom routing, granular fallback control.

Level 3

IR_PROVIDER=claude-cli

Zero source changes. CI/CD, ephemeral deploys, contractor handoffs.

Quick Start (Level 1 — Auto-patch)

import 'inference-relay/auto';
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const msg = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Analyze this contract clause...' }],
});

console.log(msg.content[0].text);
// These fields are added by inference-relay (not part of the standard SDK response)
console.log('Provider:', msg.provider);  // 'claude-cli'
console.log('Cost:', msg.costUsd);       // $0.00

The import 'inference-relay/auto' line bridges SDK instances at load time. No other code changes required. Your existing Anthropic SDK calls route through the Native Subscription Gateway automatically.

Streaming

import 'inference-relay/auto';
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const stream = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  stream: true,
  messages: [{ role: 'user', content: 'Summarize this document...' }],
});

for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    process.stdout.write(event.delta.text);
  }
}

Asynchronous Stream Decoding handles chunked responses from the Native Subscription Gateway with the same event interface as the standard Anthropic SDK.

Quick Start (Level 2 — Explicit)

Wrap a specific client when you need granular control: multiple SDK instances with different routing rules, an explicit fallback chain, custom provider priorities, or a routing strategy that depends on the request shape. The explicit form takes the same configuration object documented on the Routing and Fallback pages.

import { InferenceRelay } from 'inference-relay';

const relay = new InferenceRelay({
  licenseKey: process.env.IR_LICENSE_KEY,
  providers: [
    { type: 'claude-cli', priority: 1 },     // Subscription-routed (free)
    { type: 'anthropic-api', priority: 2 },  // Direct API fallback
  ],
  fallback: true,
});

const msg = await relay.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Analyze this contract clause...' }],
});

console.log(msg.content[0].text);
console.log('Provider:', msg.provider);   // 'claude-cli' (or 'anthropic-api' on fallback)
console.log('Cost:', msg.costUsd);        // $0.00 via gateway

The explicit form is the right choice when you have multiple Anthropic SDK clients with different requirements (e.g., one for production routing through the gateway, one for evaluation routing directly to the API for parity testing), or when your application needs to inspect msg.provider and msg.fallbackChain to make routing decisions based on which provider served a given call.

Quick Start (Level 3 — Environment Variable)

Set IR_PROVIDER=claude-cli in your environment and inference-relay routes through the Native Subscription Gateway with zero source changesto your application. This is the right choice for CI/CD pipelines, ephemeral deploys, contractor handoffs where you don't want to touch the codebase, or any context where the provider decision is made by the deploy environment rather than the code.

# .env or shell export — no source changes
export IR_LICENSE_KEY=ir_live_xxxx
export IR_PROVIDER=claude-cli

Then run your existing application unchanged. Any Anthropic SDK call inside the process — whether it's using @anthropic-ai/sdk directly or via a framework like LangChain or Mastra — routes through the gateway automatically. Override per-process by unsetting the variable; remove it from the environment to revert to direct API calls.

When to use which:reach for Level 1 if you control the source and want the smallest possible diff. Reach for Level 2 if you have multiple SDK clients or need custom routing. Reach for Level 3 if you can't modify the source at all (a sealed binary, a third-party application, a Docker image you don't own).

Verify Installation

npx inference-relay verify

Expected output:

[ir] IDENTITY:   VALIDATED
[ir] TIER:       PRO
[ir] RELAY:      ACTIVE
     Claude CLI: v2.1.96
     Tier:       max
     Usage:      19 / 5,000 (0.4%)

Three checks: IDENTITY means your license key passed the JWS handshake; TIER reflects what tier the validated key resolves to; RELAY: ACTIVE means a Native Subscription Gateway provider (Claude CLI) is installed and authenticated. If RELAY shows INACTIVE, the relay still works — it falls back to direct API providers — but the 98.9% savings only apply when the gateway is active.

Exit codes: 0 all green, 1 license invalid or relay inactive, 2 missing or malformed IR_LICENSE_KEY. Suitable for use in CI scripts.

MCP Server (optional)

Add inference-relay to Claude Code, Claude Desktop, or Cursor and query your relay fleet from inside your IDE. The 19-tool catalog covers financial intelligence, operational health, security verification, manifest sync, and fleet management. See the MCP setup guide for the full walkthrough. The one-line install for Claude Code:

claude mcp add inference-relay \
  --env IR_LICENSE_KEY=ir_live_xxxx \
  -- npx -y @inference-relay/mcp

How It Works

Handshake — The library performs an asymmetric handshake with the Protocol Authority to synchronize execution parameters. RS256-signed payloads establish a Signed Trust Chain between your application and the relay.
Routing— The Native Subscription Gateway executes inference on the user's machine using their existing Claude subscription. Heavy models (Sonnet, Opus) run at $0.00 to the developer.
Orchestration — Your application pays only for lightweight orchestration. A Haiku-class call for instruction compilation and routing costs approximately $0.0003. The gateway handles everything else.
Fallback — If the Native Subscription Gateway is unavailable, the relay falls back to direct API with your configured keys. No request is ever dropped.

Context-Preservation Parity

Cloud-based API endpoints often apply aggressive context-compression to manage global compute load. The Native Subscription Gateway maintains high-fidelity synchronization with the model's native context window, operating within the Hardware-Authorized Secure Enclave without the rate-limiting and context compression applied to high-volume API traffic.

Testing across production workloads shows approximately 80% more structured detail in complex extractions via the Native Gateway compared to equivalent direct API calls at the same token budget.

Multi-Provider Configuration (v1.4.0)

The auto-patch pattern routes through the Native Subscription Gateway. For explicit multi-provider routing — Anthropic API, OpenAI, and Ollama — construct the relay directly:

import { InferenceRelay, modelsForProvider } from 'inference-relay';

const relay = new InferenceRelay({
  licenseKey: process.env.IR_LICENSE_KEY,
  providers: [
    { type: 'anthropic-api', apiKey: process.env.ANTHROPIC_API_KEY, priority: 1 },
    { type: 'openai-api', apiKey: process.env.OPENAI_API_KEY, priority: 2 },
    { type: 'ollama', baseUrl: 'http://localhost:11434', priority: 3 },
  ],
  telemetry: false,
});

// List available models per provider
console.log(modelsForProvider('anthropic-api'));
// → [{ canonical: 'claude-haiku-4-5', ... }, ...]

const result = await relay.messages.create({
  model: 'gpt-4o-mini',  // routes to OpenAI automatically
  max_tokens: 100,
  messages: [{ role: 'user', content: 'Hello' }],
});

The relay routes each request to the provider that owns the model. If the primary provider fails, the cascade falls through to the next. See the Model Registry for the full model list and Fallback for cascade behavior.

Next Steps

MCP Server — Query your relay from inside Claude Code, Claude Desktop, or Cursor
Model Registry — 17 models across Anthropic, OpenAI, and Ollama
Provider Configuration — Configure fallback chains and routing rules
Routing Strategy — Cost-optimized vs latency-optimized routing
Fallback Behavior — Graceful degradation when the gateway is unavailable
Security Model — Dumb pipe architecture and audit trail
Enterprise Deployment — On-prem relay and SSO options
Whitepaper — Enterprise security architecture