Run your agents (or your end users') on a Claude subscription, not API rates.

inference-relay: ≈95% gross margins — your bill stops scaling with users. Your system prompts, tool definitions, and orchestration logic never enter the relay. And it's an enterprise sales shortcut because no new data processor enters your DPA stack.

Any agent, any language, any process, one baseURL.

$ curl -X POST http://localhost:7421/v1/messages \
    -H "Content-Type: application/json" \
    -d '{"model":"claude-sonnet-4-6","max_tokens":50,
         "messages":[{"role":"user","content":"hello"}]}'

Drop in from Python orchestrators, Go services, Rust agents, CI runners, or curl — the daemon translates each Anthropic SDK call through the relevant Claude subscription at ~10 ms warm-pool overhead.

Get StartedDocumentation →
PATENT PENDING
102
APPLICATION PROCESS
ORCHESTRATION_DOMAIN
114
HARDWARE
SECURE ENCLAVE
VOLATILE
104
INFERENCE RELAY
STATE-PRUNING ACTIVE
106
CONTENT
ENVELOPE
ZERO_INTERCEPTION
110
NATIVE
GATEWAY
108
LOGIC
ENVELOPE
RS256_VERIFIED
112
PROTOCOL
AUTHORITY
116
AUDIT
CHAIN
V1.1 — ANY-LANGUAGE ENTRY
PYTHON
NODE
GO
RUST
CURL
200
DAEMON @ 127.0.0.1:7421
LOOPBACK-ONLY • JWS-SIGNED
↓ forwards into orchestration above
The Billing Boundary

Effective June 15, 2026: Anthropic splits Claude subscription billing into two pools — interactive Claude Code stays subsidized; Agent SDK and programmatic third-party calls move to a metered credit pool ($20 / $100 / $200 per month) at full API rates beyond that. Serious users will deplete that pool in a day or two. inference-relay drives a real interactive Claude Code session under the hood, so your agents stay on the subscription pool that everyone else just left.

On April 4, 2026, Anthropic neutralized legacy tools that relied on unstable subscription bypasses. inference-relay establishes an authorized bridge between your application and native compute resources.

The New Standard for AI Infrastructure
For Developers: ≈95% Gross Margins; Your IP/Trade Secrets Stay Protected.
Shift the liability of high-volume inference to the user's subscription, delivering 98% cost reduction for the application developer. Drop your Anthropic SDK against localhost:7421 from any language — Python, Node, Go, Rust, curl. The daemon translates each call through the user's Claude subscription with ~10 ms warm-pool overhead (~2 s on first cold-spawn). Your system prompts, tool definitions, and orchestration logic never enter the relay. Users can inspect network traffic and processes on their own machine, but they cannot see your Secret Sauce. The mechanism is transparent (JWS-signed); the authority is not — the relay is a dumb pipe.
For Enterprise: Immediate Procurement.
Deploys via MDM, SCCM, Intune, JAMF, or any existing endpoint manager. Installs in user space; no registry writes, no system services, no new data processor entering your DPA stack. Bypass the six-month security review: execution stays inside your already-vetted security boundary; the daemon binds loopback only, so nothing on your network can reach it. Compliant by Default. Utilize existing Data Processing Agreements (DPAs) without introducing a new data processor.
For End Users: Absolute Data Sovereignty.
The "Dumb Pipe" architecture ensures prompt and completion data never transits the management plane. Users utilize the computational resources they already pay for, ensuring their data stays local and their privacy remains absolute. Loopback-only binding: nothing on your network can reach the daemon.
Orchestration Domain (App Key)
The application layer manages Intent Resolution and Schema Synthesis. This involves lightweight heuristic calls to map user requests into structured instructions. Your proprietary methodologies and system prompts are isolated here, never entering the relay.
Execution Domain (User Subscription)
The user's subscription provides the Computational Throughput for High-Density Execution. Heavy analytical tasks and large-context processing are offloaded to the native gateway. Data remains within the user's authorized account boundary.
Providers
Claude CLI
Subscription
Desktop
Anthropic API
API Key
Any
OpenAI
API Key
Any
Ollama
None
Desktop
Atomic Session Continuity

Your stream never dies.

Token 200 of a 500-token response. Your AI provider goes down mid-paragraph. Your user sees... nothing. One continuous response. Two providers behind the curtain. Zero visible disruption.

STREAMING
Tokens flowing normally from your primary provider.
PROVIDER FAILURE
Connection drops at token 200. The relay captures everything delivered so far.
SEAMLESS CONTINUATION
Second provider picks up mid-sentence. Same async iterator. Your code never knows.
// ASC is automatic. Zero configuration.
const stream = await relay.messages.create({
  model: 'claude-sonnet-4-6',
  stream: true,
  messages: [{ role: 'user', content: doc }],
});

// If your provider 529s mid-stream, the relay
// stitches to the next one. Your code never knows.
for await (const event of stream) {
  process.stdout.write(event.delta?.text ?? '');
}
<50ms
Recovery
up to 2
Failovers
zero
Token loss

Not retry-on-failure. Mid-stream state reconciliation across provider boundaries.

PATENT PENDING
v1.4.0 — Cross-Family Continuity
A stream that starts on one provider can finish on another — with automatic model equivalence mapping. Your user gets one continuous response regardless of which providers produced it.
Enterprise: Safe by Design

Data stays within your company's existing, approved Claude subscription. No new vendor approvals. No data processing reviews. No security overhead. Your app becomes compliant by default — skip the 6-month procurement cycle.

We never see your prompts. The library is a dumb pipe — metadata only.promptContent: false

§
Regulatory & Privilege Sovereignty

The Relay does not process your data. It orchestrates your resources.

Preserve AC / AWP Privilege

Moving data to a third-party developer's API key can constitute a waiver of Attorney-Client (AC) Privilege and Attorney Work Product (AWP) protection. inference-relay maintains the execution within your organization's authorized security boundary, ensuring the legal nexus remains exclusively between you and your AI provider.

Maintain HIPAA & Industry Compliance

Regulated industries utilize AI-powered applications without the risk of PHI exfiltration. Because prompt content flows directly to a vetted AI provider via the Native Gateway, the application developer is never a data processor. The existing Business Associate Agreement (BAA) or Data Processing Agreement (DPA) between the user and the AI provider remains the sole governing framework.

Self-Hosted Automation

Stop paying high monthly API credits for private research engines or personal agentic pipelines. inference-relay routes heavy workloads to an existing flat-rate subscription. This enables enterprise-grade performance for the cost of a single subscription.

Internal tools at zero marginal cost.
Claude Max ($100/mo) + inference-relay ($50/mo) = unlimited private automation for $150/mo
The New Deployment Model

Route the inference.
Keep the margin.

AI apps don't need an API budget anymore.

Every AI app today buys tokens wholesale and resells them retail. Margins crash. Procurement reviews drag. Costs scale with every user you add. inference-relay routes execution through subscriptions and keys your users already own. Your margin moves from ~15% to ~95%. Your API bill stops scaling. Three deployment patterns cover every product shape:

Solo Dev
Self-Testing Sandbox

Ordinarily, every test run during development charges your API key for code that hasn’t even shipped. inference-relay routes those calls to the Claude subscription you already pay for. Iterate as many times as you want; the API tab stops growing the moment you import the library.

Dev API Spend
→ $0
Desktop Apps
Subscription-Distributed Inference

Desktop features that need 150,000-token contexts die on the API price card. inference-relay routes execution through each user’s own Claude subscription — fundamentally cheaper than the API, and paid by the user. The feature ships. Your per-user cost is zero.

Per-User Cost
→ $0
Web SaaS
Key-Distributed Browser Routing

Selling AI to a regulated enterprise means a six-month procurement review to decide if you’re a “data processor.” inference-relay routes execution through the user’s own browser, with their own provider keys, encrypted client-side. Your server never touches the data. You never need to qualify.

Data-Processor Status
→ NEVER
Three deployment patterns. Three ways to never resell another token.
Your IDE Is the Dashboard

inference-relay ships an MCP server that turns your IDE into a live operational console. Query costs, monitor provider health, and manage your fleet — without leaving your editor.

Claude Desktop — MCP Tools
Financial Intelligence
Per-provider cost breakdown, projected burn rate, real-time savings tracking
Operational Health
Duration benchmarks (p50/p95/p99), fallback monitoring, provider availability
Security & Compliance
Audit trail, JWS handshake validation, telemetry leak scanning
Fleet Management
Multi-key status, automated rotation, activity log with type filtering
Claude Code (recommended)
claude mcp add inference-relay \
  --env IR_LICENSE_KEY=ir_live_xxxx \
  -- npx -y @inference-relay/mcp
Claude Desktop / Cursor / other MCP clients
{
  "mcpServers": {
    "inference-relay": {
      "command": "npx",
      "args": ["-y", "@inference-relay/mcp"],
      "env": {
        "IR_LICENSE_KEY": "ir_live_xxxx"
      }
    }
  }
}
Full MCP setup guide →
Superior Inference Economics

Shift computational liability from your balance sheet to the user's subscription.

WorkflowTraditionalRelaySavings
High-Context Analysis
Large document processing
$1.20$0.0298.3%
Iterative Research
Multi-step chained queries
$0.50$0.0198.0%
Multi-Step Audit
Fact-checking & cross-referencing
$0.07$0.00592.9%
Standard Chat
Single-turn responses
$0.04$0.00392.5%
Comparative Synthesis
50+ Sonnet calls per run
$1.20$0.0298.3%
Break-even: 2–3 active usersAnnual savings (100 users): $10,000+
Break-even: 2–3 active users
Annual savings (100 users): $10,000+
The SaaS Economics

AI margins are notoriously thin. By moving execution costs to the user's flat-rate subscription, your gross margin moves from ~20% to 95%+.

Benchmark: High-Context Document Analysis
DATASET: LEGAL_MEMO ≈ 15,000 WORDS
MetricDirect APIinference-relay
Orchestration Cost$0.0009$0.0009
Execution Cost$0.0856$0.0000
Extraction QualityFlat listsStructured tables
Output Volume3,721 chars6,707 chars (+80%)
Gross Margin: ~15%98.9%
High-Context Superiority

Direct API calls often truncate or over-summarize large documents to manage compute. Because inference-relay utilizes the official Claude Code binary, it inherits native prompt caching and optimized context management. In our benchmarks, the relay produced 80% more detailed extractions with structured cross-references — not flat summaries.

IP Protection Verified

We stress-tested the Two-Envelope protocol by processing a document with six proprietary trade-secret terms embedded in the orchestration layer.

Proprietary terms in relay logs: 0 / 6
The logic stayed in the app. The cost moved to the user.
Pricing
Solo
$50/mo
All providers, fallback cascade, streaming, auto-patch, analytics. 3,000 calls/month.
Pro
$100/mo
15,000 calls/month. Warm process pool, advanced routing DSL, tamper-evident audit trail.
Enterprise
Custom
Fleet policy via MDM. Org-wide key management. SSO/SCIM. Dedicated support.
The Post-April 4 Billing Boundary

Anthropic recently restricted third-party “harnesses” that utilized subscription tokens for direct API access. inference-relay maintains a stable, compliant architecture by utilizing official binary protocols rather than deprecated scraping methods.

Orchestration — App Key
The developer's API key funds instruction compilation for relay to the Native Subscription Gateway.
Execution — User Subscription
The user's subscription funds high-volume inference via the official claude CLI binary.
Revenue Alignment: Anthropic receives revenue from both the developer (API) and the end-user (Subscription). This architecture preserves platform integrity and utilizes official caching infrastructure.
Decouple intelligence from inference costs.
Get StartedRead the Docs →