inference-relay: ≈95% gross margins — your bill stops scaling with users. Your system prompts, tool definitions, and orchestration logic never enter the relay. And it's an enterprise sales shortcut because no new data processor enters your DPA stack.
Any agent, any language, any process, one baseURL.
$ curl -X POST http://localhost:7421/v1/messages \ -H "Content-Type: application/json" \ -d '{"model":"claude-sonnet-4-6","max_tokens":50, "messages":[{"role":"user","content":"hello"}]}'
Drop in from Python orchestrators, Go services, Rust agents, CI runners, or curl — the daemon translates each Anthropic SDK call through the relevant Claude subscription at ~10 ms warm-pool overhead.
Effective June 15, 2026: Anthropic splits Claude subscription billing into two pools — interactive Claude Code stays subsidized; Agent SDK and programmatic third-party calls move to a metered credit pool ($20 / $100 / $200 per month) at full API rates beyond that. Serious users will deplete that pool in a day or two. inference-relay drives a real interactive Claude Code session under the hood, so your agents stay on the subscription pool that everyone else just left.
On April 4, 2026, Anthropic neutralized legacy tools that relied on unstable subscription bypasses. inference-relay establishes an authorized bridge between your application and native compute resources.
localhost:7421 from any language — Python, Node, Go, Rust, curl. The daemon translates each call through the user's Claude subscription with ~10 ms warm-pool overhead (~2 s on first cold-spawn). Your system prompts, tool definitions, and orchestration logic never enter the relay. Users can inspect network traffic and processes on their own machine, but they cannot see your Secret Sauce. The mechanism is transparent (JWS-signed); the authority is not — the relay is a dumb pipe.Token 200 of a 500-token response. Your AI provider goes down mid-paragraph. Your user sees... nothing. One continuous response. Two providers behind the curtain. Zero visible disruption.
// ASC is automatic. Zero configuration. const stream = await relay.messages.create({ model: 'claude-sonnet-4-6', stream: true, messages: [{ role: 'user', content: doc }], }); // If your provider 529s mid-stream, the relay // stitches to the next one. Your code never knows. for await (const event of stream) { process.stdout.write(event.delta?.text ?? ''); }
Not retry-on-failure. Mid-stream state reconciliation across provider boundaries.
Data stays within your company's existing, approved Claude subscription. No new vendor approvals. No data processing reviews. No security overhead. Your app becomes compliant by default — skip the 6-month procurement cycle.
We never see your prompts. The library is a dumb pipe — metadata only.promptContent: false
The Relay does not process your data. It orchestrates your resources.
Moving data to a third-party developer's API key can constitute a waiver of Attorney-Client (AC) Privilege and Attorney Work Product (AWP) protection. inference-relay maintains the execution within your organization's authorized security boundary, ensuring the legal nexus remains exclusively between you and your AI provider.
Regulated industries utilize AI-powered applications without the risk of PHI exfiltration. Because prompt content flows directly to a vetted AI provider via the Native Gateway, the application developer is never a data processor. The existing Business Associate Agreement (BAA) or Data Processing Agreement (DPA) between the user and the AI provider remains the sole governing framework.
Stop paying high monthly API credits for private research engines or personal agentic pipelines. inference-relay routes heavy workloads to an existing flat-rate subscription. This enables enterprise-grade performance for the cost of a single subscription.
Every AI app today buys tokens wholesale and resells them retail. Margins crash. Procurement reviews drag. Costs scale with every user you add. inference-relay routes execution through subscriptions and keys your users already own. Your margin moves from ~15% to ~95%. Your API bill stops scaling. Three deployment patterns cover every product shape:
Ordinarily, every test run during development charges your API key for code that hasn’t even shipped. inference-relay routes those calls to the Claude subscription you already pay for. Iterate as many times as you want; the API tab stops growing the moment you import the library.
Desktop features that need 150,000-token contexts die on the API price card. inference-relay routes execution through each user’s own Claude subscription — fundamentally cheaper than the API, and paid by the user. The feature ships. Your per-user cost is zero.
Selling AI to a regulated enterprise means a six-month procurement review to decide if you’re a “data processor.” inference-relay routes execution through the user’s own browser, with their own provider keys, encrypted client-side. Your server never touches the data. You never need to qualify.
inference-relay ships an MCP server that turns your IDE into a live operational console. Query costs, monitor provider health, and manage your fleet — without leaving your editor.
claude mcp add inference-relay \ --env IR_LICENSE_KEY=ir_live_xxxx \ -- npx -y @inference-relay/mcp
{
"mcpServers": {
"inference-relay": {
"command": "npx",
"args": ["-y", "@inference-relay/mcp"],
"env": {
"IR_LICENSE_KEY": "ir_live_xxxx"
}
}
}
}Shift computational liability from your balance sheet to the user's subscription.
| Workflow | Traditional | Relay | Savings |
|---|---|---|---|
| High-Context Analysis Large document processing | $1.20 | $0.02 | 98.3% |
| Iterative Research Multi-step chained queries | $0.50 | $0.01 | 98.0% |
| Multi-Step Audit Fact-checking & cross-referencing | $0.07 | $0.005 | 92.9% |
| Standard Chat Single-turn responses | $0.04 | $0.003 | 92.5% |
| Comparative Synthesis 50+ Sonnet calls per run | $1.20 | $0.02 | 98.3% |
AI margins are notoriously thin. By moving execution costs to the user's flat-rate subscription, your gross margin moves from ~20% to 95%+.
DATASET: LEGAL_MEMO ≈ 15,000 WORDS| Metric | Direct API | inference-relay |
|---|---|---|
| Orchestration Cost | $0.0009 | $0.0009 |
| Execution Cost | $0.0856 | $0.0000 |
| Extraction Quality | Flat lists | Structured tables |
| Output Volume | 3,721 chars | 6,707 chars (+80%) |
Direct API calls often truncate or over-summarize large documents to manage compute. Because inference-relay utilizes the official Claude Code binary, it inherits native prompt caching and optimized context management. In our benchmarks, the relay produced 80% more detailed extractions with structured cross-references — not flat summaries.
We stress-tested the Two-Envelope protocol by processing a document with six proprietary trade-secret terms embedded in the orchestration layer.
Anthropic recently restricted third-party “harnesses” that utilized subscription tokens for direct API access. inference-relay maintains a stable, compliant architecture by utilizing official binary protocols rather than deprecated scraping methods.
claude CLI binary.