Run your agents (or your end users') on a Claude subscription, not API rates.

inference-relay: ≈95% gross margins — your bill stops scaling with users. Your system prompts, tool definitions, and orchestration logic never enter the relay. And it's an enterprise sales shortcut because no new data processor enters your DPA stack.

Any agent, any language, any process, one baseURL.

A Blecher.ai project · Built by JDBlecher

$ curl -X POST http://localhost:7421/v1/messages \
    -H "Content-Type: application/json" \
    -d '{"model":"claude-sonnet-4-6","max_tokens":50,
         "messages":[{"role":"user","content":"hello"}]}'

Drop in from Python orchestrators, Go services, Rust agents, CI runners, or curl — the daemon translates each Anthropic SDK call through the relevant Claude subscription at ~10 ms warm-pool overhead.

Get Started Documentation →

PATENT PENDING

102APPLICATION PROCESS
ORCHESTRATION_DOMAIN
114HARDWARE
SECURE ENCLAVE
VOLATILE
104INFERENCE RELAY
STATE-PRUNING ACTIVE
106CONTENT
ENVELOPE
ZERO_INTERCEPTION
110NATIVE
GATEWAY
108LOGIC
ENVELOPE
RS256_VERIFIED
112PROTOCOL
AUTHORITY
116AUDIT
CHAIN

V1.1 — ANY-LANGUAGE ENTRY
PYTHON
NODE
GO
RUST
CURL
200DAEMON @ 127.0.0.1:7421
LOOPBACK-ONLY  •  JWS-SIGNED
↓ forwards into orchestration above

The Billing Boundary

Effective June 15, 2026: Anthropic splits Claude subscription billing into two pools — interactive Claude Code stays subsidized; Agent SDK and programmatic third-party calls move to a metered credit pool ($20 / $100 / $200 per month) at full API rates beyond that. Serious users will deplete that pool in a day or two. inference-relay drives a real interactive Claude Code session under the hood, so your agents stay on the subscription pool that everyone else just left.

On April 4, 2026, Anthropic neutralized legacy tools that relied on unstable subscription bypasses. inference-relay establishes an authorized bridge between your application and native compute resources.

The New Standard for AI Infrastructure

For Developers: ≈95% Gross Margins; Your IP/Trade Secrets Stay Protected.

Shift the liability of high-volume inference to the user's subscription, delivering 98% cost reduction for the application developer. Drop your Anthropic SDK against localhost:7421 from any language — Python, Node, Go, Rust, curl. The daemon translates each call through the user's Claude subscription with ~10 ms warm-pool overhead (~2 s on first cold-spawn). Your system prompts, tool definitions, and orchestration logic never enter the relay. Users can inspect network traffic and processes on their own machine, but they cannot see your Secret Sauce. The mechanism is transparent (JWS-signed); the authority is not — the relay is a dumb pipe.

For Enterprise: Immediate Procurement.

Deploys via MDM, SCCM, Intune, JAMF, or any existing endpoint manager. Installs in user space; no registry writes, no system services, no new data processor entering your DPA stack. Bypass the six-month security review: execution stays inside your already-vetted security boundary; the daemon binds loopback only, so nothing on your network can reach it. Compliant by Default. Utilize existing Data Processing Agreements (DPAs) without introducing a new data processor.

For End Users: Absolute Data Sovereignty.

The "Dumb Pipe" architecture ensures prompt and completion data never transits the management plane. Users utilize the computational resources they already pay for, ensuring their data stays local and their privacy remains absolute. Loopback-only binding: nothing on your network can reach the daemon.

Orchestration Domain (App Key)

The application layer manages Intent Resolution and Schema Synthesis. This involves lightweight heuristic calls to map user requests into structured instructions. Your proprietary methodologies and system prompts are isolated here, never entering the relay.

Execution Domain (User Subscription)

The user's subscription provides the Computational Throughput for High-Density Execution. Heavy analytical tasks and large-context processing are offloaded to the native gateway. Data remains within the user's authorized account boundary.

Providers

Claude CLI

Subscription

Desktop

Anthropic API

API Key

Any

OpenAI

API Key

Any

Ollama

None

Desktop

Atomic Session Continuity

Your stream never dies.

Token 200 of a 500-token response. Your AI provider goes down mid-paragraph. Your user sees... nothing. One continuous response. Two providers behind the curtain. Zero visible disruption.

●

STREAMING

Tokens flowing normally from your primary provider.

✗

PROVIDER FAILURE

Connection drops at token 200. The relay captures everything delivered so far.

●

SEAMLESS CONTINUATION

Second provider picks up mid-sentence. Same async iterator. Your code never knows.

// ASC is automatic. Zero configuration.
const stream = await relay.messages.create({
  model: 'claude-sonnet-4-6',
  stream: true,
  messages: [{ role: 'user', content: doc }],
});

// If your provider 529s mid-stream, the relay
// stitches to the next one. Your code never knows.
for await (const event of stream) {
  process.stdout.write(event.delta?.text ?? '');
}

<50ms

Recovery

up to 2

Failovers

zero

Token loss

Not retry-on-failure. Mid-stream state reconciliation across provider boundaries.

PATENT PENDING

v1.4.0 — Cross-Family Continuity

A stream that starts on one provider can finish on another — with automatic model equivalence mapping. Your user gets one continuous response regardless of which providers produced it.

◆

Enterprise: Safe by Design

Data stays within your company's existing, approved Claude subscription. No new vendor approvals. No data processing reviews. No security overhead. Your app becomes compliant by default — skip the 6-month procurement cycle.

We never see your prompts. The library is a dumb pipe — metadata only.promptContent: false

Regulatory & Privilege Sovereignty

The Relay does not process your data. It orchestrates your resources.

Preserve AC / AWP Privilege

Moving data to a third-party developer's API key can constitute a waiver of Attorney-Client (AC) Privilege and Attorney Work Product (AWP) protection. inference-relay maintains the execution within your organization's authorized security boundary, ensuring the legal nexus remains exclusively between you and your AI provider.

Maintain HIPAA & Industry Compliance

Regulated industries utilize AI-powered applications without the risk of PHI exfiltration. Because prompt content flows directly to a vetted AI provider via the Native Gateway, the application developer is never a data processor. The existing Business Associate Agreement (BAA) or Data Processing Agreement (DPA) between the user and the AI provider remains the sole governing framework.

Self-Hosted Automation

Stop paying high monthly API credits for private research engines or personal agentic pipelines. inference-relay routes heavy workloads to an existing flat-rate subscription. This enables enterprise-grade performance for the cost of a single subscription.

Internal tools at zero marginal cost.

Claude Max ($100/mo) + inference-relay ($50/mo) = unlimited private automation for $150/mo

The New Deployment Model

Route the inference.
Keep the margin.

AI apps don't need an API budget anymore.

Every AI app today buys tokens wholesale and resells them retail. Margins crash. Procurement reviews drag. Costs scale with every user you add. inference-relay routes execution through subscriptions and keys your users already own. Your margin moves from ~15% to ~95%. Your API bill stops scaling. Three deployment patterns cover every product shape:

Solo Dev

Self-Testing Sandbox

Ordinarily, every test run during development charges your API key for code that hasn’t even shipped. inference-relay routes those calls to the Claude subscription you already pay for. Iterate as many times as you want; the API tab stops growing the moment you import the library.

Dev API Spend

→ $0

Desktop Apps

Subscription-Distributed Inference

Desktop features that need 150,000-token contexts die on the API price card. inference-relay routes execution through each user’s own Claude subscription — fundamentally cheaper than the API, and paid by the user. The feature ships. Your per-user cost is zero.

Per-User Cost

→ $0

Web SaaS

Key-Distributed Browser Routing

Selling AI to a regulated enterprise means a six-month procurement review to decide if you’re a “data processor.” inference-relay routes execution through the user’s own browser, with their own provider keys, encrypted client-side. Your server never touches the data. You never need to qualify.

Data-Processor Status

→ NEVER

Three deployment patterns. Three ways to never resell another token.

Your IDE Is the Dashboard

inference-relay ships an MCP server that turns your IDE into a live operational console. Query costs, monitor provider health, and manage your fleet — without leaving your editor.

Claude Desktop — MCP Tools

Financial Intelligence

Per-provider cost breakdown, projected burn rate, real-time savings tracking

Operational Health

Duration benchmarks (p50/p95/p99), fallback monitoring, provider availability

Security & Compliance

Audit trail, JWS handshake validation, telemetry leak scanning

Fleet Management

Multi-key status, automated rotation, activity log with type filtering

Claude Code (recommended)

claude mcp add inference-relay \
  --env IR_LICENSE_KEY=ir_live_xxxx \
  -- npx -y @inference-relay/mcp

Claude Desktop / Cursor / other MCP clients

{
  "mcpServers": {
    "inference-relay": {
      "command": "npx",
      "args": ["-y", "@inference-relay/mcp"],
      "env": {
        "IR_LICENSE_KEY": "ir_live_xxxx"
      }
    }
  }
}

Full MCP setup guide →

Superior Inference Economics

Shift computational liability from your balance sheet to the user's subscription.

Workflow	Traditional	Relay	Savings
High-Context Analysis Large document processing	$1.20	$0.02	98.3%
Iterative Research Multi-step chained queries	$0.50	$0.01	98.0%
Multi-Step Audit Fact-checking & cross-referencing	$0.07	$0.005	92.9%
Standard Chat Single-turn responses	$0.04	$0.003	92.5%
Comparative Synthesis 50+ Sonnet calls per run	$1.20	$0.02	98.3%

Break-even: 2–3 active usersAnnual savings (100 users): $10,000+

Break-even: 2–3 active users
Annual savings (100 users): $10,000+

The SaaS Economics

AI margins are notoriously thin. By moving execution costs to the user's flat-rate subscription, your gross margin moves from ~20% to 95%+.

Benchmark: High-Context Document Analysis

DATASET: LEGAL_MEMO ≈ 15,000 WORDS

Metric	Direct API	inference-relay
Orchestration Cost	$0.0009	$0.0009
Execution Cost	$0.0856	$0.0000
Extraction Quality	Flat lists	Structured tables
Output Volume	3,721 chars	6,707 chars (+80%)

Gross Margin: ~15% → 98.9%

High-Context Superiority

Direct API calls often truncate or over-summarize large documents to manage compute. Because inference-relay utilizes the official Claude Code binary, it inherits native prompt caching and optimized context management. In our benchmarks, the relay produced 80% more detailed extractions with structured cross-references — not flat summaries.

IP Protection Verified

We stress-tested the Two-Envelope protocol by processing a document with six proprietary trade-secret terms embedded in the orchestration layer.

Proprietary terms in relay logs: 0 / 6
The logic stayed in the app. The cost moved to the user.

Pricing

Solo

$50/mo

All providers, fallback cascade, streaming, auto-patch, analytics. 3,000 calls/month.

Get Started

Pro

$100/mo

15,000 calls/month. Warm process pool, advanced routing DSL, tamper-evident audit trail.

Get Started

Enterprise

Custom

Fleet policy via MDM. Org-wide key management. SSO/SCIM. Dedicated support.

Get Started

The Post-April 4 Billing Boundary

Anthropic recently restricted third-party “harnesses” that utilized subscription tokens for direct API access. inference-relay maintains a stable, compliant architecture by utilizing official binary protocols rather than deprecated scraping methods.

Orchestration — App Key

The developer's API key funds instruction compilation for relay to the Native Subscription Gateway.

Execution — User Subscription

The user's subscription funds high-volume inference via the official claude CLI binary.

Revenue Alignment: Anthropic receives revenue from both the developer (API) and the end-user (Subscription). This architecture preserves platform integrity and utilizes official caching infrastructure.

Decouple intelligence from inference costs.

Get Started Read the Docs →

Run your agents (or your end users') on a Claude subscription, not API rates.

Your stream never dies.

Route the inference.Keep the margin.

Route the inference.
Keep the margin.