Decouple intelligence from inference costs.

Route expensive AI inference to your users' (or your own) existing Claude subscriptions. Your app maintains the orchestration logic to protect your proprietary software and IP. Their subscription handles the compute.

import 'inference-relay/auto';

// Done. Relay is active.
Get StartedDocumentation →
PATENT PENDING
102
APPLICATION PROCESS
ORCHESTRATION_DOMAIN
114
HARDWARE
SECURE ENCLAVE
VOLATILE
104
INFERENCE RELAY
STATE-PRUNING ACTIVE
106
CONTENT
ENVELOPE
ZERO_INTERCEPTION
110
NATIVE
GATEWAY
108
LOGIC
ENVELOPE
RS256_VERIFIED
112
PROTOCOL
AUTHORITY
116
AUDIT
CHAIN
The Billing Boundary

On April 4, 2026, Anthropic restricted legacy access methods that bypassed official platform integrity. inference-relay utilizes a Dual-Domain Architecture, establishing a compliant billing boundary that isolates proprietary logic from high-volume execution.

Orchestration Domain (App Key)
The application layer manages Intent Resolution and Schema Synthesis. This involves lightweight heuristic calls to map user requests into structured instructions. Your proprietary methodologies and system prompts are isolated here, never entering the relay.
Execution Domain (User Subscription)
The user's subscription provides the Computational Throughput for High-Density Execution. Heavy analytical tasks and large-context processing are offloaded to the native gateway. Data remains within the user's authorized account boundary.
Providers
Claude CLI
Subscription
Desktop
Anthropic API
API Key
Any
OpenAI
API Key
Any
Ollama
None
Desktop
Enterprise: Safe by Design

Data stays within your company's existing, approved Claude subscription. No new vendor approvals. No data processing reviews. No security overhead. Your app becomes compliant by default — skip the 6-month procurement cycle.

We never see your prompts. The library is a dumb pipe — metadata only.promptContent: false

Self-Hosted Automation

Stop paying high monthly API credits for private research engines or personal agentic pipelines. inference-relay routes heavy workloads to an existing flat-rate subscription. This enables enterprise-grade performance for the cost of a single seat.

Internal tools at zero marginal cost.
Claude Max ($100/mo) + inference-relay ($50/mo) = unlimited private automation for $150/mo
Your IDE Is the Dashboard

inference-relay ships an MCP server that turns your IDE into a live operational console. Query costs, monitor provider health, and manage your fleet — without leaving your editor.

Claude Desktop — MCP Tools
Financial Intelligence
Per-provider cost breakdown, projected burn rate, real-time savings tracking
Operational Health
Duration benchmarks (p50/p95/p99), fallback monitoring, provider availability
Security & Compliance
Audit trail, JWS handshake validation, telemetry leak scanning
Fleet Management
Multi-key status, automated rotation, activity log with type filtering
Add to Claude Desktop / Cursor
{
  "mcpServers": {
    "inference-relay": {
      "command": "npx",
      "args": ["@inference-relay/mcp"],
      "env": {
        "IR_LICENSE_KEY": "ir_live_xxxx"
      }
    }
  }
}
Financial Orchestration
WorkflowTraditionalRelaySavings
High-Context Analysis
Large document processing
$1.20$0.0298.3%
Iterative Research
Multi-step chained queries
$0.50$0.0198.0%
Multi-Step Audit
Fact-checking & cross-referencing
$0.07$0.00592.9%
Standard Chat
Single-turn responses
$0.04$0.00392.5%
Comparative Synthesis
50+ Sonnet calls per run
$1.20$0.0298.3%
Break-even: 2–3 active usersAnnual savings (100 users): $10,000+
Break-even: 2–3 active users
Annual savings (100 users): $10,000+
The SaaS Economics

AI margins are notoriously thin. By moving execution costs to the user's flat-rate subscription, your gross margin moves from ~20% to 95%+.

Benchmark: High-Context Document Analysis
DATASET: LEGAL_MEMO ≈ 15,000 WORDS
MetricDirect APIinference-relay
Orchestration Cost$0.0009$0.0009
Execution Cost$0.0856$0.0000
Extraction QualityFlat listsStructured tables
Output Volume3,721 chars6,707 chars (+80%)
Gross Margin: ~15%98.9%
High-Context Superiority

Direct API calls often truncate or over-summarize large documents to manage compute. Because inference-relay utilizes the official Claude Code binary, it inherits native prompt caching and optimized context management. In our benchmarks, the relay produced 80% more detailed extractions with structured cross-references — not flat summaries.

IP Protection Verified

We stress-tested the Two-Envelope protocol by processing a document with six proprietary trade-secret terms embedded in the orchestration layer.

Proprietary terms in relay logs: 0 / 6
The logic stayed in the app. The cost moved to the user.
Pricing
Solo
$50/mo
1 developer seat. All providers, fallback cascade, streaming, auto-patch, analytics. 1,000 calls/month.
Pro
$100/mo
5 developer seats. 5,000 calls/month. Warm process pool, advanced routing DSL, tamper-evident audit trail.
Enterprise
Custom
Fleet policy via MDM. Org-wide key management. SSO/SCIM. SLA. Dedicated support.
The Post-April 4 Billing Boundary

Anthropic recently restricted third-party “harnesses” that utilized subscription tokens for direct API access. inference-relay maintains a stable, compliant architecture by utilizing official binary protocols rather than deprecated scraping methods.

Orchestration — App Key
The developer's API key funds triage, classification, and instruction compilation.
Execution — User Subscription
The user's subscription funds high-volume inference via the official claude CLI binary.
Revenue Alignment: Anthropic receives revenue from both the developer (API) and the end-user (Subscription). This architecture preserves platform integrity and utilizes official caching infrastructure.
Decouple intelligence from inference costs.
Get StartedRead the Docs →