Enterprise Deployment — The CFO/CIO Manual
This document addresses the two questions every enterprise buyer asks: “What does this cost?” and “What risk does this introduce?” The answers are, respectively, “almost nothing” and “less than what you have today.”
1. The Shadow AI Problem
Your engineers are already using AI. The question is whether you know about it.
Today, employees across your organization use personal Claude and ChatGPT subscriptions to write code, draft documents, and analyze data. Every one of those interactions sends corporate intellectual property through a personal account with no corporate visibility, no audit trail, and no policy enforcement.
The traditional fix is expensive: provision API keys per team, negotiate enterprise agreements, and absorb inference costs that scale linearly with headcount. For a team of 20 engineers using Claude Sonnet at moderate volume, this runs $50,000+ per year in raw API costs alone — before platform fees, key management overhead, or compliance review.
inference-relay solves this differently. Instead of replacing personal subscriptions with corporate API keys, it routes through the subscriptions your employees already pay for. The inference cost to your organization is $0.00— the employee's existing subscription covers it. What the relay adds is visibility, governance, and audit.
2. Direct Subscription Utilization (DSU)
This is the architectural decision that changes your procurement conversation entirely.
inference-relay is a Local Binary Dependency— client-side software that runs on the developer's machine. It is not a cloud service. It does not receive, store, or process prompt or completion data. The data processing relationship remains exclusively between the user and their AI provider (Anthropic, OpenAI), governed by the provider's existing Data Processing Agreement.
Because inference-relay is a Local Binary Dependency and not a Data Processor, it falls outside the scope of traditional cloud-service DPA requirements.
What This Means for Procurement
- New vendor onboarding →
npm install - New DPA negotiation → No DPA required
- New data processor registration → Not a data processor
- Security questionnaire (weeks) → Client-side software review (days)
- API key provisioning per team → Zero API keys needed (auto-patch)
- Per-token inference billing → Flat subscription, already paid
The procurement shortcut: inference-relay is client-side software, not a cloud service. It belongs in the same category as a linter, a formatter, or a build tool. It transforms how AI requests are routed. It never touches what those requests contain.
3. Compliance Positioning
SOC 2
inference-relay stores no customer data. There is no database, no object store, no log file containing user content. The relay transmits operational metadata (provider, model, token counts, cost) and enforces at the type level that content fields are structurally excluded. No customer data stored means no SOC 2 scope for the relay itself.
GDPR
The relay does not process personal data. It does not know who the user is beyond a license identifier. It does not track behavior, build profiles, or store any data that could identify a natural person. No personal data processed means no DPIA required.
HIPAA
The relay has zero exposure to Protected Health Information. Prompt and completion content — the only place PHI could appear — are structurally excluded from all relay data flows via TypeScript literal types (see Security Architecture). No PHI exposure means BAA is not applicable.
Summary
The library is a routing layer, not a data processor. Compliance obligations attach to entities that store, process, or transmit regulated data. inference-relay does none of these things — by design, by implementation, and by compiler-enforced guarantee.
4. Governance at the Edge — Fleet Policy
@inference-relay/pro enables centralized fleet management for organizations that need policy enforcement across multiple developers and machines.
Key Rotation
Available on Pro and Enterprise tiers. License keys can be rotated without downtime — the new key activates immediately, and the old key enters a grace window before revocation. Server-side storage retains only the last 4 characters of any key for identification purposes. Full keys are never stored on relay infrastructure.
Activity Log
Operational events are recorded for administrative oversight:
- Key rotation — New key issued, old key scheduled for revocation
- Tier change — Subscription tier upgraded or downgraded
- Cap warning — Usage cap exceeded (Solo or Pro tier)
- Validation rejection — Unsigned or invalid authorization attempt blocked
- Grace period entry — Payment failure detected, 7-day grace window started
Usage Caps
- Solo — 3,000 calls/month, 5% soft buffer before enforcement
- Pro — 15,000 calls/month, 5% soft buffer before enforcement
- Enterprise — Custom call volume and provisioned seats, per-contract terms
Payment Grace Period
On payment failure, the relay enters a 7-day grace period during which full functionality is maintained. This prevents a billing hiccup from disrupting active development work. After 7 days without resolution, the key is auto-revoked and the relay falls back to free-tier behavior.
5. Audit Trail
Every inference call generates an audit event. These events form a SHA-256 hash chain — each event includes the hash of the previous event, creating a tamper-evident sequence. Any modification to a historical event breaks the chain, making tampering detectable.
Audit Event Contents
- Provider — Yes (which AI service was called)
- Model — Yes (which model was invoked)
- Input tokens — Yes (usage metering)
- Output tokens — Yes (usage metering)
- Estimated cost — Yes (cost attribution)
- Duration — Yes (performance monitoring)
- Prompt content — No (literal
false, structurally excluded) - Completion content — No (literal
false, structurally excluded)
Output Formats
Audit events can be consumed via Asynchronous Stream Decoding output for integration with existing log aggregation pipelines (Splunk, Datadog, ELK), or through a custom handler function for bespoke processing.
Content Exclusion Guarantee
The promptContent: false and completionContent: false fields are not configuration options — they are TypeScript literal types. Assigning any value other than false to these fields causes a compilation failure. This guarantee is verified on every build and cannot be overridden at runtime. See Security Architecture for the full technical explanation.
6. MCP Server — IDE Integration
inference-relay ships with a Model Context Protocol (MCP) server that exposes 19 tools across 5 categories, enabling developers to query relay status and manage operations directly from their IDE.
Tool Categories
- Financial Intelligence — Query real-time cost data, token usage breakdowns, cost-per-model analysis, budget burn rate
- Operational Health — Provider status, fallback frequency, latency percentiles, error rates
- Security & Compliance — Verification state, audit chain integrity, credential store status
- Logic Management — Configuration state, active provider routing, model availability
- Fleet Management — License key status, usage against caps, fleet activity (Enterprise)
Supported Clients
The MCP server works with any MCP-compatible client:
- Claude Desktop — Native integration
- Cursor — IDE-embedded AI with relay visibility
- VS Code + MCP extension — Standard editor integration
- Any MCP client — Protocol-compliant tooling
Usage
Developers interact through natural language. Instead of navigating a dashboard, they ask their AI assistant: “What's my inference spend this week?” or “Is the Anthropic provider healthy?” The MCP server translates these into precise queries and returns structured responses.
7. Cost Model
Pricing Tiers
- Solo — $50/mo, 3,000 calls. Auto-patch, audit trail, MCP server.
- Pro — $100/mo, 15,000 calls. Warm process pool, advanced routing DSL, key rotation, priority support.
- Enterprise — Custom. Multi-developer provisioning, fleet policy, org management, dedicated onboarding.
Cost Comparison by Usage Profile
The right tier depends on monthly call volume. For typical Claude Sonnet workloads at code-context token sizes (~$0.10–$0.20 per call), here's how the math breaks down:
Light use — ~1,500 calls/month
Occasional Claude Code queries, light scripting, personal automation.
- Direct Anthropic API: ~$200/mo (~$2,400/yr)
- Claude Max + Solo Relay: $150/mo flat ($1,800/yr) — saves ~$50/mo (~$600/yr)
Active use — ~5,000 calls/month
Claude Code as primary IDE assistant plus background automation.
- Direct Anthropic API: ~$800/mo (~$9,600/yr)
- Claude Max + Pro Relay: $200/mo flat ($2,400/yr) — saves ~$600/mo (~$7,200/yr)
Power use — ~12,000 calls/month
Heavy iteration, agentic loops, document analysis at scale.
- Direct Anthropic API: ~$1,800/mo (~$21,600/yr)
- Claude Max + Pro Relay: $200/mo flat ($2,400/yr) — saves ~$1,600/mo (~$19,200/yr)
Team use — multiple developers
Org-wide deployment with fleet policy, multi-developer provisioning, audit trail, and SSO. Talk to us about Enterprise.
The pattern is the same at every tier: relay cost is fixed, API cost scales linearly. Heavy users save the most.
What You Pay For
- Logic Synchronization (configuration delivery and signature verification)
- Audit infrastructure (hash chain, telemetry pipeline)
- Fleet management (Pro/Enterprise)
- MCP server and tooling
- Priority support and updates
What you do not pay for:inference. The AI provider bills the user's subscription directly. The relay is not in the billing path.
8. Security Audit Access
For enterprise prospects requiring source-level security review before procurement:
NDA Workflow
- Contact
enterprise@inference-relay.comwith your organization name and security team contact - Mutual NDA executed (standard form or your template)
- 48-hour read-only repository access granted to your designated security reviewers
- Findings discussion scheduled with inference-relay security team
What Reviewers Will Find
- TypeScript source with full type definitions (including the literal
falsecontent types) - CI pipeline configuration including Binary String Entropy Scan
- Credential isolation implementation per platform
- RS256 signature verification logic
- Audit hash chain implementation
- No obfuscation, no compiled-only modules, no hidden network calls
Standing Offer
This audit access is a standing offer, not a special accommodation. We believe the security architecture speaks for itself and encourage rigorous review. Every enterprise customer to date has completed their security review within the 48-hour window.