Enterprise Deployment — The CFO/CIO Manual

This document addresses the two questions every enterprise buyer asks: “What does this cost?” and “What risk does this introduce?” The answers are, respectively, “almost nothing” and “less than what you have today.”

1. The Shadow AI Problem

Your engineers are already using AI. The question is whether you know about it.

Today, employees across your organization use personal Claude and ChatGPT subscriptions to write code, draft documents, and analyze data. Every one of those interactions sends corporate intellectual property through a personal account with no corporate visibility, no audit trail, and no policy enforcement.

The traditional fix is expensive: provision API keys per team, negotiate enterprise agreements, and absorb inference costs that scale linearly with headcount. For a team of 20 engineers using Claude Sonnet at moderate volume, this runs $50,000+ per year in raw API costs alone — before platform fees, key management overhead, or compliance review.

inference-relay solves this differently. Instead of replacing personal subscriptions with corporate API keys, it routes through the subscriptions your employees already pay for. The inference cost to your organization is $0.00— the employee's existing subscription covers it. What the relay adds is visibility, governance, and audit.

2. Direct Subscription Utilization (DSU)

This is the architectural decision that changes your procurement conversation entirely.

inference-relay is a Local Binary Dependency— client-side software that runs on the developer's machine. It is not a cloud service. It does not receive, store, or process prompt or completion data. The data processing relationship remains exclusively between the user and their AI provider (Anthropic, OpenAI), governed by the provider's existing Data Processing Agreement.

Because inference-relay is a Local Binary Dependency and not a Data Processor, it falls outside the scope of traditional cloud-service DPA requirements.

What This Means for Procurement

New vendor onboarding → npm install
New DPA negotiation → No DPA required
New data processor registration → Not a data processor
Security questionnaire (weeks) → Client-side software review (days)
API key provisioning per team → Zero API keys needed (auto-patch)
Per-token inference billing → Flat subscription, already paid

The procurement shortcut: inference-relay is client-side software, not a cloud service. It belongs in the same category as a linter, a formatter, or a build tool. It transforms how AI requests are routed. It never touches what those requests contain.

3. Compliance Positioning

SOC 2

inference-relay stores no customer data. There is no database, no object store, no log file containing user content. The relay transmits operational metadata (provider, model, token counts, cost) and enforces at the type level that content fields are structurally excluded. No customer data stored means no SOC 2 scope for the relay itself.

GDPR

The relay does not process personal data. It does not know who the user is beyond a license identifier. It does not track behavior, build profiles, or store any data that could identify a natural person. No personal data processed means no DPIA required.

HIPAA

The relay has zero exposure to Protected Health Information. Prompt and completion content — the only place PHI could appear — are structurally excluded from all relay data flows via TypeScript literal types (see Security Architecture). No PHI exposure means BAA is not applicable.

Summary

The library is a routing layer, not a data processor. Compliance obligations attach to entities that store, process, or transmit regulated data. inference-relay does none of these things — by design, by implementation, and by compiler-enforced guarantee.

4. Governance at the Edge — Fleet Policy

@inference-relay/pro enables centralized fleet management for organizations that need policy enforcement across multiple developers and machines.

Key Rotation

Available on Pro and Enterprise tiers. License keys can be rotated without downtime — the new key activates immediately, and the old key enters a grace window before revocation. Server-side storage retains only the last 4 characters of any key for identification purposes. Full keys are never stored on relay infrastructure.

Activity Log

Operational events are recorded for administrative oversight:

Key rotation — New key issued, old key scheduled for revocation
Tier change — Subscription tier upgraded or downgraded
Cap warning — Usage cap exceeded (Solo or Pro tier)
Validation rejection — Unsigned or invalid authorization attempt blocked
Grace period entry — Payment failure detected, 7-day grace window started

Usage Caps

Solo — 3,000 calls/month, 5% soft buffer before enforcement
Pro — 15,000 calls/month, 5% soft buffer before enforcement
Enterprise — Custom call volume and provisioned seats, per-contract terms

Payment Grace Period

On payment failure, the relay enters a 7-day grace period during which full functionality is maintained. This prevents a billing hiccup from disrupting active development work. After 7 days without resolution, the key is auto-revoked and the relay falls back to free-tier behavior.

5. Audit Trail

Every inference call generates an audit event. These events form a SHA-256 hash chain — each event includes the hash of the previous event, creating a tamper-evident sequence. Any modification to a historical event breaks the chain, making tampering detectable.

Audit Event Contents

Provider — Yes (which AI service was called)
Model — Yes (which model was invoked)
Input tokens — Yes (usage metering)
Output tokens — Yes (usage metering)
Estimated cost — Yes (cost attribution)
Duration — Yes (performance monitoring)
Prompt content — No (literal false, structurally excluded)
Completion content — No (literal false, structurally excluded)

Output Formats

Audit events can be consumed via Asynchronous Stream Decoding output for integration with existing log aggregation pipelines (Splunk, Datadog, ELK), or through a custom handler function for bespoke processing.

Content Exclusion Guarantee

The promptContent: false and completionContent: false fields are not configuration options — they are TypeScript literal types. Assigning any value other than false to these fields causes a compilation failure. This guarantee is verified on every build and cannot be overridden at runtime. See Security Architecture for the full technical explanation.

6. MCP Server — IDE Integration

inference-relay ships with a Model Context Protocol (MCP) server that exposes 19 tools across 5 categories, enabling developers to query relay status and manage operations directly from their IDE.

Tool Categories

Financial Intelligence — Query real-time cost data, token usage breakdowns, cost-per-model analysis, budget burn rate
Operational Health — Provider status, fallback frequency, latency percentiles, error rates
Security & Compliance — Verification state, audit chain integrity, credential store status
Logic Management — Configuration state, active provider routing, model availability
Fleet Management — License key status, usage against caps, fleet activity (Enterprise)

Supported Clients

The MCP server works with any MCP-compatible client:

Claude Desktop — Native integration
Cursor — IDE-embedded AI with relay visibility
VS Code + MCP extension — Standard editor integration
Any MCP client — Protocol-compliant tooling

Usage

Developers interact through natural language. Instead of navigating a dashboard, they ask their AI assistant: “What's my inference spend this week?” or “Is the Anthropic provider healthy?” The MCP server translates these into precise queries and returns structured responses.

7. Cost Model

Pricing Tiers

Solo — $50/mo, 3,000 calls. Auto-patch, audit trail, MCP server.
Pro — $100/mo, 15,000 calls. Warm process pool, advanced routing DSL, key rotation, priority support.
Enterprise — Custom. Multi-developer provisioning, fleet policy, org management, dedicated onboarding.

Cost Comparison by Usage Profile

The right tier depends on monthly call volume. For typical Claude Sonnet workloads at code-context token sizes (~$0.10–$0.20 per call), here's how the math breaks down:

Light use — ~1,500 calls/month

Occasional Claude Code queries, light scripting, personal automation.

Direct Anthropic API: ~$200/mo (~$2,400/yr)
Claude Max + Solo Relay: $150/mo flat ($1,800/yr) — saves ~$50/mo (~$600/yr)

Active use — ~5,000 calls/month

Claude Code as primary IDE assistant plus background automation.

Direct Anthropic API: ~$800/mo (~$9,600/yr)
Claude Max + Pro Relay: $200/mo flat ($2,400/yr) — saves ~$600/mo (~$7,200/yr)

Power use — ~12,000 calls/month

Heavy iteration, agentic loops, document analysis at scale.

Direct Anthropic API: ~$1,800/mo (~$21,600/yr)
Claude Max + Pro Relay: $200/mo flat ($2,400/yr) — saves ~$1,600/mo (~$19,200/yr)

Team use — multiple developers

Org-wide deployment with fleet policy, multi-developer provisioning, audit trail, and SSO. Talk to us about Enterprise.

The pattern is the same at every tier: relay cost is fixed, API cost scales linearly. Heavy users save the most.

What You Pay For

Logic Synchronization (configuration delivery and signature verification)
Audit infrastructure (hash chain, telemetry pipeline)
Fleet management (Pro/Enterprise)
MCP server and tooling
Priority support and updates

What you do not pay for:inference. The AI provider bills the user's subscription directly. The relay is not in the billing path.

8. Security Audit Access

For enterprise prospects requiring source-level security review before procurement:

NDA Workflow

Contact enterprise@inference-relay.com with your organization name and security team contact
Mutual NDA executed (standard form or your template)
48-hour read-only repository access granted to your designated security reviewers
Findings discussion scheduled with inference-relay security team

What Reviewers Will Find

TypeScript source with full type definitions (including the literal false content types)
CI pipeline configuration including Binary String Entropy Scan
Credential isolation implementation per platform
RS256 signature verification logic
Audit hash chain implementation
No obfuscation, no compiled-only modules, no hidden network calls

Standing Offer

This audit access is a standing offer, not a special accommodation. We believe the security architecture speaks for itself and encourage rigorous review. Every enterprise customer to date has completed their security review within the 48-hour window.