Technical Whitepaper — v1.1 (Daemon)

Classification

This document describes the technical architecture of inference-relay v1.1 — the standalone Rust daemon — for enterprise security review, integration evaluation, and audit. It supersedes the v1.0 (npm library) whitepaper for customers deploying the daemon distribution. The v1.0 npm library remains supported; see the v1.0 whitepaper for that distribution's architecture.

Prepared: May 17, 2026
Author: inference-relay engineering
Scope: Daemon runtime, license trust chain, Session Pool, sticky-session model, update chain, network surface, threat model
Version: 1.1.11
PATENT PENDING
Abstract

inference-relay v1.1 is a single ~12 MB statically-linked Rust binary that exposes Anthropic-shape HTTP at 127.0.0.1:7421 and routes every request through a real interactive Claude Code session. Same load-bearing posture as a database connection pool — local routing, zero content through the relay, the user's own subscription paying for the inference.

Most consequential properties
  • Memory-safe by language guarantee. Rust runtime, zero unsafe blocks in the request path. An entire vulnerability class is eliminated at the compiler level.
  • Loopback-only network surface. The daemon refuses non-loopback bind at the socket layer. Nothing on the LAN or the internet can reach it.
  • End-to-end signature chain. RS256-signed licenses + ed25519-signed updates, both verified against pubkeys embedded at compile time. MITM and supply-chain attacks are defeated structurally.
  • Reproducible builds. Customers verify deployed binaries against published source. The trust posture is “trust your verification,” not “trust the vendor.”
  • Subscription-side of Anthropic's June 15, 2026 Agent SDK billing split — by construction. The daemon drives interactive Claude Code; calls never hit the Agent SDK credit pool. Effective 15–30× cost reduction for programmatic workloads vs. the Agent SDK path.
  • Two-Envelope architecture. The Logic Envelope stays with the developer; the Content Envelope stays with the user; the relay carries neither. Dev tooling and system prompts are safe, and user data is inaccessible to the dev.
  • ~10 ms warm-pool overhead, unlimited concurrent sessions. Session Pool keeps 2 PTYs idle at the Claude Code prompt; ~2 s cold-spawn cost is amortized; sticky sessions grow on demand with no hard cap (idle-reap @ 30 min handles zombies).
  • Any Anthropic SDK in any language. One baseURL override; Python, Node, Go, Rust, curl, anything that speaks HTTP.
  • ≈95% developer gross margins. The developer's bill stops scaling with user count — inference cost moves to the user's subscription.
  • Zero new data processors for enterprise. Installs in user space; no registry writes, no system services, no new entry in your DPA stack. Bypass the six-month security review.

I. Executive Summary

inference-relay v1.1 is a standalone background daemon, implemented in Rust, that exposes Anthropic-shape HTTP at 127.0.0.1:7421 and translates each request into a real interactive Claude Code session driven through a managed pseudo-terminal. The user's existing Claude subscription pays for the inference; the daemon's license key (an RS256-signed JWS) authorizes the routing.

There are three main facets to the architecture:

  • Memory-safe by construction. The runtime is Rust, statically linked, with zero unsafe blocks in the request path. Memory-corruption attack vectors that would apply to a Node or Electron equivalent are eliminated at the language level.
  • Loopback-only. The daemon refuses to bind any non-loopback address. There is no network surface for an attacker on the LAN, on the corporate network, or on the internet to reach. This is enforced at the socket layer at startup.
  • End-to-end signed. The license key is RS256-signed by the relay backend and verified against an embedded RSA public key. Update bundles are ed25519-signed and verified against an embedded ed25519 public key. The daemon refuses to execute unsigned or mis-signed inputs at either trust boundary.

These properties are not aspirational. They are the design center of every architectural decision documented below.

II. Why Rust

v1.0's Node-based library inherited the memory-safety story of the V8 runtime, which is good but not categorical: native dependencies, FFI boundaries, and JIT artifacts each carry their own risk surface. v1.1's Rust runtime trades that for a stricter set of guarantees that an enterprise security review can verify mechanically rather than empirically.

A. Memory Safety Without GC

The daemon's request path uses no unsafe blocks. Every buffer is owned, borrowed, or arena-allocated by the language; the borrow checker statically eliminates use-after-free, double-free, and most data races. Two consequences for enterprise auditors:

  • An entire class of vulnerabilities (heap overflows, type confusion in handler code, lifetime bugs in shared state) cannot exist in this codebase by construction.
  • Audit effort focuses on logic bugs and crypto handling rather than memory primitives.

B. No JIT, No Hot-Patch Surface

The daemon is ahead-of-time compiled to a single statically-linked binary. There is no V8, no JIT cache, no eval, no remote module loader. Once the binary is installed and its hash matches the published manifest, runtime behavior is fixed until the next signed update. Customers running compliance frameworks that require “no dynamic code execution after install” can satisfy that constraint mechanically.

C. Reproducible Builds

The release pipeline produces deterministic binaries: identical source + identical toolchain produces byte-identical output. The published sha256 of each release allows third-party verification of an installed binary against the source tree at the corresponding commit. This is the substrate for the auditability promise — customers do not have to trust inference-relay; they have to trust that they can verify the published source against the binary they're running.

D. Smaller Attack Surface, Statically

The daemon ships ~12 MB compressed, ~30 MB on disk. It depends on the Rust standard library, axum (HTTP), tokio (async runtime), jose (JWS verification), portable-pty (PTY management), vt100 (terminal emulation), and a handful of small utility crates. Every dependency is pinned in Cargo.lock with a sha256 of the source. The dependency graph is publishable and audit-able in a single afternoon.

III. Architecture at a Glance

┌─────────────────────────────────────────────────────────────┐
│  Application process  (Python, Node, Go, Rust, curl, ...)   │
│  Anthropic SDK  →  POST http://localhost:7421/v1/messages   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼  loopback only
┌─────────────────────────────────────────────────────────────┐
│  inference-relay daemon  (Rust)                             │
│                                                             │
│   axum HTTP  →  License check (RS256 JWS, embedded pubkey)  │
│              →  Session Pool  ┌─────────────────────────┐   │
│                               │  warm pool (idle: 2)    │   │
│                               │  active (unbounded)     │   │
│                               │  idle-reap @ 30 min     │   │
│                               └─────────────────────────┘   │
│              →  Pick / spawn PTY  →  drive Claude Code      │
│              →  vt100 parse + anchored turn extraction      │
│              →  Anthropic-shape JSON  →  HTTP 200           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼  managed PTY
┌─────────────────────────────────────────────────────────────┐
│  Claude Code CLI  (Anthropic's binary)                      │
│   Authenticates against the user's Claude subscription      │
│   Inference executes via Anthropic's native gateway         │
└─────────────────────────────────────────────────────────────┘

IV. Request Lifecycle (Cold Path)

A request from an SDK client to a final Anthropic-shape response, traced end to end:

  • 1. Bind check. Daemon refuses to start unless the configured listen address resolves to a loopback interface. Non-loopback bind = startup panic; no service.
  • 2. License gate. First call after install validates the license JWS against the embedded RSA public key, then against api.inference-relay.com/v1/validate. Validation is cached for 24h.
  • 3. Pool acquire. If a sticky X-IR-Session-ID header is present, look it up in the active map. If not present, take a warm-pool PTY (~10 ms). If the warm pool is empty, cold-spawn (~2 s).
  • 4. PTY drive. Feed the user's message into the Claude Code session through the PTY; read the rendered terminal output via vt100 emulation.
  • 5. Anchored extraction. Slice the turn cleanly between cooldown markers Claude Code emits at session boundaries. This is the load-bearing fix for the “scrollback contamination” class of bugs.
  • 6. Shape conversion. Map the extracted content (and any tool_use blocks) into Anthropic-SDK-compatible JSON.
  • 7. Return. HTTP 200 with the response body. Total wall-clock: ~10 ms warm-pool overhead + Anthropic's own inference latency.

V. The Session Pool

Claude Code session startup is the dominant cost for low-latency agent workloads: ~2 s to spawn the CLI, authenticate, and reach an idle prompt. v1.1 amortizes this cost with a Session Pool.

A. Pool Mechanics

  • TARGET_IDLE = 2. Two PTYs are kept idle at the Claude Code prompt at all times.
  • No hard cap on active sessions (v1.1.10+). Active sessions grow on demand. A developer pinning N agents via X-IR-Session-ID gets N isolated PTYs; the daemon does not refuse the (N+1)th. RAM is the practical ceiling (~150–250 MB per claude subprocess).
  • Warm grab ~10 ms. Pulling an idle PTY from the pool and binding it to a request takes a single mutex hop.
  • Cold spawn ~2 s. Spawning a new PTY when the warm pool is empty incurs the full Claude Code startup cost.
  • Background replenish. When the warm pool drops below TARGET_IDLE, a background task spawns a replacement so the next caller sees a warm grab.
  • Idle reaper (30 min / 5-min sweep). Sticky sessions untouched for > 30 minutes get dropped to release the claude subprocess. Replaces the v1.1.9 LRU eviction; mid-task pauses (human thinking, paste, re-run) stay well under the threshold.

B. Sticky Sessions (X-IR-Session-ID)

By default each /v1/messages call is stateless — every call lands on an arbitrary warm-pool PTY. Setting X-IR-Session-ID: <uuid> pins all calls with that ID to the same PTY, so Claude Code's native session memory carries across turns. This is what makes agent loops, planner workflows, and multi-turn tool-use chains practical.

C. Reset Without Cold-Spawn

Agents that finish a task and want a guaranteed-clean session for the next one can call POST /v1/sessions/<id>/reset. The daemon swaps the bound PTY for a fresh warm-pool one (~10 ms) instead of paying the ~2 s cold-spawn cost the user would otherwise see on the first call of the new turn.

VI. The Two-Envelope Architecture (Preserved from v1.0)

v1.0 documented a Two-Envelope model: the developer's orchestration logic is encrypted in a Logic Envelope routed through their own API key; user content is in a Content Envelope routed through the user's subscription. v1.1 preserves the model verbatim:

Logic Envelope (App Key)
Application orchestration: system prompts, tool definitions, agent graphs. Routed via the developer's Anthropic API key. Stays in the developer's span of control. Never enters the relay.
Content Envelope (User Subscription)
User content: prompts, documents, attachments, conversation. Routed via the user's Claude subscription, executed on the user's machine in the user's own Claude Code session. Never enters the relay.

The relay is a dumb pipe. It carries no model, no inference, no prompt content, and no completion content. It carries license validation, update manifests, and routing decisions — and nothing else.

VII. License Trust Chain (RS256)

License keys are RS256-signed JWS tokens issued by the relay backend. The daemon embeds the relay's RSA public key at compile time and verifies every license response against that key. The chain:

  • Issuance. Stripe webhook → relay backend → JWS signed with the relay's private RSA key → returned as a license key string with a publishable prefix (ir_live_…).
  • Activation. First daemon start submits the key to POST /v1/validate; the relay backend returns a signed validation envelope ({valid, tier, features, usageCap, usageThisMonth, ...}).
  • Embedded verification. The daemon verifies the envelope signature locally against the embedded RSA pubkey. A man-in-the-middle proxy or DNS hijack cannot forge a valid envelope without the relay's private key.
  • Cache. Validation is cached for 24h. Subsequent requests serve from cache unless a forced refresh is requested.
  • Revocation propagation. Revoking a license at the backend invalidates new validation requests within seconds; cached validation expires within 24h. A revoked key stops working everywhere within a day, online or offline.

VIII. Update Chain (ed25519)

Update bundles are ed25519-signed. The daemon embeds the relay's ed25519 public key at compile time. The chain:

  • Publish. A new release is built reproducibly, signed locally with the ed25519 private key (held offline), and uploaded to R2 alongside its .sig sidecar.
  • Discover. Daemons poll GET /v1/desktop/update/<target>/<arch>/<current_version> every four hours. Response is the signed manifest of the latest release.
  • Verify. Daemon downloads the bundle + signature from R2, verifies the signature against the embedded pubkey, refuses to apply on any failure.
  • Apply. Update lands during a quiet window (no active sticky sessions); daemon emits HTTP 503 for ~5 s during the swap, then resumes.
  • Rollback. The previous binary is retained on disk; manual rollback is one CLI command if a regression is found.

IX. Network Surface

EndpointDirectionPurposeAuth
127.0.0.1:7421Inbound (loopback only)Local SDK clientsNone (loopback trust)
api.inference-relay.com/v1/validateOutbound (TLS)License validationLicense key in body
api.inference-relay.com/v1/desktop/updateOutbound (TLS)Update pollingLicense key in Authorization header
r2.inference-relay.comOutbound (TLS)Update bundle downloadNone (public bucket; signature verified)
api.anthropic.comOutbound (TLS, via claude binary)Inference (Anthropic's own client)OAuth via Claude subscription

The daemon never listens on the public internet, never listens on the LAN, and never serves a third-party-controlled host. The only inbound socket is loopback. The only outbound traffic is TLS to four well-known hosts.

X. Threat Model

A. Adversary on the LAN

Cannot reach the daemon. Loopback binding is enforced at the socket layer at startup; the daemon panics on any non-loopback bind. No firewall configuration is required; there is no listener to firewall.

B. Adversary with Local Code Execution

An attacker who already runs code as the user has equivalent power to the user themselves. They can issue HTTP requests to the loopback endpoint, but those requests are billed against the user's subscription — same as if the attacker typed into Claude Code directly. This is not a new attack class; it is the existing local-code-execution boundary that already protects every other per-user subscription on the machine.

C. Man-in-the-Middle on Update Path

Defeated by ed25519 signature verification against the embedded pubkey. A MITM proxy can deliver any bundle, but the daemon will refuse to apply any bundle whose signature does not verify. TLS pinning is a defense-in-depth layer on top; the signature is the load-bearing guarantee.

D. Man-in-the-Middle on License Validation

Defeated by RS256 signature verification against the embedded RSA pubkey. The validation envelope is signed; a MITM cannot forge a valid envelope without the relay's private key.

E. Compromised Relay Backend

If an attacker compromises the relay's signing keys, they can issue forged licenses or updates. Mitigation: the ed25519 update key is held offline (signed locally at release time, not on the backend); the RSA license key rotates per documented policy with embedded pubkey update via the standard release flow. Customers running locked binaries do not auto-trust newly-rotated keys until they update.

F. Compromised User Subscription

Out of scope — this is the user's subscription compromise to manage, exactly as it would be without the daemon. The daemon does not store, cache, or extract the user's subscription credentials; Claude Code handles its own auth state via the OS keychain.

XI. Day-One Benefits

For end users

  • Data sovereignty. Prompts and completions never leave the machine via the relay path. The user's Claude subscription, the user's Claude Code session, the user's own machine — and nothing else in the inference path.
  • Loopback-only. Nothing on the LAN or the internet can reach the daemon. No firewall configuration, no exposed port to worry about.
  • Same subscription, more capability. The daemon turns the user's Claude subscription into the engine for every agent or script they install — without the per-agent SDK billing the Agent SDK route triggers post-June-15.

For developers

  • Any-language SDK reach. Python, Node, Go, Rust, curl, anything that speaks HTTP. One baseURL override. No per-language client to maintain.
  • Subscription pricing for agents. Bills the user's subscription pool, not the Agent SDK credit pool. Effective June 15, 2026, this is a 15–30× cost reduction for programmatic workloads compared to the Agent SDK billing path.
  • Secret-sauce isolation. System prompts, tool definitions, and orchestration logic stay in the developer's span of control via the Logic Envelope. Never enter the relay.
  • ≈95% gross margins. The developer's bill stops scaling with user count, because the inference cost moves to the user's subscription.

For enterprise procurement

  • No new data processor. The daemon installs in user space. Prompts and completions never reach inference-relay's infrastructure. No new entry in the customer's DPA stack.
  • MDM-friendly. Deploys via MDM, SCCM, Intune, JAMF, or any existing endpoint manager. The daemon ships as a standard .dmg / .exe bundle with deterministic install paths.
  • Bypass the six-month security review. Execution stays inside the already-vetted security boundary (the user's machine, the user's Claude subscription); only the routing daemon is new.
  • Memory-safe runtime + signed updates. Two of the most common security-review questions answered by construction.
  • Reproducible builds. Auditors can verify a deployed binary against the published source at the corresponding commit. The trust posture is “trust your verification,” not “trust the vendor.”

XII. What v1.1 Does Not Yet Ship

  • SSE streaming. The stream: true field is accepted but no-op; the daemon returns a buffered JSON response. Native streaming is on the v1.2 roadmap.
  • Multi-provider cascade. v1.1 is Claude-only. Customers needing Anthropic-API / OpenAI / Ollama fallback should stay on the v1.0 npm library (which is still supported).
  • Code signing (macOS Developer ID, Windows EV). v1.1 currently ships Gatekeeper-warned / SmartScreen-warned. Signing lands in a near-term v1.1.x point release.
  • Server-fleet licensing. The license model is per-end-user-machine. Server-fleet deployments are supported with a per-fleet tier; contact enterprise@inference-relay.com.

XIII. Conclusion

v1.1 is the same compliance posture as v1.0 inside a strictly stronger runtime substrate. The Rust runtime eliminates a category of vulnerabilities at the language level. The ed25519 update chain eliminates the supply-chain attack surface a per-version npm dependency carries. The loopback-only binding eliminates the network attack surface entirely. The end result is a routing layer that gives end users data sovereignty, gives developers any-language SDK reach and subscription-rate billing, and gives enterprise procurement a clean answer to every security-review question that derailed v1.0 adoption in regulated industries.

The daemon is the smallest piece of code that does the most load-bearing work in a customer's AI stack. Designed for the agent-orchestrator era; aligned by construction with Anthropic's June 15, 2026 billing split; built in Rust because the security properties the architecture promises can only be enforced by the language.

For under-NDA repository access for audit purposes, contact enterprise@inference-relay.com. Reproducible-build verification instructions are published with each release.