Overview
What inference-relay is
inference-relay is a local daemon that translates Anthropic-SDK-shape HTTP calls and routes them through your Claude Code subscription instead of the metered Anthropic API. It is the Native Subscription Gateway from v1.0, repackaged as a standalone daemon.
The daemon listens on 127.0.0.1:7421. Your code points its Anthropic SDK at that address by overriding baseURL. Any process on the same machine — Python, Node, Go, Rust, curl — can hit it. The daemon translates each call to the Claude Code CLI's interactive protocol through a managed PTY (a virtualized shell connection to your claude install), bills against your subscription pool, and returns SDK-shape JSON.
You install the desktop app once. The daemon comes up on first launch, binds the license key, and serves indefinitely. No code stays running inside your application process. No npm package gets imported. The daemon is infrastructure, not a library.
The mental model in four boxes
┌────────────┐ ┌──────────────┐ ┌─────────────────────────┐ ┌──────────────┐ │ Your code │──▶│ Anthropic SDK│──▶│ inference-relay daemon │──▶│ Claude Code │ │ (any lang)│ │ baseURL = │ │ ┌───────────────────┐ │ │ subprocess │ │ │ │ localhost:7421│ │ │ Session Pool │ │ │ (your sub) │ │ │ │ │ │ │ 2 idle / 5 max │ │ │ │ │ │ │ │ │ └───────────────────┘ │ │ │ └────────────┘ └──────────────┘ └─────────────────────────┘ └──────────────┘
- Your code writes against the standard Anthropic SDK. No IR imports, no auth changes. The only edit is
baseURL. - The daemon owns the Session Pool — a set of Pre-warmed PTYs each running a Claude Code subprocess. It accepts
POST /v1/messagesand emits the same content blocks the API would. - The Pool keeps ~2 PTYs idle so the per-call overhead stays around 10 ms. Spawn cost is paid in the background, not on the request.
- The subprocess is your existing Claude Code install — same model access, same subscription, same login. inference-relay drives it, doesn't replace it.
What it does NOT do (yet)
- Multi-provider routing. v2 is Claude-first. The fallback cascade to Anthropic API / OpenAI / Ollama that v1.0 supports is on the roadmap; not in v2 today. v1.0 customers who depend on cascade should stay on v1.0 docs.
- Streaming with arbitrary backpressure. The daemon streams SSE using the same shape the Anthropic SDK emits, but doesn't expose IR-specific stream-control primitives.
What it gives you that v1.0 (the npm library) didn't
- Any language. Python, Go, Ruby, Rust, curl — anything with an Anthropic SDK or HTTP client. v1.0 was Node-only.
- Out-of-process. Doesn't compete with your application's event loop, doesn't bundle into your container image, doesn't add npm dependencies you have to audit.
- Headless operation. After first-run license entry, the daemon runs as a Windows Task Scheduler entry, a launchd plist, or a systemd unit. No window, no tray icon. See Headless.
- SDK-stateless per call. Two consecutive
/v1/messagescalls share zero state — same contract as the Anthropic API. Sticky multi-turn is opt-in via header. See Sessions. - Tool use, vision, attachments. Standard Anthropic SDK shape works through the daemon —
tools[],imageblocks, PDF and text attachments. See Tools and Attachments.
Where to go next
- New to inference-relay → Quickstart (5 min)
- Running headless / as a service → Headless
- Integrating from Python / Go / curl → SDK Integration
- Building an agent orchestrator → Agents Cookbook