For Agents — Install to Operation

A single comprehensive walkthrough for developers building agents on inference-relay v1.1+: install the daemon, point your SDK at it, run your first agent, and operate it in production. Read top to bottom; each section assumes the previous one is working.

If you want a one-page recipe instead of a walkthrough, see the Agents Cookbook.

0. Who this is for

You're writing code that calls an Anthropic SDK from a server, a long-running process, a CI runner, a desktop application, or a multi-process orchestrator. You want subscription-rate billing instead of metered API rates. You do not want to ship the Agent SDK bill that takes effect June 15, 2026: $20 / $100 / $200 monthly credit pools, then full API rates.

inference-relay drives a real interactive Claude Code session under the hood. Your agent calls route through the subscription-subsidized side of Anthropic's new split — not the Agent SDK credit pool.

1. Install the daemon

macOS — Apple Silicon

# Download and open the disk image
curl -L https://r2.inference-relay.com/desktop/1.1.14/macos-aarch64/inference-relay_1.1.14_aarch64.dmg \
  -o ~/Downloads/inference-relay.dmg
open ~/Downloads/inference-relay.dmg
# Drag inference-relay.app to /Applications, then launch once from there.

macOS — Intel

curl -L https://r2.inference-relay.com/desktop/1.1.14/macos-x86_64/inference-relay.app.tar.gz \
  -o ~/Downloads/inference-relay.app.tar.gz
tar -xzf ~/Downloads/inference-relay.app.tar.gz -C /Applications
open /Applications/inference-relay.app

Windows

Download inference-relay-1.1.14-setup.exe and run. SmartScreen will warn on first launch — choose More info → Run anyway. We don't sign with Microsoft EV or Apple Developer ID, by design — see the changelog for the reasoning.

Linux

Three packaging formats: portable AppImage (no install), Debian/Ubuntu .deb, and Fedora/RHEL/openSUSE .rpm.

# AppImage (any distro, no install required)
curl -L https://r2.inference-relay.com/desktop/1.1.14/linux-x86_64/inference-relay_1.1.14_amd64.AppImage \
  -o ~/inference-relay.AppImage
chmod +x ~/inference-relay.AppImage
~/inference-relay.AppImage

# Debian / Ubuntu (.deb)
curl -L https://r2.inference-relay.com/desktop/1.1.14/linux-x86_64/inference-relay_1.1.14_amd64.deb -o /tmp/ir.deb
sudo dpkg -i /tmp/ir.deb

# Fedora / RHEL / openSUSE (.rpm)
curl -L https://r2.inference-relay.com/desktop/1.1.14/linux-x86_64/inference-relay-1.1.14-1.x86_64.rpm -o /tmp/ir.rpm
sudo rpm -i /tmp/ir.rpm

2. Activate your license

On first launch the daemon prompts for your license key. You can also set it via environment variable so the daemon boots non-interactively:

# macOS / Linux
export IR_LICENSE_KEY=ir_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
inference-relay activate

# Windows (PowerShell)
$env:IR_LICENSE_KEY = 'ir_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
inference-relay activate

Activation validates the JWS-signed license against api.inference-relay.com/v1/validate, caches the response for 24h, and starts the daemon listener on 127.0.0.1:7421. License revocations propagate within 24h; cached responses are validated against an embedded RS256 public key, so a revoked key stops working even offline once the cache expires.

3. Verify the daemon is up

curl http://localhost:7421/v1/health
# → {"status":"ok","version":"1.1.11","pool":{"idle":2,"active":0,"spawning":0}}

idle: 2 means the Session Pool has two pre-warmed Claude Code sessions ready. active: 0 means no requests are in flight. Active sessions are unbounded (v1.1.10+); an idle-reaper drops Sessions untouched for 30 minutes.

4. Make your first agent call

Python

from anthropic import Anthropic

client = Anthropic(
    api_key="unused",                        # required by SDK, ignored by daemon
    base_url="http://localhost:7421",        # the only line that changes
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Plan a 3-step refactor of fn render(): ..."},
    ],
)
print(response.content[0].text)

Node / TypeScript

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: 'unused',
  baseURL: 'http://localhost:7421',
});

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [{ role: 'user', content: '...' }],
});

Go

client := anthropic.NewClient(
    option.WithAPIKey("unused"),
    option.WithBaseURL("http://localhost:7421"),
)

resp, err := client.Messages.New(ctx, anthropic.MessageNewParams{
    Model: anthropic.F("claude-sonnet-4-6"),
    MaxTokens: anthropic.F(int64(1024)),
    Messages: anthropic.F([]anthropic.MessageParam{
        anthropic.NewUserMessage(anthropic.NewTextBlock("...")),
    }),
})

Rust

use anthropic_sdk::Client;

let client = Client::builder()
    .api_key("unused")
    .base_url("http://localhost:7421")
    .build()?;

let resp = client.messages()
    .model("claude-sonnet-4-6")
    .max_tokens(1024)
    .add_message("user", "...")
    .create()
    .await?;

5. Multi-turn agents: sticky sessions

By default every call is stateless — same Anthropic-SDK contract as the metered API. For multi-turn agent loops where you want one Claude Code session to retain context across calls, set the X-IR-Session-ID header. The daemon pins all calls with the same session ID to the same warm-pool PTY:

import uuid
from anthropic import Anthropic

session_id = str(uuid.uuid4())

client = Anthropic(
    api_key="unused",
    base_url="http://localhost:7421",
    default_headers={"X-IR-Session-ID": session_id},
)

# All four calls share Claude Code session state
for step in plan_steps():
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": step}],
    )
    record(response)

See Sessions for the full sticky-session semantics, idle-reap policy (30 min), and reset endpoint.

6. Tools

Pass the tools argument exactly as you would with the metered API. The daemon forwards tool definitions to Claude Code and returns tool_use blocks in the response. Your code handles the loop:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    tools=[{
        "name": "read_file",
        "description": "Read the contents of a file on disk.",
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"],
        },
    }],
    messages=[{"role": "user", "content": "Summarize README.md"}],
)

for block in response.content:
    if block.type == "tool_use" and block.name == "read_file":
        contents = open(block.input["path"]).read()
        # Send the tool result back in the next turn
        ...

See Tools for caller-defined tools and MCP server integration.

7. Vision & attachments

Image and PDF inputs work unchanged. Pass them as image or document blocks in the message content array. The daemon forwards base64 payloads to Claude Code; no upload to inference-relay servers.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_encoded_screenshot,
            }},
            {"type": "text", "text": "What's broken in this UI?"},
        ],
    }],
)

See Attachments for size limits and supported formats.

8. Streaming

v1.1.9+ supports native SSE streaming. SDK calls with stream: true receive standard Anthropiccontent_block_delta events as Claude renders, at ~100 ms intervals (the daemon's PTY poll cadence). Tool-use blocks burst post-cooldown.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{"role": "user", "content": "..."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

9. Headless operation

Long-running agents need the daemon to survive application restarts and machine reboots. Install as a system service:

# macOS (launchd)
inference-relay install-service --user
# Loads ~/Library/LaunchAgents/com.inference-relay.daemon.plist

# Windows (Task Scheduler)
inference-relay install-service --user
# Registers an "At log on" trigger under the current user

# Linux (systemd --user)
inference-relay install-service --user
# Writes ~/.config/systemd/user/inference-relay.service

See Headless Operation for service management, log locations, and unattended-restart semantics.

10. Error handling

The daemon returns the standard Anthropic HTTP error shape, so SDK retry/back-off paths work unchanged:

  • 401 — invalid or revoked license; check /v1/validate
  • 429 — downstream Anthropic rate-limit (daemon itself doesn't throttle; concurrent sticky sessions are unbounded)
  • 500 (transient) — Claude Code session crashed; retry once with the same session ID and the daemon will rebuild from the warm pool
  • 503 — daemon is shutting down (auto-update in progress); retry after 5 s

Add a single except APIError with exponential back-off; the daemon's Session Pool absorbs most transients in the ~10 ms range.

11. Reset between conversations

For agents that need a guaranteed clean slate (long-running planner → new task), call the reset endpoint instead of waiting for the 30-minute idle reaper:

curl -X POST http://localhost:7421/v1/sessions/$SESSION_ID/reset
# → {"swapped": true, "took_ms": 11}

Reset swaps the PTY for a fresh warm-pool session in ~10 ms — much faster than the ~2 s cold-spawn cost.

12. Deploy at scale

Single-machine, single-user

Default. Daemon runs on the user's laptop or workstation. Warm pool keeps 2 PTYs idle; active sessions grow on demand. Suitable for solo developers, single-seat agent runs, CI runners with 1 active workflow.

Single-machine, multi-process (parallel agents)

Same daemon, multiple SDK callers. Set distinct X-IR-Session-ID values per orchestrator to keep their contexts isolated. No hard cap on active sessions — each gets its own PTY with no cross-contamination. The 30-minute idle reaper cleans up abandoned sessions; RAM (~150–250 MB per claude subprocess) is the practical ceiling.

Server fleet (advanced)

v1.1 is designed for end-user machines (your developers, your end users). Server-fleet deployment — many machines each running the daemon under your service account — is supported but requires a per-fleet license tier. Contact enterprise@inference-relay.com.

13. Monitor & debug

  • GET /v1/health — pool state + version
  • GET /v1/sessions — list active sessions (ID, age, last activity)
  • GET /v1/recent_calls — last N requests (in-memory; cleared on restart)
  • tail -f ~/Library/Logs/inference-relay/daemon.log — full daemon log (macOS path; see Troubleshooting for other platforms)

14. Update

The daemon polls api.inference-relay.com/v1/desktop/update every four hours. New releases ship as ed25519-signed bundles; the daemon verifies the signature against an embedded public key before applying. Updates install in-place during a quiet window (no active sessions); the daemon emits a 503 for ~5 s during the swap.

Disable auto-update via launch flag: inference-relay --no-auto-update. Manual update via: inference-relay update --check.

15. Common gotchas

  • Forgot the baseURL override — calls go to api.anthropic.com at metered rates instead of the daemon. Check your SDK constructor.
  • License not activated /v1/validate returns 401; daemon refuses to serve. Run inference-relay activate.
  • Sticky-session header not sent — multi-turn calls don't share context. Confirm X-IR-Session-ID reaches the daemon via GET /v1/recent_calls.
  • Claude Code not logged in — the underlying CLI prompts for browser login on first session spawn. Run claude once interactively to authenticate, then the daemon's Session Pool picks up the cached auth.
  • Loopback firewall — corporate machines sometimes block 127.0.0.1:7421. Change the port via IR_LISTEN_PORT; loopback-only binding is non-negotiable.

16. Where to go next

  • Agents Cookbook — five recipes (single-shot batch, planner loop, tool-using agent, vision extraction, reset-between-conversations).
  • API Reference — all 12 endpoints, request/response shapes, error codes.
  • Security — loopback binding, JWS license trust chain, ed25519 update signing.
  • Technical Whitepaper — full architecture deep-dive: Rust runtime, Session Pool, two-envelope mechanism.