API Reference

Conventions

The daemon listens on 127.0.0.1:7421 by default. Override with the SR_DAEMON_PORT environment variable when launching.

All endpoints accept and return Content-Type: application/json unless noted. Errors are JSON-shaped: {"error": "<message>"} for 4xx/5xx responses.

License gating applies to /v1/messages. License validation runs in a 5-minute background loop; the per-call gate checks cached state.

GET /v1/health

Daemon liveness + diagnostics. Unauthenticated.

Response (200):

{
  "port": 7421,
  "sessions": 0,
  "status": "healthy",
  "uptime": 18472
}

uptime is milliseconds since daemon start.

GET /v1/version

Daemon + environment version probe.

Response (200):

{
  "ir": "1.1.11",
  "platform": "macos",
  "arch": "aarch64",
  "claudeAvailable": true,
  "claudeMissing": [],
  "claudeVersion": null,
  "substrate": "rust-pty",
  "builtinToolsEnabled": false
}

GET /v1/availability

Probe whether the daemon can serve requests right now (i.e., the claude binary is reachable).

Response (200):

{"ok": true, "missing": []}

When claude is missing:

{"ok": false, "missing": ["claude"]}

GET /v1/license

Current license validation state. Cached; re-validated every 5 minutes in the background, or on demand via /v1/license/refresh.

Response (200):

{
  "configured": true,
  "license": {
    "valid": true,
    "tier": "solo",
    "features": ["streaming","auto_patch","dashboard"],
    "usageThisMonth": 42,
    "usageCap": 5000,
    "capExceeded": false,
    "lastCapChange": 0,
    "topUpGrants": 0
  },
  "refreshedAt": 1778952693048
}

license.valid: false indicates the daemon will refuse /v1/messages with 402 Payment Required.

POST /v1/license/refresh

Force a fresh license validation against api.inference-relay.com. Returns the same shape as GET /v1/license.

POST /v1/messages

The main inference endpoint. Anthropic-SDK-shape request and response.

Headers:

  • Content-Type: application/json (required)
  • X-IR-Session-ID: <stable-id> (optional) — sticky-Session routing. Without it, the daemon mints a UUID per request → fresh stateless Session per call.

Request body:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "system": "Optional system prompt",
  "messages": [
    {"role": "user", "content": "string OR array of content blocks"}
  ],
  "stream": false,
  "tools": [/* optional Anthropic tool definitions */]
}

Content blocks supported:

  • {"type": "text", "text": "..."}
  • {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
  • {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": "..."}}
  • {"type": "tool_use", "id": "...", "name": "...", "input": {...}}
  • {"type": "tool_result", "tool_use_id": "...", "content": "..."}

Response (200):

{
  "content": [{"type": "text", "text": "..."}],
  "model": "claude-sonnet-4-6",
  "usage": {"input_tokens": 12, "output_tokens": 47},
  "stop_reason": "end_turn",
  "provider": "claude-pty",
  "durationMs": 3247,
  "raw_transcript": "..."
}

durationMs, provider, and raw_transcript are inference-relay extensions. Standard SDKs ignore them.

stop_reason values: end_turn, tool_use, max_tokens.

Errors:

400 — invalid request body:

{"error": "invalid request body: missing field `messages`"}

402 — license invalid:

{
  "error": "License invalid",
  "code": "license_invalid",
  "dashboardUrl": "https://inference-relay.com/dashboard"
}

402 — cap exceeded:

{
  "error": "Monthly usage cap reached (5001 / 5000). Top up to continue.",
  "code": "cap_exceeded",
  "tier": "solo",
  "used": 5001,
  "cap": 5000,
  "topUpUrl": "https://inference-relay.com/dashboard/billing?topup=1"
}

500 — provider acquire failed:

{
  "error": "provider acquire failed: claude binary not found on PATH or known locations",
  "session_id": "<request-uuid-or-header-value>"
}

Streaming note: the stream: true request field is parsed but not implemented in v1.1 — the daemon returns a buffered JSON response regardless. Native SSE is on the v1.2 roadmap.

Supported model strings (recommended):

  • claude-sonnet-4-6
  • claude-opus-4-6
  • claude-haiku-4-5

The daemon's cost-routing uses substring matching on opus/sonnet/haiku, so versioned aliases (e.g., claude-sonnet-4-20250514) also work — they just inherit the generic sonnet cost band.

POST /v1/messages/count_tokens

Pre-flight token estimate for a request body. Same shape as /v1/messages but no model call happens — the daemon estimates input tokens via a 4-bytes-per-token approximation.

Response (200):

{"input_tokens": 142}

Use for budget gating before submitting expensive calls. Accuracy is within ~10% of true tokenization for typical English; outliers exist for non-Latin scripts and dense code.

POST /v1/chat/completions (v1.1.15+)

OpenAI Chat Completions inbound shape. Accepts the OpenAI request body (string-or-array content; system/user/assistant/tool messages; tools; tool_choice), translates internally to the Anthropic-shape pipeline, returns an OpenAI chat.completion (or SSE chunks when stream: true). Same license gate, same cascade, same sticky-session header (X-IR-Session-ID) as /v1/messages.

Request:

POST /v1/chat/completions
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "Be terse."},
    {"role": "user", "content": "Reply with: OK"}
  ],
  "max_tokens": 30,
  "stream": false,
  "tools": [...]
}

Response (200):

{
  "id": "chatcmpl-<uuid>",
  "object": "chat.completion",
  "created": 1779094486,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "OK",
      "tool_calls": [...]  // present when claude calls a tool
    },
    "finish_reason": "stop" | "length" | "tool_calls"
  }],
  "usage": {"prompt_tokens": 19, "completion_tokens": 2, "total_tokens": 21}
}

Streaming (stream: true) emits the standard OpenAI SSE shape: each chunk is data: {"choices": [{"delta": {...}}]}, terminated by data: [DONE]. Tool-call arguments stream across multiple delta.tool_calls[N].function.arguments chunks matching the OpenAI wire spec.

Vision (v1.1.16+) — image_url blocks accepted with data:image/<type>;base64,... URLs. Remote http(s) URLs are rejected (would add an SSRF surface). Legacy function_call field is not supported (use tools); audio modalities are not supported.

GET /v1/activity/stream (v1.1.14+)

Server-Sent Events. Each successful or errored /v1/messages (or /v1/chat/completions) call emits one call event carrying the persisted CallRecord shape from /v1/recent-calls. Used by the dashboard's live Recent Activity panel; SDK consumers can also subscribe for real-time observability.

Event types:

  • event: calldata: is a JSON CallRecord (id, sessionId, status, model, inputTokens, outputTokens, costAvoidedUsd, promptPreview, rawTranscript, error, etc.).
  • event: laggeddata: {"dropped": N}". Broadcast channel capacity is 256; consumers that fall behind get this marker and should refetch /v1/recent-calls to reconcile.

Keep-alive comments fire every 15 s so reverse proxies don't idle-close the connection.

POST /v1/sessions/:id/clear-prompt

Clear the prompt input buffer + attachment state for a sticky Session WITHOUT wiping conversation memory. Use between programmatic calls when you want to guarantee a clean composition state.

Response (200):

{
  "ok": true,
  "session_id": "planner-1",
  "cleared": {"prompt": true, "staged": 0, "attachments": 0}
}

If the Session id doesn't exist:

{"ok": true, "cleared": {...}, "note": "no active session for this id"}

POST /v1/sessions/:id/reset

Hard reset: drop the existing Session's PTY and bind a fresh Pre-warmed replacement from the Pool. Wipes conversation memory.

Response (200):

{"ok": true, "session_id": "planner-1", "latency_ms": 12}

~10 ms in the happy path (warm pool grab). Up to ~2 s if the pool is empty (synchronous spawn).

DELETE /v1/sessions/:id

Drop the Session entirely. The next call with that id spawns a fresh Session from the Pool.

Response (200):

{"ok": true, "session_id": "planner-1"}

Response (404):

{"ok": false, "session_id": "unknown-id"}

GET /v1/sessions

Pool snapshot.

Response (200):

{
  "sessions": [],
  "pool": {"idle": 2, "active": 1, "spawning": 0}
}

idle = pre-warmed PTYs.
active = currently bound to a sticky session id.
spawning = background replenisher in flight.
sessions[] is intentionally empty — sticky-Session ids are not exposed by the public endpoint.

GET /v1/recent-calls

Returns every call record in the in-memory ring since daemon start (or since the last POST /v1/recent-calls/clear). The in-memory list is currently unbounded — restart the daemon or clear via the endpoint to reclaim memory if it grows on long-running installs.

The persisted file at ~/.inference-relay/recent-calls.jsonl rotates at approximately 1,000 entries via a byte-threshold trigger (~1.5 MB).

Response (200):

{
  "calls": [
    {
      "id": "uuid",
      "session_id": "uuid",
      "started_at": 1778952693048,
      "completed_at": 1778952696295,
      "model": "claude-sonnet-4-6",
      "input_tokens": 12,
      "output_tokens": 47,
      "status": "success",
      "error": null
    }
  ]
}

status: success | error.
License key never appears in this log.

POST /v1/recent-calls/clear

Clear the in-memory ring buffer + the persisted JSONL file.

Response (200): {"ok": true}

GET /v1/settings / PATCH /v1/settings

Read or update daemon settings. License key in GET responses is redacted to ••••••••<last-4-chars>.

PATCH accepts {licenseKey, workingDir, builtinToolsEnabled, ...}. Setting licenseKey triggers an immediate validation against api.inference-relay.com.

GET /v1/debug-bundle

Diagnostic snapshot: redacted settings, recent calls, license state, pool snapshot, daemon version. Useful for support tickets.

Response (200):

{
  "version": {...},
  "settings": {/* license redacted */},
  "license": {...},
  "pool": {...},
  "recent_calls_count": 42
}

Stub endpoints (reserved, not yet implemented)

These routes exist in the daemon but return stub responses today. Don't build on them — the shape will change when the implementation lands.

  • POST /v1/warmup — currently returns 503 with {"error": "TODO Days 5-7"}. Intended to let operators pre-warm the Session Pool before issuing real traffic. Use a real /v1/messages call as a warmup probe for now.
  • GET /v1/events — currently returns {"events": []}. Intended for a daemon-side event stream (turn events, license refresh, pool state changes).
  • POST /v1/events/clear — currently returns {"ok": true}. No-op stub.
  • POST /v1/conversations/:id/clear — currently returns {"ok": true}. No-op stub; predates POST /v1/sessions/:id/reset. Use /sessions/:id/reset for actual conversation-memory wipes.

Internal endpoints (not for SDK consumers)

These exist for the bundled MCP server to call back into the daemon during tool round-trips. Don't call them from your code.

  • POST /v1/internal/tool_call — staging endpoint used by the MCP server to register caller-defined tool invocations

Where to go next