API Reference

Conventions

The daemon listens on 127.0.0.1:7421 by default. Override with the SR_DAEMON_PORT environment variable when launching.

All endpoints accept and return Content-Type: application/json unless noted. Errors are JSON-shaped: {"error": "<message>"} for 4xx/5xx responses.

License gating applies to /v1/messages. License validation runs in a 5-minute background loop; the per-call gate checks cached state.

Backend caveats (read this before integrating)

The claude-cli backend (the default, free-of-charge subscription path) ships with two well-known quirks integrators must handle. Both are also reported by the daemon in the notes array of GET /v1/version so your code can probe them programmatically.

1. JSON output may include markdown code fences

Claude Code's CLI agent wraps JSON-mode responses in ```json … ``` fences even when the system prompt explicitly forbids them. Use a fence-tolerant parser: strip triple-backtick lines, then extract the first balanced JSON object. Strict JSON.parse on the raw response will intermittently throw.

Notes-array key: claude_cli_may_wrap_json_in_markdown_fences

2. Forced `tool_choice` returns 400

The CLI tool protocol can't honor a forced specific-tool selection. Sending tool_choice: {type: "tool", name: "..."} or tool_choice: \"required\" to /v1/messages returns 400 with:

{
  "code": "TOOL_CHOICE_UNSUPPORTED",
  "type": "invalid_request_error",
  "message": "Forced tool_choice is not supported on the claude-cli (subscription) backend. Use prompt-based JSON mode or route this call to anthropic-api with your own API key.",
  "tool_choice_received": {...}
}

For structured output, use prompt-based JSON mode (instruct the model to respond with a JSON object — the fence-tolerant parser from Caveat 1 handles the response). For genuine forced tool-use, route around the daemon to anthropic-api with your own API key.

tool_choice values that do pass through: null, unset, "auto", {type: "auto"}, {type: "any"}.

Notes-array key: claude_cli_forced_tool_choice_unsupported

`GET /v1/health`

Daemon liveness + diagnostics. Unauthenticated.

Response (200):

{
  "port": 7421,
  "sessions": 0,
  "status": "healthy",
  "uptime": 18472
}

uptime is milliseconds since daemon start.

`GET /v1/version`

Daemon + environment version probe.

Response (200):

{
  "ir": "1.1.11",
  "platform": "macos",
  "arch": "aarch64",
  "claudeAvailable": true,
  "claudeMissing": [],
  "claudeVersion": null,
  "substrate": "rust-pty",
  "builtinToolsEnabled": false
}

`GET /v1/availability`

Probe whether the daemon can serve requests right now (i.e., the claude binary is reachable).

Response (200):

{"ok": true, "missing": []}

When claude is missing:

{"ok": false, "missing": ["claude"]}

`GET /v1/license`

Current license validation state. Cached; re-validated every 5 minutes in the background, or on demand via /v1/license/refresh.

Response (200):

{
  "configured": true,
  "license": {
    "valid": true,
    "tier": "solo",
    "features": ["streaming","auto_patch","dashboard"],
    "usageThisMonth": 42,
    "usageCap": 5000,
    "capExceeded": false,
    "lastCapChange": 0,
    "topUpGrants": 0
  },
  "refreshedAt": 1778952693048
}

license.valid: false indicates the daemon will refuse /v1/messages with 402 Payment Required.

`POST /v1/license/refresh`

Force a fresh license validation against api.inference-relay.com. Returns the same shape as GET /v1/license.

`POST /v1/messages`

The main inference endpoint. Anthropic-SDK-shape request and response.

Headers:

Content-Type: application/json (required)
X-IR-Session-ID: <stable-id> (optional) — sticky-Session routing. Without it, the daemon mints a UUID per request → fresh stateless Session per call.

Request body:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "system": "Optional system prompt",
  "messages": [
    {"role": "user", "content": "string OR array of content blocks"}
  ],
  "stream": false,
  "tools": [/* optional Anthropic tool definitions */]
}

Content blocks supported:

{"type": "text", "text": "..."}
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": "..."}}
{"type": "tool_use", "id": "...", "name": "...", "input": {...}}
{"type": "tool_result", "tool_use_id": "...", "content": "..."}

Response (200):

{
  "content": [{"type": "text", "text": "..."}],
  "model": "claude-sonnet-4-6",
  "usage": {"input_tokens": 12, "output_tokens": 47},
  "stop_reason": "end_turn",
  "provider": "claude-pty",
  "durationMs": 3247,
  "raw_transcript": "..."
}

durationMs, provider, and raw_transcript are inference-relay extensions. Standard SDKs ignore them.

stop_reason values: end_turn, tool_use, max_tokens.

Errors:

400 — invalid request body:

{"error": "invalid request body: missing field `messages`"}

402 — license invalid:

{
  "error": "License invalid",
  "code": "license_invalid",
  "dashboardUrl": "https://inference-relay.com/dashboard"
}

402 — cap exceeded:

{
  "error": "Monthly usage cap reached (5001 / 5000). Top up to continue.",
  "code": "cap_exceeded",
  "tier": "solo",
  "used": 5001,
  "cap": 5000,
  "topUpUrl": "https://inference-relay.com/dashboard/billing?topup=1"
}

500 — provider acquire failed:

{
  "error": "provider acquire failed: claude binary not found on PATH or known locations",
  "session_id": "<request-uuid-or-header-value>"
}

Streaming note: the stream: true request field is parsed but not implemented in v1.1 — the daemon returns a buffered JSON response regardless. Native SSE is on the v1.2 roadmap.

Supported model strings (recommended):

claude-sonnet-4-6
claude-opus-4-6
claude-haiku-4-5

The daemon's cost-routing uses substring matching on opus/sonnet/haiku, so versioned aliases (e.g., claude-sonnet-4-20250514) also work — they just inherit the generic sonnet cost band.

`POST /v1/messages/count_tokens`

Pre-flight token estimate for a request body. Same shape as /v1/messages but no model call happens — the daemon estimates input tokens via a 4-bytes-per-token approximation.

Response (200):

{"input_tokens": 142}

Use for budget gating before submitting expensive calls. Accuracy is within ~10% of true tokenization for typical English; outliers exist for non-Latin scripts and dense code.

`POST /v1/chat/completions` (v1.1.15+)

OpenAI Chat Completions inbound shape. Accepts the OpenAI request body (string-or-array content; system/user/assistant/tool messages; tools; tool_choice), translates internally to the Anthropic-shape pipeline, returns an OpenAI chat.completion (or SSE chunks when stream: true). Same license gate, same cascade, same sticky-session header (X-IR-Session-ID) as /v1/messages.

Request:

POST /v1/chat/completions
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "Be terse."},
    {"role": "user", "content": "Reply with: OK"}
  ],
  "max_tokens": 30,
  "stream": false,
  "tools": [...]
}

Response (200):

{
  "id": "chatcmpl-<uuid>",
  "object": "chat.completion",
  "created": 1779094486,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "OK",
      "tool_calls": [...]  // present when claude calls a tool
    },
    "finish_reason": "stop" | "length" | "tool_calls"
  }],
  "usage": {"prompt_tokens": 19, "completion_tokens": 2, "total_tokens": 21}
}

Streaming (stream: true) emits the standard OpenAI SSE shape: each chunk is data: {"choices": [{"delta": {...}}]}, terminated by data: [DONE]. Tool-call arguments stream across multiple delta.tool_calls[N].function.arguments chunks matching the OpenAI wire spec.

Vision (v1.1.16+) — image_url blocks accepted with data:image/<type>;base64,... URLs. Remote http(s) URLs are rejected (would add an SSRF surface). Legacy function_call field is not supported (use tools); audio modalities are not supported.

`GET /v1/activity/stream` (v1.1.14+)

Server-Sent Events. Each successful or errored /v1/messages (or /v1/chat/completions) call emits one call event carrying the persisted CallRecord shape from /v1/recent-calls. Used by the dashboard's live Recent Activity panel; SDK consumers can also subscribe for real-time observability.

Event types:

event: call — data: is a JSON CallRecord (id, sessionId, status, model, inputTokens, outputTokens, costAvoidedUsd, promptPreview, rawTranscript, error, etc.).
event: lagged — data: {"dropped": N}". Broadcast channel capacity is 256; consumers that fall behind get this marker and should refetch /v1/recent-calls to reconcile.

Keep-alive comments fire every 15 s so reverse proxies don't idle-close the connection.

`POST /v1/sessions/:id/clear-prompt`

Clear the prompt input buffer + attachment state for a sticky Session WITHOUT wiping conversation memory. Use between programmatic calls when you want to guarantee a clean composition state.

Response (200):

{
  "ok": true,
  "session_id": "planner-1",
  "cleared": {"prompt": true, "staged": 0, "attachments": 0}
}

If the Session id doesn't exist:

{"ok": true, "cleared": {...}, "note": "no active session for this id"}

`POST /v1/sessions/:id/reset`

Hard reset: drop the existing Session's PTY and bind a fresh Pre-warmed replacement from the Pool. Wipes conversation memory.

Response (200):

{"ok": true, "session_id": "planner-1", "latency_ms": 12}

~10 ms in the happy path (warm pool grab). Up to ~2 s if the pool is empty (synchronous spawn).

`DELETE /v1/sessions/:id`

Drop the Session entirely. The next call with that id spawns a fresh Session from the Pool.

Response (200):

{"ok": true, "session_id": "planner-1"}

Response (404):

{"ok": false, "session_id": "unknown-id"}

`GET /v1/sessions`

Pool snapshot.

Response (200):

{
  "sessions": [],
  "pool": {"idle": 2, "active": 1, "spawning": 0}
}

idle = pre-warmed PTYs.
active = currently bound to a sticky session id.
spawning = background replenisher in flight.
sessions[] is intentionally empty — sticky-Session ids are not exposed by the public endpoint.

`GET /v1/recent-calls`

Returns every call record in the in-memory ring since daemon start (or since the last POST /v1/recent-calls/clear). The in-memory list is currently unbounded — restart the daemon or clear via the endpoint to reclaim memory if it grows on long-running installs.

The persisted file at ~/.inference-relay/recent-calls.jsonl rotates at approximately 1,000 entries via a byte-threshold trigger (~1.5 MB).

Response (200):

{
  "calls": [
    {
      "id": "uuid",
      "session_id": "uuid",
      "started_at": 1778952693048,
      "completed_at": 1778952696295,
      "model": "claude-sonnet-4-6",
      "input_tokens": 12,
      "output_tokens": 47,
      "status": "success",
      "error": null
    }
  ]
}

status: success | error.
License key never appears in this log.

`POST /v1/recent-calls/clear`

Clear the in-memory ring buffer + the persisted JSONL file.

Response (200): {"ok": true}

`GET /v1/settings` / `PATCH /v1/settings`

Read or update daemon settings. License key in GET responses is redacted to ••••••••<last-4-chars>.

PATCH accepts {licenseKey, workingDir, builtinToolsEnabled, ...}. Setting licenseKey triggers an immediate validation against api.inference-relay.com.

`GET /v1/debug-bundle`

Diagnostic snapshot: redacted settings, recent calls, license state, pool snapshot, daemon version. Useful for support tickets.

Response (200):

{
  "version": {...},
  "settings": {/* license redacted */},
  "license": {...},
  "pool": {...},
  "recent_calls_count": 42
}

Stub endpoints (reserved, not yet implemented)

These routes exist in the daemon but return stub responses today. Don't build on them — the shape will change when the implementation lands.

POST /v1/warmup — currently returns 503 with {"error": "TODO Days 5-7"}. Intended to let operators pre-warm the Session Pool before issuing real traffic. Use a real /v1/messages call as a warmup probe for now.
GET /v1/events — currently returns {"events": []}. Intended for a daemon-side event stream (turn events, license refresh, pool state changes).
POST /v1/events/clear — currently returns {"ok": true}. No-op stub.
POST /v1/conversations/:id/clear — currently returns {"ok": true}. No-op stub; predates POST /v1/sessions/:id/reset. Use /sessions/:id/reset for actual conversation-memory wipes.

Internal endpoints (not for SDK consumers)

These exist for the bundled MCP server to call back into the daemon during tool round-trips. Don't call them from your code.

POST /v1/internal/tool_call — staging endpoint used by the MCP server to register caller-defined tool invocations

Where to go next

Quickstart → Quickstart
SDK examples per language → SDK Integration
Session lifecycle → Sessions

API Reference

Conventions

Backend caveats (read this before integrating)

1. JSON output may include markdown code fences

2. Forced tool_choice returns 400

GET /v1/health

GET /v1/version

GET /v1/availability

GET /v1/license

POST /v1/license/refresh

POST /v1/messages

POST /v1/messages/count_tokens

POST /v1/chat/completions (v1.1.15+)

GET /v1/activity/stream (v1.1.14+)

POST /v1/sessions/:id/clear-prompt

POST /v1/sessions/:id/reset

DELETE /v1/sessions/:id

GET /v1/sessions

GET /v1/recent-calls

POST /v1/recent-calls/clear

GET /v1/settings / PATCH /v1/settings

GET /v1/debug-bundle

Stub endpoints (reserved, not yet implemented)

Internal endpoints (not for SDK consumers)

Where to go next

2. Forced `tool_choice` returns 400

`GET /v1/health`

`GET /v1/version`

`GET /v1/availability`

`GET /v1/license`

`POST /v1/license/refresh`

`POST /v1/messages`

`POST /v1/messages/count_tokens`

`POST /v1/chat/completions` (v1.1.15+)

`GET /v1/activity/stream` (v1.1.14+)

`POST /v1/sessions/:id/clear-prompt`

`POST /v1/sessions/:id/reset`

`DELETE /v1/sessions/:id`

`GET /v1/sessions`

`GET /v1/recent-calls`

`POST /v1/recent-calls/clear`

`GET /v1/settings` / `PATCH /v1/settings`

`GET /v1/debug-bundle`