API Reference
Conventions
The daemon listens on 127.0.0.1:7421 by default. Override with the SR_DAEMON_PORT environment variable when launching.
All endpoints accept and return Content-Type: application/json unless noted. Errors are JSON-shaped: {"error": "<message>"} for 4xx/5xx responses.
License gating applies to /v1/messages. License validation runs in a 5-minute background loop; the per-call gate checks cached state.
GET /v1/health
Daemon liveness + diagnostics. Unauthenticated.
Response (200):
{
"port": 7421,
"sessions": 0,
"status": "healthy",
"uptime": 18472
}uptime is milliseconds since daemon start.
GET /v1/version
Daemon + environment version probe.
Response (200):
{
"ir": "1.1.11",
"platform": "macos",
"arch": "aarch64",
"claudeAvailable": true,
"claudeMissing": [],
"claudeVersion": null,
"substrate": "rust-pty",
"builtinToolsEnabled": false
}GET /v1/availability
Probe whether the daemon can serve requests right now (i.e., the claude binary is reachable).
Response (200):
{"ok": true, "missing": []}When claude is missing:
{"ok": false, "missing": ["claude"]}GET /v1/license
Current license validation state. Cached; re-validated every 5 minutes in the background, or on demand via /v1/license/refresh.
Response (200):
{
"configured": true,
"license": {
"valid": true,
"tier": "solo",
"features": ["streaming","auto_patch","dashboard"],
"usageThisMonth": 42,
"usageCap": 5000,
"capExceeded": false,
"lastCapChange": 0,
"topUpGrants": 0
},
"refreshedAt": 1778952693048
}license.valid: false indicates the daemon will refuse /v1/messages with 402 Payment Required.
POST /v1/license/refresh
Force a fresh license validation against api.inference-relay.com. Returns the same shape as GET /v1/license.
POST /v1/messages
The main inference endpoint. Anthropic-SDK-shape request and response.
Headers:
Content-Type: application/json(required)X-IR-Session-ID: <stable-id>(optional) — sticky-Session routing. Without it, the daemon mints a UUID per request → fresh stateless Session per call.
Request body:
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"system": "Optional system prompt",
"messages": [
{"role": "user", "content": "string OR array of content blocks"}
],
"stream": false,
"tools": [/* optional Anthropic tool definitions */]
}Content blocks supported:
{"type": "text", "text": "..."}{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": "..."}}{"type": "tool_use", "id": "...", "name": "...", "input": {...}}{"type": "tool_result", "tool_use_id": "...", "content": "..."}
Response (200):
{
"content": [{"type": "text", "text": "..."}],
"model": "claude-sonnet-4-6",
"usage": {"input_tokens": 12, "output_tokens": 47},
"stop_reason": "end_turn",
"provider": "claude-pty",
"durationMs": 3247,
"raw_transcript": "..."
}durationMs, provider, and raw_transcript are inference-relay extensions. Standard SDKs ignore them.
stop_reason values: end_turn, tool_use, max_tokens.
Errors:
400 — invalid request body:
{"error": "invalid request body: missing field `messages`"}402 — license invalid:
{
"error": "License invalid",
"code": "license_invalid",
"dashboardUrl": "https://inference-relay.com/dashboard"
}402 — cap exceeded:
{
"error": "Monthly usage cap reached (5001 / 5000). Top up to continue.",
"code": "cap_exceeded",
"tier": "solo",
"used": 5001,
"cap": 5000,
"topUpUrl": "https://inference-relay.com/dashboard/billing?topup=1"
}500 — provider acquire failed:
{
"error": "provider acquire failed: claude binary not found on PATH or known locations",
"session_id": "<request-uuid-or-header-value>"
}Streaming note: the stream: true request field is parsed but not implemented in v1.1 — the daemon returns a buffered JSON response regardless. Native SSE is on the v1.2 roadmap.
Supported model strings (recommended):
claude-sonnet-4-6claude-opus-4-6claude-haiku-4-5
The daemon's cost-routing uses substring matching on opus/sonnet/haiku, so versioned aliases (e.g., claude-sonnet-4-20250514) also work — they just inherit the generic sonnet cost band.
POST /v1/messages/count_tokens
Pre-flight token estimate for a request body. Same shape as /v1/messages but no model call happens — the daemon estimates input tokens via a 4-bytes-per-token approximation.
Response (200):
{"input_tokens": 142}Use for budget gating before submitting expensive calls. Accuracy is within ~10% of true tokenization for typical English; outliers exist for non-Latin scripts and dense code.
POST /v1/chat/completions (v1.1.15+)
OpenAI Chat Completions inbound shape. Accepts the OpenAI request body (string-or-array content; system/user/assistant/tool messages; tools; tool_choice), translates internally to the Anthropic-shape pipeline, returns an OpenAI chat.completion (or SSE chunks when stream: true). Same license gate, same cascade, same sticky-session header (X-IR-Session-ID) as /v1/messages.
Request:
POST /v1/chat/completions
Content-Type: application/json
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "Be terse."},
{"role": "user", "content": "Reply with: OK"}
],
"max_tokens": 30,
"stream": false,
"tools": [...]
}Response (200):
{
"id": "chatcmpl-<uuid>",
"object": "chat.completion",
"created": 1779094486,
"model": "gpt-4o",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "OK",
"tool_calls": [...] // present when claude calls a tool
},
"finish_reason": "stop" | "length" | "tool_calls"
}],
"usage": {"prompt_tokens": 19, "completion_tokens": 2, "total_tokens": 21}
}Streaming (stream: true) emits the standard OpenAI SSE shape: each chunk is data: {"choices": [{"delta": {...}}]}, terminated by data: [DONE]. Tool-call arguments stream across multiple delta.tool_calls[N].function.arguments chunks matching the OpenAI wire spec.
Vision (v1.1.16+) — image_url blocks accepted with data:image/<type>;base64,... URLs. Remote http(s) URLs are rejected (would add an SSRF surface). Legacy function_call field is not supported (use tools); audio modalities are not supported.
GET /v1/activity/stream (v1.1.14+)
Server-Sent Events. Each successful or errored /v1/messages (or /v1/chat/completions) call emits one call event carrying the persisted CallRecord shape from /v1/recent-calls. Used by the dashboard's live Recent Activity panel; SDK consumers can also subscribe for real-time observability.
Event types:
event: call—data:is a JSON CallRecord (id, sessionId, status, model, inputTokens, outputTokens, costAvoidedUsd, promptPreview, rawTranscript, error, etc.).event: lagged—data: {"dropped": N}". Broadcast channel capacity is 256; consumers that fall behind get this marker and should refetch/v1/recent-callsto reconcile.
Keep-alive comments fire every 15 s so reverse proxies don't idle-close the connection.
POST /v1/sessions/:id/clear-prompt
Clear the prompt input buffer + attachment state for a sticky Session WITHOUT wiping conversation memory. Use between programmatic calls when you want to guarantee a clean composition state.
Response (200):
{
"ok": true,
"session_id": "planner-1",
"cleared": {"prompt": true, "staged": 0, "attachments": 0}
}If the Session id doesn't exist:
{"ok": true, "cleared": {...}, "note": "no active session for this id"}POST /v1/sessions/:id/reset
Hard reset: drop the existing Session's PTY and bind a fresh Pre-warmed replacement from the Pool. Wipes conversation memory.
Response (200):
{"ok": true, "session_id": "planner-1", "latency_ms": 12}~10 ms in the happy path (warm pool grab). Up to ~2 s if the pool is empty (synchronous spawn).
DELETE /v1/sessions/:id
Drop the Session entirely. The next call with that id spawns a fresh Session from the Pool.
Response (200):
{"ok": true, "session_id": "planner-1"}Response (404):
{"ok": false, "session_id": "unknown-id"}GET /v1/sessions
Pool snapshot.
Response (200):
{
"sessions": [],
"pool": {"idle": 2, "active": 1, "spawning": 0}
}idle = pre-warmed PTYs.active = currently bound to a sticky session id.spawning = background replenisher in flight.sessions[] is intentionally empty — sticky-Session ids are not exposed by the public endpoint.
GET /v1/recent-calls
Returns every call record in the in-memory ring since daemon start (or since the last POST /v1/recent-calls/clear). The in-memory list is currently unbounded — restart the daemon or clear via the endpoint to reclaim memory if it grows on long-running installs.
The persisted file at ~/.inference-relay/recent-calls.jsonl rotates at approximately 1,000 entries via a byte-threshold trigger (~1.5 MB).
Response (200):
{
"calls": [
{
"id": "uuid",
"session_id": "uuid",
"started_at": 1778952693048,
"completed_at": 1778952696295,
"model": "claude-sonnet-4-6",
"input_tokens": 12,
"output_tokens": 47,
"status": "success",
"error": null
}
]
}status: success | error.
License key never appears in this log.
POST /v1/recent-calls/clear
Clear the in-memory ring buffer + the persisted JSONL file.
Response (200): {"ok": true}
GET /v1/settings / PATCH /v1/settings
Read or update daemon settings. License key in GET responses is redacted to ••••••••<last-4-chars>.
PATCH accepts {licenseKey, workingDir, builtinToolsEnabled, ...}. Setting licenseKey triggers an immediate validation against api.inference-relay.com.
GET /v1/debug-bundle
Diagnostic snapshot: redacted settings, recent calls, license state, pool snapshot, daemon version. Useful for support tickets.
Response (200):
{
"version": {...},
"settings": {/* license redacted */},
"license": {...},
"pool": {...},
"recent_calls_count": 42
}Stub endpoints (reserved, not yet implemented)
These routes exist in the daemon but return stub responses today. Don't build on them — the shape will change when the implementation lands.
POST /v1/warmup— currently returns503with{"error": "TODO Days 5-7"}. Intended to let operators pre-warm the Session Pool before issuing real traffic. Use a real/v1/messagescall as a warmup probe for now.GET /v1/events— currently returns{"events": []}. Intended for a daemon-side event stream (turn events, license refresh, pool state changes).POST /v1/events/clear— currently returns{"ok": true}. No-op stub.POST /v1/conversations/:id/clear— currently returns{"ok": true}. No-op stub; predatesPOST /v1/sessions/:id/reset. Use/sessions/:id/resetfor actual conversation-memory wipes.
Internal endpoints (not for SDK consumers)
These exist for the bundled MCP server to call back into the daemon during tool round-trips. Don't call them from your code.
POST /v1/internal/tool_call— staging endpoint used by the MCP server to register caller-defined tool invocations
Where to go next
- Quickstart → Quickstart
- SDK examples per language → SDK Integration
- Session lifecycle → Sessions