Routing Engine — The Economic Balancing Guide
Default Behavior
The default routing strategy is prefer-user, which sends requests through the Native Subscription Gateway first and falls back to API providers only when the gateway is unavailable or unhealthy.
This default exists because it maximizes savings: every request that routes through the user's subscription costs the developer nothing. API providers serve as Automated Service Continuity — they absorb traffic only when the primary gateway cannot.
const relay = new InferenceRelay({
routing: 'prefer-user', // This is the default — shown for clarity
});Gross Margin Optimization
Cost-based routing is not a feature for saving a few cents. It is a deterministic shift from 15% to 98% gross margins on AI-powered applications.
The core insight: not all inference calls carry the same economic weight. Lightweight orchestration calls (classification, triage, routing) cost fractions of a cent. Heavy execution calls (drafting, analysis, long-context reasoning) cost 10–100x more. Routing the expensive calls through user subscriptions while keeping cheap calls on your own API key is the difference between a viable business and a cash furnace.
The maxCostPerCall Threshold
const relay = new InferenceRelay({
routing: {
maxCostPerCall: 0.01, // USD
},
});When the Predictive Resource Allocationengine estimates that a request will exceed the threshold, it routes to the user's subscription. When the estimate falls below the threshold, it stays on the app's API key.
Example: Real-World Split
- Triage / classification (Haiku, ~$0.0003) → App API key
- Tool selection (Haiku, ~$0.0005) → App API key
- Document analysis (Sonnet, ~$0.12) → User subscription
- Long-form drafting (Sonnet, ~$0.18) → User subscription
- Image understanding (GPT-4o, ~$0.08) → User subscription
The cheap calls stay on your key (fast, no dependency on user auth). The expensive calls route through the user's subscription (zero cost to you).
Predictive Resource Allocation
Every routing decision is made before the request is sent, not after. The estimateCost() method performs pre-flight token estimation and USD calculation.
How Estimation Works
- Input tokens — Approximate from content length (characters / 4)
- Output tokens — Use the lesser of
max_tokensand a reasonable ceiling (4096) - USD calculation — Apply the model's pricing from the built-in pricing table
// The relay does this internally before every request:
const estimate = provider.estimateCost({
model: 'claude-sonnet-4-20250514',
messages: [...],
max_tokens: 4096,
});
// estimate.totalCostUsd → 0.087The estimate is intentionally conservative — it assumes near-maximum output generation. This means the routing engine will occasionally route a cheap request to the user subscription, but it will almost never route an expensive request to the app key by mistake. The bias is toward protecting margins, not toward precision.
Custom Routing Functions
For full control, pass a function that receives the request parameters and the cost estimate, and returns a routing strategy.
const relay = new InferenceRelay({
routing: (params, estimate) => {
// Route expensive calls through user subscription
if (estimate.totalCostUsd > 0.01) return 'prefer-user';
// Keep cheap calls on the app key
return 'prefer-app';
},
});The function is called synchronously before each request. It has access to the request params (model, messages, max_tokens, tools, metadata) and the estimate (estimated input/output tokens and total USD).
Return Values
'prefer-user'— Try Native Gateway first, then API providers'prefer-app'— Try app API providers first, then Native Gateway'user-only'— Native Gateway only, fail if unavailable'app-only'— API providers only, never use Native Gatewaystring(provider ID) — Route to a specific named provider
Local-First Routing
For development environments or privacy-sensitive deployments, route to Ollama whenever it is available.
const relay = new InferenceRelay({
routing: { preferLocal: true },
});This checks the Ollama provider's health first. If Ollama is running and healthy, all requests go there. If Ollama is unavailable, the standard cascade takes over.
Condition Logic Engine
The Condition Logic Engine is a declarative rules system for environment-aware routing. It replaces imperative routing functions with a structured DSL that is easier to audit, version, and deploy across environments.
Available as a separate package:
import { advancedRouting } from '@inference-relay/pro';Rule Structure
Each rule has a match object (conditions) and a route (target). The engine evaluates rules top-to-bottom and uses first-match-wins semantics.
const route = advancedRouting({
rules: [
// High-cost requests always go through user subscription
{ match: { estimatedCost: { gt: 0.50 } }, route: 'prefer-user' },
// OpenAI models route to the OpenAI provider (pattern matching)
{ match: { model: 'gpt-*' }, route: 'openai-api' },
// Image requests must use a multimodal-capable API provider
{ match: { hasImages: true }, route: 'anthropic-api' },
// Everything else defaults to user subscription
{ default: 'prefer-user' },
],
});
const relay = new InferenceRelay({ routing: route });Available Match Conditions
model— Logic mapping against the model identifier (string or pattern)estimatedCost— Numeric comparison against estimated USD (gt,lt,gte,lte)hasImages— Boolean: whether the request contains image contenttokenCount— Numeric comparison against estimated total tokensmetadata— Match against custom metadata fields
Environment-Aware Configuration
The Condition Logic Engine is particularly useful for maintaining different routing policies across environments:
const route = advancedRouting({
rules: process.env.NODE_ENV === 'production'
? [
{ match: { estimatedCost: { gt: 0.01 } }, route: 'prefer-user' },
{ default: 'prefer-app' },
]
: [
// Development: use local inference for everything possible
{ match: { model: 'claude-*' }, route: 'ollama' },
{ default: 'prefer-app' },
],
});This keeps routing logic declarative and auditable while adapting to deployment context.
Continue reading: Fallback Engine for cascade behavior, health states, and atomic session preservation.