Routing Engine — The Economic Balancing Guide

Default Behavior

The default routing strategy is prefer-user, which sends requests through the Native Subscription Gateway first and falls back to API providers only when the gateway is unavailable or unhealthy.

This default exists because it maximizes savings: every request that routes through the user's subscription costs the developer nothing. API providers serve as Automated Service Continuity — they absorb traffic only when the primary gateway cannot.

const relay = new InferenceRelay({
  routing: 'prefer-user', // This is the default — shown for clarity
});

Gross Margin Optimization

Cost-based routing is not a feature for saving a few cents. It is a deterministic shift from 15% to 98% gross margins on AI-powered applications.

The core insight: not all inference calls carry the same economic weight. Lightweight orchestration calls (classification, triage, routing) cost fractions of a cent. Heavy execution calls (drafting, analysis, long-context reasoning) cost 10–100x more. Routing the expensive calls through user subscriptions while keeping cheap calls on your own API key is the difference between a viable business and a cash furnace.

The maxCostPerCall Threshold

const relay = new InferenceRelay({
  routing: {
    maxCostPerCall: 0.01, // USD
  },
});

When the Predictive Resource Allocationengine estimates that a request will exceed the threshold, it routes to the user's subscription. When the estimate falls below the threshold, it stays on the app's API key.

Example: Real-World Split

Triage / classification (Haiku, ~$0.0003) → App API key
Tool selection (Haiku, ~$0.0005) → App API key
Document analysis (Sonnet, ~$0.12) → User subscription
Long-form drafting (Sonnet, ~$0.18) → User subscription
Image understanding (GPT-4o, ~$0.08) → User subscription

The cheap calls stay on your key (fast, no dependency on user auth). The expensive calls route through the user's subscription (zero cost to you).

Predictive Resource Allocation

Every routing decision is made before the request is sent, not after. The estimateCost() method performs pre-flight token estimation and USD calculation.

How Estimation Works

Input tokens — Approximate from content length (characters / 4)
Output tokens — Use the lesser of max_tokens and a reasonable ceiling (4096)
USD calculation — Apply the model's pricing from the built-in pricing table

// The relay does this internally before every request:
const estimate = provider.estimateCost({
  model: 'claude-sonnet-4-20250514',
  messages: [...],
  max_tokens: 4096,
});
// estimate.totalCostUsd → 0.087

The estimate is intentionally conservative — it assumes near-maximum output generation. This means the routing engine will occasionally route a cheap request to the user subscription, but it will almost never route an expensive request to the app key by mistake. The bias is toward protecting margins, not toward precision.

Custom Routing Functions

For full control, pass a function that receives the request parameters and the cost estimate, and returns a routing strategy.

const relay = new InferenceRelay({
  routing: (params, estimate) => {
    // Route expensive calls through user subscription
    if (estimate.totalCostUsd > 0.01) return 'prefer-user';

    // Keep cheap calls on the app key
    return 'prefer-app';
  },
});

The function is called synchronously before each request. It has access to the request params (model, messages, max_tokens, tools, metadata) and the estimate (estimated input/output tokens and total USD).

Return Values

'prefer-user' — Try Native Gateway first, then API providers
'prefer-app' — Try app API providers first, then Native Gateway
'user-only' — Native Gateway only, fail if unavailable
'app-only' — API providers only, never use Native Gateway
string (provider ID) — Route to a specific named provider

Local-First Routing

For development environments or privacy-sensitive deployments, route to Ollama whenever it is available.

const relay = new InferenceRelay({
  routing: { preferLocal: true },
});

This checks the Ollama provider's health first. If Ollama is running and healthy, all requests go there. If Ollama is unavailable, the standard cascade takes over.

Condition Logic Engine

The Condition Logic Engine is a declarative rules system for environment-aware routing. It replaces imperative routing functions with a structured DSL that is easier to audit, version, and deploy across environments.

Available as a separate package:

import { advancedRouting } from '@inference-relay/pro';

Rule Structure

Each rule has a match object (conditions) and a route (target). The engine evaluates rules top-to-bottom and uses first-match-wins semantics.

const route = advancedRouting({
  rules: [
    // High-cost requests always go through user subscription
    { match: { estimatedCost: { gt: 0.50 } }, route: 'prefer-user' },

    // OpenAI models route to the OpenAI provider (pattern matching)
    { match: { model: 'gpt-*' }, route: 'openai-api' },

    // Image requests must use a multimodal-capable API provider
    { match: { hasImages: true }, route: 'anthropic-api' },

    // Everything else defaults to user subscription
    { default: 'prefer-user' },
  ],
});

const relay = new InferenceRelay({ routing: route });

Available Match Conditions

model — Logic mapping against the model identifier (string or pattern)
estimatedCost — Numeric comparison against estimated USD (gt, lt, gte, lte)
hasImages — Boolean: whether the request contains image content
tokenCount — Numeric comparison against estimated total tokens
metadata — Match against custom metadata fields

Environment-Aware Configuration

The Condition Logic Engine is particularly useful for maintaining different routing policies across environments:

const route = advancedRouting({
  rules: process.env.NODE_ENV === 'production'
    ? [
        { match: { estimatedCost: { gt: 0.01 } }, route: 'prefer-user' },
        { default: 'prefer-app' },
      ]
    : [
        // Development: use local inference for everything possible
        { match: { model: 'claude-*' }, route: 'ollama' },
        { default: 'prefer-app' },
      ],
});

This keeps routing logic declarative and auditable while adapting to deployment context.

Continue reading: Fallback Engine for cascade behavior, health states, and atomic session preservation.