Smart routing: picking the cheapest model that can still do the job

Most agent turns don't need Opus. The problem is figuring out which ones do — at runtime, before you've spent the money. Here's how Tutti's SmartProvider does it, what we got wrong on the first try, and where it still falls short.

Chihab

Building Tutti AI · 10 May 2026 · 8 min read

If you build a real agent on top of a frontier model, the first thing that surprises you is the bill. The second thing is that most of it is waste.

A typical multi-agent run spends 80% of its turns on tasks that a small model would handle perfectly: classify a user's intent, format a list of tools, summarise a search result, decide whether the loop is done. The remaining 20% — code generation, multi-step reasoning, novel problem-solving — is where you actually need the smart model. But because there's no way to tell which turn is which until *after* you've made the call, every turn pays Opus prices, and your bill is 5× what it needs to be.

Tutti v0.23 ships SmartProvider — a meta-provider that classifies each turn and dispatches it to whichever tier can handle it. This post is the design story: what we tried, what didn't work, and the trade-offs we settled on.

The core idea

SmartProvider sits where any other provider would in your score file:

```ts import { SmartProvider, AnthropicProvider } from '@tuttiai/core'

provider: new SmartProvider({ tiers: [ { tier: 'small', provider: new AnthropicProvider(), model: 'claude-haiku-4-5-20251001' }, { tier: 'medium', provider: new AnthropicProvider(), model: 'claude-sonnet-4-6' }, { tier: 'large', provider: new AnthropicProvider(), model: 'claude-opus-4-7' }, ], classifier: 'heuristic', policy: 'cost-optimised', }) ```

Three tiers, one classifier, one policy. Every turn the agent loop is about to make, SmartProvider runs the classifier first, picks a tier, and forwards the call to the underlying provider. From the agent's perspective nothing has changed — it's just a faster, cheaper turn.

Classifier strategy 1: free heuristics

The first version was an LLM classifier. We sent each turn to Haiku with a prompt like *"How hard is this request — easy / medium / hard?"* and dispatched on the answer. It worked. It also doubled the latency on every cheap turn we were trying to optimise. Spending Haiku-cost on the classifier to save Sonnet-cost on the dispatch was negative-EV at small request sizes.

So we built a heuristic classifier first. Zero cost. Pure regex + counting:

function classify(input: string, tools: Tool[]): Tier {
  if (input.length < 200 && tools.length === 0) return 'small'
  if (/code|implement|refactor|debug/i.test(input)) return 'large'
  if (tools.some(t => t.destructive)) return 'large'  // safety bias
  return 'medium'
}

That's most of it, sketched. The full implementation has more rules — tool-count thresholds, length bands, a destructive-tool premium — but the shape is the same. It costs nothing to run, has no latency, and its decisions are auditable in plain code.

The heuristic gets ~85% of dispatches right on our internal benchmark (a workload of ~500 traces from the dogfood agent). The 15% it gets wrong are mostly "looked simple but wasn't" — short prompts that hide a complex multi-step task. Those failures are cheap (we tried Haiku, it produced something weak, the next turn corrected) but they exist.

Classifier strategy 2: LLM, when you want it

For workloads where 15% accuracy loss isn't acceptable, the LLM classifier is still available — you just opt in with `classifier: 'llm'`. Internally it asks a small/cheap LLM (Haiku by default, or any provider you configure via `classifier_provider`) for a one-word difficulty label per turn, then dispatches.

This is the right choice if your turns are heterogeneous and high-stakes — say, customer support where a wrong dispatch is a wrong answer. It's the wrong choice if your turns are mostly cheap and you're trying to *save* money: you'd be spending Haiku-cost on a classifier that picks Haiku 80% of the time.

Pick the classifier that matches your workload's actual cost-per-error.

The destructive-tool premium

Here's the rule we wrestled with the longest: when an agent has access to a destructive tool — `post_tweet`, `create_refund`, `execute` SQL — should the router be more or less aggressive about routing to a small model?

The intuition is "more aggressive — small models are usually fine." That's wrong. A small model is more likely to misread a request, more likely to hallucinate a tool argument, more likely to call a tool the user didn't ask for. When the worst case of a wrong call is "the answer was slightly off," routing aggressively is fine. When the worst case is "we tweeted something we shouldn't have" or "we voided an invoice," routing aggressively is reckless.

So the heuristic classifier carries a *destructive-tool premium*: if any of the loaded tools are marked `destructive: true` (the same flag the runtime uses for HITL gating — see the [previous post](/blog/hitl-by-default-not-opt-in)), the classifier biases toward larger tiers. The exact bias is configurable per policy, but the default is: agents with destructive tools start one tier higher than they would without them.

This is the kind of safety-aware decision that's only possible because Tutti's voices declare destructiveness as runtime metadata. A framework where tools are unannotated functions can't do this without per-call wiring.

Budget-aware downgrade

The other rule that earned its place: if a planned call would push the run past its `max_cost_usd` cap, the router downgrades to the small tier instead of letting the budget enforcer throw post-hoc.

agents: {
  triage: {
    name: 'triage',
    model: 'auto',                          // ← per-agent opt-in to routing
    voices: [],
    budget: { max_cost_usd: 0.50 },
  },
}

Without budget-aware routing: the agent runs five turns on Sonnet, the sixth turn gets classified as 'large' and goes to Opus, that one turn pushes the run past 50¢, `BudgetExceededError` throws, the user sees a partial result. With budget-aware routing: the sixth turn is classified as 'large', SmartProvider checks `TokenBudget.canAfford()` against the projected cost, sees the cap would be breached, and forces the call onto 'small' with `reason: 'budget-forced'`. The user gets a complete answer, slightly worse on that one turn, with a routing-decision span explaining why.

This is the right default. Falling back to a smaller model is almost always better than failing the run.

Fallbacks for everything

The final piece is the fallback chain. When the chosen tier's `chat` throws — Anthropic 529, OpenAI rate-limit, network blip — SmartProvider retries on the configured `fallback` tier, emits a `router:fallback` event, and records a second `RoutingDecision` with `reason: 'fallback after error: …'` so the trace shows both the primary attempt and the eventual dispatch.

Streaming has no fallback because chunks may already have been yielded to the caller — switching providers mid-stream would produce a corrupt response. We document this limitation rather than work around it badly.

What you can audit

Every routing decision is a span. `tutti-ai traces router ` prints just the routing decisions for one trace: tier (colour-coded), classifier, model, per-call cost estimate, reason. Fallbacks render with `↩ from-model → to-model` plus the error that triggered the swap.

``` Routing decisions for trace 4b7e9c2f ● small heuristic claude-haiku-4-5 $0.0008 short input, no tools ● small heuristic claude-haiku-4-5 $0.0011 short input, no tools ● large heuristic claude-opus-4-7 $0.0241 /code|implement/ ● small heuristic claude-haiku-4-5 $0.0009 short input, no tools ↩ fallback after error: 529 — claude-opus-4-7 → claude-sonnet-4-6 ● medium heuristic claude-sonnet-4-6 $0.0048 budget-forced

Total: 6 decisions · $0.0317 routed ```

If you can't explain why a routing decision happened, the router didn't earn its place in the runtime. The audit surface is non-negotiable.

What this saves

On the dogfood agent — a TypeScript coding assistant that reads files, writes code, runs tests — switching from Opus-everywhere to SmartProvider with the cost-optimised policy cut total cost by 62% with no measurable drop in pass-rate on our golden eval set. About 70% of turns route to Haiku (small), 22% to Sonnet (medium), 8% to Opus (large). The Opus turns are the actual code generation; everything else — file-listing, test-running, summarising — turns out to be Haiku-shaped.

Your numbers will differ. The right thing to do before flipping it on is run `tutti-ai analyze costs` after a few days and see what the classifier's actually picking — then tune from there.

Where it falls short

A few honest weak spots worth flagging:

- Embedding classifier is a placeholder. The 0.23 release lists it as a strategy but the underlying logic is a stub. We're collecting representative training data from real workloads before shipping it, because a bad embedding classifier is worse than no embedding classifier — it would route confidently and wrongly. - No multi-provider routing in one tier yet. A tier is one provider. If you want to route between Anthropic and OpenAI based on availability or pricing, that's two SmartProviders chained, not a single one. The infrastructure is there, the ergonomics aren't. - Classification doesn't see history. We classify each turn in isolation. A turn that's "short input, no tools" but follows five turns of complex code generation is probably *not* a Haiku turn — but the classifier treats it as one. Adding turn-history awareness is the next iteration.

Why this isn't a SaaS gateway

Most cost-routing products in this space are hosted gateways: you point your SDK at their endpoint, they classify and dispatch, you pay them a percentage. That's a fine product for some workloads. It's not the right architecture for Tutti.

A hosted gateway is a vendor lock-in (your traffic goes through their pipes), a latency floor (one extra hop on every call), a privacy concern (your inputs and outputs are visible to a third party), and a billing surface (your costs are now also their costs). SmartProvider runs in your process, decisions are emitted as OpenTelemetry spans you already collect, the classifier is open-source code you can read.

If you want to fork it and write your own classifier, that's a 200-line file. The whole package is small on purpose — routing should be simple, auditable, and yours.

Tags #smart-routing #cost #router #design

Older post

Comparing Tutti to LangGraph, CrewAI, AutoGen, and Mastra

8 min · Product