Post

Multi-Model Orchestration — SLM + LLM Hybrid Architectures

The 2026 pattern for cost-efficient production agents: small language models handle simple tasks cheaply, large models handle complex tasks well. How to design the routing layer, what to measure, and what the cost-quality tradeoffs actually look like.

Multi-Model Orchestration — SLM + LLM Hybrid Architectures

The LLM pricing collapse I covered on Day 6 created a tiered market: efficient small models at a fraction of the cost of frontier reasoning models. The question is how to use both appropriately.

The emerging pattern in production 2026 systems is hybrid SLM + LLM orchestration — small language models for high-volume, low-complexity tasks and large models for the tasks that need frontier reasoning.

flowchart TD
    A[Incoming Task] --> B{Task Router}
    B -->|classify_intent<br/>extract_entities<br/>summarise| C[Efficient Model<br/>claude-haiku-4-5<br/>$0.10-0.25/M tokens]
    B -->|generate_response<br/>code_review| D[Frontier Model<br/>claude-sonnet-4-6<br/>$3-5/M tokens]
    B -->|debug_complex<br/>plan_architecture| E[Reasoning Model<br/>claude-opus-4-8<br/>$15-75/M tokens]
    C & D & E --> F[Output]

Why Not Just Use the Frontier Model for Everything?

If the frontier model produces better output, why not use it for every operation?

Cost. At $3–5/M input tokens for frontier models vs. $0.10–0.25/M for efficient models, the price ratio is 15–50x. An agent making 100 LLM calls per task at frontier prices has very different economics from the same agent at efficient-tier prices.

Latency. Frontier models are slower. A call to Claude Opus 4.8 takes 3–8 seconds; a call to Claude Haiku 4.5 takes 0.5–1.5 seconds. For agent workflows with many sequential steps, this compounds significantly.

Unnecessary capability. Using a frontier model to classify a document as one of five categories is equivalent to hiring a PhD for data entry. The quality doesn’t improve; you’re just paying more for the same output.


The Task Taxonomy

Before designing the routing layer, classify your agent’s tasks by model requirements:

Efficient-model tasks:

  • Intent classification (“is this a billing question or a technical question?”)
  • Entity extraction (order number, date, product name from text)
  • Summarisation of structured content
  • Standard generation from clear templates
  • Formatting and transformation (JSON → natural language)
  • Binary decisions with clear criteria

Frontier-model tasks:

  • Complex reasoning across multiple documents
  • Ambiguous situation interpretation requiring judgment
  • Novel problem solving without clear patterns
  • Code generation for complex logic
  • Synthesis of conflicting information
  • High-stakes decisions where errors are expensive

Reasoning-model tasks (the most expensive tier):

  • Multi-step logical deduction
  • Complex debugging with non-obvious root causes
  • Architectural planning with many interacting constraints
  • Mathematical reasoning

The Routing Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from enum import Enum

class ModelTier(Enum):
    EFFICIENT = "claude-haiku-4-5-20251001"
    FRONTIER = "claude-sonnet-4-6"
    REASONING = "claude-opus-4-8-20261001"  # with extended thinking

TASK_ROUTING = {
    "classify_intent": ModelTier.EFFICIENT,
    "extract_entities": ModelTier.EFFICIENT,
    "generate_summary": ModelTier.EFFICIENT,
    "answer_faq": ModelTier.EFFICIENT,
    "analyse_sentiment": ModelTier.EFFICIENT,
    
    "generate_response": ModelTier.FRONTIER,
    "interpret_ambiguous_request": ModelTier.FRONTIER,
    "write_technical_explanation": ModelTier.FRONTIER,
    "review_code_logic": ModelTier.FRONTIER,
    
    "debug_complex_error": ModelTier.REASONING,
    "plan_architecture": ModelTier.REASONING,
    "evaluate_tradeoffs": ModelTier.REASONING,
}

async def execute_task(task_type: str, prompt: str, context: dict) -> str:
    tier = TASK_ROUTING.get(task_type, ModelTier.FRONTIER)  # default to frontier
    
    model = tier.value
    use_reasoning = (tier == ModelTier.REASONING)
    
    return await llm_call(model=model, prompt=prompt, use_reasoning=use_reasoning)

The routing table is a simple lookup, but the value is in having made the classification decision deliberately rather than defaulting to one model for everything.


Dynamic Routing: Confidence-Based Escalation

Static routing works well for well-defined tasks. For tasks where the complexity varies with the input, confidence-based escalation is more precise.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
async def adaptive_generate(task_type: str, prompt: str) -> str:
    # First attempt with efficient model
    response = await llm_call(model=ModelTier.EFFICIENT.value, prompt=prompt)
    
    # Check confidence
    confidence_check = await llm_call(
        model=ModelTier.EFFICIENT.value,  # cheap confidence check
        prompt=f"""
        Task: {task_type}
        Response generated: {response}
        
        On a scale of 1-10, how confident are you this response is accurate 
        and complete? Respond with just the number.
        """
    )
    
    confidence = int(confidence_check.strip())
    
    if confidence >= 8:
        return response  # efficient model was sufficient
    
    # Escalate to frontier model
    return await llm_call(model=ModelTier.FRONTIER.value, prompt=prompt)

This pattern uses the efficient model for the task and for the confidence check (two cheap calls), only escalating to the frontier when the efficient model itself signals uncertainty.


Measuring the Tradeoff

Build a measurement framework before deploying hybrid routing:

  1. Sample 200 tasks across your task distribution
  2. Run each task on all three tiers
  3. Have human reviewers (or an LLM judge) assess quality for each result
  4. Calculate: quality score per tier, cost per tier, latency per tier
  5. Plot quality vs. cost for each task type

This empirical data tells you exactly which tasks benefit from escalation and by how much. The routing table becomes evidence-based rather than intuition-based.

In my experience, this measurement typically shows 60–70% of agent tasks can use the efficient tier with negligible quality loss, 25–35% benefit meaningfully from the frontier tier, and 5–10% genuinely need reasoning models.


Day 16 of the Production Agentic AI series. Previous: Reasoning Models for Agents

This post is licensed under CC BY 4.0 by the author.