BackendRouter

Table of contents

  1. How routing works
  2. Registering backends
    1. Add parameters
  3. Basic example - chat + embedding
  4. Multi-model example - several chat backends
  5. Disposal
  6. Using with OpenAIBackend

BackendRouter composes multiple ILLMBackend instances into a single backend and routes each call to the right one. The most common use-case is pairing a large chat model with a small specialised embedding model, but any number of named backends can be registered.


How routing works

Call typeBackend used
RespondAsync(model: "name")The backend registered under "name"
RespondAsync(model: null)The default backend
RespondStreamingAsync(...)Same rules as RespondAsync
EmbedAsync(...)The embedding backend; falls back to the default if none is designated
EmbedBatchAsync(...)Same as EmbedAsync
PingAsync()Pings all registered backends; returns true only when every one succeeds

Registering backends

Use the fluent Add method to build the router. Calls can be chained.

var lm = new BackendRouter()
    .Add("qwen-9b",    chatBackend,  isDefault: true)
    .Add("embed-300m", embedBackend, isEmbedding: true);

Add parameters

ParameterDefaultDescription
name(required)Routing key. Pass as the model argument to RespondAsync / RespondStreamingAsync to target this backend explicitly
backend(required)The ILLMBackend to register
isDefaultfalseThis backend handles requests where model is null. The first non-embedding backend is auto-designated default; pass isDefault: true to override
isEmbeddingfalseThis backend handles EmbedAsync / EmbedBatchAsync

Basic example - chat + embedding

The most common pattern: a large model for reasoning, a small model for vector embeddings.

using Agentic;
using Agentic.Runtime.Core;
using Mantle = Agentic.Runtime.Mantle;

var chatOptions = new Mantle.LmSessionOptions
{
    ModelPath     = @"/models/qwen3.5-9b-q4.gguf",
    ContextTokens = 8192,
    MaxToolRounds = 32,
    DefaultRequest = new Mantle.ResponseRequest
    {
        MaxOutputTokens = 1024,
        EnableThinking  = false,
    },
};

var embedOptions = new Mantle.LmSessionOptions
{
    ModelPath     = @"/models/embeddinggemma-300m-qat-q4.gguf",
    ContextTokens = 2048,
    BatchTokens   = 512,
};

await using var chatBackend  = new NativeBackend(chatOptions,  LlamaBackend.Cuda);
await using var embedBackend = new NativeBackend(embedOptions, LlamaBackend.Cuda);

await using var lm = new BackendRouter()
    .Add("chat",  chatBackend,  isDefault: true)
    .Add("embed", embedBackend, isEmbedding: true);

// Chat calls go to the default chat backend
var agent = new Agent(lm, new AgentOptions { SystemPrompt = "You are a helpful assistant." });
await agent.ChatStreamAsync("Summarise this document.");

// Embedding calls go to the embedding backend
var vector = await lm.EmbedAsync("What is the warranty period?");

// You can also route chat explicitly by name when multiple chat backends are registered
var response = await lm.RespondAsync("Explain this code.", model: "chat");

Multi-model example - several chat backends

Register as many backends as needed and select between them at call time.

await using var lm = new BackendRouter()
    .Add("fast",      fastBackend,      isDefault: true)   // quick answers, small model
    .Add("smart",     smartBackend)                        // complex reasoning
    .Add("ocr",       ocrBackend)                          // vision-heavy tasks
    .Add("embed-300m", embedBackend, isEmbedding: true);

// Default - routes to "fast"
await agent.ChatStreamAsync("What time is it?");

// Explicit routing via the model parameter
await agent.ChatStreamAsync("Refactor this entire module.", model: "smart");
await agent.ChatStreamAsync("Extract text from this receipt.", model: "ocr");

// Embeddings always go to "embed-300m"
var vector = await lm.EmbedAsync("search query");

The model string is matched case-insensitively against registered names. Passing an unrecognised name throws InvalidOperationException with the list of valid names.


Disposal

BackendRouter disposes all registered backends when it is disposed. If backends are also wrapped in await using at the call site, disposal is safe - NativeBackend is idempotent.

// Both patterns are safe
await using var chatBackend  = new NativeBackend(chatOptions,  LlamaBackend.Cuda);
await using var embedBackend = new NativeBackend(embedOptions, LlamaBackend.Cuda);
await using var lm = new BackendRouter()
    .Add("chat",  chatBackend,  isDefault: true)
    .Add("embed", embedBackend, isEmbedding: true);

Using with OpenAIBackend

BackendRouter works with any ILLMBackend - mix local and remote backends freely.

var remoteBackend = new OpenAIBackend(new LMConfig
{
    Endpoint  = "https://api.openai.com",
    ModelName = "gpt-4o",
    ApiKey    = "sk-...",
});

await using var localEmbed = new NativeBackend(embedOptions, LlamaBackend.Cpu);

await using var lm = new BackendRouter()
    .Add("gpt-4o", remoteBackend,  isDefault: true)
    .Add("embed",  localEmbed, isEmbedding: true);