Native Backend
Table of contents
- Quick Start
- Constructor Overloads
LmSessionOptionsReferenceLlamaRuntimeInstaller- Full Example with Workflow
- Dual-model setup with
BackendRouter
NativeBackend runs inference locally through Agentic.Runtime (llama.cpp). It implements ILLMBackend so it is a drop-in replacement for OpenAIBackend. The session is initialized lazily on first use, and when constructed with a LlamaBackend the runtime binaries are downloaded and installed automatically if not already present.
Quick Start
using Agentic;
using Agentic.Runtime.Core;
using Agentic.Runtime.Mantle;
var sessionOptions = new LmSessionOptions
{
ModelPath = @"/path/to/model.gguf",
ToolRegistry = new ToolRegistry(),
Compaction = new ConversationCompactionOptions(MaxInputTokens: 4096),
};
await using var lm = new NativeBackend(
sessionOptions,
backend: LlamaBackend.Cuda,
cudaVersion: "12.4",
installProgress: new Progress<(string msg, double pct)>(p => Console.Write($"\r[{p.pct:F0}%] {p.msg}")));
var agent = new Agent(lm, new AgentOptions
{
SystemPrompt = "You are a helpful assistant.",
OnEvent = e => { if (e.Kind == AgentEventKind.TextDelta) Console.Write(e.Text); },
});
await agent.ChatStreamAsync("Hello!");
On first run the correct llama.cpp release is downloaded and extracted. Every subsequent run skips the download entirely.
Constructor Overloads
Explicit backend directory — you already have the binaries
var sessionOptions = new LmSessionOptions
{
BackendDirectory = @"C:\llama-runtime\cuda-b8269",
ModelPath = @"C:\models\qwen.gguf",
ToolRegistry = new ToolRegistry(),
Compaction = new ConversationCompactionOptions(MaxInputTokens: 4096),
ContextTokens = 8192,
MaxToolRounds = 32,
};
await using var lm = new NativeBackend(sessionOptions);
Auto-install — downloads the runtime on first run
var sessionOptions = new LmSessionOptions
{
ModelPath = @"C:\models\qwen.gguf",
ToolRegistry = new ToolRegistry(),
Compaction = new ConversationCompactionOptions(MaxInputTokens: 4096),
ContextTokens = 8192,
MaxToolRounds = 32,
// BackendDirectory is omitted — resolved automatically
};
await using var lm = new NativeBackend(
sessionOptions,
backend: LlamaBackend.Cuda,
cudaVersion: "12.4", // null = pick highest CUDA 12.x available
releaseTag: "b8269", // null = latest release
installProgress: new Progress<(string msg, double pct)>(
p => Console.Write($"\r[{p.pct:F0}%] {p.msg}")));
NativeBackend members
| Member | Description |
|---|---|
BackendDirectory | Resolved path to the native binaries after the session is first initialized; null before first use |
LmSessionOptions Reference
| Property | Default | Description |
|---|---|---|
ModelPath | (required) | Full path to the GGUF model file |
ToolRegistry | (required) | Tool registry available to the session |
Compaction | (required) | Conversation compaction policy (see Context Compaction) |
BackendDirectory | "" | Directory containing llama.cpp binaries. Omit when using auto-install |
ContextTokens | 8192 | Total KV cache token capacity |
ResetContextTokens | 2048 | Reserved context for context-reset operations |
BatchTokens | 1024 | Prompt evaluation batch size |
MicroBatchTokens | 1024 | Internal llama.cpp micro-batch size |
MaxToolRounds | 10 | Maximum tool-call rounds per turn |
Threads | null | CPU thread count (null = llama.cpp default) |
FlashAttention | false | Enable flash attention when supported |
OffloadKvCacheToGpu | true | Offload KV cache to GPU |
UseMmap | true | Memory-map the model file |
DefaultRequest | null | Default generation settings (temperature, top-p, etc.) for every turn |
Logger | null | ILogger for session and engine diagnostics |
LlamaRuntimeInstaller
LlamaRuntimeInstaller downloads and extracts llama.cpp native binaries from ggml-org/llama.cpp GitHub releases. Installed runtimes are cached under DefaultInstallRoot and reused on subsequent runs with no network call.
using Agentic.Runtime.Core;
// Ensure installed, return the binary directory
string backendDir = await LlamaRuntimeInstaller.EnsureInstalledAsync(
backend: LlamaBackend.Cuda,
cudaVersion: "12.4", // null = pick the highest CUDA 12.x asset automatically
releaseTag: "b8269", // null = always use the latest published release
progress: new Progress<(string msg, double pct)>(p => Console.Write($"\r[{p.pct:F0}%] {p.msg}")));
// Check what is already installed (no download)
string? existing = LlamaRuntimeInstaller.FindInstalled(LlamaBackend.Cuda, releaseTag: "b8269");
LlamaBackend enum
| Value | Description |
|---|---|
Cpu | CPU-only inference (AVX2 preferred, falls back to AVX / noavx) |
Cuda | NVIDIA CUDA GPU acceleration |
Vulkan | Vulkan GPU acceleration (AMD / Intel / NVIDIA) |
Install root
| Platform | Default path |
|---|---|
| Windows | %LOCALAPPDATA%\Agentic\llama-runtime\ |
| Linux / macOS | ~/.local/share/Agentic/llama-runtime/ |
Each installed release occupies its own subdirectory named {backend}-{tag} (e.g. cuda-b8269), so multiple versions coexist safely.
Release pinning
By default the installer always fetches the latest published release. To pin to a specific build, pass releaseTag:
// Always use b8269, regardless of what is latest on GitHub
await LlamaRuntimeInstaller.EnsureInstalledAsync(LlamaBackend.Cuda, releaseTag: "b8269");
If b8269 is already installed the call returns immediately. Pass forceReinstall: true to re-download even when the directory exists.
Full Example with Workflow
using Agentic;
using Agentic.Runtime.Core;
using Agentic.Runtime.Mantle;
var sessionOptions = new LmSessionOptions
{
ModelPath = @"/models/qwen3.5-9b-q4.gguf",
ToolRegistry = new ToolRegistry(),
Compaction = new ConversationCompactionOptions(
MaxInputTokens: 8192,
ReservedForGeneration: 256),
ContextTokens = 8192,
MaxToolRounds = 32,
DefaultRequest = new ResponseRequest
{
MaxOutputTokens = 1024,
EnableThinking = false,
},
};
await using var lm = new NativeBackend(
sessionOptions,
backend: LlamaBackend.Cuda,
cudaVersion: "12.4",
installProgress: new Progress<(string msg, double pct)>(
p => Console.Write($"\r[{p.pct:F0}%] {p.msg}")));
var agent = new Agent(lm, new AgentOptions
{
SystemPrompt = "You are a helpful assistant.",
OnEvent = e => { if (e.Kind == AgentEventKind.TextDelta) Console.Write(e.Text); },
});
await agent.ChatStreamAsync("Explain the Pythagorean theorem.");
Dual-model setup with BackendRouter
A common local pattern is to pair a large model for chat and reasoning with a small specialised model for vector embeddings. BackendRouter wires them together as a single ILLMBackend:
var chatOptions = new Mantle.LmSessionOptions
{
ModelPath = @"/models/qwen3.5-9b-q4.gguf",
ContextTokens = 8192,
MaxToolRounds = 32,
};
var embedOptions = new Mantle.LmSessionOptions
{
ModelPath = @"/models/embeddinggemma-300m-qat-q4.gguf",
ContextTokens = 2048,
BatchTokens = 512,
};
await using var chatBackend = new NativeBackend(chatOptions, LlamaBackend.Cuda);
await using var embedBackend = new NativeBackend(embedOptions, LlamaBackend.Cuda);
await using var lm = new BackendRouter()
.Add("qwen-9b", chatBackend, isDefault: true)
.Add("embed-300m", embedBackend, isEmbedding: true);
All RespondAsync / RespondStreamingAsync calls go to the chat model; all EmbedAsync / EmbedBatchAsync calls go to the embedding model. See BackendRouter for full routing rules and multi-model examples.