LlamaCppBackend — llama.cpp C API integration. More...

#include <entropic/inference/backend.h>
#include <entropic/inference/sampler.h>
#include <entropic/inference/tokenizer.h>
#include "prompt_cache.h"
#include <llama.h>
#include <atomic>
#include <chrono>
#include <cstdint>
#include <functional>
#include <memory>
#include <mutex>
#include <string>
#include <vector>

Include dependency graph for llama_cpp_backend.h:

This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Classes
class	entropic::LlamaCppBackend
	LlamaCppBackend — common llama.cpp patterns (15% layer). More...

struct	entropic::LlamaCppBackend::CommonChatResult
	Result of a common_chat parse: native tool calls + split content. More...

struct	entropic::LlamaCppBackend::BatchSeq
	Per-sequence state for the gh#98 multi-seq batched decode. More...

Namespaces
namespace	entropic
	Activate model on GPU (WARM → ACTIVE).

Detailed Description

LlamaCppBackend — llama.cpp C API integration.

Versioned subclass pattern: LlamaCppBackend provides common llama.cpp patterns (decode loop, sampler chain, tokenization). The pinned-commit subclass (LlamaCppBackend_b8420) overrides API-version-specific calls.

VRAM lifecycle mapping

COLD: nothing allocated
WARM: llama_model loaded (CPU mmap+mlock, n_gpu_layers=0)
ACTIVE: model reloaded with gpu_layers, llama_context created

Key differences from Python LlamaCppBackend

Direct llama.cpp C API (not llama-cpp-python wrapper)
No Python GIL — generation runs natively
No asyncio bridge — streaming is synchronous with callback

Internal to inference .so — not exposed across boundaries.

Version: 1.9.13

Definition in file llama_cpp_backend.h.

Classes

Namespaces

Detailed Description