Entropic 2.3.8
Local-first agentic inference engine
Loading...
Searching...
No Matches
entropic::LlamaCppBackend Class Reference

LlamaCppBackend — common llama.cpp patterns (15% layer). More...

#include </home/runner/work/entropic/entropic/src/inference/llama_cpp_backend.h>

Inheritance diagram for entropic::LlamaCppBackend:
Collaboration diagram for entropic::LlamaCppBackend:

Public Member Functions

 ~LlamaCppBackend () override
 Free llama.cpp + mtmd resources on destruction.
 
void set_prompt_cache_config (const PromptCacheConfig &config)
 Set prompt cache configuration.
 
void clear_prompt_cache () override
 Drop every cached prefix so the next prefill re-seeds.
 
std::vector< int32_t > tokenize_text (const std::string &text) const override
 Tokenize text to token IDs using model vocabulary.
 
llama_model * llama_model_ptr ()
 Get the loaded llama_model pointer.
 
llama_context * llama_context_ptr ()
 Get the active llama_context pointer.
 
GenerationResult generate_speculative_with_draft (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel, LlamaCppBackend &draft, int n_draft_max, const std::string &draft_path)
 Speculative-decoding kernel with explicit draft backend.
 
- Public Member Functions inherited from entropic::InferenceBackend
bool load (const ModelConfig &config)
 Load model into CPU RAM (COLD → WARM).
 
bool activate ()
 Promote to GPU (WARM → ACTIVE).
 
void deactivate ()
 Release GPU layers (ACTIVE → WARM).
 
void unload ()
 Full unload (→ COLD).
 
bool load_and_activate (const ModelConfig &config)
 Convenience: load() + activate().
 
GenerationResult generate (const std::vector< Message > &messages, const GenerationParams &params)
 Generate a complete response.
 
GenerationResult generate_streaming (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)
 Generate with per-token streaming callback.
 
GenerationResult generate_speculative (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)
 Generate via the speculative-decoding kernel (v2.1.11).
 
GenerationResult complete (const std::string &prompt, const GenerationParams &params)
 Raw text completion without chat template.
 
LogprobResult evaluate_logprobs (const int32_t *tokens, int n_tokens)
 Evaluate per-token log-probabilities for a token sequence.
 
float compute_perplexity (const int32_t *tokens, int n_tokens)
 Compute perplexity for a token sequence.
 
ModelState state () const
 Current lifecycle state (lock-free read).
 
bool is_active () const
 True when state is ACTIVE.
 
bool is_loaded () const
 True when state is WARM or ACTIVE.
 
int count_tokens (const std::string &text) const
 Count tokens using model's tokenizer.
 
int context_length () const
 Model's context window size.
 
const ModelConfigconfig () const
 Stored model config.
 
bool supports (BackendCapability cap) const
 Query whether this backend supports a capability.
 
std::vector< BackendCapabilitycapabilities () const
 Get all supported capabilities as a vector.
 
BackendInfo info () const
 Get backend metadata.
 
bool save_state (int seq_id, std::vector< uint8_t > &buffer) const
 Save model state to buffer.
 
bool restore_state (int seq_id, const std::vector< uint8_t > &buffer)
 Restore model state from buffer.
 
bool clear_state (int seq_id=-1)
 Clear/reset model state for a sequence.
 
GenerationResult generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams &params)
 Generate with explicit sequence ID.
 
GenerationResult generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)
 Streaming generation with explicit sequence ID.
 

Protected Member Functions

bool do_load (const ModelConfig &config) override
 Load model into CPU RAM (COLD → WARM).
 
bool do_activate () override
 Activate model on GPU (WARM → ACTIVE).
 
void do_deactivate () override
 Deactivate: free context, reload model CPU-only.
 
void do_unload () override
 Full unload — free all resources, clear prompt cache.
 
GenerationResult do_generate (const std::vector< Message > &messages, const GenerationParams &params) override
 Generate a complete response using chat template.
 
GenerationResult do_generate_streaming (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) override
 Streaming generation with per-token callback.
 
GenerationResult do_generate_speculative (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) override
 Speculative streaming via the abstract InferenceBackend interface (kept as NOT_SUPPORTED — see kernel entry below).
 
GenerationResult do_complete (const std::string &prompt, const GenerationParams &params) override
 Raw text completion without chat template.
 
int do_count_tokens (const std::string &text) const override
 Count tokens in text.
 
LogprobResult do_evaluate_logprobs (const int32_t *tokens, int n_tokens) override
 Evaluate per-token log-probabilities via sequential decode.
 
bool do_supports (BackendCapability cap) const override
 Declare llama.cpp backend capabilities.
 
std::string do_backend_name () const override
 Return backend name.
 
BackendInfo do_info () const override
 Populate backend metadata from llama.cpp model.
 
bool do_clear_state (int seq_id) override
 Clear KV cache or recurrent hidden state.
 
std::vector< llama_token > tokenize (const std::string &text, bool add_special) const
 Tokenize text using model vocabulary.
 
std::string detokenize (llama_token token) const
 Detokenize a single token.
 
std::string apply_chat_template (const std::vector< Message > &messages, const GenerationParams &params) const
 Apply chat template to messages.
 
GenerationResult decode_loop (const std::vector< llama_token > &tokens, const GenerationParams &params, std::function< void(std::string_view)> on_token, std::atomic< bool > *cancel)
 Core decode loop — shared by generate and streaming.
 
bool run_prefill (const std::vector< llama_token > &tokens)
 Run batched prefill on input tokens.
 
std::string step_token (llama_sampler *sampler, std::string &generated, std::function< void(std::string_view)> &on_token, const std::vector< std::string > &stop)
 Generate one token and append to output.
 
llama_sampler * create_sampler (const GenerationParams &params) const
 Create sampler chain from generation params.
 
bool run_prefill_cached (const std::vector< llama_token > &tokens, const std::string &system_prompt, const std::vector< Message > &messages, const GenerationParams &params)
 Run prefill with prompt cache integration.
 
bool decode_tokens_from (const std::vector< llama_token > &tokens, int start_offset)
 Decode tokens starting at a given offset.
 
bool restore_cached_prefix (const CacheEntry *cached, const std::vector< llama_token > &tokens)
 Restore KV state from cache and decode remaining tokens.
 
bool prefill_and_cache_prefix (const std::vector< llama_token > &tokens, int prefix_tokens, const CacheKey &key)
 Two-pass prefill: prefix-only prefill → save → rest.
 
void save_prefix_to_cache (const CacheKey &key, int prefix_tokens)
 Capture seq 0 KV state and store under the given key.
 
int compute_prefix_token_count (const std::vector< Message > &messages, const GenerationParams &params)
 Compute token count of system messages only.
 
llama_seq_id allocate_temp_seq_id ()
 Allocate a temporary sequence ID for evaluation.
 
void release_temp_seq_id (llama_seq_id seq_id)
 Release a temporary sequence ID back to the pool.
 
bool is_recurrent () const
 Check if loaded model is recurrent.
 
GenerationResult generate_multimodal (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > *cancel)
 Multimodal generation core (v1.9.11 Phases 5–7).
 
void init_mmproj_if_configured ()
 Initialize the libmtmd context if mmproj is configured.
 
bool load_gpu_model ()
 Load the GGUF model onto the GPU (do_activate step 1).
 
bool create_inference_context ()
 Create the llama context + prompt cache (do_activate step 2).
 
entropic_error_t mtmd_prefill (const std::string &prompt, const std::vector<::mtmd_bitmap * > &bitmaps, std::string &err_msg)
 Run mtmd_tokenize + mtmd_helper_eval_chunks on a prompt.
 
GenerationResult run_sampling_loop (const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > *cancel, const std::chrono::steady_clock::time_point &t0)
 Sample tokens until stop / max_tokens / cancel.
 
GenerationResult do_generate_text_only (const std::vector< Message > &messages, const GenerationParams &params)
 Text-only batch generation (extracted from do_generate).
 
GenerationResult do_generate_streaming_text_only (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)
 Text-only streaming generation (extracted from streaming).
 
- Protected Member Functions inherited from entropic::InferenceBackend
virtual bool do_save_state (int seq_id, std::vector< uint8_t > &buffer) const
 Save model state (KV cache or hidden state).
 
virtual bool do_restore_state (int seq_id, const std::vector< uint8_t > &buffer)
 Restore model state.
 
virtual GenerationResult do_generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams &params)
 Generate with sequence ID.
 
virtual GenerationResult do_generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)
 Streaming generation with sequence ID.
 
bool fire_model_load_hook (const ModelConfig &config)
 Fire ON_MODEL_LOAD pre-hook.
 
void set_hooks (const HookInterface &hooks)
 Set the hook dispatch interface.
 

Static Protected Member Functions

static std::string extract_system_prompt (const std::vector< Message > &messages)
 Extract the system prompt from messages.
 
static float extract_token_logprob (const float *logits, int32_t next_token, int n_vocab)
 Extract log-probability for a token from logits.
 

Protected Attributes

llama_model * model_ = nullptr
 Loaded model (WARM+)
 
llama_context * ctx_ = nullptr
 Inference context (ACTIVE)
 
const llama_vocab * vocab_ = nullptr
 Vocabulary (from model_)
 
PromptCacheConfig prompt_cache_config_
 Cache config (v1.8.3)
 
std::unique_ptr< PromptCacheprompt_cache_
 KV prefix cache (v1.8.3)
 
std::mutex seq_id_mutex_
 Guards temp seq_id pool (v1.9.10)
 
std::vector< llama_seq_id > free_seq_ids_
 Available temporary seq_ids (v1.9.10)
 
bool is_recurrent_ = false
 True if loaded model is recurrent (GDN/Mamba/RWKV).
 
::mtmd_context * mtmd_ctx_ = nullptr
 libmtmd context, or nullptr if no mmproj loaded.
 
bool has_vision_ = false
 Cached mtmd_support_vision(mtmd_ctx_) result.
 
- Protected Attributes inherited from entropic::InferenceBackend
std::string last_error_
 Last error message for diagnostics.
 

Detailed Description

LlamaCppBackend — common llama.cpp patterns (15% layer).

Provides decode loop, sampler chain creation, tokenization helpers. Pinned-version subclass overrides do_load/do_activate with version-specific API calls.

Version
1.8.3

Definition at line 63 of file llama_cpp_backend.h.

Constructor & Destructor Documentation

◆ ~LlamaCppBackend()

entropic::LlamaCppBackend::~LlamaCppBackend ( )
override

Free llama.cpp + mtmd resources on destruction.

Destructor — route to do_unload() so GPU buffers don't leak.

gh#58 v2.2.7 follow-up: the base InferenceBackend destructor is defaulted, so without this override the raw model_ / ctx_ / mtmd_ctx_ pointers leak when the backend goes out of scope. GPU buffers held by those objects stay allocated for the remainder of the process, and the next handle's GPU model load fails because llama.cpp's CUDA pool sees the prior allocations as stale-but-occupied.

@utility

Version
2.2.8

gh#58 v2.2.7 follow-up: previously the defaulted base destructor left model_/ctx_/mtmd_ctx_ as raw pointers that were never freed. On a second handle's GPU model load, llama.cpp's CUDA pool then failed because the prior buffers were still allocated.

Definition at line 390 of file llama_cpp_backend.cpp.

Member Function Documentation

◆ allocate_temp_seq_id()

llama_seq_id entropic::LlamaCppBackend::allocate_temp_seq_id ( )
protected

Allocate a temporary sequence ID for evaluation.

Returns
Unused seq_id, or -1 if pool is exhausted.
Version
1.9.10
Returns
Unused seq_id (starts at 1, 0 is generation).

Definition at line 584 of file llama_cpp_backend.cpp.

◆ apply_chat_template()

std::string entropic::LlamaCppBackend::apply_chat_template ( const std::vector< Message > &  messages,
const GenerationParams params 
) const
protected

Apply chat template to messages.

Render the chat template for messages (fallback on failure).

Parameters
messagesConversation history.
paramsGeneration parameters (for enable_thinking).
Returns
Formatted prompt string.
Version
1.8.2

Definition at line 690 of file llama_cpp_backend.cpp.

◆ clear_prompt_cache()

void entropic::LlamaCppBackend::clear_prompt_cache ( )
inlineoverridevirtual

Drop every cached prefix so the next prefill re-seeds.

Called by the orchestrator on identity/prompt-prefix changes. No-op when the cache has not been constructed yet. (P1-7, 2.0.6-rc16)

@utility

Version
2.0.6-rc16

Reimplemented from entropic::InferenceBackend.

Definition at line 105 of file llama_cpp_backend.h.

◆ compute_prefix_token_count()

int entropic::LlamaCppBackend::compute_prefix_token_count ( const std::vector< Message > &  messages,
const GenerationParams params 
)
protected

Compute token count of system messages only.

Compute system prefix token count from messages.

Parameters
messagesMessage list.
paramsGeneration params (for template).
Returns
Token count, 0 if no system messages.
Version
1.8.3
Parameters
messagesOriginal message list.
paramsGeneration params (for template).
Returns
Token count of the system prefix, 0 if no system message.

Definition at line 1032 of file llama_cpp_backend.cpp.

◆ create_inference_context()

bool entropic::LlamaCppBackend::create_inference_context ( )
protected

Create the llama context + prompt cache (do_activate step 2).

Extracted from do_activate. Builds ctx_ from model_ and the configured context params, then lazily creates the prompt cache.

Returns
true on success; sets last_error_ on failure.
true on success; sets last_error_ on failure.

Definition at line 282 of file llama_cpp_backend.cpp.

◆ create_sampler()

llama_sampler * entropic::LlamaCppBackend::create_sampler ( const GenerationParams params) const
protected

Create sampler chain from generation params.

Parameters
paramsGeneration parameters.
Returns
Sampler chain (caller frees via llama_sampler_free).
Version
1.8.2

Chain order per llama.cpp convention: grammar → penalties → temperature → top-k → top-p → dist

seed < 0 maps to LLAMA_DEFAULT_SEED (random). (P2-14)

Parameters
paramsGeneration parameters.
Returns
Sampler chain (caller frees).

Definition at line 732 of file llama_cpp_backend.cpp.

◆ decode_loop()

GenerationResult entropic::LlamaCppBackend::decode_loop ( const std::vector< llama_token > &  tokens,
const GenerationParams params,
std::function< void(std::string_view)>  on_token,
std::atomic< bool > *  cancel 
)
protected

Core decode loop — shared by generate and streaming.

Core decode loop shared by generate, streaming, and complete.

Parameters
tokensInput token sequence.
paramsGeneration parameters.
on_tokenPer-token callback (nullptr for batch).
cancelCancel flag (nullptr for batch).
Returns
GenerationResult.
Version
1.8.2
Parameters
tokensInput token sequence.
paramsGeneration parameters.
on_tokenPer-token callback (nullptr for batch).
cancelCancel flag (nullptr for batch).
Returns
GenerationResult.

Definition at line 857 of file llama_cpp_backend.cpp.

◆ decode_tokens_from()

bool entropic::LlamaCppBackend::decode_tokens_from ( const std::vector< llama_token > &  tokens,
int  start_offset 
)
protected

Decode tokens starting at a given offset.

Decode remaining tokens starting at start_offset.

Parameters
tokensFull token sequence.
start_offsetFirst token to decode.
Returns
true on success.
Version
2.0.6

Assumes seq_pos_max(0) == start_offset - 1 so that llama_batch_get_one auto-positions tokens at start_offset onward.

Parameters
tokensFull token sequence.
start_offsetIndex of the first token to decode.
Returns
true on success, false on decode failure.

Definition at line 939 of file llama_cpp_backend.cpp.

◆ detokenize()

std::string entropic::LlamaCppBackend::detokenize ( llama_token  token) const
protected

Detokenize a single token.

Detokenize a single token to string.

Parameters
tokenToken ID.
Returns
String representation.
Version
1.8.2
Parameters
tokenToken ID.
Returns
String representation.

Definition at line 460 of file llama_cpp_backend.cpp.

◆ do_activate()

bool entropic::LlamaCppBackend::do_activate ( )
overrideprotectedvirtual

Activate model on GPU (WARM → ACTIVE).

Reloads model with n_gpu_layers from config, then creates inference context with KV cache. v2.2.7 (gh#61) wired cache_type_k/v. v2.2.8 (gh#58 follow-up) added the diagnostic-rich error message. v2.2.9 extracted the cparams builder to satisfy the SLOC gate.

Returns
true on success.

Implements entropic::InferenceBackend.

Definition at line 230 of file llama_cpp_backend.cpp.

◆ do_backend_name()

std::string entropic::LlamaCppBackend::do_backend_name ( ) const
overrideprotectedvirtual

Return backend name.

Returns
"llama.cpp".

Implements entropic::InferenceBackend.

Definition at line 2431 of file llama_cpp_backend.cpp.

◆ do_clear_state()

bool entropic::LlamaCppBackend::do_clear_state ( int  seq_id)
overrideprotectedvirtual

Clear KV cache or recurrent hidden state.

Parameters
seq_idSequence ID, or -1 for all sequences.
Returns
true on success.

Reimplemented from entropic::InferenceBackend.

Definition at line 2475 of file llama_cpp_backend.cpp.

◆ do_complete()

GenerationResult entropic::LlamaCppBackend::do_complete ( const std::string &  prompt,
const GenerationParams params 
)
overrideprotectedvirtual

Raw text completion without chat template.

Parameters
promptRaw prompt string.
paramsGeneration parameters.
Returns
GenerationResult.

Implements entropic::InferenceBackend.

Definition at line 2357 of file llama_cpp_backend.cpp.

◆ do_count_tokens()

int entropic::LlamaCppBackend::do_count_tokens ( const std::string &  text) const
overrideprotectedvirtual

Count tokens in text.

Parameters
textInput text.
Returns
Token count.

Implements entropic::InferenceBackend.

Definition at line 504 of file llama_cpp_backend.cpp.

◆ do_deactivate()

void entropic::LlamaCppBackend::do_deactivate ( )
overrideprotectedvirtual

Deactivate: free context, reload model CPU-only.

Implements entropic::InferenceBackend.

Definition at line 349 of file llama_cpp_backend.cpp.

◆ do_evaluate_logprobs()

LogprobResult entropic::LlamaCppBackend::do_evaluate_logprobs ( const int32_t *  tokens,
int  n_tokens 
)
overrideprotectedvirtual

Evaluate per-token log-probabilities via sequential decode.

Clears memory, then processes tokens one at a time using the same decode path as generation. After each token, extracts logits for the next-token prediction. Compatible with recurrent/hybrid models that only support single-output-position batches.

Parameters
tokensToken IDs to evaluate.
n_tokensNumber of tokens (minimum 2).
Returns
LogprobResult with per-transition logprobs and perplexity.

Implements entropic::InferenceBackend.

Definition at line 538 of file llama_cpp_backend.cpp.

◆ do_generate()

GenerationResult entropic::LlamaCppBackend::do_generate ( const std::vector< Message > &  messages,
const GenerationParams params 
)
overrideprotectedvirtual

Generate a complete response using chat template.

v2.1.8 (gh#37 / v1.9.11 Phases 5–7): dispatches to generate_multimodal() when any message carries IMAGE content_parts AND the backend has vision (mmproj loaded). When images arrive but vision is not available, image parts are stripped with a warning and generation proceeds text-only.

Parameters
messagesConversation history.
paramsGeneration parameters.
Returns
GenerationResult.

Implements entropic::InferenceBackend.

Definition at line 1388 of file llama_cpp_backend.cpp.

◆ do_generate_speculative()

GenerationResult entropic::LlamaCppBackend::do_generate_speculative ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > &  cancel 
)
overrideprotectedvirtual

Speculative streaming via the abstract InferenceBackend interface (kept as NOT_SUPPORTED — see kernel entry below).

Abstract speculative entry point.

The actual draft-pair-aware kernel lives in generate_speculative_with_draft and is called by the orchestrator after it has resolved the draft backend. This abstract override exists for backends with implicit draft resolution; LlamaCppBackend requires an explicit draft handle.

Returns
GenerationResult with NOT_SUPPORTED.
Version
2.1.11

LlamaCppBackend requires an explicit draft handle, so the abstract single-backend variant returns NOT_SUPPORTED. The orchestrator dynamic_casts both target and draft and calls generate_speculative_with_draft directly. (v2.1.11, gh#36)

Returns
GenerationResult with NOT_SUPPORTED.

Reimplemented from entropic::InferenceBackend.

Definition at line 1556 of file llama_cpp_backend.cpp.

◆ do_generate_streaming()

GenerationResult entropic::LlamaCppBackend::do_generate_streaming ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > &  cancel 
)
overrideprotectedvirtual

Streaming generation with per-token callback.

Parameters
messagesConversation history.
paramsGeneration parameters.
on_tokenPer-token callback.
cancelAtomic cancel flag.
Returns
GenerationResult.

Implements entropic::InferenceBackend.

Definition at line 1469 of file llama_cpp_backend.cpp.

◆ do_generate_streaming_text_only()

GenerationResult entropic::LlamaCppBackend::do_generate_streaming_text_only ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > &  cancel 
)
protected

Text-only streaming generation (extracted from streaming).

Text-only streaming body (v2.1.8, extracted for knots SLOC).

@utility

Version
2.1.8

Definition at line 1493 of file llama_cpp_backend.cpp.

◆ do_generate_text_only()

GenerationResult entropic::LlamaCppBackend::do_generate_text_only ( const std::vector< Message > &  messages,
const GenerationParams params 
)
protected

Text-only batch generation (extracted from do_generate).

Text-only generate body (v2.1.8, extracted for knots SLOC).

@utility

Version
2.1.8

Definition at line 1408 of file llama_cpp_backend.cpp.

◆ do_info()

BackendInfo entropic::LlamaCppBackend::do_info ( ) const
overrideprotectedvirtual

Populate backend metadata from llama.cpp model.

Returns
BackendInfo with model-specific details.

Reimplemented from entropic::InferenceBackend.

Definition at line 2441 of file llama_cpp_backend.cpp.

◆ do_load()

bool entropic::LlamaCppBackend::do_load ( const ModelConfig config)
overrideprotectedvirtual

Load model into CPU RAM (COLD → WARM).

Uses llama_model_load_from_file with n_gpu_layers=0 for CPU-only mmap+mlock loading. Model stays in page cache for fast reactivation.

Parameters
configValidated model config.
Returns
true on success.

Implements entropic::InferenceBackend.

Definition at line 160 of file llama_cpp_backend.cpp.

◆ do_supports()

bool entropic::LlamaCppBackend::do_supports ( BackendCapability  cap) const
overrideprotectedvirtual

Declare llama.cpp backend capabilities.

Parameters
capCapability to check.
Returns
true if this backend supports the capability.

Reimplemented from entropic::InferenceBackend.

Definition at line 2393 of file llama_cpp_backend.cpp.

◆ do_unload()

void entropic::LlamaCppBackend::do_unload ( )
overrideprotectedvirtual

Full unload — free all resources, clear prompt cache.

Implements entropic::InferenceBackend.

Definition at line 399 of file llama_cpp_backend.cpp.

◆ extract_system_prompt()

std::string entropic::LlamaCppBackend::extract_system_prompt ( const std::vector< Message > &  messages)
staticprotected

Extract the system prompt from messages.

Extract system prompt text from message list.

Parameters
messagesConversation history.
Returns
System prompt text, empty if no system message.
Version
1.8.3
Parameters
messagesConversation history.
Returns
System message content, empty if none found.

Definition at line 916 of file llama_cpp_backend.cpp.

◆ extract_token_logprob()

float entropic::LlamaCppBackend::extract_token_logprob ( const float *  logits,
int32_t  next_token,
int  n_vocab 
)
staticprotected

Extract log-probability for a token from logits.

Computes log_softmax(logits)[next_token] using the numerically stable form: logits[t] - max - log(sum(exp(logits - max))).

Parameters
logitsRaw logits array from llama_get_logits_ith().
next_tokenThe token to score.
n_vocabVocabulary size.
Returns
log P(next_token | context).
Version
1.9.10

Uses numerically stable log-softmax: log P(t) = logits[t] - max - log(sum(exp(logits - max)))

Parameters
logitsRaw logits array.
next_tokenToken to score.
n_vocabVocabulary size.
Returns
log P(next_token | context).

Definition at line 618 of file llama_cpp_backend.cpp.

◆ generate_multimodal()

GenerationResult entropic::LlamaCppBackend::generate_multimodal ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > *  cancel 
)
protected

Multimodal generation core (v1.9.11 Phases 5–7).

Multimodal generation core (v2.1.8, gh#37 / v1.9.11 Phase 6).

Runs the libmtmd-backed prefill + decode for messages whose content_parts contain image entries. Builds an mtmd_bitmap list from ContentPart paths, inserts media markers in the chat-formatted prompt, then calls mtmd_helper_eval_chunks to encode images and decode all chunks in order.

After eval, sampling proceeds via the normal step_token loop — the cache state is positioned past the multimodal prefill.

Parameters
messagesConversation history (must contain images).
paramsGeneration parameters.
on_tokenPer-token callback (nullptr for batch mode).
cancelCancel atomic (nullptr for batch mode).
Returns
GenerationResult.
Version
2.1.8

Definition at line 1338 of file llama_cpp_backend.cpp.

◆ generate_speculative_with_draft()

GenerationResult entropic::LlamaCppBackend::generate_speculative_with_draft ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > &  cancel,
LlamaCppBackend draft,
int  n_draft_max,
const std::string &  draft_path 
)

Speculative-decoding kernel with explicit draft backend.

Speculative generation against a draft model (gh#36).

Adapts the upstream speculative-simple reference loop at pin 253ba110b to entropic's idioms: drives a draft LlamaCppBackend through common_speculative_*, verifies in batch on the target, and emits one on_token callback per accepted token (not per proposed) — preserving the standard streaming contract. Honors cancel between accept rounds; latency is one accept round (typically 1–8 tokens).

Correctness contract: output distribution bit-identical to plain decode on rejection cases. Verified by test_speculative_correctness.cpp against Qwen3.6-A3B target

  • Qwen3.5-0.8B draft.

Constraints (v2.1.11):

  • Both target and draft must report common_context_can_seq_rm == COMMON_CONTEXT_SEQ_RM_TYPE_FULL. Falls back to NOT_SUPPORTED otherwise (partial-acceptance checkpoint restore is deferred — see decision log #41).
  • Both backends must be ACTIVE.
  • Caller (orchestrator) is responsible for compat verification before calling — this entry trusts the pair.
Parameters
messagesConversation history.
paramsGeneration parameters (samplers + max_tokens + seed).
on_tokenCallback fired once per accepted token.
cancelCancellation flag (polled between accept rounds).
draftDraft backend (must be ACTIVE).
Returns
GenerationResult with content, token_count, finish_reason.
Version
2.1.11

Definition at line 2312 of file llama_cpp_backend.cpp.

◆ init_mmproj_if_configured()

void entropic::LlamaCppBackend::init_mmproj_if_configured ( )
protected

Initialize the libmtmd context if mmproj is configured.

Initialize libmtmd context if mmproj is configured (v2.1.8).

@utility

Version
2.1.8

Extracted from do_activate to keep that function under the knots SLOC threshold. mtmd holds a reference to the live model_, so init runs after the GPU reload and before any generation. Failure is non-fatal — the engine falls back to text-only with a logged diagnostic.

Definition at line 319 of file llama_cpp_backend.cpp.

◆ is_recurrent()

bool entropic::LlamaCppBackend::is_recurrent ( ) const
protected

Check if loaded model is recurrent.

Returns
true if GDN/Mamba/RWKV architecture.
Version
1.9.13
Returns
true if GDN/Mamba/RWKV architecture.

Definition at line 2380 of file llama_cpp_backend.cpp.

◆ llama_context_ptr()

llama_context * entropic::LlamaCppBackend::llama_context_ptr ( )
inline

Get the active llama_context pointer.

Returns
nullptr if state is not ACTIVE. @utility
Version
1.9.2

Definition at line 134 of file llama_cpp_backend.h.

◆ llama_model_ptr()

llama_model * entropic::LlamaCppBackend::llama_model_ptr ( )
inline

Get the loaded llama_model pointer.

Returns
nullptr if state is COLD. @utility
Version
1.9.2

Definition at line 126 of file llama_cpp_backend.h.

◆ load_gpu_model()

bool entropic::LlamaCppBackend::load_gpu_model ( )
protected

Load the GGUF model onto the GPU (do_activate step 1).

Extracted from do_activate to keep it knots-clean. Sets model_ + vocab_ on success; sets last_error_ and returns false otherwise.

Returns
true on success.
true on success; sets last_error_ on failure.

Definition at line 243 of file llama_cpp_backend.cpp.

◆ mtmd_prefill()

entropic_error_t entropic::LlamaCppBackend::mtmd_prefill ( const std::string &  prompt,
const std::vector<::mtmd_bitmap * > &  bitmaps,
std::string &  err_msg 
)
protected

Run mtmd_tokenize + mtmd_helper_eval_chunks on a prompt.

mtmd prefill helper (v2.1.8) — tokenize + eval chunks.

Parameters
promptMarker-substituted chat-formatted prompt.
bitmapsLoaded image bitmaps in marker order (borrowed).
[out]err_msgFilled on failure.
Returns
ENTROPIC_OK on success, error code on failure. @utility
Version
2.1.8

Wraps mtmd_tokenize + mtmd_helper_eval_chunks. KV cache is cleared first so prefill always starts at seq position 0. Bitmap ownership stays with the caller (mtmd_tokenize borrows for the call).

Definition at line 1253 of file llama_cpp_backend.cpp.

◆ prefill_and_cache_prefix()

bool entropic::LlamaCppBackend::prefill_and_cache_prefix ( const std::vector< llama_token > &  tokens,
int  prefix_tokens,
const CacheKey key 
)
protected

Two-pass prefill: prefix-only prefill → save → rest.

Prefill in two passes: prefix → save → remainder.

Parameters
tokensFull token sequence.
prefix_tokensSystem prefix token count.
keyCache key to store under.
Returns
true on success.
Version
2.0.6

The v2.0.6 correctness fix. llama_state_seq_get_data has no range parameter, so any save captures whatever KV state happens to be in seq 0 at save time. By prefilling ONLY the system prefix first, saving, and then continuing with the rest of the prompt, we guarantee the cache entry covers exactly prefix_tokens positions — no residue from later conversation tokens can leak into a subsequent delegation's cache hit.

If prefix_tokens is 0 or >= total tokens, falls back to a plain full prefill without caching (nothing meaningful to cache).

Parameters
tokensFull token sequence.
prefix_tokensSystem prefix token count.
keyCache key for the prefix.
Returns
true on success.

Definition at line 1072 of file llama_cpp_backend.cpp.

◆ release_temp_seq_id()

void entropic::LlamaCppBackend::release_temp_seq_id ( llama_seq_id  seq_id)
protected

Release a temporary sequence ID back to the pool.

Parameters
seq_idThe seq_id to release.
Version
1.9.10
Parameters
seq_idThe seq_id to release.

Definition at line 600 of file llama_cpp_backend.cpp.

◆ restore_cached_prefix()

bool entropic::LlamaCppBackend::restore_cached_prefix ( const CacheEntry cached,
const std::vector< llama_token > &  tokens 
)
protected

Restore KV state from cache and decode remaining tokens.

Restore cached prefix KV and decode remaining tokens.

Parameters
cachedCache entry to restore.
tokensFull token sequence.
Returns
true on success, false to fall back to full prefill.
Version
2.0.6

After v2.0.6 the cached state contains ONLY the system prefix (two-pass prefill saves at the prefix boundary, see prefill_and_cache_prefix). Restore is therefore clean by construction: llama_state_seq_set_data leaves seq 0 with exactly cached->token_count positions filled, and llama_batch_get_one auto-positions subsequent decodes at that boundary.

Parameters
cachedCache entry to restore from.
tokensFull token sequence.
Returns
true on success, false to fall back to full prefill.

Definition at line 978 of file llama_cpp_backend.cpp.

◆ run_prefill()

bool entropic::LlamaCppBackend::run_prefill ( const std::vector< llama_token > &  tokens)
protected

Run batched prefill on input tokens.

Parameters
tokensInput token sequence.
Returns
true on success.
Version
1.8.2
Parameters
tokensInput token sequence.
Returns
true on success.

Definition at line 790 of file llama_cpp_backend.cpp.

◆ run_prefill_cached()

bool entropic::LlamaCppBackend::run_prefill_cached ( const std::vector< llama_token > &  tokens,
const std::string &  system_prompt,
const std::vector< Message > &  messages,
const GenerationParams params 
)
protected

Run prefill with prompt cache integration.

Parameters
tokensFull token sequence.
system_promptSystem prompt text for cache key.
messagesOriginal messages (for prefix boundary).
paramsGeneration parameters.
Returns
true on success.
Version
1.8.3

On cache hit: restore prefix KV and decode the remainder. On cache miss: two-pass prefill (prefix → save → remainder) so the stored cache entry contains prefix-only state.

Parameters
tokensFull token sequence.
system_promptSystem prompt text for cache key.
messagesOriginal messages (for prefix boundary).
paramsGeneration parameters.
Returns
true on success.

Definition at line 1114 of file llama_cpp_backend.cpp.

◆ run_sampling_loop()

GenerationResult entropic::LlamaCppBackend::run_sampling_loop ( const GenerationParams params,
std::function< void(std::string_view token)>  on_token,
std::atomic< bool > *  cancel,
const std::chrono::steady_clock::time_point &  t0 
)
protected

Sample tokens until stop / max_tokens / cancel.

Shared sampling loop (v2.1.8).

Shared by generate_multimodal and the text-only paths after prefill. Operates on the already-positioned ctx_ KV cache.

Parameters
paramsGeneration parameters.
on_tokenPer-token callback (nullable).
cancelAtomic cancel flag (nullable).
t0Generation start time for finalize_result.
Returns
GenerationResult. @utility
Version
2.1.8

Operates on the already-positioned ctx_ KV cache. Mirrors the body of both text-only generation variants but factored out so generate_multimodal can reuse it after mtmd_prefill.

Definition at line 1296 of file llama_cpp_backend.cpp.

◆ save_prefix_to_cache()

void entropic::LlamaCppBackend::save_prefix_to_cache ( const CacheKey key,
int  prefix_tokens 
)
protected

Capture seq 0 KV state and store under the given key.

Parameters
keyCache key.
prefix_tokensToken count to record with the entry.
Version
2.0.6

The caller is responsible for ensuring seq 0 currently contains exactly prefix_tokens positions — this function trusts that contract and serializes whatever is there.

Parameters
keyCache key for the prefix.
prefix_tokensNumber of prefix tokens currently in seq 0.

Definition at line 1007 of file llama_cpp_backend.cpp.

◆ set_prompt_cache_config()

void entropic::LlamaCppBackend::set_prompt_cache_config ( const PromptCacheConfig config)
inline

Set prompt cache configuration.

Must be called before activate(). The config is consumed when the cache is constructed during do_activate().

Parameters
configPrompt cache configuration. @utility
Version
1.8.3

Definition at line 91 of file llama_cpp_backend.h.

◆ step_token()

std::string entropic::LlamaCppBackend::step_token ( llama_sampler *  sampler,
std::string &  generated,
std::function< void(std::string_view)> &  on_token,
const std::vector< std::string > &  stop 
)
protected

Generate one token and append to output.

Parameters
samplerSampler chain.
generatedAccumulated output (mutated).
on_tokenStreaming callback.
stopStop sequences.
Returns
"continue", "stop", "eos", or "error".
Version
1.8.2
Parameters
samplerSampler chain.
generatedAccumulated output string (mutated).
on_tokenStreaming callback (may be nullptr).
stopStop sequences.
Returns
"continue", "stop", "eos", or "error".

Definition at line 820 of file llama_cpp_backend.cpp.

◆ tokenize()

std::vector< llama_token > entropic::LlamaCppBackend::tokenize ( const std::string &  text,
bool  add_special 
) const
protected

Tokenize text using model vocabulary.

Parameters
textInput text.
add_specialAdd special tokens (BOS).
Returns
Token vector.
Version
1.8.2
Parameters
textInput text.
add_specialAdd BOS/EOS special tokens.
Returns
Vector of token IDs.

Definition at line 430 of file llama_cpp_backend.cpp.

◆ tokenize_text()

std::vector< int32_t > entropic::LlamaCppBackend::tokenize_text ( const std::string &  text) const
overridevirtual

Tokenize text to token IDs using model vocabulary.

Parameters
textInput text.
Returns
Token ID vector with BOS.
Version
1.10.2
Parameters
textInput text.
Returns
Token ID vector with BOS. @utility
Version
1.10.2

Reimplemented from entropic::InferenceBackend.

Definition at line 516 of file llama_cpp_backend.cpp.

Member Data Documentation

◆ ctx_

llama_context* entropic::LlamaCppBackend::ctx_ = nullptr
protected

Inference context (ACTIVE)

Definition at line 241 of file llama_cpp_backend.h.

◆ free_seq_ids_

std::vector<llama_seq_id> entropic::LlamaCppBackend::free_seq_ids_
protected

Available temporary seq_ids (v1.9.10)

Definition at line 437 of file llama_cpp_backend.h.

◆ has_vision_

bool entropic::LlamaCppBackend::has_vision_ = false
protected

Cached mtmd_support_vision(mtmd_ctx_) result.

Version
2.1.8

Definition at line 468 of file llama_cpp_backend.h.

◆ is_recurrent_

bool entropic::LlamaCppBackend::is_recurrent_ = false
protected

True if loaded model is recurrent (GDN/Mamba/RWKV).

Set during do_load() from llama_model_is_recurrent(). Drives capability reporting (KV_CACHE vs HIDDEN_STATE, speculative decoding compatibility, etc.).

Version
1.9.13

Definition at line 446 of file llama_cpp_backend.h.

◆ model_

llama_model* entropic::LlamaCppBackend::model_ = nullptr
protected

Loaded model (WARM+)

Definition at line 240 of file llama_cpp_backend.h.

◆ mtmd_ctx_

::mtmd_context* entropic::LlamaCppBackend::mtmd_ctx_ = nullptr
protected

libmtmd context, or nullptr if no mmproj loaded.

Allocated in do_activate() when ModelConfig::mmproj_path is set; freed in do_deactivate()/do_unload(). The leading :: is required — without it mtmd_context resolves into the entropic:: namespace and the forward declaration becomes an incompatible incomplete type at the call sites.

Version
2.1.8

Definition at line 464 of file llama_cpp_backend.h.

◆ prompt_cache_

std::unique_ptr<PromptCache> entropic::LlamaCppBackend::prompt_cache_
protected

KV prefix cache (v1.8.3)

Definition at line 247 of file llama_cpp_backend.h.

◆ prompt_cache_config_

PromptCacheConfig entropic::LlamaCppBackend::prompt_cache_config_
protected

Cache config (v1.8.3)

Definition at line 246 of file llama_cpp_backend.h.

◆ seq_id_mutex_

std::mutex entropic::LlamaCppBackend::seq_id_mutex_
protected

Guards temp seq_id pool (v1.9.10)

Definition at line 436 of file llama_cpp_backend.h.

◆ vocab_

const llama_vocab* entropic::LlamaCppBackend::vocab_ = nullptr
protected

Vocabulary (from model_)

Definition at line 242 of file llama_cpp_backend.h.


The documentation for this class was generated from the following files: