|
Entropic 2.3.8
Local-first agentic inference engine
|
LlamaCppBackend — common llama.cpp patterns (15% layer). More...
#include </home/runner/work/entropic/entropic/src/inference/llama_cpp_backend.h>


Public Member Functions | |
| ~LlamaCppBackend () override | |
| Free llama.cpp + mtmd resources on destruction. | |
| void | set_prompt_cache_config (const PromptCacheConfig &config) |
| Set prompt cache configuration. | |
| void | clear_prompt_cache () override |
| Drop every cached prefix so the next prefill re-seeds. | |
| std::vector< int32_t > | tokenize_text (const std::string &text) const override |
| Tokenize text to token IDs using model vocabulary. | |
| llama_model * | llama_model_ptr () |
| Get the loaded llama_model pointer. | |
| llama_context * | llama_context_ptr () |
| Get the active llama_context pointer. | |
| GenerationResult | generate_speculative_with_draft (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel, LlamaCppBackend &draft, int n_draft_max, const std::string &draft_path) |
| Speculative-decoding kernel with explicit draft backend. | |
Public Member Functions inherited from entropic::InferenceBackend | |
| bool | load (const ModelConfig &config) |
| Load model into CPU RAM (COLD → WARM). | |
| bool | activate () |
| Promote to GPU (WARM → ACTIVE). | |
| void | deactivate () |
| Release GPU layers (ACTIVE → WARM). | |
| void | unload () |
| Full unload (→ COLD). | |
| bool | load_and_activate (const ModelConfig &config) |
| Convenience: load() + activate(). | |
| GenerationResult | generate (const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate a complete response. | |
| GenerationResult | generate_streaming (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Generate with per-token streaming callback. | |
| GenerationResult | generate_speculative (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Generate via the speculative-decoding kernel (v2.1.11). | |
| GenerationResult | complete (const std::string &prompt, const GenerationParams ¶ms) |
| Raw text completion without chat template. | |
| LogprobResult | evaluate_logprobs (const int32_t *tokens, int n_tokens) |
| Evaluate per-token log-probabilities for a token sequence. | |
| float | compute_perplexity (const int32_t *tokens, int n_tokens) |
| Compute perplexity for a token sequence. | |
| ModelState | state () const |
| Current lifecycle state (lock-free read). | |
| bool | is_active () const |
| True when state is ACTIVE. | |
| bool | is_loaded () const |
| True when state is WARM or ACTIVE. | |
| int | count_tokens (const std::string &text) const |
| Count tokens using model's tokenizer. | |
| int | context_length () const |
| Model's context window size. | |
| const ModelConfig & | config () const |
| Stored model config. | |
| bool | supports (BackendCapability cap) const |
| Query whether this backend supports a capability. | |
| std::vector< BackendCapability > | capabilities () const |
| Get all supported capabilities as a vector. | |
| BackendInfo | info () const |
| Get backend metadata. | |
| bool | save_state (int seq_id, std::vector< uint8_t > &buffer) const |
| Save model state to buffer. | |
| bool | restore_state (int seq_id, const std::vector< uint8_t > &buffer) |
| Restore model state from buffer. | |
| bool | clear_state (int seq_id=-1) |
| Clear/reset model state for a sequence. | |
| GenerationResult | generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate with explicit sequence ID. | |
| GenerationResult | generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Streaming generation with explicit sequence ID. | |
Protected Member Functions | |
| bool | do_load (const ModelConfig &config) override |
| Load model into CPU RAM (COLD → WARM). | |
| bool | do_activate () override |
| Activate model on GPU (WARM → ACTIVE). | |
| void | do_deactivate () override |
| Deactivate: free context, reload model CPU-only. | |
| void | do_unload () override |
| Full unload — free all resources, clear prompt cache. | |
| GenerationResult | do_generate (const std::vector< Message > &messages, const GenerationParams ¶ms) override |
| Generate a complete response using chat template. | |
| GenerationResult | do_generate_streaming (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) override |
| Streaming generation with per-token callback. | |
| GenerationResult | do_generate_speculative (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) override |
| Speculative streaming via the abstract InferenceBackend interface (kept as NOT_SUPPORTED — see kernel entry below). | |
| GenerationResult | do_complete (const std::string &prompt, const GenerationParams ¶ms) override |
| Raw text completion without chat template. | |
| int | do_count_tokens (const std::string &text) const override |
| Count tokens in text. | |
| LogprobResult | do_evaluate_logprobs (const int32_t *tokens, int n_tokens) override |
| Evaluate per-token log-probabilities via sequential decode. | |
| bool | do_supports (BackendCapability cap) const override |
| Declare llama.cpp backend capabilities. | |
| std::string | do_backend_name () const override |
| Return backend name. | |
| BackendInfo | do_info () const override |
| Populate backend metadata from llama.cpp model. | |
| bool | do_clear_state (int seq_id) override |
| Clear KV cache or recurrent hidden state. | |
| std::vector< llama_token > | tokenize (const std::string &text, bool add_special) const |
| Tokenize text using model vocabulary. | |
| std::string | detokenize (llama_token token) const |
| Detokenize a single token. | |
| std::string | apply_chat_template (const std::vector< Message > &messages, const GenerationParams ¶ms) const |
| Apply chat template to messages. | |
| GenerationResult | decode_loop (const std::vector< llama_token > &tokens, const GenerationParams ¶ms, std::function< void(std::string_view)> on_token, std::atomic< bool > *cancel) |
| Core decode loop — shared by generate and streaming. | |
| bool | run_prefill (const std::vector< llama_token > &tokens) |
| Run batched prefill on input tokens. | |
| std::string | step_token (llama_sampler *sampler, std::string &generated, std::function< void(std::string_view)> &on_token, const std::vector< std::string > &stop) |
| Generate one token and append to output. | |
| llama_sampler * | create_sampler (const GenerationParams ¶ms) const |
| Create sampler chain from generation params. | |
| bool | run_prefill_cached (const std::vector< llama_token > &tokens, const std::string &system_prompt, const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Run prefill with prompt cache integration. | |
| bool | decode_tokens_from (const std::vector< llama_token > &tokens, int start_offset) |
| Decode tokens starting at a given offset. | |
| bool | restore_cached_prefix (const CacheEntry *cached, const std::vector< llama_token > &tokens) |
| Restore KV state from cache and decode remaining tokens. | |
| bool | prefill_and_cache_prefix (const std::vector< llama_token > &tokens, int prefix_tokens, const CacheKey &key) |
| Two-pass prefill: prefix-only prefill → save → rest. | |
| void | save_prefix_to_cache (const CacheKey &key, int prefix_tokens) |
| Capture seq 0 KV state and store under the given key. | |
| int | compute_prefix_token_count (const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Compute token count of system messages only. | |
| llama_seq_id | allocate_temp_seq_id () |
| Allocate a temporary sequence ID for evaluation. | |
| void | release_temp_seq_id (llama_seq_id seq_id) |
| Release a temporary sequence ID back to the pool. | |
| bool | is_recurrent () const |
| Check if loaded model is recurrent. | |
| GenerationResult | generate_multimodal (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > *cancel) |
| Multimodal generation core (v1.9.11 Phases 5–7). | |
| void | init_mmproj_if_configured () |
| Initialize the libmtmd context if mmproj is configured. | |
| bool | load_gpu_model () |
| Load the GGUF model onto the GPU (do_activate step 1). | |
| bool | create_inference_context () |
| Create the llama context + prompt cache (do_activate step 2). | |
| entropic_error_t | mtmd_prefill (const std::string &prompt, const std::vector<::mtmd_bitmap * > &bitmaps, std::string &err_msg) |
| Run mtmd_tokenize + mtmd_helper_eval_chunks on a prompt. | |
| GenerationResult | run_sampling_loop (const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > *cancel, const std::chrono::steady_clock::time_point &t0) |
| Sample tokens until stop / max_tokens / cancel. | |
| GenerationResult | do_generate_text_only (const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Text-only batch generation (extracted from do_generate). | |
| GenerationResult | do_generate_streaming_text_only (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Text-only streaming generation (extracted from streaming). | |
Protected Member Functions inherited from entropic::InferenceBackend | |
| virtual bool | do_save_state (int seq_id, std::vector< uint8_t > &buffer) const |
| Save model state (KV cache or hidden state). | |
| virtual bool | do_restore_state (int seq_id, const std::vector< uint8_t > &buffer) |
| Restore model state. | |
| virtual GenerationResult | do_generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate with sequence ID. | |
| virtual GenerationResult | do_generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Streaming generation with sequence ID. | |
| bool | fire_model_load_hook (const ModelConfig &config) |
| Fire ON_MODEL_LOAD pre-hook. | |
| void | set_hooks (const HookInterface &hooks) |
| Set the hook dispatch interface. | |
Static Protected Member Functions | |
| static std::string | extract_system_prompt (const std::vector< Message > &messages) |
| Extract the system prompt from messages. | |
| static float | extract_token_logprob (const float *logits, int32_t next_token, int n_vocab) |
| Extract log-probability for a token from logits. | |
Protected Attributes | |
| llama_model * | model_ = nullptr |
| Loaded model (WARM+) | |
| llama_context * | ctx_ = nullptr |
| Inference context (ACTIVE) | |
| const llama_vocab * | vocab_ = nullptr |
| Vocabulary (from model_) | |
| PromptCacheConfig | prompt_cache_config_ |
| Cache config (v1.8.3) | |
| std::unique_ptr< PromptCache > | prompt_cache_ |
| KV prefix cache (v1.8.3) | |
| std::mutex | seq_id_mutex_ |
| Guards temp seq_id pool (v1.9.10) | |
| std::vector< llama_seq_id > | free_seq_ids_ |
| Available temporary seq_ids (v1.9.10) | |
| bool | is_recurrent_ = false |
| True if loaded model is recurrent (GDN/Mamba/RWKV). | |
| ::mtmd_context * | mtmd_ctx_ = nullptr |
| libmtmd context, or nullptr if no mmproj loaded. | |
| bool | has_vision_ = false |
Cached mtmd_support_vision(mtmd_ctx_) result. | |
Protected Attributes inherited from entropic::InferenceBackend | |
| std::string | last_error_ |
| Last error message for diagnostics. | |
LlamaCppBackend — common llama.cpp patterns (15% layer).
Provides decode loop, sampler chain creation, tokenization helpers. Pinned-version subclass overrides do_load/do_activate with version-specific API calls.
Definition at line 63 of file llama_cpp_backend.h.
|
override |
Free llama.cpp + mtmd resources on destruction.
Destructor — route to do_unload() so GPU buffers don't leak.
gh#58 v2.2.7 follow-up: the base InferenceBackend destructor is defaulted, so without this override the raw model_ / ctx_ / mtmd_ctx_ pointers leak when the backend goes out of scope. GPU buffers held by those objects stay allocated for the remainder of the process, and the next handle's GPU model load fails because llama.cpp's CUDA pool sees the prior allocations as stale-but-occupied.
@utility
gh#58 v2.2.7 follow-up: previously the defaulted base destructor left model_/ctx_/mtmd_ctx_ as raw pointers that were never freed. On a second handle's GPU model load, llama.cpp's CUDA pool then failed because the prior buffers were still allocated.
Definition at line 390 of file llama_cpp_backend.cpp.
|
protected |
Allocate a temporary sequence ID for evaluation.
Definition at line 584 of file llama_cpp_backend.cpp.
|
protected |
Apply chat template to messages.
Render the chat template for messages (fallback on failure).
| messages | Conversation history. |
| params | Generation parameters (for enable_thinking). |
Definition at line 690 of file llama_cpp_backend.cpp.
|
inlineoverridevirtual |
Drop every cached prefix so the next prefill re-seeds.
Called by the orchestrator on identity/prompt-prefix changes. No-op when the cache has not been constructed yet. (P1-7, 2.0.6-rc16)
@utility
Reimplemented from entropic::InferenceBackend.
Definition at line 105 of file llama_cpp_backend.h.
|
protected |
Compute token count of system messages only.
Compute system prefix token count from messages.
| messages | Message list. |
| params | Generation params (for template). |
| messages | Original message list. |
| params | Generation params (for template). |
Definition at line 1032 of file llama_cpp_backend.cpp.
|
protected |
Create the llama context + prompt cache (do_activate step 2).
Extracted from do_activate. Builds ctx_ from model_ and the configured context params, then lazily creates the prompt cache.
Definition at line 282 of file llama_cpp_backend.cpp.
|
protected |
Create sampler chain from generation params.
| params | Generation parameters. |
Chain order per llama.cpp convention: grammar → penalties → temperature → top-k → top-p → dist
seed < 0 maps to LLAMA_DEFAULT_SEED (random). (P2-14)
| params | Generation parameters. |
Definition at line 732 of file llama_cpp_backend.cpp.
|
protected |
Core decode loop — shared by generate and streaming.
Core decode loop shared by generate, streaming, and complete.
| tokens | Input token sequence. |
| params | Generation parameters. |
| on_token | Per-token callback (nullptr for batch). |
| cancel | Cancel flag (nullptr for batch). |
| tokens | Input token sequence. |
| params | Generation parameters. |
| on_token | Per-token callback (nullptr for batch). |
| cancel | Cancel flag (nullptr for batch). |
Definition at line 857 of file llama_cpp_backend.cpp.
|
protected |
Decode tokens starting at a given offset.
Decode remaining tokens starting at start_offset.
| tokens | Full token sequence. |
| start_offset | First token to decode. |
Assumes seq_pos_max(0) == start_offset - 1 so that llama_batch_get_one auto-positions tokens at start_offset onward.
| tokens | Full token sequence. |
| start_offset | Index of the first token to decode. |
Definition at line 939 of file llama_cpp_backend.cpp.
|
protected |
Detokenize a single token.
Detokenize a single token to string.
| token | Token ID. |
| token | Token ID. |
Definition at line 460 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Activate model on GPU (WARM → ACTIVE).
Reloads model with n_gpu_layers from config, then creates inference context with KV cache. v2.2.7 (gh#61) wired cache_type_k/v. v2.2.8 (gh#58 follow-up) added the diagnostic-rich error message. v2.2.9 extracted the cparams builder to satisfy the SLOC gate.
Implements entropic::InferenceBackend.
Definition at line 230 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Return backend name.
Implements entropic::InferenceBackend.
Definition at line 2431 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Clear KV cache or recurrent hidden state.
| seq_id | Sequence ID, or -1 for all sequences. |
Reimplemented from entropic::InferenceBackend.
Definition at line 2475 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Raw text completion without chat template.
| prompt | Raw prompt string. |
| params | Generation parameters. |
Implements entropic::InferenceBackend.
Definition at line 2357 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Count tokens in text.
| text | Input text. |
Implements entropic::InferenceBackend.
Definition at line 504 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Deactivate: free context, reload model CPU-only.
Implements entropic::InferenceBackend.
Definition at line 349 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Evaluate per-token log-probabilities via sequential decode.
Clears memory, then processes tokens one at a time using the same decode path as generation. After each token, extracts logits for the next-token prediction. Compatible with recurrent/hybrid models that only support single-output-position batches.
| tokens | Token IDs to evaluate. |
| n_tokens | Number of tokens (minimum 2). |
Implements entropic::InferenceBackend.
Definition at line 538 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Generate a complete response using chat template.
v2.1.8 (gh#37 / v1.9.11 Phases 5–7): dispatches to generate_multimodal() when any message carries IMAGE content_parts AND the backend has vision (mmproj loaded). When images arrive but vision is not available, image parts are stripped with a warning and generation proceeds text-only.
| messages | Conversation history. |
| params | Generation parameters. |
Implements entropic::InferenceBackend.
Definition at line 1388 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Speculative streaming via the abstract InferenceBackend interface (kept as NOT_SUPPORTED — see kernel entry below).
Abstract speculative entry point.
The actual draft-pair-aware kernel lives in generate_speculative_with_draft and is called by the orchestrator after it has resolved the draft backend. This abstract override exists for backends with implicit draft resolution; LlamaCppBackend requires an explicit draft handle.
LlamaCppBackend requires an explicit draft handle, so the abstract single-backend variant returns NOT_SUPPORTED. The orchestrator dynamic_casts both target and draft and calls generate_speculative_with_draft directly. (v2.1.11, gh#36)
Reimplemented from entropic::InferenceBackend.
Definition at line 1556 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Streaming generation with per-token callback.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Atomic cancel flag. |
Implements entropic::InferenceBackend.
Definition at line 1469 of file llama_cpp_backend.cpp.
|
protected |
Text-only streaming generation (extracted from streaming).
Text-only streaming body (v2.1.8, extracted for knots SLOC).
@utility
Definition at line 1493 of file llama_cpp_backend.cpp.
|
protected |
Text-only batch generation (extracted from do_generate).
Text-only generate body (v2.1.8, extracted for knots SLOC).
@utility
Definition at line 1408 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Populate backend metadata from llama.cpp model.
Reimplemented from entropic::InferenceBackend.
Definition at line 2441 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Load model into CPU RAM (COLD → WARM).
Uses llama_model_load_from_file with n_gpu_layers=0 for CPU-only mmap+mlock loading. Model stays in page cache for fast reactivation.
| config | Validated model config. |
Implements entropic::InferenceBackend.
Definition at line 160 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Declare llama.cpp backend capabilities.
| cap | Capability to check. |
Reimplemented from entropic::InferenceBackend.
Definition at line 2393 of file llama_cpp_backend.cpp.
|
overrideprotectedvirtual |
Full unload — free all resources, clear prompt cache.
Implements entropic::InferenceBackend.
Definition at line 399 of file llama_cpp_backend.cpp.
|
staticprotected |
Extract the system prompt from messages.
Extract system prompt text from message list.
| messages | Conversation history. |
| messages | Conversation history. |
Definition at line 916 of file llama_cpp_backend.cpp.
|
staticprotected |
Extract log-probability for a token from logits.
Computes log_softmax(logits)[next_token] using the numerically stable form: logits[t] - max - log(sum(exp(logits - max))).
| logits | Raw logits array from llama_get_logits_ith(). |
| next_token | The token to score. |
| n_vocab | Vocabulary size. |
Uses numerically stable log-softmax: log P(t) = logits[t] - max - log(sum(exp(logits - max)))
| logits | Raw logits array. |
| next_token | Token to score. |
| n_vocab | Vocabulary size. |
Definition at line 618 of file llama_cpp_backend.cpp.
|
protected |
Multimodal generation core (v1.9.11 Phases 5–7).
Multimodal generation core (v2.1.8, gh#37 / v1.9.11 Phase 6).
Runs the libmtmd-backed prefill + decode for messages whose content_parts contain image entries. Builds an mtmd_bitmap list from ContentPart paths, inserts media markers in the chat-formatted prompt, then calls mtmd_helper_eval_chunks to encode images and decode all chunks in order.
After eval, sampling proceeds via the normal step_token loop — the cache state is positioned past the multimodal prefill.
| messages | Conversation history (must contain images). |
| params | Generation parameters. |
| on_token | Per-token callback (nullptr for batch mode). |
| cancel | Cancel atomic (nullptr for batch mode). |
Definition at line 1338 of file llama_cpp_backend.cpp.
| GenerationResult entropic::LlamaCppBackend::generate_speculative_with_draft | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params, | ||
| std::function< void(std::string_view token)> | on_token, | ||
| std::atomic< bool > & | cancel, | ||
| LlamaCppBackend & | draft, | ||
| int | n_draft_max, | ||
| const std::string & | draft_path | ||
| ) |
Speculative-decoding kernel with explicit draft backend.
Speculative generation against a draft model (gh#36).
Adapts the upstream speculative-simple reference loop at pin 253ba110b to entropic's idioms: drives a draft LlamaCppBackend through common_speculative_*, verifies in batch on the target, and emits one on_token callback per accepted token (not per proposed) — preserving the standard streaming contract. Honors cancel between accept rounds; latency is one accept round (typically 1–8 tokens).
Correctness contract: output distribution bit-identical to plain decode on rejection cases. Verified by test_speculative_correctness.cpp against Qwen3.6-A3B target
Constraints (v2.1.11):
common_context_can_seq_rm == COMMON_CONTEXT_SEQ_RM_TYPE_FULL. Falls back to NOT_SUPPORTED otherwise (partial-acceptance checkpoint restore is deferred — see decision log #41).| messages | Conversation history. |
| params | Generation parameters (samplers + max_tokens + seed). |
| on_token | Callback fired once per accepted token. |
| cancel | Cancellation flag (polled between accept rounds). |
| draft | Draft backend (must be ACTIVE). |
Definition at line 2312 of file llama_cpp_backend.cpp.
|
protected |
Initialize the libmtmd context if mmproj is configured.
Initialize libmtmd context if mmproj is configured (v2.1.8).
@utility
Extracted from do_activate to keep that function under the knots SLOC threshold. mtmd holds a reference to the live model_, so init runs after the GPU reload and before any generation. Failure is non-fatal — the engine falls back to text-only with a logged diagnostic.
Definition at line 319 of file llama_cpp_backend.cpp.
|
protected |
Check if loaded model is recurrent.
Definition at line 2380 of file llama_cpp_backend.cpp.
|
inline |
Get the active llama_context pointer.
Definition at line 134 of file llama_cpp_backend.h.
|
inline |
Get the loaded llama_model pointer.
Definition at line 126 of file llama_cpp_backend.h.
|
protected |
Load the GGUF model onto the GPU (do_activate step 1).
Extracted from do_activate to keep it knots-clean. Sets model_ + vocab_ on success; sets last_error_ and returns false otherwise.
Definition at line 243 of file llama_cpp_backend.cpp.
|
protected |
Run mtmd_tokenize + mtmd_helper_eval_chunks on a prompt.
mtmd prefill helper (v2.1.8) — tokenize + eval chunks.
| prompt | Marker-substituted chat-formatted prompt. | |
| bitmaps | Loaded image bitmaps in marker order (borrowed). | |
| [out] | err_msg | Filled on failure. |
Wraps mtmd_tokenize + mtmd_helper_eval_chunks. KV cache is cleared first so prefill always starts at seq position 0. Bitmap ownership stays with the caller (mtmd_tokenize borrows for the call).
Definition at line 1253 of file llama_cpp_backend.cpp.
|
protected |
Two-pass prefill: prefix-only prefill → save → rest.
Prefill in two passes: prefix → save → remainder.
| tokens | Full token sequence. |
| prefix_tokens | System prefix token count. |
| key | Cache key to store under. |
The v2.0.6 correctness fix. llama_state_seq_get_data has no range parameter, so any save captures whatever KV state happens to be in seq 0 at save time. By prefilling ONLY the system prefix first, saving, and then continuing with the rest of the prompt, we guarantee the cache entry covers exactly prefix_tokens positions — no residue from later conversation tokens can leak into a subsequent delegation's cache hit.
If prefix_tokens is 0 or >= total tokens, falls back to a plain full prefill without caching (nothing meaningful to cache).
| tokens | Full token sequence. |
| prefix_tokens | System prefix token count. |
| key | Cache key for the prefix. |
Definition at line 1072 of file llama_cpp_backend.cpp.
|
protected |
Release a temporary sequence ID back to the pool.
| seq_id | The seq_id to release. |
| seq_id | The seq_id to release. |
Definition at line 600 of file llama_cpp_backend.cpp.
|
protected |
Restore KV state from cache and decode remaining tokens.
Restore cached prefix KV and decode remaining tokens.
| cached | Cache entry to restore. |
| tokens | Full token sequence. |
After v2.0.6 the cached state contains ONLY the system prefix (two-pass prefill saves at the prefix boundary, see prefill_and_cache_prefix). Restore is therefore clean by construction: llama_state_seq_set_data leaves seq 0 with exactly cached->token_count positions filled, and llama_batch_get_one auto-positions subsequent decodes at that boundary.
| cached | Cache entry to restore from. |
| tokens | Full token sequence. |
Definition at line 978 of file llama_cpp_backend.cpp.
|
protected |
Run batched prefill on input tokens.
| tokens | Input token sequence. |
| tokens | Input token sequence. |
Definition at line 790 of file llama_cpp_backend.cpp.
|
protected |
Run prefill with prompt cache integration.
| tokens | Full token sequence. |
| system_prompt | System prompt text for cache key. |
| messages | Original messages (for prefix boundary). |
| params | Generation parameters. |
On cache hit: restore prefix KV and decode the remainder. On cache miss: two-pass prefill (prefix → save → remainder) so the stored cache entry contains prefix-only state.
| tokens | Full token sequence. |
| system_prompt | System prompt text for cache key. |
| messages | Original messages (for prefix boundary). |
| params | Generation parameters. |
Definition at line 1114 of file llama_cpp_backend.cpp.
|
protected |
Sample tokens until stop / max_tokens / cancel.
Shared sampling loop (v2.1.8).
Shared by generate_multimodal and the text-only paths after prefill. Operates on the already-positioned ctx_ KV cache.
| params | Generation parameters. |
| on_token | Per-token callback (nullable). |
| cancel | Atomic cancel flag (nullable). |
| t0 | Generation start time for finalize_result. |
Operates on the already-positioned ctx_ KV cache. Mirrors the body of both text-only generation variants but factored out so generate_multimodal can reuse it after mtmd_prefill.
Definition at line 1296 of file llama_cpp_backend.cpp.
|
protected |
Capture seq 0 KV state and store under the given key.
| key | Cache key. |
| prefix_tokens | Token count to record with the entry. |
The caller is responsible for ensuring seq 0 currently contains exactly prefix_tokens positions — this function trusts that contract and serializes whatever is there.
| key | Cache key for the prefix. |
| prefix_tokens | Number of prefix tokens currently in seq 0. |
Definition at line 1007 of file llama_cpp_backend.cpp.
|
inline |
Set prompt cache configuration.
Must be called before activate(). The config is consumed when the cache is constructed during do_activate().
| config | Prompt cache configuration. @utility |
Definition at line 91 of file llama_cpp_backend.h.
|
protected |
Generate one token and append to output.
| sampler | Sampler chain. |
| generated | Accumulated output (mutated). |
| on_token | Streaming callback. |
| stop | Stop sequences. |
| sampler | Sampler chain. |
| generated | Accumulated output string (mutated). |
| on_token | Streaming callback (may be nullptr). |
| stop | Stop sequences. |
Definition at line 820 of file llama_cpp_backend.cpp.
|
protected |
Tokenize text using model vocabulary.
| text | Input text. |
| add_special | Add special tokens (BOS). |
| text | Input text. |
| add_special | Add BOS/EOS special tokens. |
Definition at line 430 of file llama_cpp_backend.cpp.
|
overridevirtual |
Tokenize text to token IDs using model vocabulary.
| text | Input text. |
| text | Input text. |
Reimplemented from entropic::InferenceBackend.
Definition at line 516 of file llama_cpp_backend.cpp.
|
protected |
Inference context (ACTIVE)
Definition at line 241 of file llama_cpp_backend.h.
|
protected |
Available temporary seq_ids (v1.9.10)
Definition at line 437 of file llama_cpp_backend.h.
|
protected |
Cached mtmd_support_vision(mtmd_ctx_) result.
Definition at line 468 of file llama_cpp_backend.h.
|
protected |
True if loaded model is recurrent (GDN/Mamba/RWKV).
Set during do_load() from llama_model_is_recurrent(). Drives capability reporting (KV_CACHE vs HIDDEN_STATE, speculative decoding compatibility, etc.).
Definition at line 446 of file llama_cpp_backend.h.
|
protected |
Loaded model (WARM+)
Definition at line 240 of file llama_cpp_backend.h.
|
protected |
libmtmd context, or nullptr if no mmproj loaded.
Allocated in do_activate() when ModelConfig::mmproj_path is set; freed in do_deactivate()/do_unload(). The leading :: is required — without it mtmd_context resolves into the entropic:: namespace and the forward declaration becomes an incompatible incomplete type at the call sites.
Definition at line 464 of file llama_cpp_backend.h.
|
protected |
KV prefix cache (v1.8.3)
Definition at line 247 of file llama_cpp_backend.h.
|
protected |
Cache config (v1.8.3)
Definition at line 246 of file llama_cpp_backend.h.
|
protected |
Guards temp seq_id pool (v1.9.10)
Definition at line 436 of file llama_cpp_backend.h.
|
protected |
Vocabulary (from model_)
Definition at line 242 of file llama_cpp_backend.h.