|
Entropic 2.3.8
Local-first agentic inference engine
|
Concrete base class for inference backends (80% logic). More...
#include <entropic/inference/backend.h>

Public Member Functions | |
| bool | load (const ModelConfig &config) |
| Load model into CPU RAM (COLD → WARM). | |
| bool | activate () |
| Promote to GPU (WARM → ACTIVE). | |
| void | deactivate () |
| Release GPU layers (ACTIVE → WARM). | |
| void | unload () |
| Full unload (→ COLD). | |
| bool | load_and_activate (const ModelConfig &config) |
| Convenience: load() + activate(). | |
| GenerationResult | generate (const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate a complete response. | |
| GenerationResult | generate_streaming (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Generate with per-token streaming callback. | |
| GenerationResult | generate_speculative (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Generate via the speculative-decoding kernel (v2.1.11). | |
| GenerationResult | complete (const std::string &prompt, const GenerationParams ¶ms) |
| Raw text completion without chat template. | |
| LogprobResult | evaluate_logprobs (const int32_t *tokens, int n_tokens) |
| Evaluate per-token log-probabilities for a token sequence. | |
| float | compute_perplexity (const int32_t *tokens, int n_tokens) |
| Compute perplexity for a token sequence. | |
| ModelState | state () const |
| Current lifecycle state (lock-free read). | |
| bool | is_active () const |
| True when state is ACTIVE. | |
| bool | is_loaded () const |
| True when state is WARM or ACTIVE. | |
| int | count_tokens (const std::string &text) const |
| Count tokens using model's tokenizer. | |
| virtual std::vector< int32_t > | tokenize_text (const std::string &text) const |
| Tokenize text to token IDs. | |
| int | context_length () const |
| Model's context window size. | |
| virtual void | clear_prompt_cache () |
| Invalidate any backend-owned prompt/KV caches. | |
| const ModelConfig & | config () const |
| Stored model config. | |
| bool | supports (BackendCapability cap) const |
| Query whether this backend supports a capability. | |
| std::vector< BackendCapability > | capabilities () const |
| Get all supported capabilities as a vector. | |
| BackendInfo | info () const |
| Get backend metadata. | |
| bool | save_state (int seq_id, std::vector< uint8_t > &buffer) const |
| Save model state to buffer. | |
| bool | restore_state (int seq_id, const std::vector< uint8_t > &buffer) |
| Restore model state from buffer. | |
| bool | clear_state (int seq_id=-1) |
| Clear/reset model state for a sequence. | |
| GenerationResult | generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate with explicit sequence ID. | |
| GenerationResult | generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Streaming generation with explicit sequence ID. | |
Protected Member Functions | |
| virtual bool | do_load (const ModelConfig &config)=0 |
| Load model into CPU RAM. | |
| virtual bool | do_activate ()=0 |
| Promote loaded model to GPU. | |
| virtual void | do_deactivate ()=0 |
| Release GPU, keep CPU. | |
| virtual void | do_unload ()=0 |
| Full unload. | |
| virtual GenerationResult | do_generate (const std::vector< Message > &messages, const GenerationParams ¶ms)=0 |
| Subclass generation. | |
| virtual GenerationResult | do_generate_streaming (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel)=0 |
| Subclass streaming generation. | |
| virtual GenerationResult | do_generate_speculative (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Subclass speculative-decoding streaming generation. | |
| virtual GenerationResult | do_complete (const std::string &prompt, const GenerationParams ¶ms)=0 |
| Subclass raw completion. | |
| virtual int | do_count_tokens (const std::string &text) const =0 |
| Subclass token counting. | |
| virtual LogprobResult | do_evaluate_logprobs (const int32_t *tokens, int n_tokens)=0 |
| Backend-specific logprob evaluation. | |
| virtual bool | do_supports (BackendCapability cap) const |
| Declare supported capabilities. | |
| virtual std::string | do_backend_name () const =0 |
| Return backend name identifier. | |
| virtual BackendInfo | do_info () const |
| Populate backend metadata. | |
| virtual bool | do_save_state (int seq_id, std::vector< uint8_t > &buffer) const |
| Save model state (KV cache or hidden state). | |
| virtual bool | do_restore_state (int seq_id, const std::vector< uint8_t > &buffer) |
| Restore model state. | |
| virtual bool | do_clear_state (int seq_id) |
| Clear/reset model state. | |
| virtual GenerationResult | do_generate_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms) |
| Generate with sequence ID. | |
| virtual GenerationResult | do_generate_streaming_seq (int seq_id, const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view token)> on_token, std::atomic< bool > &cancel) |
| Streaming generation with sequence ID. | |
| bool | fire_model_load_hook (const ModelConfig &config) |
| Fire ON_MODEL_LOAD pre-hook. | |
| void | set_hooks (const HookInterface &hooks) |
| Set the hook dispatch interface. | |
Protected Attributes | |
| std::string | last_error_ |
| Last error message for diagnostics. | |
Concrete base class for inference backends (80% logic).
Public methods implement the lifecycle state machine, transition locking, timing, and logging. Protected virtual methods are the 20% that subclasses override with backend-specific logic.
Invalid transitions are no-ops with INFO log (not errors).
| bool entropic::InferenceBackend::activate | ( | ) |
Promote to GPU (WARM → ACTIVE).
Loads first if COLD.
Loads first if COLD.
Definition at line 88 of file backend.cpp.
| std::vector< BackendCapability > entropic::InferenceBackend::capabilities | ( | ) | const |
Get all supported capabilities as a vector.
Get all supported capabilities.
Convenience method. Iterates BackendCapability enum, calls supports() on each.
Definition at line 468 of file backend.cpp.
|
inlinevirtual |
Invalidate any backend-owned prompt/KV caches.
Called when identity or prompt-prefix inputs change so stale cached prefixes are never served against the new system prompt. Default is a no-op for backends with no cache. (P1-7, 2.0.6-rc16)
@utility
Reimplemented in entropic::LlamaCppBackend.
| bool entropic::InferenceBackend::clear_state | ( | int | seq_id = -1 | ) |
Clear/reset model state for a sequence.
Clear model state.
| seq_id | Sequence identifier (-1 for all sequences). |
For transformers: clears KV cache. For recurrent: resets hidden state to initial values.
Requires WARM or ACTIVE.
| seq_id | Sequence ID, or -1 for all. |
Definition at line 548 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::complete | ( | const std::string & | prompt, |
| const GenerationParams & | params | ||
| ) |
Raw text completion without chat template.
Raw text completion.
| prompt | Raw prompt string (no chat formatting). |
| params | Generation parameters. |
Requires ACTIVE state.
| prompt | Raw prompt string. |
| params | Generation parameters. |
Definition at line 308 of file backend.cpp.
| float entropic::InferenceBackend::compute_perplexity | ( | const int32_t * | tokens, |
| int | n_tokens | ||
| ) |
Compute perplexity for a token sequence.
| tokens | Array of token IDs. |
| n_tokens | Number of tokens (minimum 2). |
Convenience method — calls evaluate_logprobs() and returns only the perplexity value.
Convenience wrapper — calls evaluate_logprobs() and returns only the perplexity value.
| tokens | Array of token IDs. |
| n_tokens | Number of tokens (minimum 2). |
Definition at line 401 of file backend.cpp.
|
inline |
Stored model config.
|
inline |
| int entropic::InferenceBackend::count_tokens | ( | const std::string & | text | ) | const |
Count tokens using model's tokenizer.
Count tokens.
Exact if loaded, estimate if COLD.
| text | Text to tokenize. |
Definition at line 442 of file backend.cpp.
| void entropic::InferenceBackend::deactivate | ( | ) |
Release GPU layers (ACTIVE → WARM).
No-op if not ACTIVE.
No-op if not ACTIVE.
Definition at line 117 of file backend.cpp.
|
protectedpure virtual |
Promote loaded model to GPU.
Called under transition_mutex_.
Implemented in entropic::LlamaCppBackend.
|
protectedpure virtual |
Return backend name identifier.
Pure virtual — every backend must identify itself.
Implemented in entropic::LlamaCppBackend.
|
protectedvirtual |
Clear/reset model state.
Default: state clear not supported.
| seq_id | Sequence ID, or -1 for all. |
| seq_id | Sequence identifier. |
Reimplemented in entropic::LlamaCppBackend.
Definition at line 688 of file backend.cpp.
|
protectedpure virtual |
Subclass raw completion.
Called only when ACTIVE.
| prompt | Raw prompt string (no chat template applied). |
| params | Generation parameters. |
Implemented in entropic::LlamaCppBackend.
|
protectedpure virtual |
Subclass token counting.
Called only when model loaded.
| text | Text whose tokens should be counted. |
Implemented in entropic::LlamaCppBackend.
|
protectedpure virtual |
Release GPU, keep CPU.
Called under transition_mutex_.
Implemented in entropic::LlamaCppBackend.
|
protectedpure virtual |
Backend-specific logprob evaluation.
| tokens | Token IDs to evaluate. |
| n_tokens | Number of tokens. |
Called by evaluate_logprobs() after state validation and eval_mutex_ acquisition. The base class handles state assertion, minimum token count validation, mutex, perplexity computation from logprobs, and logging. The implementation handles batch allocation, decode calls, logit extraction, and temporary seq_id lifecycle.
Implemented in entropic::LlamaCppBackend.
|
protectedpure virtual |
Subclass generation.
Called only when ACTIVE.
| messages | Conversation history. |
| params | Generation parameters. |
Implemented in entropic::LlamaCppBackend.
|
protectedvirtual |
Generate with sequence ID.
Default: ignores seq_id, delegates to do_generate().
| seq_id | Sequence identifier. |
| messages | Conversation history. |
| params | Generation parameters. |
Default: ignores seq_id, delegates to do_generate().
| seq_id | Sequence identifier (ignored). |
| messages | Conversation history. |
| params | Generation parameters. |
Definition at line 701 of file backend.cpp.
|
protectedvirtual |
Subclass speculative-decoding streaming generation.
Default implementation of speculative streaming generation.
Same contract as do_generate_streaming (callback fires once per accepted token, cancel flag honored between accept rounds) but the backend is expected to drive a draft model through common_speculative_* (or equivalent) to propose tokens, then verify them in batch against the target. Output distribution MUST be bit-identical to plain decode on rejection cases (correctness contract from the v2.1.11 proposal).
Default returns a result with ENTROPIC_ERROR_NOT_SUPPORTED — backends that don't implement speculative fall through to the standard streaming path via the orchestrator.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Callback for each accepted token. |
| cancel | Atomic cancel flag. |
Returns ENTROPIC_ERROR_NOT_SUPPORTED so the orchestrator falls back to plain do_generate_streaming. Backends that implement the speculative kernel override this method. (v2.1.11, gh#36)
| messages | Unused in default impl. |
| params | Unused in default impl. |
| on_token | Unused in default impl. |
| cancel | Unused in default impl. |
Reimplemented in entropic::LlamaCppBackend.
Definition at line 286 of file backend.cpp.
|
protectedpure virtual |
Subclass streaming generation.
Called only when ACTIVE.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Callback invoked per emitted token. |
| cancel | Atomic flag — when true, subclass must stop streaming. |
Implemented in entropic::LlamaCppBackend.
|
protectedvirtual |
Streaming generation with sequence ID.
Default: ignores seq_id, delegates to do_generate_streaming().
| seq_id | Sequence identifier. |
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Cancellation flag. |
Default: ignores seq_id, delegates to do_generate_streaming().
| seq_id | Sequence identifier (ignored). |
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Cancellation flag. |
Definition at line 720 of file backend.cpp.
|
protectedvirtual |
Populate backend metadata.
Default: BackendInfo with name only.
Default: returns BackendInfo with name from do_backend_name().
Reimplemented in entropic::LlamaCppBackend.
Definition at line 647 of file backend.cpp.
|
protectedpure virtual |
Load model into CPU RAM.
Called under transition_mutex_.
| config | Validated model config. |
Implemented in entropic::LlamaCppBackend.
|
protectedvirtual |
Restore model state.
Default: state restore not supported.
| seq_id | Sequence identifier. |
| buffer | State data to restore. |
| seq_id | Sequence identifier. |
| buffer | State data. |
Definition at line 675 of file backend.cpp.
|
protectedvirtual |
Save model state (KV cache or hidden state).
Default: state save not supported.
| seq_id | Sequence identifier. |
| buffer | Output buffer. |
| seq_id | Sequence identifier. |
| buffer | Output buffer. |
Definition at line 661 of file backend.cpp.
|
protectedvirtual |
Declare supported capabilities.
Default: no capabilities supported.
| cap | Capability to check. |
Default: returns false for everything.
| cap | Capability to check. |
Reimplemented in entropic::LlamaCppBackend.
Definition at line 637 of file backend.cpp.
|
protectedpure virtual |
| LogprobResult entropic::InferenceBackend::evaluate_logprobs | ( | const int32_t * | tokens, |
| int | n_tokens | ||
| ) |
Evaluate per-token log-probabilities for a token sequence.
Evaluate per-token log-probabilities.
| tokens | Array of token IDs to evaluate. |
| n_tokens | Number of tokens in the array (minimum 2). |
| std::runtime_error | if model is not ACTIVE. |
| std::runtime_error | if n_tokens < 2. |
Requires ACTIVE state.
The 80% logic: state check, input validation, eval_mutex_, perplexity computation from raw logprobs, and logging. Delegates to do_evaluate_logprobs() for backend-specific batch/decode work.
| tokens | Array of token IDs. |
| n_tokens | Number of tokens (minimum 2). |
| std::runtime_error | on state/input errors. |
Definition at line 343 of file backend.cpp.
|
protected |
Fire ON_MODEL_LOAD pre-hook.
| config | Model config being loaded. |
| config | Model config being loaded. |
Definition at line 417 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::generate | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params | ||
| ) |
Generate a complete response.
| messages | Conversation history. |
| params | Generation parameters. |
Requires ACTIVE state.
| messages | Conversation history. |
| params | Generation parameters. |
Definition at line 182 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::generate_seq | ( | int | seq_id, |
| const std::vector< Message > & | messages, | ||
| const GenerationParams & | params | ||
| ) |
Generate with explicit sequence ID.
| seq_id | Sequence identifier for multi-sequence backends. |
| messages | Conversation history. |
| params | Generation parameters. |
Default: ignores seq_id, delegates to generate().
Requires ACTIVE.
| seq_id | Sequence identifier. |
| messages | Conversation history. |
| params | Generation parameters. |
Definition at line 571 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::generate_speculative | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params, | ||
| std::function< void(std::string_view token)> | on_token, | ||
| std::atomic< bool > & | cancel | ||
| ) |
Generate via the speculative-decoding kernel (v2.1.11).
Public entry point for speculative-decoding streaming.
Public wrapper around do_generate_speculative. Validates ACTIVE state, then delegates to the subclass override. Backends that do not implement the kernel return ENTROPIC_ERROR_NOT_SUPPORTED — the orchestrator falls back to generate_streaming in that case.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-accepted-token callback. |
| cancel | Cancellation flag. |
Mirrors generate_streaming: validates ACTIVE state then delegates to the subclass override. Stamps generation_time_ms. Returns the subclass's NOT_SUPPORTED on stub backends — caller falls back to generate_streaming.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-accepted-token callback. |
| cancel | Cancellation flag. |
Definition at line 248 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::generate_streaming | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params, | ||
| std::function< void(std::string_view token)> | on_token, | ||
| std::atomic< bool > & | cancel | ||
| ) |
Generate with per-token streaming callback.
Streaming generation.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Called for each token (valid only during callback). |
| cancel | Set to true to abort. Latency: one token. |
Requires ACTIVE state.
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Atomic cancel flag. |
Definition at line 211 of file backend.cpp.
| GenerationResult entropic::InferenceBackend::generate_streaming_seq | ( | int | seq_id, |
| const std::vector< Message > & | messages, | ||
| const GenerationParams & | params, | ||
| std::function< void(std::string_view token)> | on_token, | ||
| std::atomic< bool > & | cancel | ||
| ) |
Streaming generation with explicit sequence ID.
Streaming generation with sequence ID.
| seq_id | Sequence identifier. |
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Cancellation flag. |
Requires ACTIVE.
| seq_id | Sequence identifier. |
| messages | Conversation history. |
| params | Generation parameters. |
| on_token | Per-token callback. |
| cancel | Cancellation flag. |
Definition at line 603 of file backend.cpp.
| BackendInfo entropic::InferenceBackend::info | ( | ) | const |
Get backend metadata.
Base class returns a default-constructed BackendInfo with name from do_backend_name(). Subclasses override do_info() to populate architecture, quantization, memory usage, etc.
Delegates to do_info().
Definition at line 486 of file backend.cpp.
|
inline |
|
inline |
| bool entropic::InferenceBackend::load | ( | const ModelConfig & | config | ) |
Load model into CPU RAM (COLD → WARM).
| config | Model configuration (path, context length, GPU layers, etc.). |
Acquires transition_mutex_. No-op if already WARM/ACTIVE.
| config | Validated model config. |
Definition at line 54 of file backend.cpp.
| bool entropic::InferenceBackend::load_and_activate | ( | const ModelConfig & | config | ) |
Convenience: load() + activate().
| config | Model configuration passed through to load(). |
| config | Model config. |
Definition at line 165 of file backend.cpp.
| bool entropic::InferenceBackend::restore_state | ( | int | seq_id, |
| const std::vector< uint8_t > & | buffer | ||
| ) |
Restore model state from buffer.
Restore model state.
| seq_id | Sequence identifier to restore into. |
| buffer | Previously saved state buffer. |
Requires ACTIVE.
| seq_id | Sequence identifier. |
| buffer | Previously saved state. |
Definition at line 524 of file backend.cpp.
| bool entropic::InferenceBackend::save_state | ( | int | seq_id, |
| std::vector< uint8_t > & | buffer | ||
| ) | const |
Save model state to buffer.
Save model state.
| seq_id | Sequence identifier (0 for single-sequence backends). |
| buffer | Output buffer. Caller owns the returned data. |
For transformers: saves KV cache state for the sequence. For recurrent: saves hidden state.
Requires ACTIVE.
| seq_id | Sequence identifier. |
| buffer | Output buffer. |
Definition at line 500 of file backend.cpp.
|
inlineprotected |
|
inline |
| bool entropic::InferenceBackend::supports | ( | BackendCapability | cap | ) | const |
Query whether this backend supports a capability.
Query backend capability.
| cap | Capability to query. |
Base class returns false for all capabilities. Subclasses override do_supports() to declare their capabilities. Lock-free — no state transitions involved.
Delegates to do_supports().
| cap | Capability to query. |
Definition at line 458 of file backend.cpp.
|
inlinevirtual |
Tokenize text to token IDs.
| text | Input text. |
Reimplemented in entropic::LlamaCppBackend.
| void entropic::InferenceBackend::unload | ( | ) |
Full unload (→ COLD).
Releases all RAM + VRAM.
Idempotent.
Definition at line 139 of file backend.cpp.
|
protected |