|
Entropic 2.3.8
Local-first agentic inference engine
|
Multi-model lifecycle and routing orchestrator. More...
#include <entropic/inference/orchestrator.h>
Classes | |
| struct | SpeculativeCompatInfo |
| Result of a speculative-decoding compatibility check. More... | |
Public Types | |
| enum class | ResidencyEvent : int { Loaded = 0 , Evicted = 1 , ActivationSwap = 2 } |
Residency observer event codes — mirror the C ABI enum entropic_residency_event_t exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2). More... | |
| using | ResidencyObserverFn = std::function< void(ResidencyEvent event, const std::string &tier_name, const std::string &model_path, size_t footprint)> |
| Residency observer callback type (internal C++ form). | |
Public Member Functions | |
| bool | initialize (const ParsedConfig &config) |
| Initialize from parsed config. | |
| void | shutdown () |
| Shutdown — unload all models. | |
| ~ModelOrchestrator () | |
| Destructor — invokes shutdown() and AdapterManager::unload_all(). | |
| GenerationResult | generate (const std::vector< Message > &messages, const GenerationParams ¶ms, const std::string &tier_name="") |
| Generate using routed or explicit tier. | |
| GenerationResult | generate_streaming (const std::vector< Message > &messages, const GenerationParams ¶ms, std::function< void(std::string_view)> on_token, std::atomic< bool > &cancel, const std::string &tier_name="") |
| Streaming generation. | |
| std::string | route (const std::vector< Message > &messages) |
| Route to tier using router model. | |
| RoutingResult | last_routing_result () const |
| Last routing result. | |
| std::string | last_used_tier () const |
| Last used tier name. | |
| std::vector< std::string > | loaded_models () const |
| Currently loaded model tier names. | |
| std::vector< std::string > | available_models () const |
| All configured tier names. | |
| bool | can_handoff (const std::string &from, const std::string &to) const |
| Check if handoff is permitted. | |
| ChatAdapter * | get_adapter (const std::string &tier_name) const |
| Get adapter for a tier. | |
| InferenceBackend * | get_backend (const std::string &tier_name) const |
| Get the inference backend for a tier (for evaluation APIs). | |
| AdapterManager & | adapter_manager () |
| Access the LoRA adapter manager. | |
| GrammarRegistry & | grammar_registry () |
| Access the grammar registry. | |
| ProfileRegistry & | profile_registry () |
| Access the GPU resource profile registry. | |
| ThroughputTracker & | throughput_tracker () |
| Access the throughput tracker. | |
| size_t | load_grammars_from (const std::filesystem::path &grammar_dir) |
| Load grammars from an explicit directory path. | |
| void | clear_all_prompt_caches () |
| Invalidate prompt/KV caches across every pooled backend. | |
| bool | has_vision_capable_tier () const |
Return true if any configured tier declares the "vision" capability (gh#41, v2.1.8). | |
| SpeculativeCompatInfo | check_speculative_compat () const |
| Check whether the currently-configured target/draft pair is compatible for speculative decoding. | |
| void | set_speculative_enabled (bool enabled) |
| Runtime toggle for the speculative-decoding path. | |
| void | set_residency_observer (ResidencyObserverFn cb) |
| Register a residency observer. | |
| std::string | residency_snapshot_json () const |
| Serialize the current residency set as a JSON string. | |
| size_t | vram_budget_bytes () const |
| Engine-tracked VRAM budget in bytes (0 = unknown). | |
| size_t | tier_footprint_bytes (const std::string &tier_name) const |
| Estimated VRAM footprint for a given tier in bytes. | |
| entropic_error_t | last_residency_error () const |
Last residency-related error code, or ENTROPIC_OK if none. | |
| void | clear_last_residency_error () |
Clear last_residency_error(). | |
| std::string | select_vision_tier () const |
| Pick the canonical vision-capable tier name (gh#41). | |
Multi-model lifecycle and routing orchestrator.
Manages model pool deduplication, per-tier adapters, VRAM lifecycle, and digit-based tier routing via a router model.
Definition at line 71 of file orchestrator.h.
| using entropic::ModelOrchestrator::ResidencyObserverFn = std::function<void( ResidencyEvent event, const std::string& tier_name, const std::string& model_path, size_t footprint)> |
Residency observer callback type (internal C++ form).
Fires synchronously from inside get_model / deactivate_current_if_needed while the swap mutex is held. Must not call back into the orchestrator on the same thread.
| event | Lifecycle event code. |
| tier_name | Tier whose backing model changed VRAM state. |
| model_path | Resolved GGUF path for the model. |
| footprint | Estimated VRAM footprint in bytes. @callback |
Definition at line 333 of file orchestrator.h.
|
strong |
Residency observer event codes — mirror the C ABI enum entropic_residency_event_t exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2).
Definition at line 313 of file orchestrator.h.
| entropic::ModelOrchestrator::~ModelOrchestrator | ( | ) |
Destructor — invokes shutdown() and AdapterManager::unload_all().
Orchestrate teardown order (gh#58 close-out).
Combines two fixes:
shutdown() actually run on destroy, so VRAM release on handle teardown is explicit rather than relying on the shared_ptr<LlamaCppBackend> cascade. Pre-v2.2.9 nothing called shutdown() during the destroy path.AdapterManager::unload_all() so loaded LoRA llama_adapter_lora* handles don't leak. Pre- v2.3.0 AdapterManager had no destructor — same shape as the pre-v2.2.8 LlamaCppBackend leak.Teardown order matters: backends first (frees llama_contexts), then adapter handles (safe because the contexts that referenced HOT adapters are gone). Out-of-line so the .cpp can stage the two calls in the right sequence and keep the header free of implementation noise.
@utility
See header. @utility
Definition at line 243 of file orchestrator.cpp.
|
inline |
Access the LoRA adapter manager.
Definition at line 199 of file orchestrator.h.
| std::vector< std::string > entropic::ModelOrchestrator::available_models | ( | ) | const |
| bool entropic::ModelOrchestrator::can_handoff | ( | const std::string & | from, |
| const std::string & | to | ||
| ) | const |
| ModelOrchestrator::SpeculativeCompatInfo entropic::ModelOrchestrator::check_speculative_compat | ( | ) | const |
Check whether the currently-configured target/draft pair is compatible for speculative decoding.
Speculative compatibility check (target vs draft).
Reads the active main tier as the target (verifier) and the "draft" role on the SecondaryModelLoader as the proposer. Returns compatible=false with a specific diagnostic when:
entropic::speculative::check_compat rejects the pairing (recurrent target, tokenizer mismatch, etc.).Reads the active main tier as the target and the "draft" slot on the secondary loader as the draft. Returns a structured diagnostic the C ABI can forward to consumers.
Definition at line 1219 of file orchestrator.cpp.
| void entropic::ModelOrchestrator::clear_all_prompt_caches | ( | ) |
Invalidate prompt/KV caches across every pooled backend.
Invalidate prompt caches across every pooled backend.
Used when identity content (system prompt prefix) changes so no cached prefix is served against the new prompt. (P1-7, 2.0.6-rc16)
@utility
Called on identity content changes so no cached prefix is served against the new system prompt. (P1-7, 2.0.6-rc16). Fans out to secondary roles (router, draft) via SecondaryModelLoader (v2.1.11).
@utility
Definition at line 1113 of file orchestrator.cpp.
|
inline |
| GenerationResult entropic::ModelOrchestrator::generate | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params, | ||
| const std::string & | tier_name = "" |
||
| ) |
Generate using routed or explicit tier.
Generate response using routed or explicit tier.
| messages | Conversation history. |
| params | Generation parameters. |
| tier_name | Explicit tier, or empty for routing. |
Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches through LlamaCppBackend::generate_speculative_with_draft; falls back to plain decode otherwise. The dispatch decision is delegated to run_generate_dispatch to keep this method under the SLOC gate. v2.2.4 (gh#57): a refused activation now reports ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE via build_no_model_error instead of the generic GENERATE_FAILED.
| messages | Conversation history. |
| params | Generation parameters. |
| tier_name | Explicit tier or empty for routing. |
Definition at line 399 of file orchestrator.cpp.
| GenerationResult entropic::ModelOrchestrator::generate_streaming | ( | const std::vector< Message > & | messages, |
| const GenerationParams & | params, | ||
| std::function< void(std::string_view)> | on_token, | ||
| std::atomic< bool > & | cancel, | ||
| const std::string & | tier_name = "" |
||
| ) |
Streaming generation.
Streaming generation with speculative dispatch.
Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches to LlamaCppBackend::generate_speculative_with_draft via try_speculative_route_streaming with the draft resolved from secondary_loader_.get("draft"). Falls back to plain streaming on NOT_SUPPORTED or compatibility failure, with a diagnostic logged.
Definition at line 453 of file orchestrator.cpp.
| ChatAdapter * entropic::ModelOrchestrator::get_adapter | ( | const std::string & | tier_name | ) | const |
| InferenceBackend * entropic::ModelOrchestrator::get_backend | ( | const std::string & | tier_name | ) | const |
Get the inference backend for a tier (for evaluation APIs).
Get the inference backend for a tier.
| tier_name | Tier name (e.g. "lead", "eng"). |
| tier_name | Tier name. |
Definition at line 903 of file orchestrator.cpp.
|
inline |
Access the grammar registry.
Definition at line 207 of file orchestrator.h.
| bool entropic::ModelOrchestrator::has_vision_capable_tier | ( | ) | const |
Return true if any configured tier declares the "vision" capability (gh#41, v2.1.8).
Vision-capability lookup (gh#41, v2.1.8).
Read-only lookup over the parsed ModelsConfig — does not touch backend state. Used by the facade's entropic_run_messages entry point to short-circuit with ENTROPIC_ERROR_NO_VISION_TIER before dispatching a multimodal turn that no tier can handle.
@utility
Definition at line 1128 of file orchestrator.cpp.
| bool entropic::ModelOrchestrator::initialize | ( | const ParsedConfig & | config | ) |
Initialize from parsed config.
Initialize orchestrator: backends, routing, adapters, grammars.
| config | Full engine config. |
Adds speculative-draft activation alongside router activation in v2.1.11 (gh#36) — the draft slot loads when inference.speculative. enabled is true and a draft_model is configured. v2.2.4 (gh#57) caches the VRAM budget from ENTROPIC_VRAM_BUDGET_BYTES so the residency gate in get_model has a number to test against.
| config | Parsed engine config. |
Definition at line 187 of file orchestrator.cpp.
|
inline |
Last residency-related error code, or ENTROPIC_OK if none.
Set by get_model when a tier-fit check fails (returns ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE). The facade clears it after translating to the C ABI return code. Independent of the last_error_ string carried on individual backends.
@utility
Definition at line 400 of file orchestrator.h.
| RoutingResult entropic::ModelOrchestrator::last_routing_result | ( | ) | const |
| std::string entropic::ModelOrchestrator::last_used_tier | ( | ) | const |
| size_t entropic::ModelOrchestrator::load_grammars_from | ( | const std::filesystem::path & | grammar_dir | ) |
Load grammars from an explicit directory path.
| grammar_dir | Path containing .gbnf files. |
Called by the facade after data-dir resolution. This is the fallback path when config_dir doesn't contain a grammars subdir (e.g., installed layout where grammars live under share/entropic).
| grammar_dir | Path to directory containing .gbnf files. |
Definition at line 1092 of file orchestrator.cpp.
| std::vector< std::string > entropic::ModelOrchestrator::loaded_models | ( | ) | const |
Currently loaded model tier names.
Includes "router" when the secondary loader reports the role as loaded (v2.1.11, gh#27 — previously checked the raw router_ field).
Definition at line 867 of file orchestrator.cpp.
|
inline |
Access the GPU resource profile registry.
Definition at line 215 of file orchestrator.h.
| std::string entropic::ModelOrchestrator::residency_snapshot_json | ( | ) | const |
Serialize the current residency set as a JSON string.
Serialize the current VRAM residency snapshot to JSON.
Schema is documented on the C ABI entry point entropic_residency_snapshot (entropic.h). Read-only — takes the swap mutex briefly to obtain a consistent snapshot.
Definition at line 1440 of file orchestrator.cpp.
| std::string entropic::ModelOrchestrator::route | ( | const std::vector< Message > & | messages | ) |
Route to tier using router model.
Route to appropriate tier using router model.
| messages | Current conversation. |
Guard updated in v2.1.11: routing requires models.router to be configured (was: router_ non-null). The slot is owned by secondary_loader_ since gh#27.
| messages | Current conversation. |
Definition at line 506 of file orchestrator.cpp.
| std::string entropic::ModelOrchestrator::select_vision_tier | ( | ) | const |
Pick the canonical vision-capable tier name (gh#41).
First vision-capable tier name (gh#41, v2.1.8).
Returns the first tier (iteration order of the parsed models.tiers map) whose capabilities include "vision", or empty string if none exists. Multi-tier policy refinements (e.g. prefer the default tier when it qualifies) can layer on top later — single-vision-tier deployments are the common case for v2.1.8 (gh#42 ships the primary tier as the only vision-capable bundled entry).
Definition at line 1141 of file orchestrator.cpp.
| void entropic::ModelOrchestrator::set_residency_observer | ( | ResidencyObserverFn | cb | ) |
Register a residency observer.
Register / replace / clear the residency observer.
Replaces the previous one.
Passing an empty std::function clears the observer.
| cb | Observer callable, or empty to clear. @utility |
Definition at line 1369 of file orchestrator.cpp.
|
inline |
Runtime toggle for the speculative-decoding path.
Lets consumers (and tests) flip speculative on/off without reinitializing the orchestrator. Defaults to whatever inference.speculative.enabled was in the parsed config at init time.
| enabled | true to route through the speculative kernel when a compatible draft is loaded. @utility |
Definition at line 301 of file orchestrator.h.
| void entropic::ModelOrchestrator::shutdown | ( | ) |
Shutdown — unload all models.
Main-tier pool is unloaded directly; secondary roles (router, draft, etc.) are released through secondary_loader_.shutdown() (v2.1.11).
Definition at line 226 of file orchestrator.cpp.
|
inline |
Access the throughput tracker.
Definition at line 223 of file orchestrator.h.
| size_t entropic::ModelOrchestrator::tier_footprint_bytes | ( | const std::string & | tier_name | ) | const |
Estimated VRAM footprint for a given tier in bytes.
Public footprint accessor — memoizes via tier_footprint_bytes_.
Sum of GGUF weights file size and a context-length-derived KV cache estimate plus the configured vram_reserve_mb headroom. Returns 0 if the tier is unknown.
| tier_name | Tier name. @utility |
Definition at line 1352 of file orchestrator.cpp.
|
inline |
Engine-tracked VRAM budget in bytes (0 = unknown).
Sources, in priority order: ENTROPIC_VRAM_BUDGET_BYTES environment override → CUDA cudaMemGetInfo (when the CUDA inference backend is the active build) → 0. When 0, the orchestrator does not enforce a per-tier budget gate.
@utility
Definition at line 374 of file orchestrator.h.