Entropic 2.3.8
Local-first agentic inference engine
Loading...
Searching...
No Matches
entropic::ModelOrchestrator Class Reference

Multi-model lifecycle and routing orchestrator. More...

#include <entropic/inference/orchestrator.h>

Classes

struct  SpeculativeCompatInfo
 Result of a speculative-decoding compatibility check. More...
 

Public Types

enum class  ResidencyEvent : int { Loaded = 0 , Evicted = 1 , ActivationSwap = 2 }
 Residency observer event codes — mirror the C ABI enum entropic_residency_event_t exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2). More...
 
using ResidencyObserverFn = std::function< void(ResidencyEvent event, const std::string &tier_name, const std::string &model_path, size_t footprint)>
 Residency observer callback type (internal C++ form).
 

Public Member Functions

bool initialize (const ParsedConfig &config)
 Initialize from parsed config.
 
void shutdown ()
 Shutdown — unload all models.
 
 ~ModelOrchestrator ()
 Destructor — invokes shutdown() and AdapterManager::unload_all().
 
GenerationResult generate (const std::vector< Message > &messages, const GenerationParams &params, const std::string &tier_name="")
 Generate using routed or explicit tier.
 
GenerationResult generate_streaming (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view)> on_token, std::atomic< bool > &cancel, const std::string &tier_name="")
 Streaming generation.
 
std::string route (const std::vector< Message > &messages)
 Route to tier using router model.
 
RoutingResult last_routing_result () const
 Last routing result.
 
std::string last_used_tier () const
 Last used tier name.
 
std::vector< std::string > loaded_models () const
 Currently loaded model tier names.
 
std::vector< std::string > available_models () const
 All configured tier names.
 
bool can_handoff (const std::string &from, const std::string &to) const
 Check if handoff is permitted.
 
ChatAdapterget_adapter (const std::string &tier_name) const
 Get adapter for a tier.
 
InferenceBackendget_backend (const std::string &tier_name) const
 Get the inference backend for a tier (for evaluation APIs).
 
AdapterManageradapter_manager ()
 Access the LoRA adapter manager.
 
GrammarRegistrygrammar_registry ()
 Access the grammar registry.
 
ProfileRegistryprofile_registry ()
 Access the GPU resource profile registry.
 
ThroughputTrackerthroughput_tracker ()
 Access the throughput tracker.
 
size_t load_grammars_from (const std::filesystem::path &grammar_dir)
 Load grammars from an explicit directory path.
 
void clear_all_prompt_caches ()
 Invalidate prompt/KV caches across every pooled backend.
 
bool has_vision_capable_tier () const
 Return true if any configured tier declares the "vision" capability (gh#41, v2.1.8).
 
SpeculativeCompatInfo check_speculative_compat () const
 Check whether the currently-configured target/draft pair is compatible for speculative decoding.
 
void set_speculative_enabled (bool enabled)
 Runtime toggle for the speculative-decoding path.
 
void set_residency_observer (ResidencyObserverFn cb)
 Register a residency observer.
 
std::string residency_snapshot_json () const
 Serialize the current residency set as a JSON string.
 
size_t vram_budget_bytes () const
 Engine-tracked VRAM budget in bytes (0 = unknown).
 
size_t tier_footprint_bytes (const std::string &tier_name) const
 Estimated VRAM footprint for a given tier in bytes.
 
entropic_error_t last_residency_error () const
 Last residency-related error code, or ENTROPIC_OK if none.
 
void clear_last_residency_error ()
 Clear last_residency_error().
 
std::string select_vision_tier () const
 Pick the canonical vision-capable tier name (gh#41).
 

Detailed Description

Multi-model lifecycle and routing orchestrator.

Manages model pool deduplication, per-tier adapters, VRAM lifecycle, and digit-based tier routing via a router model.

Version
1.8.2

Definition at line 71 of file orchestrator.h.

Member Typedef Documentation

◆ ResidencyObserverFn

using entropic::ModelOrchestrator::ResidencyObserverFn = std::function<void( ResidencyEvent event, const std::string& tier_name, const std::string& model_path, size_t footprint)>

Residency observer callback type (internal C++ form).

Fires synchronously from inside get_model / deactivate_current_if_needed while the swap mutex is held. Must not call back into the orchestrator on the same thread.

Parameters
eventLifecycle event code.
tier_nameTier whose backing model changed VRAM state.
model_pathResolved GGUF path for the model.
footprintEstimated VRAM footprint in bytes. @callback
Version
2.2.4

Definition at line 333 of file orchestrator.h.

Member Enumeration Documentation

◆ ResidencyEvent

Residency observer event codes — mirror the C ABI enum entropic_residency_event_t exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2).

Version
2.2.4

Definition at line 313 of file orchestrator.h.

Constructor & Destructor Documentation

◆ ~ModelOrchestrator()

entropic::ModelOrchestrator::~ModelOrchestrator ( )

Destructor — invokes shutdown() and AdapterManager::unload_all().

Orchestrate teardown order (gh#58 close-out).

Combines two fixes:

  • gh#63 (v2.2.9): make shutdown() actually run on destroy, so VRAM release on handle teardown is explicit rather than relying on the shared_ptr<LlamaCppBackend> cascade. Pre-v2.2.9 nothing called shutdown() during the destroy path.
  • gh#58 close-out (v2.3.0): also call AdapterManager::unload_all() so loaded LoRA llama_adapter_lora* handles don't leak. Pre- v2.3.0 AdapterManager had no destructor — same shape as the pre-v2.2.8 LlamaCppBackend leak.

Teardown order matters: backends first (frees llama_contexts), then adapter handles (safe because the contexts that referenced HOT adapters are gone). Out-of-line so the .cpp can stage the two calls in the right sequence and keep the header free of implementation noise.

@utility

Version
2.3.0

See header. @utility

Version
2.3.0

Definition at line 243 of file orchestrator.cpp.

Member Function Documentation

◆ adapter_manager()

AdapterManager & entropic::ModelOrchestrator::adapter_manager ( )
inline

Access the LoRA adapter manager.

Returns
Reference to AdapterManager. @utility
Version
1.9.2

Definition at line 199 of file orchestrator.h.

◆ available_models()

std::vector< std::string > entropic::ModelOrchestrator::available_models ( ) const

All configured tier names.

Version
1.8.2

Definition at line 885 of file orchestrator.cpp.

◆ can_handoff()

bool entropic::ModelOrchestrator::can_handoff ( const std::string &  from,
const std::string &  to 
) const

Check if handoff is permitted.

Version
1.8.2

Definition at line 915 of file orchestrator.cpp.

◆ check_speculative_compat()

ModelOrchestrator::SpeculativeCompatInfo entropic::ModelOrchestrator::check_speculative_compat ( ) const

Check whether the currently-configured target/draft pair is compatible for speculative decoding.

Speculative compatibility check (target vs draft).

Reads the active main tier as the target (verifier) and the "draft" role on the SecondaryModelLoader as the proposer. Returns compatible=false with a specific diagnostic when:

  • No main tier is loaded (target unavailable).
  • No draft role is loaded (no proposer configured).
  • entropic::speculative::check_compat rejects the pairing (recurrent target, tokenizer mismatch, etc.).
Returns
SpeculativeCompatInfo. @utility
Version
2.1.11

Reads the active main tier as the target and the "draft" slot on the secondary loader as the draft. Returns a structured diagnostic the C ABI can forward to consumers.

Returns
SpeculativeCompatInfo with compatible flag + diagnostic.

Definition at line 1219 of file orchestrator.cpp.

◆ clear_all_prompt_caches()

void entropic::ModelOrchestrator::clear_all_prompt_caches ( )

Invalidate prompt/KV caches across every pooled backend.

Invalidate prompt caches across every pooled backend.

Used when identity content (system prompt prefix) changes so no cached prefix is served against the new prompt. (P1-7, 2.0.6-rc16)

@utility

Version
2.0.6-rc16

Called on identity content changes so no cached prefix is served against the new system prompt. (P1-7, 2.0.6-rc16). Fans out to secondary roles (router, draft) via SecondaryModelLoader (v2.1.11).

@utility

Version
2.1.11

Definition at line 1113 of file orchestrator.cpp.

◆ clear_last_residency_error()

void entropic::ModelOrchestrator::clear_last_residency_error ( )
inline

Clear last_residency_error().

@utility

Version
2.2.4

Definition at line 407 of file orchestrator.h.

◆ generate()

GenerationResult entropic::ModelOrchestrator::generate ( const std::vector< Message > &  messages,
const GenerationParams params,
const std::string &  tier_name = "" 
)

Generate using routed or explicit tier.

Generate response using routed or explicit tier.

Parameters
messagesConversation history.
paramsGeneration parameters.
tier_nameExplicit tier, or empty for routing.
Returns
GenerationResult with routing/swap timing.
Version
1.8.2

Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches through LlamaCppBackend::generate_speculative_with_draft; falls back to plain decode otherwise. The dispatch decision is delegated to run_generate_dispatch to keep this method under the SLOC gate. v2.2.4 (gh#57): a refused activation now reports ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE via build_no_model_error instead of the generic GENERATE_FAILED.

Parameters
messagesConversation history.
paramsGeneration parameters.
tier_nameExplicit tier or empty for routing.
Returns
GenerationResult.

Definition at line 399 of file orchestrator.cpp.

◆ generate_streaming()

GenerationResult entropic::ModelOrchestrator::generate_streaming ( const std::vector< Message > &  messages,
const GenerationParams params,
std::function< void(std::string_view)>  on_token,
std::atomic< bool > &  cancel,
const std::string &  tier_name = "" 
)

Streaming generation.

Streaming generation with speculative dispatch.

Version
1.8.2

Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches to LlamaCppBackend::generate_speculative_with_draft via try_speculative_route_streaming with the draft resolved from secondary_loader_.get("draft"). Falls back to plain streaming on NOT_SUPPORTED or compatibility failure, with a diagnostic logged.

Definition at line 453 of file orchestrator.cpp.

◆ get_adapter()

ChatAdapter * entropic::ModelOrchestrator::get_adapter ( const std::string &  tier_name) const

Get adapter for a tier.

Version
1.8.2

Definition at line 930 of file orchestrator.cpp.

◆ get_backend()

InferenceBackend * entropic::ModelOrchestrator::get_backend ( const std::string &  tier_name) const

Get the inference backend for a tier (for evaluation APIs).

Get the inference backend for a tier.

Parameters
tier_nameTier name (e.g. "lead", "eng").
Returns
Backend pointer, or nullptr if tier not found.
Version
1.10.2
Parameters
tier_nameTier name.
Returns
Backend pointer, or nullptr if not found. @utility
Version
1.10.2

Definition at line 903 of file orchestrator.cpp.

◆ grammar_registry()

GrammarRegistry & entropic::ModelOrchestrator::grammar_registry ( )
inline

Access the grammar registry.

Returns
Reference to GrammarRegistry. @utility
Version
1.9.3

Definition at line 207 of file orchestrator.h.

◆ has_vision_capable_tier()

bool entropic::ModelOrchestrator::has_vision_capable_tier ( ) const

Return true if any configured tier declares the "vision" capability (gh#41, v2.1.8).

Vision-capability lookup (gh#41, v2.1.8).

Read-only lookup over the parsed ModelsConfig — does not touch backend state. Used by the facade's entropic_run_messages entry point to short-circuit with ENTROPIC_ERROR_NO_VISION_TIER before dispatching a multimodal turn that no tier can handle.

@utility

Version
2.1.8
Returns
true if any configured tier declares "vision".

Definition at line 1128 of file orchestrator.cpp.

◆ initialize()

bool entropic::ModelOrchestrator::initialize ( const ParsedConfig config)

Initialize from parsed config.

Initialize orchestrator: backends, routing, adapters, grammars.

Parameters
configFull engine config.
Returns
true on success.
Version
1.8.2

Adds speculative-draft activation alongside router activation in v2.1.11 (gh#36) — the draft slot loads when inference.speculative. enabled is true and a draft_model is configured. v2.2.4 (gh#57) caches the VRAM budget from ENTROPIC_VRAM_BUDGET_BYTES so the residency gate in get_model has a number to test against.

Parameters
configParsed engine config.
Returns
true on success. @utility
Version
2.2.4

Definition at line 187 of file orchestrator.cpp.

◆ last_residency_error()

entropic_error_t entropic::ModelOrchestrator::last_residency_error ( ) const
inline

Last residency-related error code, or ENTROPIC_OK if none.

Set by get_model when a tier-fit check fails (returns ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE). The facade clears it after translating to the C ABI return code. Independent of the last_error_ string carried on individual backends.

@utility

Version
2.2.4

Definition at line 400 of file orchestrator.h.

◆ last_routing_result()

RoutingResult entropic::ModelOrchestrator::last_routing_result ( ) const

Last routing result.

Version
1.8.2

Definition at line 845 of file orchestrator.cpp.

◆ last_used_tier()

std::string entropic::ModelOrchestrator::last_used_tier ( ) const

Last used tier name.

Version
1.8.2

Definition at line 854 of file orchestrator.cpp.

◆ load_grammars_from()

size_t entropic::ModelOrchestrator::load_grammars_from ( const std::filesystem::path &  grammar_dir)

Load grammars from an explicit directory path.

Parameters
grammar_dirPath containing .gbnf files.
Returns
Number of grammars loaded.
Version
2.0.6

Called by the facade after data-dir resolution. This is the fallback path when config_dir doesn't contain a grammars subdir (e.g., installed layout where grammars live under share/entropic).

Parameters
grammar_dirPath to directory containing .gbnf files.
Returns
Number of grammars loaded.

Definition at line 1092 of file orchestrator.cpp.

◆ loaded_models()

std::vector< std::string > entropic::ModelOrchestrator::loaded_models ( ) const

Currently loaded model tier names.

Version
1.8.2

Includes "router" when the secondary loader reports the role as loaded (v2.1.11, gh#27 — previously checked the raw router_ field).

Definition at line 867 of file orchestrator.cpp.

◆ profile_registry()

ProfileRegistry & entropic::ModelOrchestrator::profile_registry ( )
inline

Access the GPU resource profile registry.

Returns
Reference to ProfileRegistry. @utility
Version
2.0.0

Definition at line 215 of file orchestrator.h.

◆ residency_snapshot_json()

std::string entropic::ModelOrchestrator::residency_snapshot_json ( ) const

Serialize the current residency set as a JSON string.

Serialize the current VRAM residency snapshot to JSON.

Schema is documented on the C ABI entry point entropic_residency_snapshot (entropic.h). Read-only — takes the swap mutex briefly to obtain a consistent snapshot.

Returns
JSON object string. @utility
Version
2.2.4

Definition at line 1440 of file orchestrator.cpp.

◆ route()

std::string entropic::ModelOrchestrator::route ( const std::vector< Message > &  messages)

Route to tier using router model.

Route to appropriate tier using router model.

Parameters
messagesCurrent conversation.
Returns
Selected tier name.
Version
1.8.2

Guard updated in v2.1.11: routing requires models.router to be configured (was: router_ non-null). The slot is owned by secondary_loader_ since gh#27.

Parameters
messagesCurrent conversation.
Returns
Selected tier name.

Definition at line 506 of file orchestrator.cpp.

◆ select_vision_tier()

std::string entropic::ModelOrchestrator::select_vision_tier ( ) const

Pick the canonical vision-capable tier name (gh#41).

First vision-capable tier name (gh#41, v2.1.8).

Returns the first tier (iteration order of the parsed models.tiers map) whose capabilities include "vision", or empty string if none exists. Multi-tier policy refinements (e.g. prefer the default tier when it qualifies) can layer on top later — single-vision-tier deployments are the common case for v2.1.8 (gh#42 ships the primary tier as the only vision-capable bundled entry).

Returns
Vision tier name, or "" if none configured. @utility
Version
2.1.8
Returns
Tier name, or "" if none configured.

Definition at line 1141 of file orchestrator.cpp.

◆ set_residency_observer()

void entropic::ModelOrchestrator::set_residency_observer ( ResidencyObserverFn  cb)

Register a residency observer.

Register / replace / clear the residency observer.

Replaces the previous one.

Passing an empty std::function clears the observer.

Parameters
cbObserver callable, or empty to clear. @utility
Version
2.2.4

Definition at line 1369 of file orchestrator.cpp.

◆ set_speculative_enabled()

void entropic::ModelOrchestrator::set_speculative_enabled ( bool  enabled)
inline

Runtime toggle for the speculative-decoding path.

Lets consumers (and tests) flip speculative on/off without reinitializing the orchestrator. Defaults to whatever inference.speculative.enabled was in the parsed config at init time.

Parameters
enabledtrue to route through the speculative kernel when a compatible draft is loaded. @utility
Version
2.1.11

Definition at line 301 of file orchestrator.h.

◆ shutdown()

void entropic::ModelOrchestrator::shutdown ( )

Shutdown — unload all models.

Version
1.8.2

Main-tier pool is unloaded directly; secondary roles (router, draft, etc.) are released through secondary_loader_.shutdown() (v2.1.11).

Definition at line 226 of file orchestrator.cpp.

◆ throughput_tracker()

ThroughputTracker & entropic::ModelOrchestrator::throughput_tracker ( )
inline

Access the throughput tracker.

Returns
Reference to ThroughputTracker. @utility
Version
2.0.0

Definition at line 223 of file orchestrator.h.

◆ tier_footprint_bytes()

size_t entropic::ModelOrchestrator::tier_footprint_bytes ( const std::string &  tier_name) const

Estimated VRAM footprint for a given tier in bytes.

Public footprint accessor — memoizes via tier_footprint_bytes_.

Sum of GGUF weights file size and a context-length-derived KV cache estimate plus the configured vram_reserve_mb headroom. Returns 0 if the tier is unknown.

Parameters
tier_nameTier name. @utility
Version
2.2.4

Definition at line 1352 of file orchestrator.cpp.

◆ vram_budget_bytes()

size_t entropic::ModelOrchestrator::vram_budget_bytes ( ) const
inline

Engine-tracked VRAM budget in bytes (0 = unknown).

Sources, in priority order: ENTROPIC_VRAM_BUDGET_BYTES environment override → CUDA cudaMemGetInfo (when the CUDA inference backend is the active build) → 0. When 0, the orchestrator does not enforce a per-tier budget gate.

@utility

Version
2.2.4

Definition at line 374 of file orchestrator.h.


The documentation for this class was generated from the following files: