Multi-model lifecycle and routing orchestrator. More...

#include <entropic/inference/orchestrator.h>

Classes
struct	SpeculativeCompatInfo
	Result of a speculative-decoding compatibility check. More...

Public Types
enum class	ResidencyEvent : int { Loaded = 0 , Evicted = 1 , ActivationSwap = 2 }
	Residency observer event codes — mirror the C ABI enum `entropic_residency_event_t` exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2). More...

using	ResidencyObserverFn = std::function< void(ResidencyEvent event, const std::string &tier_name, const std::string &model_path, size_t footprint)>
	Residency observer callback type (internal C++ form).

Public Member Functions
bool	initialize (const ParsedConfig &config)
	Initialize from parsed config.

void	shutdown ()
	Shutdown — unload all models.

	~ModelOrchestrator ()
	Destructor — invokes shutdown() and AdapterManager::unload_all().

GenerationResult	generate (const std::vector< Message > &messages, const GenerationParams &params, const std::string &tier_name="")
	Generate using routed or explicit tier.

GenerationResult	generate (const std::vector< Message > &messages, const GenerationParams &params, std::atomic< bool > &cancel, const std::string &tier_name="")
	Batch generation with cancel support.

std::vector< GenerationResult >	generate_batch (const std::vector< std::vector< Message > > &messages_list, const std::vector< GenerationParams > &params_list, const std::vector< std::string > &tiers, std::atomic< bool > &cancel)
	Same-prefix batch generation on a shared resident model (gh#98).

GenerationResult	generate_streaming (const std::vector< Message > &messages, const GenerationParams &params, std::function< void(std::string_view)> on_token, std::atomic< bool > &cancel, const std::string &tier_name="")
	Streaming generation.

std::string	route (const std::vector< Message > &messages)
	Route to tier using router model.

RoutingResult	last_routing_result () const
	Last routing result.

std::string	last_used_tier () const
	Last used tier name.

std::vector< std::string >	loaded_models () const
	Currently loaded model tier names.

std::vector< std::string >	available_models () const
	All configured tier names.

bool	can_handoff (const std::string &from, const std::string &to) const
	Check if handoff is permitted.

ChatAdapter *	get_adapter (const std::string &tier_name) const
	Get adapter for a tier.

InferenceBackend *	get_backend (const std::string &tier_name) const
	Get the inference backend for a tier (for evaluation APIs).

AdapterManager &	adapter_manager ()
	Access the LoRA adapter manager.

GrammarRegistry &	grammar_registry ()
	Access the grammar registry.

ProfileRegistry &	profile_registry ()
	Access the GPU resource profile registry.

ThroughputTracker &	throughput_tracker ()
	Access the throughput tracker.

size_t	load_grammars_from (const std::filesystem::path &grammar_dir)
	Load grammars from an explicit directory path.

void	clear_all_prompt_caches ()
	Invalidate prompt/KV caches across every pooled backend.

bool	has_vision_capable_tier () const
	Return true if any configured tier declares the `"vision"` capability (gh#41, v2.1.8).

SpeculativeCompatInfo	check_speculative_compat () const
	Check whether the currently-configured target/draft pair is compatible for speculative decoding.

void	set_speculative_enabled (bool enabled)
	Runtime toggle for the speculative-decoding path.

void	set_residency_observer (ResidencyObserverFn cb)
	Register a residency observer.

std::string	residency_snapshot_json () const
	Serialize the current residency set as a JSON string.

size_t	vram_budget_bytes () const
	Engine-tracked VRAM budget in bytes (0 = unknown).

size_t	tier_footprint_bytes (const std::string &tier_name) const
	Estimated VRAM footprint for a given tier in bytes.

entropic_error_t	last_residency_error () const
	Last residency-related error code, or `ENTROPIC_OK` if none.

void	clear_last_residency_error ()
	Clear `last_residency_error()`.

std::string	select_vision_tier () const
	Pick the canonical vision-capable tier name (gh#41).

void	apply_tier_sampler_defaults_for_test (GenerationParams &params, const std::string &tier_name)
	Test-only forwarder to the private per-tier sampler default application (gh#94, audit task #71).

Detailed Description

Multi-model lifecycle and routing orchestrator.

Manages model pool deduplication, per-tier adapters, VRAM lifecycle, and digit-based tier routing via a router model.

Version: 1.8.2

Definition at line 112 of file orchestrator.h.

Member Typedef Documentation

◆ ResidencyObserverFn

using entropic::ModelOrchestrator::ResidencyObserverFn = std::function<void( ResidencyEvent event, const std::string& tier_name, const std::string& model_path, size_t footprint)>

Residency observer callback type (internal C++ form).

Fires synchronously from inside get_model / deactivate_current_if_needed while the swap mutex is held. Must not call back into the orchestrator on the same thread.

Parameters

event	Lifecycle event code.
tier_name	Tier whose backing model changed VRAM state.
model_path	Resolved GGUF path for the model.
footprint	Estimated VRAM footprint in bytes. @callback

Version: 2.2.4

Definition at line 412 of file orchestrator.h.

Member Enumeration Documentation

◆ ResidencyEvent

enum class entropic::ModelOrchestrator::ResidencyEvent : int

strong

Residency observer event codes — mirror the C ABI enum entropic_residency_event_t exactly (LOADED=0, EVICTED=1, ACTIVATION_SWAP=2).

Version: 2.2.4

Definition at line 392 of file orchestrator.h.

Constructor & Destructor Documentation

◆ ~ModelOrchestrator()

entropic::ModelOrchestrator::~ModelOrchestrator ( )

Destructor — invokes shutdown() and AdapterManager::unload_all().

Orchestrate teardown order (gh#58 close-out).

Combines two fixes:

gh#63 (v2.2.9): make shutdown() actually run on destroy, so VRAM release on handle teardown is explicit rather than relying on the shared_ptr<LlamaCppBackend> cascade. Pre-v2.2.9 nothing called shutdown() during the destroy path.
gh#58 close-out (v2.3.0): also call AdapterManager::unload_all() so loaded LoRA llama_adapter_lora* handles don't leak. Pre- v2.3.0 AdapterManager had no destructor — same shape as the pre-v2.2.8 LlamaCppBackend leak.

Teardown order matters: backends first (frees llama_contexts), then adapter handles (safe because the contexts that referenced HOT adapters are gone). Out-of-line so the .cpp can stage the two calls in the right sequence and keep the header free of implementation noise.

@utility

Version: 2.3.0

See header. @utility

Version: 2.3.0

Definition at line 264 of file orchestrator.cpp.

Member Function Documentation

◆ adapter_manager()

AdapterManager & entropic::ModelOrchestrator::adapter_manager ( )

inline

Access the LoRA adapter manager.

Returns: Reference to AdapterManager. @utility

Version: 1.9.2

Definition at line 278 of file orchestrator.h.

◆ apply_tier_sampler_defaults_for_test()

void entropic::ModelOrchestrator::apply_tier_sampler_defaults_for_test	(	GenerationParams &	params,
		const std::string &	tier_name
	)

inline

Test-only forwarder to the private per-tier sampler default application (gh#94, audit task #71).

Exposes ModelOrchestrator::apply_tier_sampler_defaults so a CPU unit test can assert the effective sampler values produced by the MEMBER path — specifically the tier.X -> ov.X 9-line hand-copy (orchestrator.cpp) that the free apply_tier_sampler_overrides tests cannot reach. The tier is looked up in config_, so the caller must have a config assigned (initialize() assigns config_ before any model load, so a failed init still populates it). Pure forwarder — no behavior of its own.

Parameters

params	Generation params (mutated in place).
tier_name	Tier whose frontmatter sampler config to apply.

Definition at line 523 of file orchestrator.h.

◆ available_models()

std::vector< std::string > entropic::ModelOrchestrator::available_models ( ) const

All configured tier names.

Version: 1.8.2

Definition at line 1174 of file orchestrator.cpp.

◆ can_handoff()

bool entropic::ModelOrchestrator::can_handoff	(	const std::string &	from,
		const std::string &	to
	)		const

Check if handoff is permitted.

Version: 1.8.2

Definition at line 1204 of file orchestrator.cpp.

◆ check_speculative_compat()

ModelOrchestrator::SpeculativeCompatInfo entropic::ModelOrchestrator::check_speculative_compat ( ) const

Check whether the currently-configured target/draft pair is compatible for speculative decoding.

Speculative compatibility check (target vs draft).

Reads the active main tier as the target (verifier) and the "draft" role on the SecondaryModelLoader as the proposer. Returns compatible=false with a specific diagnostic when:

No main tier is loaded (target unavailable).
No draft role is loaded (no proposer configured).
entropic::speculative::check_compat rejects the pairing (recurrent target, tokenizer mismatch, etc.).

Returns: SpeculativeCompatInfo. @utility

Version: 2.1.11

Reads the active main tier as the target and the "draft" slot on the secondary loader as the draft. Returns a structured diagnostic the C ABI can forward to consumers.

Returns: SpeculativeCompatInfo with compatible flag + diagnostic.

Definition at line 1508 of file orchestrator.cpp.

◆ clear_all_prompt_caches()

void entropic::ModelOrchestrator::clear_all_prompt_caches ( )

Invalidate prompt/KV caches across every pooled backend.

Invalidate prompt caches across every pooled backend.

Used when identity content (system prompt prefix) changes so no cached prefix is served against the new prompt. (P1-7, 2.0.6-rc16)

@utility

Version: 2.0.6-rc16

Called on identity content changes so no cached prefix is served against the new system prompt. (P1-7, 2.0.6-rc16). Fans out to secondary roles (router, draft) via SecondaryModelLoader (v2.1.11).

@utility

Version: 2.1.11

Definition at line 1402 of file orchestrator.cpp.

◆ clear_last_residency_error()

void entropic::ModelOrchestrator::clear_last_residency_error ( )

inline

Clear last_residency_error().

@utility

Version: 2.2.4

Definition at line 486 of file orchestrator.h.

◆ generate() [1/2]

GenerationResult entropic::ModelOrchestrator::generate	(	const std::vector< Message > &	messages,
		const GenerationParams &	params,
		const std::string &	tier_name = `""`
	)

Generate using routed or explicit tier.

Generate response using routed or explicit tier.

Parameters

messages	Conversation history.
params	Generation parameters.
tier_name	Explicit tier, or empty for routing.

Returns: GenerationResult with routing/swap timing.

Version: 1.8.2

Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches through LlamaCppBackend::generate_speculative_with_draft; falls back to plain decode otherwise. The dispatch decision is delegated to run_generate_dispatch to keep this method under the SLOC gate. v2.2.4 (gh#57): a refused activation now reports ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE via build_no_model_error instead of the generic GENERATE_FAILED.

Parameters

messages	Conversation history.
params	Generation parameters.
tier_name	Explicit tier or empty for routing.

Returns: GenerationResult.

Definition at line 570 of file orchestrator.cpp.

◆ generate() [2/2]

GenerationResult entropic::ModelOrchestrator::generate	(	const std::vector< Message > &	messages,
		const GenerationParams &	params,
		std::atomic< bool > &	cancel,
		const std::string &	tier_name = `""`
	)

Batch generation with cancel support.

Batch generate with cancel — see header for contract.

(gh#81, v2.4.2)

Mirrors generate plus a cancel-flag reference; the backend polls the flag per decode step. Closes the gh#81 60s-lag gap where the batch path had no path to honor mid-decode cancel.

Version: 2.4.2

Bypasses run_generate_dispatch (speculative routing) because speculative kernels live on the streaming path; batch only ever calls plain decode. Calls model->generate(messages, params, cancel) which polls cancel per token.

Definition at line 621 of file orchestrator.cpp.

◆ generate_batch()

std::vector< GenerationResult > entropic::ModelOrchestrator::generate_batch	(	const std::vector< std::vector< Message > > &	messages_list,
		const std::vector< GenerationParams > &	params_list,
		const std::vector< std::string > &	tiers,
		std::atomic< bool > &	cancel
	)

Same-prefix batch generation on a shared resident model (gh#98).

Same-prefix batch generation on a shared model — see header (gh#98).

All N requests run on ONE backend (resolved from the lead tier — they share the model pool). Each request's params are resolved per its tier (grammar + samplers), then dispatched to the backend's generate_batch, which prefills the shared prompt prefix once and fans out. Adapter parse is applied per result.

Parameters

messages_list	Per-request message lists (built with each tier's system prompt so same-tier requests share a prefix).
params_list	Per-request base params.
tiers	Per-request tier names ("" = lead/default).
cancel	Cancel flag.

Returns: One GenerationResult per request, in input order.

Version: 2.8.0

All requests resolve to ONE backend (the lead tier's model — tiers share the pool). Each request's params are resolved per its own tier (grammar + samplers), then the backend's generate_batch prefills the shared prefix once and fans out. Tool staging is per-model, so this path targets grammar-constrained requests (params.grammar), not common_chat tool injection.

Definition at line 672 of file orchestrator.cpp.

◆ generate_streaming()

GenerationResult entropic::ModelOrchestrator::generate_streaming	(	const std::vector< Message > &	messages,
		const GenerationParams &	params,
		std::function< void(std::string_view)>	on_token,
		std::atomic< bool > &	cancel,
		const std::string &	tier_name = `""`
	)

Streaming generation.

Streaming generation with speculative dispatch.

Version: 1.8.2

Speculative routing added in v2.1.11 (gh#36): when the kernel is configured and the target/draft pair is compatible, dispatches to LlamaCppBackend::generate_speculative_with_draft via try_speculative_route_streaming with the draft resolved from secondary_loader_.get("draft"). Falls back to plain streaming on NOT_SUPPORTED or compatibility failure, with a diagnostic logged.

Definition at line 714 of file orchestrator.cpp.

◆ get_adapter()

ChatAdapter * entropic::ModelOrchestrator::get_adapter ( const std::string & tier_name ) const

Get adapter for a tier.

Version: 1.8.2

Definition at line 1219 of file orchestrator.cpp.

◆ get_backend()

InferenceBackend * entropic::ModelOrchestrator::get_backend ( const std::string & tier_name ) const

Get the inference backend for a tier (for evaluation APIs).

Get the inference backend for a tier.

Parameters

tier_name Tier name (e.g. "lead", "eng").

Returns: Backend pointer, or nullptr if tier not found.

Version: 1.10.2

Parameters

tier_name Tier name.

Returns: Backend pointer, or nullptr if not found. @utility

Version: 1.10.2

Definition at line 1192 of file orchestrator.cpp.

◆ grammar_registry()

GrammarRegistry & entropic::ModelOrchestrator::grammar_registry ( )

inline

Access the grammar registry.

Returns: Reference to GrammarRegistry. @utility

Version: 1.9.3

Definition at line 286 of file orchestrator.h.

◆ has_vision_capable_tier()

bool entropic::ModelOrchestrator::has_vision_capable_tier ( ) const

Return true if any configured tier declares the "vision" capability (gh#41, v2.1.8).

Vision-capability lookup (gh#41, v2.1.8).

Read-only lookup over the parsed ModelsConfig — does not touch backend state. Used by the facade's entropic_run_messages entry point to short-circuit with ENTROPIC_ERROR_NO_VISION_TIER before dispatching a multimodal turn that no tier can handle.

@utility

Version: 2.1.8

Returns: true if any configured tier declares "vision".

Definition at line 1417 of file orchestrator.cpp.

◆ initialize()

bool entropic::ModelOrchestrator::initialize ( const ParsedConfig & config )

Initialize from parsed config.

Initialize orchestrator: backends, routing, adapters, grammars.

Parameters

config Full engine config.

Returns: true on success.

Version: 1.8.2

Adds speculative-draft activation alongside router activation in v2.1.11 (gh#36) — the draft slot loads when inference.speculative. enabled is true and a draft_model is configured. v2.2.4 (gh#57) caches the VRAM budget from ENTROPIC_VRAM_BUDGET_BYTES so the residency gate in get_model has a number to test against.

Parameters

config Parsed engine config.

Returns: true on success. @utility

Version: 2.2.4

Definition at line 197 of file orchestrator.cpp.

◆ last_residency_error()

entropic_error_t entropic::ModelOrchestrator::last_residency_error ( ) const

inline

Last residency-related error code, or ENTROPIC_OK if none.

Set by get_model when a tier-fit check fails (returns ENTROPIC_ERROR_TIER_MODEL_TOO_LARGE). The facade clears it after translating to the C ABI return code. Independent of the last_error_ string carried on individual backends.

@utility

Version: 2.2.4

Definition at line 479 of file orchestrator.h.

◆ last_routing_result()

RoutingResult entropic::ModelOrchestrator::last_routing_result ( ) const

Last routing result.

Version: 1.8.2

Definition at line 1134 of file orchestrator.cpp.

◆ last_used_tier()

std::string entropic::ModelOrchestrator::last_used_tier ( ) const

Last used tier name.

Version: 1.8.2

Definition at line 1143 of file orchestrator.cpp.

◆ load_grammars_from()

size_t entropic::ModelOrchestrator::load_grammars_from ( const std::filesystem::path & grammar_dir )

Load grammars from an explicit directory path.

Parameters

grammar_dir Path containing .gbnf files.

Returns: Number of grammars loaded.

Version: 2.0.6

Called by the facade after data-dir resolution. This is the fallback path when config_dir doesn't contain a grammars subdir (e.g., installed layout where grammars live under share/entropic).

Parameters

grammar_dir Path to directory containing .gbnf files.

Returns: Number of grammars loaded.

Definition at line 1381 of file orchestrator.cpp.

◆ loaded_models()

std::vector< std::string > entropic::ModelOrchestrator::loaded_models ( ) const

Currently loaded model tier names.

Version: 1.8.2

Includes "router" when the secondary loader reports the role as loaded (v2.1.11, gh#27 — previously checked the raw router_ field).

Definition at line 1156 of file orchestrator.cpp.

◆ profile_registry()

ProfileRegistry & entropic::ModelOrchestrator::profile_registry ( )

inline

Access the GPU resource profile registry.

Returns: Reference to ProfileRegistry. @utility

Version: 2.0.0

Definition at line 294 of file orchestrator.h.

◆ residency_snapshot_json()

std::string entropic::ModelOrchestrator::residency_snapshot_json ( ) const

Serialize the current residency set as a JSON string.

Serialize the current VRAM residency snapshot to JSON.

Schema is documented on the C ABI entry point entropic_residency_snapshot (entropic.h). Read-only — takes the swap mutex briefly to obtain a consistent snapshot.

Returns: JSON object string. @utility

Version: 2.2.4

Definition at line 1802 of file orchestrator.cpp.

◆ route()

std::string entropic::ModelOrchestrator::route ( const std::vector< Message > & messages )

Route to tier using router model.

Route to appropriate tier using router model.

Parameters

messages Current conversation.

Returns: Selected tier name.

Version: 1.8.2

Guard updated in v2.1.11: routing requires models.router to be configured (was: router_ non-null). The slot is owned by secondary_loader_ since gh#27.

Parameters

messages Current conversation.

Returns: Selected tier name.

Definition at line 766 of file orchestrator.cpp.

◆ select_vision_tier()

std::string entropic::ModelOrchestrator::select_vision_tier ( ) const

Pick the canonical vision-capable tier name (gh#41).

First vision-capable tier name (gh#41, v2.1.8).

Returns the first tier (iteration order of the parsed models.tiers map) whose capabilities include "vision", or empty string if none exists. Multi-tier policy refinements (e.g. prefer the default tier when it qualifies) can layer on top later — single-vision-tier deployments are the common case for v2.1.8 (gh#42 ships the primary tier as the only vision-capable bundled entry).

Returns: Vision tier name, or "" if none configured. @utility

Version: 2.1.8

Returns: Tier name, or "" if none configured.

Definition at line 1430 of file orchestrator.cpp.

◆ set_residency_observer()

void entropic::ModelOrchestrator::set_residency_observer ( ResidencyObserverFn cb )

Register a residency observer.

Register / replace / clear the residency observer.

Replaces the previous one.

Passing an empty std::function clears the observer.

Parameters

cb	Observer callable, or empty to clear. @utility

Version: 2.2.4

Definition at line 1731 of file orchestrator.cpp.

◆ set_speculative_enabled()

void entropic::ModelOrchestrator::set_speculative_enabled ( bool enabled )

inline

Runtime toggle for the speculative-decoding path.

Lets consumers (and tests) flip speculative on/off without reinitializing the orchestrator. Defaults to whatever inference.speculative.enabled was in the parsed config at init time.

Parameters

enabled true to route through the speculative kernel when a compatible draft is loaded. @utility

Version: 2.1.11

Definition at line 380 of file orchestrator.h.

◆ shutdown()

void entropic::ModelOrchestrator::shutdown ( )

Shutdown — unload all models.

Version: 1.8.2

Main-tier pool is unloaded directly; secondary roles (router, draft, etc.) are released through secondary_loader_.shutdown() (v2.1.11).

Definition at line 247 of file orchestrator.cpp.

◆ throughput_tracker()

ThroughputTracker & entropic::ModelOrchestrator::throughput_tracker ( )

inline

Access the throughput tracker.

Returns: Reference to ThroughputTracker. @utility

Version: 2.0.0

Definition at line 302 of file orchestrator.h.

◆ tier_footprint_bytes()

size_t entropic::ModelOrchestrator::tier_footprint_bytes ( const std::string & tier_name ) const

Estimated VRAM footprint for a given tier in bytes.

Public footprint accessor — memoizes via tier_footprint_bytes_.

Sum of GGUF weights file size and a context-length-derived KV cache estimate plus the configured vram_reserve_mb headroom. Returns 0 if the tier is unknown.

Parameters

tier_name Tier name. @utility

Version: 2.2.4

Definition at line 1714 of file orchestrator.cpp.

◆ vram_budget_bytes()

size_t entropic::ModelOrchestrator::vram_budget_bytes ( ) const

inline

Engine-tracked VRAM budget in bytes (0 = unknown).

Sources, in priority order: ENTROPIC_VRAM_BUDGET_BYTES environment override → CUDA cudaMemGetInfo (when the CUDA inference backend is the active build) → 0. When 0, the orchestrator does not enforce a per-tier budget gate.

@utility

Version: 2.2.4

Definition at line 453 of file orchestrator.h.

The documentation for this class was generated from the following files:

entropic/inference/orchestrator.h
inference/orchestrator.cpp

Classes

Public Types

Public Member Functions

Detailed Description

Member Typedef Documentation

◆ ResidencyObserverFn

Member Enumeration Documentation

◆ ResidencyEvent

Constructor & Destructor Documentation

◆ ~ModelOrchestrator()

Member Function Documentation

◆ adapter_manager()

◆ apply_tier_sampler_defaults_for_test()

◆ available_models()

◆ can_handoff()

◆ check_speculative_compat()

◆ clear_all_prompt_caches()

◆ clear_last_residency_error()

◆ generate() [1/2]

◆ generate() [2/2]

◆ generate_batch()

◆ generate_streaming()

◆ get_adapter()

◆ get_backend()

◆ grammar_registry()

◆ has_vision_capable_tier()

◆ initialize()

◆ last_residency_error()

◆ last_routing_result()

◆ last_used_tier()

◆ load_grammars_from()

◆ loaded_models()

◆ profile_registry()

◆ residency_snapshot_json()

◆ route()

◆ select_vision_tier()

◆ set_residency_observer()

◆ set_speculative_enabled()

◆ shutdown()

◆ throughput_tracker()

◆ tier_footprint_bytes()

◆ vram_budget_bytes()