Activate model on GPU (WARM → ACTIVE).
BackendCapability
Capabilities that an inference backend may or may not support.
@ SPECULATIVE_DECODING
Speculative decoding compatibility.
@ MULTI_SEQUENCE
Multiple concurrent sequences on one model instance.
@ PROMPT_CACHING
Prompt cache prefix save/load (v1.8.3)
@ HIDDEN_STATE
Recurrent hidden state management (save/load/reset)
@ GRAMMAR
GBNF grammar-constrained generation.
@ TOKENIZER
Token counting / tokenizer access.
@ VISION
Vision / multimodal input (v1.9.11)
@ RAW_COMPLETION
Raw text completion without chat template.
@ LORA_ADAPTERS
LoRA adapter hot-swapping (v1.9.2)
@ LOG_PROBS
Log-probability retrieval (v1.9.10)
@ STREAMING
Streaming token-by-token generation.
@ KV_CACHE
KV cache state management (save/load/clear)
@ AUDIO
Audio input via mtmd audio projector (gh#53, v2.3.0)
@ _COUNT
Sentinel — must be last. Used for iteration/array sizing.
Backend metadata for introspection.
size_t ram_bytes
RAM consumed by loaded model (bytes). 0 if COLD.
int max_context_length
Maximum context length.
size_t parameter_count
Number of parameters (from model metadata).
std::string architecture
Architecture family of the loaded model.
std::string compute_device
"cuda", "vulkan", "cpu", "npu"
std::string name
Backend identifier (e.g. "llama.cpp", "axcl")
std::string quantization
Quantization type (e.g. "IQ3_XXS", "Q8_0", "fp16").
std::string version
Backend version string.
size_t vram_bytes
VRAM consumed by loaded model (bytes). 0 if COLD.
std::string model_format
"gguf", "axmodel", "onnx", etc.