Entropic 2.3.8
Local-first agentic inference engine
Loading...
Searching...
No Matches
backend_capability.h
Go to the documentation of this file.
1// SPDX-License-Identifier: Apache-2.0
13#pragma once
14
15#include <cstddef>
16#include <string>
17
18namespace entropic {
19
33enum class BackendCapability : int {
34 KV_CACHE = 0,
35 HIDDEN_STATE = 1,
36 STREAMING = 2,
37 RAW_COMPLETION = 3,
38 GRAMMAR = 4,
39 LORA_ADAPTERS = 5,
40 MULTI_SEQUENCE = 6,
41 TOKENIZER = 7,
42 LOG_PROBS = 8,
43 VISION = 9,
45 PROMPT_CACHING = 11,
46 AUDIO = 12,
47 _COUNT
48};
49
59 std::string name;
60 std::string version;
61 std::string compute_device;
62 std::string model_format;
63
68 std::string architecture;
69
75
76 size_t vram_bytes = 0;
77 size_t ram_bytes = 0;
78 size_t parameter_count = 0;
79 std::string quantization;
80};
81
82} // namespace entropic
Activate model on GPU (WARM → ACTIVE).
BackendCapability
Capabilities that an inference backend may or may not support.
@ SPECULATIVE_DECODING
Speculative decoding compatibility.
@ MULTI_SEQUENCE
Multiple concurrent sequences on one model instance.
@ PROMPT_CACHING
Prompt cache prefix save/load (v1.8.3)
@ HIDDEN_STATE
Recurrent hidden state management (save/load/reset)
@ GRAMMAR
GBNF grammar-constrained generation.
@ TOKENIZER
Token counting / tokenizer access.
@ VISION
Vision / multimodal input (v1.9.11)
@ RAW_COMPLETION
Raw text completion without chat template.
@ LORA_ADAPTERS
LoRA adapter hot-swapping (v1.9.2)
@ LOG_PROBS
Log-probability retrieval (v1.9.10)
@ STREAMING
Streaming token-by-token generation.
@ KV_CACHE
KV cache state management (save/load/clear)
@ AUDIO
Audio input via mtmd audio projector (gh#53, v2.3.0)
@ _COUNT
Sentinel — must be last. Used for iteration/array sizing.
Backend metadata for introspection.
size_t ram_bytes
RAM consumed by loaded model (bytes). 0 if COLD.
int max_context_length
Maximum context length.
size_t parameter_count
Number of parameters (from model metadata).
std::string architecture
Architecture family of the loaded model.
std::string compute_device
"cuda", "vulkan", "cpu", "npu"
std::string name
Backend identifier (e.g. "llama.cpp", "axcl")
std::string quantization
Quantization type (e.g. "IQ3_XXS", "Q8_0", "fp16").
std::string version
Backend version string.
size_t vram_bytes
VRAM consumed by loaded model (bytes). 0 if COLD.
std::string model_format
"gguf", "axmodel", "onnx", etc.