|
Entropic 2.3.8
Local-first agentic inference engine
|
Speculative-decoding configuration (inference.speculative.
More...
#include <entropic/types/config.h>

Public Attributes | |
| bool | enabled = false |
| Master switch (off by default) | |
| int | n_draft = 16 |
| Window size (proposed tokens) | |
| ModelConfig | draft = make_default_draft_model_config() |
| Full ModelConfig for the draft model. | |
Speculative-decoding configuration (inference.speculative.
*).
253ba110b)The orchestrator's check_speculative_compat() refuses the pairing when the target model is recurrent OR hybrid at the pinned llama.cpp commit. Among bundled primaries today, that leaves a SINGLE workable family:
| Bundled key | llama.cpp arch | Speculative? |
|---|---|---|
| qwen3_5_0_8b | QWEN35 (hybrid SSM) | refused |
| qwen3_5_2b | QWEN35 (hybrid SSM) | refused |
| qwen3_5_4b | QWEN35 (hybrid SSM) | refused |
| qwen3_5_9b | QWEN35 (hybrid SSM) | refused |
| primary (3.5-A3B) | QWEN35MOE (hybrid) | refused |
| qwen3_6_a3b | QWEN35MOE (hybrid) | refused |
| nemotron3_nano_4b | NEMOTRON_H (hybrid) | refused |
| gemma4_a4b | GEMMA4 (pure xformer) | OK |
| gemma4_e4b | GEMMA4 (pure xformer) | OK |
| gemma4_e2b | GEMMA4 (pure xformer) | OK |
Bit-identical correctness was verified empirically on the Gemma 4 family in Session 5 (proposal Implementation Log, Gate A). Hybrid SSM targets produce divergent KV state across upstream's split-prefill scheme — the issue is structural to common_speculative_* at this pin, not entropic-side. Consumers pairing a non-Gemma primary with a Gemma draft (or vice versa) will also be refused, since the gate looks at the TARGET arch.
gemma4_e4b + draft=gemma4_e2b (CPU): bit-identical, measurable speedup on long generations.gemma4_a4b + draft=gemma4_e2b (CPU): more aggressive verifier; needs ~16 GB VRAM at modest context.Future llama.cpp pins that fix the cross-ubatch SSM state issue (or alternate non-hybrid Qwen/Llama arches added to the bundled registry) will widen this set without code change — the gate is data-driven via llama_model_is_hybrid / llama_model_is_recurrent.
| ModelConfig entropic::SpeculativeConfig::draft = make_default_draft_model_config() |
Full ModelConfig for the draft model.
Mirrors how tier configs are structured: every llama.cpp knob (gpu_layers, n_threads, n_batch, flash_attn, context_length, use_mlock, cache_type_k/v, tensor_split, ...) is consumer-tunable from YAML via inference.speculative.draft.<field>. path accepts a bundled-model registry key OR a literal filesystem path; resolved at config-parse time by BundledModels::resolve().
Defaults are kernel-aware — see make_default_draft_model_config().
| bool entropic::SpeculativeConfig::enabled = false |
| int entropic::SpeculativeConfig::n_draft = 16 |