Speculative-decoding configuration (inference.speculative. More...

#include <entropic/types/config.h>

Collaboration diagram for entropic::SpeculativeConfig:

Public Attributes
bool	enabled = false
	Master switch (off by default)

int	n_draft = 4
	Window size (proposed tokens).

bool	mtp = false
	gh#106 (v2.9.0): drive MTP (the draft is a trunk-sharing head via ctx_other) instead of the gh#36 separate-draft kernel.

ModelConfig	draft = make_default_draft_model_config()
	Full ModelConfig for the draft model.

Detailed Description

Speculative-decoding configuration (inference.speculative.

*).

Architecture compatibility (v2.1.11 pin 253ba110b)

The orchestrator's check_speculative_compat() refuses the pairing when the target model is recurrent OR hybrid at the pinned llama.cpp commit. Among bundled primaries today, that leaves a SINGLE workable family:

Bundled key	llama.cpp arch	Speculative?
qwen3_5_0_8b	QWEN35 (hybrid SSM)	refused
qwen3_5_2b	QWEN35 (hybrid SSM)	refused
qwen3_5_4b	QWEN35 (hybrid SSM)	refused
qwen3_5_9b	QWEN35 (hybrid SSM)	refused
primary (3.5-A3B)	QWEN35MOE (hybrid)	refused
qwen3_6_a3b	QWEN35MOE (hybrid)	refused
nemotron3_nano_4b	NEMOTRON_H (hybrid)	refused
gemma4_a4b	GEMMA4 (pure xformer)	OK
gemma4_e4b	GEMMA4 (pure xformer)	OK
gemma4_e2b	GEMMA4 (pure xformer)	OK

Bit-identical correctness was verified empirically on the Gemma 4 family in Session 5 (proposal Implementation Log, Gate A). Hybrid SSM targets produce divergent KV state across upstream's split-prefill scheme — the issue is structural to common_speculative_* at this pin, not entropic-side. Consumers pairing a non-Gemma primary with a Gemma draft (or vice versa) will also be refused, since the gate looks at the TARGET arch.

Recommended pairings (bundled)

target=gemma4_e4b + draft=gemma4_e2b (CPU): bit-identical, measurable speedup on long generations.
target=gemma4_a4b + draft=gemma4_e2b (CPU): more aggressive verifier; needs ~16 GB VRAM at modest context.

Future llama.cpp pins that fix the cross-ubatch SSM state issue (or alternate non-hybrid Qwen/Llama arches added to the bundled registry) will widen this set without code change — the gate is data-driven via llama_model_is_hybrid / llama_model_is_recurrent.

Version: 2.1.11

Definition at line 876 of file config.h.

Member Data Documentation

◆ draft

ModelConfig entropic::SpeculativeConfig::draft = make_default_draft_model_config()

Full ModelConfig for the draft model.

Mirrors how tier configs are structured: every llama.cpp knob (gpu_layers, n_threads, n_batch, flash_attn, context_length, use_mlock, cache_type_k/v, tensor_split, ...) is consumer-tunable from YAML via inference.speculative.draft.<field>. path accepts a bundled-model registry key OR a literal filesystem path; resolved at config-parse time by BundledModels::resolve().

Defaults are kernel-aware — see make_default_draft_model_config().

Definition at line 905 of file config.h.

◆ enabled

bool entropic::SpeculativeConfig::enabled = false

Master switch (off by default)

Definition at line 877 of file config.h.

◆ mtp

bool entropic::SpeculativeConfig::mtp = false

gh#106 (v2.9.0): drive MTP (the draft is a trunk-sharing head via ctx_other) instead of the gh#36 separate-draft kernel.

draft.path points at the mtp-*.gguf head; target-owned, lossless.

Definition at line 884 of file config.h.

◆ n_draft

int entropic::SpeculativeConfig::n_draft = 4

Window size (proposed tokens).

gh#108 (v2.9.2): 16 over-drafts the small MTP head — measured a NET slowdown on Q2 (0.53x @16 vs 1.39x @2) and 1.91x vs 2.34x on Q8; ~4 is the sweet spot (upstream MTP example uses 3).

Definition at line 878 of file config.h.

The documentation for this struct was generated from the following file: