Nemotron 3 chat adapter (v2.1.9, gh#47). More...

#include <entropic/inference/adapters/adapter_base.h>
#include <unordered_map>

Include dependency graph for nemotron3_adapter.h:

This graph shows which files directly or indirectly include this file:

Classes
class	entropic::Nemotron3Adapter
	Nemotron 3 chat adapter (hybrid Mamba-Transformer family). More...

Namespaces
namespace	entropic
	Activate model on GPU (WARM → ACTIVE).

Detailed Description

Nemotron 3 chat adapter (v2.1.9, gh#47).

Architecture-verification gate (proposal §gh#47)

Hybrid Mamba-Transformer (Mamba-2 + MLP + 4 attention layers), compressed from NVIDIA-Nemotron-Nano-9B-v2 via Nemotron Elastic.
GGUF arch tag: nemotron_h (variant nemotron_h_moe).
llama.cpp status: fully integrated. LLM_ARCH_NEMOTRON_H is in the arch enumeration; llm_build_nemotron_h extends llm_build_mamba_base — state handling is shared with the stable Mamba path, not experimental.
Chat template: thinking-enabled by default; <think> and </think> are separate special tokens. With llama.cpp CLI use --special to surface them; programmatic generation receives the tokens already detokenised, so the adapter's base-class strip_think_blocks / extract_thinking handle them naturally.
Tool-call format: the vLLM docs advertise the qwen3_coder XML parser, but empirical capture (gh#70, v2.3.8) showed the bundled nemotron_h GGUFs actually emit a DSML invoke format at every precision (Q4_K_XL / Q8_0 / BF16):
<｜DSML｜function_calls>

<｜DSML｜invoke name="tool.name">

<｜DSML｜parameter name="key" string="value"/>

</｜DSML｜invoke>

</｜DSML｜function_calls>

(fullwidth-pipe ｜ = U+FF5C; self-closing typed parameter tags). The adapter parses DSML first, then falls back to the qwen XML and tagged-JSON paths for rigged-prompt / mixed-format consumers.
Reasoning trace: yes — handled by base-class think-block primitives; no Nemotron-specific override needed.

Gate outcome: PASSES. Nemotron3Adapter proceeds.

Internal to inference .so.

Definition in file nemotron3_adapter.h.

Classes