|
Entropic 2.3.8
Local-first agentic inference engine
|
UTF-8 validation + replacement at every system boundary where bytes change ownership. More...


Go to the source code of this file.
Namespaces | |
| namespace | entropic |
| Activate model on GPU (WARM → ACTIVE). | |
Functions | |
| ENTROPIC_EXPORT std::string | entropic::mcp::sanitize_utf8 (std::string_view input) |
| Replace invalid UTF-8 byte sequences with U+FFFD. | |
UTF-8 validation + replacement at every system boundary where bytes change ownership.
Bytes carrying invalid UTF-8 (legacy-codepage source comments, mojibake, truncated multi-byte runs, model-stream desyncs under XML-tool-call pressure) crash any downstream nlohmann::json::dump() with json::type_error 316. v2.1.0 introduced the sanitizer at the MCP inbound boundary on the assumption that "sanitize once at inbound, trust
downstream" was sufficient. v2.1.1 (issue #3) generalized that to a boundary-of-ownership policy after the demo session showed bytes reaching json::dump() from paths that bypassed the inbound boundary.
src/mcp/tool_executor.cppsrc/core/response_generator.cpp (sanitize once at message-finalization, NEVER per-token — a multi-byte codepoint can split across token boundaries)src/facade/entropic_audit.cppsrc/facade/json_serializers.hInterior code (engine memory, hook contexts, dedup cache, per-tier routing) trusts the bytes — once a string has crossed an inbound boundary, no further sanitize call is needed. Conversely, do NOT add sanitize calls inside loops or hot interior paths; they belong at the outermost system seam each direction.
sanitize_utf8() walks the input as a byte sequence, validates each codepoint per RFC 3629, and writes U+FFFD (REPLACEMENT CHARACTER, 0xEF 0xBF 0xBD) in place of any malformed sequence. ASCII and well-formed multi-byte input pass through verbatim. The namespace is entropic::mcp for legacy reasons (v2.1.0 introduced it as an MCP-only utility); relocation to a generic namespace would change the exported C++ symbol and is deferred to a future major release.
Definition in file utf8_sanitize.h.
| std::string entropic::mcp::sanitize_utf8 | ( | std::string_view | input | ) |
Replace invalid UTF-8 byte sequences with U+FFFD.
| input | Raw bytes from a tool-result subprocess. Treated as a byte sequence, not a code-point sequence. |
input if already valid UTF-8, or with each malformed byte sequence replaced by U+FFFD (the Unicode replacement character).Continuation bytes must be in 0x80..0xBF; a missing or out-of-range continuation triggers replacement and advances past the leading byte only (the next byte gets a fresh validation pass — Bjoern Hoehrmann's "robust resync" property).
@utility
| input | Raw bytes (potentially malformed). |
Definition at line 96 of file utf8_sanitize.cpp.