Entropic 2.3.8
Local-first agentic inference engine
Loading...
Searching...
No Matches
utf8_sanitize.cpp File Reference

Implementation of sanitize_utf8(). More...

#include <entropic/mcp/utf8_sanitize.h>
#include <cstdint>
Include dependency graph for utf8_sanitize.cpp:

Go to the source code of this file.

Namespaces

namespace  entropic
 Activate model on GPU (WARM → ACTIVE).
 

Functions

ENTROPIC_EXPORT std::string entropic::mcp::sanitize_utf8 (std::string_view input)
 Replace invalid UTF-8 byte sequences with U+FFFD.
 

Detailed Description

Implementation of sanitize_utf8().

Version
2.1.0

Definition in file utf8_sanitize.cpp.

Function Documentation

◆ sanitize_utf8()

std::string entropic::mcp::sanitize_utf8 ( std::string_view  input)

Replace invalid UTF-8 byte sequences with U+FFFD.

Parameters
inputRaw bytes from a tool-result subprocess. Treated as a byte sequence, not a code-point sequence.
Returns
A new string equal to input if already valid UTF-8, or with each malformed byte sequence replaced by U+FFFD (the Unicode replacement character).
Algorithm:
Per RFC 3629:
  • 0x00..0x7F → 1-byte ASCII, pass through
  • 0xC2..0xDF + 1 cont → 2-byte sequence
  • 0xE0..0xEF + 2 cont → 3-byte sequence (with sub-range checks on the first continuation byte)
  • 0xF0..0xF4 + 3 cont → 4-byte sequence (with sub-range checks)
  • anything else → replace with U+FFFD, advance one byte

Continuation bytes must be in 0x80..0xBF; a missing or out-of-range continuation triggers replacement and advances past the leading byte only (the next byte gets a fresh validation pass — Bjoern Hoehrmann's "robust resync" property).

@utility

Version
2.1.0
Parameters
inputRaw bytes (potentially malformed).
Returns
Sanitized string; equal to input when input is already valid. @utility
Version
2.1.0

Definition at line 96 of file utf8_sanitize.cpp.