正在切换页面...

Washing Away the Lead: ANSI Escape Code Stripping and Defending Against LLM Token Pollution

mediumConsoleRegexData CleaningNLPANSIUpdated

(Article 53: Agent Dynamics - Data Denoising)

In Agent development, you can easily make a "seemingly reasonable" mistake: Feeding the unmodified stdout of npm run build, pytest, or cargo test directly to the model as an Observation.

Then you will witness two types of disasters:

The model starts outputting unintelligible text, as if it's "dazzled."
Context consumption skyrockets, directly triggering "token limit exceeded" errors or massive latency spikes.

The culprit is often a relic of the terminal ecosystem left over from decades ago: ANSI Escape Codes.

To humans, these represent colors and formatting. To a model, they are a pile of noisy bytes that will be shredded by the tokenizer. They also create a massive amount of "repetitive but seemingly different" text, burning completely through its attention and context budget.

1. What Exactly Are ANSI Escape Codes: They Are Not "Characters," but a Terminal Control Protocol

ANSI escape codes are a class of control sequences starting with ESC (hexadecimal 0x1B), used to issue "drawing instructions" to a terminal emulator: Setting colors, Moving the cursor, Clearing the screen, Setting titles, And even triggering hyperlinks in certain implementations.

Its history and standard systems are very complex, but you must at least remember the "family tree":

CSI: Control sequences starting with ESC [ (the most common).
OSC: Operating system command sequences starting with ESC ] (titles, clipboard, etc.).
DCS/APC, etc.: Lower-level, less common, but will appear in TUIs and certain terminal capabilities.

1.1 Common Control Sequences (CSI)

Color Control (SGR): \x1b[31m represents red, \x1b[0m represents resetting the color.
Cursor Movement: \x1b[A moves up one line.
Line Clearing: \x1b[K clears content from the cursor to the end of the line.

1.2 Why Models "Hate" Them (It's Not OCD, It's Physical Cost)

A large model's Tokenizer will shred these escape characters into a massive amount of meaningless Tokens. For example, the simple red word Error:

What humans see: Error (in red)
Raw content seen by the LLM: \x1b[31mError\x1b[0m
Tokenizer splitting result: [\x1b, [, 3, 1, m, Error, \x1b, [, 0, m]

This kind of pollution causes:

Scattered Attention: The model must waste precious Attention Heads processing these noisy Tokens, causing it to become less sensitive to the actual error message.
Context Wastage: Complex colored logs can increase Token consumption by 30% to 200%.

2. Stripping is Not Just Saving Tokens: It is Also the First Gate of "Observation Security"

Many people treat ANSI stripping as a "performance optimization." But in an Agent system, it carries a much more hardcore significance: Preventing Observation Pollution.

Your stdout might carry:

Untrusted external content (download script outputs, CI injections, third-party tool prompts).
Snippets capable of inducing execution (e.g., disguising text that looks like a log as a command suggestion).

Therefore, the correct strategy is:

The UI layer can display colored raw output (for humans to look at).
The LLM side receives ONLY "cleansed plain text + source metadata + truncation strategy," and the raw bytes are archived for auditing.

In the Agent's data funnel, a physical filter must be established: All data flowing out of the terminal must be "washed clean" before entering Memory.

3. Regex Stripping: Usable, But You Must Know Its Boundaries

Regex is the first line of filtration. It is highly effective against "colored logs," but unreliable against an "interactive terminal canvas."

Here, I will provide an implementation that covers common sequences, while also pointing out the problems it cannot solve.

3.1 [Core Code] Broad-Spectrum Sequence Stripping + Progress Bar Flattening

import re

class AnsiStripper:
    """
    The Agent's retinal filter:
    Strips away all ANSI terminal residue that interferes with the large model's semantic understanding.
    """
    def __init__(self):
        # Covers common ANSI/ECMA-48 sequences (especially CSI).
        # Note: It cannot "understand the canvas"; it can only perform byte-layer stripping.
        self.ansi_regex = re.compile(
            r'(?:\x1B[@-_]|[\x80-\x9F])(?:[0-?]*[ -/]*[@-~])?',
            re.VERBOSE
        )

    def strip(self, text: str) -> str:
        """Strips color and formatting codes"""
        if not text: return ""
        return self.ansi_regex.sub('', text)

    def collapse_progress_bars(self, text: str) -> str:
        """
        [Geek Optimization]: Handles scrolling progress bars.
        When a program outputs 1%... 2%..., it continuously sends \r (carriage return without line feed).
        If unhandled, the large model will see hundreds of lines of repeated progress.
        """
        # Minimum viable strategy:
        # 1) Split into "logical lines" by \n first
        # 2) For internal \r characters within each line, retain only the final overwritten result
        out_lines: list[str] = []
        for line in text.split("\n"):
            if "\r" in line:
                out_lines.append(line.split("\r")[-1])
            else:
                out_lines.append(line)
        return "\n".join(out_lines).strip()

4. The Collapse of the Defense Line: Why TUIs Turn "Logs" into a Dynamic Canvas

When you encounter the fancy progress bars of htop, vim, fzf, or npm, regex becomes unreliable. Because these programs are not "outputting a line of text": They are repeatedly erasing and redrawing within a fixed-size screen buffer.

From the raw bytes, you can see:

Cursor moving up, moving left, clearing lines.
Overwriting the same line using carriage returns (\r).
Multiple "canvas operations" contained within a single output burst.

This leads to two problems:

Text repetition: The same line is overwritten 300 times, so the model sees 300 lines.
Semantic misalignment: You stripped the control sequences but failed to reconstruct the canvas, resulting in treating an "intermediate state" as the final state.

4.1 The Ultimate Solution: Virtual Terminal Emulator

The truly robust approach is: Embed a virtual terminal emulator inside the Runner, Render the PTY bytes onto an in-memory screen buffer, And then only export the plain text of the "final screen snapshot."

This is equivalent to: Turning the "byte stream" back into "the screen seen by the human eye."

Rendering: Project the raw bytes returned by the PTY onto an in-memory virtual canvas, just like drawing a picture.
Screen Dump: Discard all intermediate "flickering" and "moving" processes.
Result: Extract only the currently visible 24 lines of text on the screen and send them to the large model.

For an Agent, this is much closer to the correct solution than "a stronger regex."

5. Correct Stratification of the Observation Pipeline: The Three Worlds of Raw Bytes, UI, and LLM

Please maintain this rule when designing systems:

Raw Data (bytes): Used for auditing and post-mortems; saved exactly as-is (can be compressed).
UI Rendering: For human experience, ANSI can be retained, or colors applied manually.
LLM Observation: Only consumes cleansed plain text, accompanied by metadata:
- Source command
- Timestamp
- Truncation strategy
- Whether it originated from a virtual terminal snapshot

This isn't perfectionism; this is decoupling "visualization" from "reasoning-friendly input" to avoid bringing UI side effects into the model's context.

6. Minimum Testability: Prepare a Set of "Disgusting Inputs" for Your Stripper

A stripper MUST be testable. Otherwise, you will never know if you are "stripping noise" or "accidentally deleting evidence."

A minimum test suite is recommended to include:

SGR: Red/reset colors mixed into an error line.
CSI: Cursor movement + line clearing.
OSC: Title-setting sequences (common in certain tools).
\r: A progress bar overwritten 100 times.

Your assertion should not be "the output is shorter," but rather:

Critical error lines MUST be retained.
Repetitive overwrites MUST collapse.
Post-stripping content MUST be stable (same input yields same output).

Chapter Summary (Do Not Treat This as a Minor Optimization)

Tokens are Gold: Do not waste precious context on garbage like \x1b[31m.
Handling \r is More Important Than Handling Colors: If you don't handle the progress bar pile-up caused by carriage returns, a single npm install can suffocate your Agent via Token overflow.
Regex is Only the First Step: The true solution is "Virtual Terminal Simulation -> Screen Snapshot -> Plain Text Export."
Stripping is Also a Safety Gate: Observation inputs must first be desensitized/isolated/trimmed to avoid being led astray by noise and injections.

Having cleared the noise from the data, your Agent can finally analyze errors without distraction. In the next chapter, we will face a more realistic problem: [Execution Timeouts and Ghost Processes: How to Prevent an Infinite Loop Command from Dragging Your Agent into the Quagmire?]. We are going to start writing "Watchdogs"!

(End of text - Deep Dive Series 19)

References and Extensions (For Verification)

Basics of ANSI escape codes and overviews of categories like CSI/OSC.
Clarification of Escape code standards and naming confusion (ECMA-48, etc.).