正在切换页面...

Salvaging from the Quagmire: Raw Text Tool Parsers and Fallback Strategies

mediumParsingRegexFallbackTool CallingUpdated

(Article 50: The Resilience of Agent Protocols)

In the previous article, we discussed how silky smooth Schema validation is when supported by top-tier constrained decoding APIs. However, in the cruel reality of actual Agent deployment (especially when clients need to run local open-source models like Llama 3 or Qwen 2 offline on their intranet), models usually do not natively support OpenAI-style physical interception for tool_calls.

At this point, we can only let the large model output raw text (Raw Text) within a standard conversational flow. This demands the most hardcore craft of a geek—"Text Salvaging Surgery": precisely severing machine instructions from a pile of gibberish.

0. First, Put the Parser Back into a Secure Context: External Text is Untrusted by Default

Before discussing "how to parse," we must first clarify the threat model. The moment you allow raw text to carry "tool instructions," you must acknowledge:

The model's output is not a trusted boundary.
Observations (web pages, RAG, logs) are not trusted boundaries.
Successful parsing merely means "extraction," not "authorization."

Indirect Prompt Injection (IPI) is not a theoretical problem; It has already appeared in real-world retrieval and tool output chains, And it will disguise "contextual content" as "high-priority instructions."

As long as you allow raw text to contain "tool instructions," you must accept one fact:

Raw text may come from untrusted data (user input, web pages, RAG results).
Untrusted data may carry Indirect Prompt Injections (IPI).

Therefore, the parser must obey two hard rules:

Successful parsing does not equal executability: The parser's output is still untrusted input; it must go through an allowlist/permissions/auditing check before execution.
Data blocks must be isolated: Retrieval/web snippets must be wrapped in isolation tags, and their source and timestamp must be recorded in the audit logs.

1. Native Convention Tag Patterns: The Engineering Battle Between XML and JSON

When a large model loses native API interface support, there are currently two most robust protocol schools in the industry: one is the Markdown JSON Block protocol, and the other is the XML Tag protocol, highly revered by Anthropic and the large-scale code Agent community.

1.0 Conclusion First: Protocols are Not Aesthetics, They are Failure Mode Management

When choosing a protocol, what you really need to compare is:

Can you still salvage a complete instruction during a "half-baked output"?
Can you reliably slice out the boundaries from the noise when it's "mixed with gibberish"?
Will you self-destruct when "parameters contain code/quotes/backticks"?

Being "more elegant" doesn't matter. What matters is: will the system enter a retry storm when parsing fails?

1.1 Why is XML Better Suited for "Salvaging" Than JSON?

JSON parsing is "all or nothing": missing a single comma will cause json.loads() to crash and throw an exception. XML (or custom tags), on the other hand, possesses extremely strong local recognizability.

Core Advantages of XML:

Starting Atomicity: Seeing <tool_call> immediately indicates the action has begun, regardless of the preceding gibberish.
Fault Tolerance Boundary: Even if the args internally contain incredibly complex special characters (like quotes in a shell script), as long as we match </tool_call> via a non-greedy regex, we can close the tag.

1.2 Protocol Selection Matrix: When to Use Which

Goal	XML Tag	Markdown ```json code block	Notes
Salvageable during partial streaming output	Strong	Weak	JSON dies completely missing one bracket; XML can be "closed and patched"
Chaining multiple tool calls	Medium	Strong	JSON arrays are more natural but require strict syntax
Parameters containing code snippets	Strong	Medium	XML can treat args as pure text; JSON requires escaping
Compatibility across different models	Strong	Medium	Using self-explanatory XML tags is often more stable
Security isolation (Data vs Instructions)	Strong	Medium	XML makes it easier to "tag data blocks," reducing misparsing risks

2. Violent Aesthetics: Implementation of the Fuzzy Tool Parser

When the Agent engine receives a "quagmire of text" containing wordy gibberish, explanatory information, and tool tags, we must use multi-level regex to dismantle it.

2.1 [Core Code] A Regex Extractor with Self-Healing Capabilities

Here is a Python implementation capable of violently salvaging instructions from both Markdown and XML:

import re
import json

class FuzzyToolParser:
    """
    A text parser that ignores gibberish and only looks for instructions.
    It not only uses regex to find tags but is also responsible for patching fragmented JSON strings.
    """
    def __init__(self):
        # Compatible with XML tags and Markdown code blocks
        self.xml_pattern = re.compile(r"<tool_call>\s*<name>(.*?)<\/name>\s*<args>(.*?)<\/args>\s*<\/tool_call>", re.DOTALL)
        self.md_json_pattern = re.compile(r"```json\s*(\{.*?\})\s*```", re.DOTALL)

    def extract(self, text: str) -> list:
        tool_calls = []
        
        # 1. Attempt to salvage from XML tags
        for match in self.xml_pattern.finditer(text):
            tool_calls.append({
                "name": match.group(1).strip(),
                "args": self._lenient_json_parse(match.group(2).strip())
            })
            
        # 2. If nothing is found in XML, attempt to find it within Markdown code blocks
        if not tool_calls:
            for match in self.md_json_pattern.finditer(text):
                tool_calls.append(self._lenient_json_parse(match.group(1)))
                
        return tool_calls

    def _lenient_json_parse(self, raw_str: str):
        """
        Lenient Parser:
        1. Attempt direct parsing.
        2. If it fails, attempt to patch trailing curly braces.
        3. If it still fails, wrap it as plain text.
        """
        try:
            return json.loads(raw_str)
        except json.JSONDecodeError:
            # Violent patching: Prevent model output interruption
            fixed_str = raw_str.strip()
            if not fixed_str.endswith("}"): fixed_str += "}"
            try:
                return json.loads(fixed_str)
            except:
                return {"raw_text_args": raw_str}

2.2 Failure Modes and Governance Points: Parsing Failures Turn Into Retry Storms

Failure Mode	Trigger	Consequence	Governance Point
Parsing Failure	Half-baked tags / half-baked JSON	Retry storm	Commit boundaries + Upper limits
Misparsing	Treating data as instructions	Privilege escalation side effects	Data isolation + deny-by-default
Output Explosion	Massive volume of raw text	Timeout / Memory leak	Truncation + hashing
Lack of Auditing	No source fields	Inability to post-mortem	Observability + Auditing

3. "Limb Completion" Logic in Stream Truncation

If you are dealing with ultra-fast Streaming truncation, the situation gets even worse. When the network suddenly drops or the Token limit is reached, you might only receive: <tool_call><name>run_shell</name><args>rm -rf /

A true industrial-grade implementation isn't "writing a few more regexes"; It is a state machine:

Enter capturing state: Sees <tool_call>.
Enter field state: Captures <name>, <args>.
Enter closing state: Only commits upon seeing </tool_call>.
Stream ends but still in capturing state: Triggers "patching strategy," but MUST degrade to read-only mode.

The core principle of state machine patching is: It is better to execute less than to execute by mistake. The patched instruction must pass through the allowlist and permission checks again at the security layer.

In implementation, there can be a daemon coroutine: When the stream ends but is_capturing is still True, It attempts to inject </args></tool_call> to complete structural closure. However, the output of this closure can only enter shadow mode for read-only tools, Direct entry into write-type tool paths is strictly prohibited.

3.2 Degradation Paths: How to Advance the Task When Parsing Fails

When parsing fails continuously, the system should not idle, and it certainly shouldn't proceed with write-type tools. Recommended minimum degradation strategy:

Enter shadow mode: Only allow read-only tools (ls/cat/grep/status).
Inject the parse error and the expected format as a one-shot example back into the model.
Exceeding a threshold triggers a circuit breaker, requesting human intervention or handing off to a dedicated agent.

All three of these actions must be written to the audit chain (observation/auditing); otherwise, you will not be able to review why it "suddenly stopped executing."

3.3 Input Validation and Guardrails: There Must Be a Second Door Behind the Parser

The parser solves the problem of "picking structure out of text." The security layer solves the problem of "whether this structure can be executed."

It is recommended to have at least three lines of defense:

Allowlist: Only allow a small subset of tool names (write-type tools disabled by default).
Parameter validation: Enforce length/character set/dangerous pattern constraints on paths, URLs, and commands.
Auditing and replay: Record the source and context hash of every tool call to facilitate accountability and regression testing.

Such "input validation and guardrails" are explicitly recommended for defense in depth in agentic systems.

4. Error Correction Loop: How Do You Educate a "Disobedient" Model?

If the parser fails completely (e.g., the model outputs a string of unrecognizable gibberish), you absolutely cannot let the program idle.

The Geek's Fallback Strategy:

Inject a Punishing Observation: [PARSE_ERROR]: I completely failed to understand the format you just output. Please re-check the format requirements in the System Prompt and output strictly according to the <tool_call> tags!.
Forced Degradation: If parsing fails 2 consecutive times, the system should actively modify the current System Message, adding an extremely simple One-Shot example, forcibly pulling the model's probability distribution back on track.

Furthermore: When the system enters a continuous failure state, The Runner should hard-lock "write-type tools," Only allowing read-only tools to gather evidence, Until parsing and validation stably recover.

Chapter Summary

Text is Communication: Don't blindly trust SDKs. At the Agent level, network transmission is merely a byte stream; you must possess the ability to dig logic out of a quagmire of bytes.
Robustness Over Elegance: In offline small-model scenarios, a parser covered in regexes and seemingly "inelegant" can often save your Agent's life.
XML is an Excellent Relay Format: Especially when handling parameters containing code snippets.

Having mastered the art of text salvaging, your Agent no longer relies on top-tier commercial APIs. It can run robustly on an old laptop with 5GB of VRAM, using an open-source Llama 3 to execute code and manage files.

In the next chapter, we will completely rip off the operating system's veil: [Terminal Hijacking and PTY: How Does an Agent Disguise Itself as a Human to Take Over Your iTerm2?]. We are going to start writing C code to control pseudo-terminals!

(End of text - Deep Dive Series 16 / Approx. 1600 words) (Note: It is recommended to use the FuzzyToolParser from this chapter in conjunction with a pre-commit hook to periodically scan extreme text cases in your test suite.)

Reference Materials (For Verification)

Anthropic streaming messages: https://docs.anthropic.com/claude/reference/messages-streaming
IPI in the wild: https://arxiv.org/abs/2601.07072
Prompt injection best practices (AWS): https://docs.aws.amazon.com/pdfs/prescriptive-guidance/latest/llm-prompt-engineering-best-practices/llm-prompt-engineering-best-practices.pdf