正在切换页面...

Precision Information Quotas: Dynamic Context Assembly and Sliding Window Compression

hardContext ManagementToken BudgetRAGOptimizationLLMUpdated

What

Dynamic Context Assembly (DCA) is not about "cramming in a bit more text." It is a fundamental runtime component: it treats the context window as a schedulable resource, deciding prior to every model invocation "what information must appear, in what order, and in what format (raw text / skeleton / summary / retrieval fragments)."

This article deconstructs DCA into implementable engineering modules and explains exactly why simply relying on "longer context windows" does not automatically resolve context-management failures.

Problem

Without structured DCA, you will crash into three classic failure modes:

Token Explosions: Rules, tool definitions, historical dialogue, source code, logs, and RAG outputs are piled together. A single request instantly hits the maximum token limit.
Attention Degradation: Even if you stay within the token limit, ultra-long contexts suffer from the phenomenon where "critical information in the middle is significantly harder to leverage," leading to severe reasoning degradation.
Irreproducibility: You have no idea why the model "suddenly became stupid" on turn #5, because you have zero observability into exactly what context was assembled and fed into that specific round.

Point #2 is not anecdotal. Both academic research and empirical practice highlight the "Lost in the Middle" phenomenon: When relevant information appears in the middle of the input, models struggle to extract it; performance peaks when information sits at the very beginning or the very end of the prompt. You cannot expect "cramming more" to automatically yield better results. You must architect the structure and positional strategy of the context. Reference: https://arxiv.org/abs/2307.03172

Principle

1) Layer the Context, Do Not Flatten It

A maintainable DCA architecture typically isolates context into at least four strict layers:

Stable Prefix (Rules/Tools): Workspace rules, tool schemas, permission boundaries, output constraints. Keep this highly stable to avoid token drift across turns.
Active Scene (Workspace/Task): The specific file fragment currently being edited, active stack traces, test failures, and raw command outputs. This is the "physical crime scene" and holds the highest priority.
Immediate Vicinity (Recent Turns): Retains the exact text of the last few turns, preserving immediate context and short-term feedback loops.
Long-Term Memory (Summary/RAG): Structured summaries of older history, retrieved fragments, and semantic index results.

[!WARNING] The "ratio" between these layers is never static. The industrial approach implements each layer as an independent module, utilizing a budget scheduler to dynamically adjust parameters based on the specific task topology.

2) Provide Skeletons First, Hydrate on Demand

Feeding a raw 5,000-line file into the model is usually a catastrophic waste of tokens and attention span. The robust engineering approach utilizes a two-phase pipeline:

Skeleton View: Provide only class/function signatures, docstrings, imports, and critical constants. Allow the model to acquire the API topology first.
Hydration: When the model explicitly declares, "I need to modify function X," then dynamically inject the function body into the active scene.

Generating Skeletons doesn't require inventing new AI tools. Standard implementations use Tree-sitter or native language parsers to extract AST structures. The principle is: "Let the model understand the structure before diving into the details."

3) Compression is Not Deletion; It is Forging Executable Facts

"Summarization" is not making paragraphs shorter. It is the process of transmuting historical dialogue into:

Verified Facts: e.g., Confirmed paths, environment configs, established interface contracts.
Executed Actions: e.g., Which files have already been modified, what verifications have run.
Critical Decision Nodes: e.g., Why Solution A was chosen over Solution B.
Unresolved Blockers: Points requiring external information or human-in-the-loop decisions.

Only compressed content in this exact format can be leveraged by the runtime to drive subsequent actions, rather than just acting as passive background noise.

Usage

Below is a minimal, executable DCA pipeline skeleton. The value here is not the specific python syntax, but the interfaces and the auditing requirements.

1) Data Structure: Treat Context as Blocks

from dataclasses import dataclass
from typing import Literal

ContextKind = Literal["rules", "workspace", "recent", "summary", "retrieval"]

@dataclass(frozen=True)
class ContextBlock:
    kind: ContextKind
    title: str
    content: str
    # Observability: Is this raw text, a skeleton, a summary, or RAG output?
    form: Literal["raw", "skeleton", "summary", "retrieved"]
    # Budgeting: The estimated token weight for this block
    budget_hint_tokens: int

2) The Assembler: Sequence First, Prune Second

class ContextAssembler:
    """
    Dynamic Context Assembler:
    Responsible for assembling diverse information blocks into a highly governed prompt.
    """

    def __init__(self, token_counter):
        self._token_counter = token_counter

    def assemble(self, *, rules, workspace, recent, summary, retrieval, max_tokens: int):
        blocks = []
        blocks += rules
        blocks += workspace
        blocks += recent
        blocks += retrieval
        blocks += summary

        # CRITICAL: Record the "Physical Evidence" of the assembly plan per round.
        # Without this, post-mortem debugging is impossible.
        debug_plan = [(b.kind, b.form, b.title, b.budget_hint_tokens) for b in blocks]

        packed = []
        used = 0
        for b in blocks:
            cost = self._token_counter.estimate(b.content)
            if used + cost > max_tokens:
                # Minimum Viable Strategy: Prune summary/retrieval first, then recent, 
                # and finally non-critical parts of the workspace.
                continue
            packed.append(b)
            used += cost

        return packed, {"used_tokens_est": used, "plan": debug_plan}

This code is deliberately "primitive." The true engineering requirement is that you must generate an observable assembly plan. When a model hallucinates on turn 7, you must be able to definitively answer, "What exactly was fed to it on turn 7?"

3) Dependency Tracking: Turning "References" into Context Radii

When the model processes File A, and File A imports classes/functions from File B, the assembler must possess the capability to "expand the radius on demand":

Parse imports in A.
Inject the Skeleton of B.
If the model needs to modify B, hydrate the local implementation of B.

This is vastly more deterministic and controllable than "dumping the entire repository into the prompt."

Pitfall

Blindly Chasing Completeness: Believing "more context equals more accuracy." In reality, it simply drowns out the critical signals.
Summaries Losing Decision Nodes: Retaining the narrative but stripping out the "Why." This causes the model to repeatedly walk down dead ends in subsequent turns.
Retrieval Noise Pollution: Injecting too many disparate RAG fragments will forcefully derail the model's reasoning. Retrieval must be highly explicable and parameter-tunable.
Missing Assembly Logs: Disables post-mortems and makes A/B testing impossible. You are reduced to adjusting prompts purely on "feel."

Debug

Treat DCA as a module that demands "Unit Testing" and "Replayability":

Feed fixed rules/scene/history into the same task to verify assembly output stability.
Log the plan and used_tokens_est for every assembly, and flush it to the trace on failed rounds.
Replay Failed Rounds: Replace exactly one layer (e.g., retrieval) and observe the variance in model output.

Source

Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/abs/2307.03172

Metrics and Acceptance (Proving the DCA is Actually Better)

A DCA without metrics is just "a different way of writing code." You require three categories of metrics at a minimum:

1) Cost and Capacity

Delta between prompt_tokens_est (Estimated) and prompt_tokens_actual (Actual).
retrieval_tokens ratio (Exactly how much of your budget is being eaten by RAG fragments?).
Total context blocks per round and average block size (Too many tiny blocks inherently creates attention noise).

2) Quality and Stability

Task Success Rate (Did it ultimately resolve the issue?).
First-Pass Success Rate (Success without relying on retry loops).
Retry Distribution (Are the retries actually making progress or spinning in circles?).
Critical Fact Drop Rate (Verifiable via "Replay Benchmarks").

3) Explicability

Is the assembly plan replayable per round? (Does the same input deterministically yield the same plan?).
In a failed round, can you isolate the fault to a specific layer? (e.g., Retrieval noise, Summary dropped a decision node, Workspace was truncated).

Positional Strategy (Engineering the "Lost in the Middle" Phenomenon)

The pragmatic engineering takeaway from "Lost in the Middle" is this: You cannot simply care about whether the text is in the prompt; you must engineer where it sits and in what format it appears.

Actionable positional strategies:

Anchor Critical Constraints: Rules and Tool schemas sit at the absolute beginning (or end, depending on model bias), but they must be anchored and stable.
Place Critical Evidence at the Edges: The most vital error stack traces or localized file chunks for the immediate task must be pushed to the very beginning or very end of the input.
Treat the Middle Zone as an Index: Fill the middle with Skeletons, Table of Contents, Summaries, and critical line-number indices. Do not drown your most vital evidence in the middle of a 30k token blob.

You won't get it perfect on day one. The crucial step is making "position" a tunable parameter and recording its state in the assembly log.

RAG Denoising (Making Retrieval Help, Not Derail)

The most catastrophic failure in retrieval engines isn't "failing to find data"; it is "retrieving too much highly-similar but fundamentally irrelevant data."

Practical denoising strategies:

Hard-Cap Recall Volume: It is infinitely better to provide less data than to inject 10 near-identical fragments that dilute the prompt.
Enforce Deduplication: If similarity scores are overwhelmingly high, retain only the highest-fidelity fragment.
Mandate Attribution: Every retrieval fragment must be accompanied by provenance (File Path / URL / Title / Paragraph ID).
Enforce Confidence Thresholds: If retrieval scores fall below a strict threshold, do not force the data in. It is far safer to let the model explicitly request more information.

The common denominator: Upgrade retrieval from "automatic text appending" to "Injecting Context Blocks with Evidence," ensuring it is explicable, replayable, and tunable.

Replay Benchmarking (Turning DCA into a Testable Module)

DCA implementations easily fall into the trap of "we changed a lot of code, but we can't mathematically prove it's better." The brutal, pragmatic approach is to construct a suite of replayable test cases:

Fixed Inputs: The exact same rules, workspace, recent, and retrieval candidates.
Fixed Evaluation Goals: e.g., "Identify the root cause," "Provide the minimal diff," "Explain the fix."
Isolate Single Variables: E.g., alter only the assembly order, only the retrieval cap, or only the summary format.

You do not need a massive, globally-scaled benchmark framework on day one. Achieving just two things makes your system incredibly robust:

Every assembly outputs a plan log, proving exactly what was fed to the model at that specific millisecond.
Failed rounds can be entirely replayed off-line, allowing you to swap out individual layers to rapidly identify which module injected the fatal noise.

When you can replay and pinpoint faults, DCA graduates from "prompt engineering voodoo" to a rigorously iterable software module.