正在切换页面...

Hierarchy and Decentralization: The Balancing Act of the Manager-Worker Architecture

mediumMulti-agentManager-WorkerArchitectureOrchestrationLLMUpdated

What (What this article is about)

Manager-Worker is not "booting up multiple agents to chat with each other"; it is an executable orchestration system. It dissects a prolonged task into tickets, dispatches those tickets to single-responsibility workers, and deploys hard gates (timeouts, retries, idempotency, permissions, isolation, audits, observability, rollbacks, and degradation) to guarantee it doesn't spiral out of control during extended runtimes.

This article deconstructs Manager-Worker into three engineering artifacts:

Ticket Protocol: Transmuting ambiguous goals into executable steps.
Supervision: Workers will fail; the system must auto-isolate failures and recover.
Governance: Any side effect must possess a commit log (WAL) and an idempotency key.

Problem (The Engineering Problem to Solve)

Monolithic agents face two classes of hard-crashes during complex tasks:

Context Collapse: Wide task scopes, massive file counts, and frequent errors shatter the context window.
Control Loop Collapse: A step fails, triggers infinite retries, and ultimately plunges into a token storm and resource leak (Retries, Timeouts, Resource Release).

The absolute value of Manager-Worker is pulverizing complexity into controllable sub-problems, but it inherently introduces novel risks:

Concurrency Conflicts: Multiple workers writing to the identical resource (Concurrency).
Retry Side Effects: Worker retries triggering duplicate commits (Idempotency).
Attribution Nightmares: Who did what, why did they do it, and who authorized it? (Audit, Observability).

Therefore, the core of this chapter is not "division of labor," but "forging division of labor into a governable system."

Principle (System Structure: Ticketing System + Supervision Tree + Gate Layer)

1) Ticketing System: Transmuting Language into Protocol

The manager's input to the worker MUST be structured. Otherwise, you achieve nothing beyond "forwarding vague prompts to a different model."

A recommended minimal ticket encompasses:

task_id: Globally unique, piercing through traces and logs (Audit, Observability).
role: Worker archetype (coder/tester/reviewer/...).
context_pointers: Specific locations of files/directories/logs requiring reads (Evading full-context stuffing).
success_criteria: Crisp acceptance conditions (e.g., unit tests must pass).
constraints: Hard ceilings on timeouts/retries/permission scopes (Timeouts, Retries, Permissions).
idempotency_key: Mandatory if the task might yield side effects (Idempotency).

2) Supervision Tree: Failure Isolation and Auto-Recovery

Workers failing is the baseline reality. Engineering systems must not attempt to "prevent failures," but must guarantee:

Failures do not metastasize globally.
Failures can be auto-recovered or degraded.
Failures forge an indelible chain of evidence (Audit).

Erlang's supervision tree design principles serve as the classic reference: Treat failure as a standard path, utilize supervision strategies for isolation and recovery, and mandate backoffs/ceilings to avert "self-exciting restart storms." Reference: https://www.erlang.org/doc/design_principles/des_princ

3) Durable Execution: Long Tasks Must Be Interruptible and Recoverable

Long tasks will inevitably encounter interruptions (network severs, model rate limits, machine reboots). Therefore, orchestration must be checkpointable and recoverable. LangGraph's durable execution documentation offers a brilliantly clear engineering direction: Persist execution state, and natively support resuming execution. Reference: https://docs.langchain.com/oss/python/langgraph/durable-execution

Usage (How to Use: A Shippable Manager-Worker Orchestration Skeleton)

1) Ticket Schema (Example)

{
  "task_id": "t-20260421-001",
  "role": "coder",
  "context_pointers": ["file:src/foo.ts", "log:test-run-123"],
  "success_criteria": ["unit_tests_pass", "no_lint_errors"],
  "constraints": {"timeout_ms": 600000, "max_retries": 2},
  "permission_scope": {"fs_write": ["src/"], "net": "deny"},
  "idempotency_key": "idem-t-20260421-001-step-2"
}

2) Worker Output MUST Be Structured (Otherwise Unauditable)

A worker should not vomit 1000 lines of raw logs back at the manager. It should return:

Manifest of Changes: Which files were modified, what is the patch_id.
Verification Results: Which commands executed, did they timeout, failure reason tags.
Risk Vectors: Were retries triggered, were permission denials encountered? (Timeouts, Retries, Permissions).

3) Unified Governance: Funneling Side Effects into a Single Commit Layer

The easiest way to flip the car is letting workers casually write to the filesystem on their own. The drastically more reliable approach is:

Workers generate strictly patches.
The manager (or orchestrator) operates as the exclusive committer, executing:
- Schema validation
- Permission validation
- Idempotency key generation and WAL recording (Idempotency, Audit)
- Commit and rollback strategies (Rollback, Degradation)

Thus, "Multi-worker concurrency" never mutates into "Unrestricted free-for-all writes."

4) Retry Strategies: Retries Are Not a Free Lunch

Retries must be ruthlessly systematized:

Every ticket wields a max_retries ceiling (Retries).
Every retry mandates recording a retry_reason (Observability).
Steps possessing side effects MUST utilize an idempotency_key (Idempotency).
Breaching thresholds forces degradation to read-only diagnostics or human intervention (Degradation).

Task queue architectures (e.g., Celery) offer massive troves of mature experience regarding ack/retry semantics: "At-least-once delivery" mandates that you execute idempotency and deduplication at the business layer. Reference: https://docs.celeryq.dev/en/stable/

Design (Design Trade-offs: When NOT to Use Manager-Worker)

Manager-worker is not a silver bullet. You should fiercely avoid forcing it onto these two scenarios:

Hyper-micro tasks: The overhead of decomposition dwarfs the yield.
Scenarios demanding strong-consistency commits: You should prioritize architecting "Commit Protocols and WALs" over relying on conversational collaboration.

Its apex use case is engineering tasks where "The task scope is immense, but steps can be crisply sliced into independent tickets."

Pitfall (Common Traps and Error Prevention)

Manager Over-Micromanagement: Stuffing the context to bursting, reducing the worker to a "parrot" (Degradation).
Unstructured Worker Output: Unauditable, impossible to post-mortem (Audit).
Concurrent Multi-Worker Writes: Conflict and rollback hellscape (Concurrency, Rollback).
Absence of Idempotency Keys: Retries manufacturing duplicate side effects (Idempotency, Retries).
Absence of Checkpoints: Long tasks suffering interruptions forced to run from scratch (Timeouts, Resource Release).

Debug (How to Troubleshoot Manager-Worker Issues)

Recommended diagnostic sequence:

Inspect the Ticket First: Does the ticket encapsulate success criteria, timeouts, retry ceilings, and permission scopes?
Inspect the Commit Second: Is there an exclusive committer? Is the WAL being written? Are idempotency keys present?
Inspect Observability Third: What is the failure reason distribution? Is it timeouts, permission denials, or raw logic bugs?

An Executable Orchestration State Machine (Grounding "Collaboration" into Recoverable Execution)

Transmuting manager-worker into a system hinges upon granting it a state machine. A minimal recommended state machine:

created: Task instantiated, root trace generated.
planned: Initial ticket manifest generated (plan tickets).
dispatched: Dispatched to workers, injected into task queues.
collecting: Harvesting structured worker yields.
verifying: Executing verification (tests / static analysis).
committing: The exclusive committer writes WAL and commits side effects (Idempotency, Audit).
completed: Successful termination.
blocked: Demanding human intervention or privilege escalation (Degradation).

Every transition from one state to the next MUST write a checkpoint, guaranteeing that interruptions can be recovered and execution resumed (Timeouts, Resource Release).

Minimal Orchestrator Pseudocode (Emphasizing Timeouts/Retries/Idempotency)

class Orchestrator:
    """
    The Orchestrator:
    Responsible for dispatching tickets, collecting yields, verifying, and writing to the WAL at the commit node.
    """

    async def run(self, task):
        state = await self._load_or_create_state(task)
        tickets = await self._plan(task)

        results = []
        for t in tickets:
            r = await self._dispatch_with_retry(t, timeout_ms=t.constraints.timeout_ms, max_retries=t.constraints.max_retries)
            results.append(r)

        await self._verify(results)

        # The Exclusive Commit Node: Generate idempotency key, write WAL, THEN commit
        idem = self._make_idempotency_key(task)
        wal_id = await self._wal.append(task_id=task.id, idempotency_key=idem, resources=self._resources(results))
        await self._commit(results, wal_id=wal_id, idempotency_key=idem)

This pseudocode deliberately highlights exactly three things:

Every single dispatch is bound by timeout and retry ceilings (Timeouts, Retries).
The commit node is singular (evading concurrent dual-writes) (Concurrency).
The commit node is mandated to write the WAL and generate an idempotency key (Idempotency, Audit).

The Worker's "Report Template" (Preventing Manager Bloat)

The ultimate worker report is "Structured + Navigable," not a wall of text:

changed_files: Manifest of files.
patch_id: Location of the patch.
commands_run: Executed commands and their exit codes.
failures: Failure reason tags and critical lines.
timing: Latency and whether a timeout occurred.

Templetizing the report ensures the manager's context assembly is stably reusable, drastically lowering drift risk (Degradation).

Source (Reference Materials)

LangGraph durable execution: https://docs.langchain.com/oss/python/langgraph/durable-execution
Erlang supervision principles: https://www.erlang.org/doc/design_principles/des_princ
Celery docs (retries/acks): https://docs.celeryq.dev/en/stable/
AutoGen paper: https://arxiv.org/abs/2308.08155