正在切换页面...

Abandoning Code: Anthropic Computer Use and Pure Visual Pixel Translation Arrays

expertComputer UseClaudeMultimodalNative OSVision GroundingUpdated

What (What this article covers)

The essence of Computer Use is "treating the GUI as an observable environment": The model inspects a screenshot, proposes actions (click/type/scroll/wait), the host machine executes the action and captures another screenshot, forging a tight perception-action closed loop.

This article eschews showing off and directly dissects the brutal engineering bottlenecks:

The scaling and resolution constraints of visual inputs.
Coordinate mapping and error amplification (why the mouse clicks the wrong spot).
Action verification and infinite-loop prevention (timeouts, retries, degradation).
Security boundaries: Computer Use is an extremely high-privilege channel that absolutely demands zero-trust integration and a kill switch (authorization, isolation, auditing).

The official documentation explicitly mandates recommendations regarding screenshot handling, action delays, and action verification; these represent the absolute baseline for implementing a reliable Computer Use pipeline. Reference: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool

Problem (The engineering problem to be solved)

The failure modes of Computer Use are highly "physical":

Miss-clicks: Coordinate mapping errors cause the mouse to click "Cancel" instead of "Confirm".
Focus Dislocation: Input text dumps into the wrong window/input box (a massive authorization risk).
Occlusion and Scrolling: Buttons become obscured, or page scrolling shifts the target position.
Loading State Infinite Loops: The interface remains unchanged after a click, prompting the model to relentlessly click the same spot (retry storm).
Non-deterministic UIs: Popups, tooltips, and animations render screenshot diffing nearly impossible to judge accurately.

If these problems are not resolved, Computer Use devolves into a "random clicker bot" with an unacceptably massive blast radius.

Principle (The Perception-Action Closed Loop: Every Step Must Be Verifiable)

The minimal closed loop for Computer Use is:

capture: Take a screenshot (input).
decide: The model outputs action candidates (intent).
enforce: The host machine executes authorization checks and security gates (authorization, isolation).
act: Execute the physical action (side-effect).
verify: Verify if the action actually took effect (visual diffs/DOM state/window focus).
loop: Continue/Degrade/Halt (degradation).

Pay strict attention to this boundary: The model merely proposes actions; the host machine decides whether to permit execution. You must never surrender raw execution authority directly to the model.

Usage (How to do it: Four Critical Modules)

1) Screenshots and Scaling: Forging "Inputs" into Controllable Data Planes

You must exert iron control over three parameters:

Resolution: Too massive drives latency and token costs through the roof; too small renders targets indiscernible.
Compression Strategy: Aggressive compression introduces blur and lethal targeting errors.
Consistency: Enforce strict consistency in resolution and scaling strategies across identical environments to facilitate deterministic replay testing (observability).

Official documentation strongly dictates selecting an optimal resolution and engineering robust screenshot handling; these are hard engineering boundary conditions. Reference: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool

2) Coordinate Mapping: Why Errors Amplify

The standard implementation involves "the model outputting normalized coordinates," which the host machine maps to raw physical pixels. This mapping carries two lethal risks:

Aspect Ratios: Screenshot scaling alters pixel density; simplistic linear mapping will wildly miss the target.
UI Dynamism: Button coordinates shift wildly depending on state; stale coordinates immediately fail.

A rudimentary yet highly effective strategy is "Two-Stage Localization":

Wide-Angle Localization: A full-screen screenshot yields a rough bounding box.
Local Zoom: Crop the rough area, providing the model with a vastly clearer local image to extract hyper-precise coordinates.

3) Action Verification: Evading "Invalid Clicks" that Trigger Infinite Loops

Action verification is the absolute lifeline of Computer Use. At a minimum, you must enforce:

Frame Diffs: Did the screenshot delta before and after the click cross the required threshold?
Timeouts: Wait a maximum of T milliseconds post-click; if nothing shifts, trigger degradation (timeout, degradation).
Retry Ceilings: An identical action absolutely cannot be retried infinitely (retry).
State Probing: When available, fuse window focus, accessibility trees, and DOM signals to elevate determinism (observability).

Official documentation explicitly warns to "validate actions before execution / add action delays." The engineering translation of this is: "Do not allow mindless rapid-firing clicks." Reference: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool

4) Security Boundaries: Computer Use Mandates Zero Trust

Computer Use is a supreme-privilege channel because it can:

Manipulate absolutely any GUI.
Input highly sensitive data (passwords, PII).
Trigger irreversible actions (deletion, financial transactions, production releases).

Consequently, you must force it into a zero-trust perimeter:

Default Deny: Write actions, payments, deletions, and deployments require explicit manual approval (authorization).
Tool Stratification: read-only / low-risk / high-risk, rigorously bound to hierarchical approval chains (auditing).
Kill Switch: At any microsecond, a single action must be capable of halting and purging the child processes (resource release).

This is not merely a security checklist; it is the non-negotiable engineering baseline to "prevent runaway AI."

A Minimal Coordinate Projector (Pseudocode)

The example below highlights "Mapping + Boundary Clamping + Action Delay," carving out interfaces for subsequent action verification.

class VisualTargeting:
    """
    Coordinate Projector:
    Maps normalized model coordinates to raw physical pixels, enforcing strict boundary clamping.
    """

    def __init__(self, *, screen_w: int, screen_h: int, ref: int = 1000):
        self._w = screen_w
        self._h = screen_h
        self._ref = ref

    def denormalize(self, x: int, y: int) -> tuple[int, int]:
        rx = int(x * (self._w / self._ref))
        ry = int(y * (self._h / self._ref))
        rx = max(0, min(rx, self._w - 1))
        ry = max(0, min(ry, self._h - 1))
        return rx, ry

Observability (Without Traces, You Cannot Debug Computer Use)

Every single action must log, at a minimum:

trace_id / task_id
action_type (click/type/scroll/wait)
before_frame_hash / after_frame_hash (or precise delta metrics)
attempt / timeout_ms / retry_reason
result / error_code

OpenTelemetry's GenAI event semantic conventions provide a stellar entry point for "how to transform tool behaviors into aggregatable metadata fields." Reference: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/

Pitfall (Common Traps and Defenses)

Omitting Action Verification: Rapidly spirals into an inescapable void of invalid click loops (retry storms).
Omitting Timeouts: Waits stretch into infinity (timeouts).
Absence of a Kill Switch: A single operational blunder yields an uncontrollable blast radius (authorization, isolation).
Failing to Archive Screenshot Evidence: Zero ability to perform post-mortems on "why the mouse clicked the wrong target" (auditing, observability).

Debug (Troubleshooting Computer Use Instability)

Recommended forensic sequence:

Resolution/Scaling: Is it completely consistent? Are you crushing the image with excessive compression?
Coordinate Mapping: Are you rigorously calculating aspect ratios? Have you deployed local zoom?
Verification Strategies: Are diff thresholds logical? Are timeouts actually tripping? Do retry ceilings exist and function?
Authorization Gates: Were catastrophic, high-risk actions erroneously permitted to execute?

Replay and Regression (Computer Use Must Be Testable)

Debugging Computer Use by merely "watching it click live" is a path to sheer agony. You must mandate that every execution produces a replayable asset:

Archive Keyframes: Before/After screenshots (strictly deploying redaction/cropping).
Archive Action Sequences: action_type + coordinates + target window metadata (if available).
Archive Verification Metrics: Diff deltas, wait durations, timeout triggers, and retry counts.
Archive Failure Reason Tags: Miss-click / occlusion / focus lost / stuck loading / permission denied.

This enables you to forge a "Golden Replay Set":

Every time you alter coordinate mappings, diff thresholds, or action delays, replay a batch of historical tasks.
If miss-click rates spike or timeout rates climb, aggressively reject the deployment (degradation/blocking).

Failure Reason Taxonomy (Mandatory Standardization)

miss_click
focus_lost
element_occluded
loading_no_change
scroll_drift
permission_denied
timeout
retry_exhausted

The strategic value of these tags is transmuting "it clicked the wrong thing" into a quantifiable, statistical failure taxonomy (observability), allowing diverse failure types to trigger bespoke recovery algorithms (degradation).

Recovery Strategies (Granting the System Autonomous Damage Control)

Upon verification failure, you must never permit the model to continue clicking blindly. The recommended stratified recovery protocol:

First Failure: Local zoom + Relocalization (demand stronger visual evidence).
Second Failure: Throttle action intensity (read-only observation/wait), and command the model to justify its next move (auditing).
Serial Failures: Instantly trip the Human-In-The-Loop alarm or ruthlessly terminate the task (degradation, resource release).

The absolute core of this strategy is "Finite Retries + Explicit Exits." Without it, you simply built a machine that rapidly incinerates tokens and CPU cycles clicking a dead screen (retry, timeout).

Mandatory Gates for High-Risk Actions (Highly Recommended Default: ON)

Within the Computer Use domain, certain actions will trigger a catastrophe even if "clicked perfectly." These actions mandate hard gates:

Delete/Purge/Overwrite: Any button or menu capable of nuking data.
Payment/Purchase/Deploy: Any irreversible external action.
Privilege Modification: Altering accounts, cryptographic keys, or IAM configurations.

Gating Strategy Recommendations:

Default Deny (Block everything immediately).
Mandatory HITL: Surface screenshot evidence + action previews + target resources, forcing human sign-off (authorization, auditing).
Mandatory Post-Execution Verification: Force screenshot diffing and state confirmation (observability).

This is not being "overly cautious." This is the foundational professional baseline required when dealing with a god-tier privilege channel (authorization, isolation).

Source (Reference Materials)

Anthropic computer use tool docs: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool
Engineering experiments and implementation clues (resolution/coordinate mapping): https://simonwillison.net/2024/Oct/22/computer-use/
Anthropic quickstarts: https://github.com/anthropics/anthropic-quickstarts
OTel GenAI events semconv: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/