Table of Contents
Vibe coding is the loop of pointing an AI coding agent at your screen and saying "fix this." It fails in predictable ways. The agent guesses which file is open. It misreads a button label. It treats the arrow you drew as a UI element. Everyone has blamed the model. The evidence says the model is fine — the missing piece is context.
When an AI coding agent is given structured context — file path, window title, URL, accessibility tree, annotation shapes — accuracy on visual bugs jumps and turn counts fall. When it is given raw pixels, it flails. This article compiles the public benchmarks, the repository-level RAG research, and Stash's own measurements to make the case numerically.
TL;DR — One Number
Annotation-mismatch error rate drops from 30-40% on raw screenshots to under 5% when the agent can read structured annotation shapes, app metadata, and window title alongside the image. (Stash internal measurement across its Context Banner and XMP payload pipeline.)
That single shift — turning a screenshot from a pixel grid into a structured capture — is the difference between the agent pointing at the wrong button and the agent pointing at the right one. The rest of this article explains why that number looks the way it does, and where the public benchmarks agree.
The Accuracy-vs-Context Curve
Capability and grounding are two different problems. Model capability — how well an LLM reasons about code when it has every piece of information it needs — has climbed steeply. On SWE-bench Verified, Claude Opus 4.7 now resolves ~82% of curated software engineering issues, with Gemini 3.1 Pro at ~78.8% and GPT 5.4 at ~78.2% [1]. Two years ago, the same benchmark was below 20% for the best frontier model [2].
That capability only materializes when the model gets the right code in the window. On repository-level code-completion benchmarks, retrieval-augmented generation (RAG) lifts accuracy substantially even on strong base models:
| Model | No retrieval (base) | With function-definition retrieval | Relative improvement |
|---|---|---|---|
| Llama-3.1-8B-Instruct | 34.02 / 46.07 | 39.64 / 49.35 | ~27.6% |
| Qwen2.5-Coder-14B-Instruct | 29.79 / 48.56 | 51.12 / 61.96 | ~71.6% |
Source: WeChat RAG-for-code-completion study [3], summarizing exact-match and edit-similarity metrics.
The SWE-bench Verified scaffold called Agentless reported the sharpest illustration of this. By improving which files it fed the model — not by changing the model itself — it roughly doubled the best open-source SWE-bench score, from ~16% to ~33% [4]. The lesson generalizes: correctly selected, summarized context improves resolution accuracy and reduces runtime; unfiltered context provides limited or negative benefit [5].
Long-context research points the same direction. Claude 3 Opus scored above 99% on needle-in-a-haystack retrieval, and Claude Opus 4.7 maintains 98.5% recall at a 1M-token window [6]. Context-window size is not the blocker. What goes into the window is.
Why Raw Pixels Fail
Screenshots look like rich context to humans. To a vision model, a screenshot is a grid of patch tokens with most structural information missing. Three failure modes recur:
- Patch tokenization throws away small UI. Vision models split an image into fixed patches (often 14×14 or 16×16 pixels after resizing). A one-pixel outline on a button, a single-line error badge, or a 10px caret all fall inside a patch that also contains lots of other pixels. The model sees a smear, not an element. See How Vision Models Actually Process Your Screenshots for the pipeline.
- OCR hallucination on UI. Pixel OCR on low-contrast, small, or antialiased UI text is where vision models hallucinate most. A button labeled "Save Draft" renders at 11pt; the model confidently calls it "Save" or "Saved" or "Send." You don't notice until the agent's next step targets the wrong button.
- Chat upload strips metadata. Drag a PNG into most chat clients and every piece of embedded context — XMP, EXIF, accessibility sidecars, bundle siblings — is discarded at upload. The model only gets pixels.
The implication is not "vision models are bad." It is that pixels are a lossy encoding of a UI that was originally structured. You had structured data (the DOM, the a11y tree, the file path). You rasterized it to a screenshot. You asked the model to reconstruct what you threw away. Adding a little structured context back in returns most of the signal.
What "Context" Means, Field by Field
"More context" is vague. The useful granularity is per-field. This table is the core of the argument — each row is a specific data item, with what happens when it is missing and what changes when it is present.
| Context field | Without it | With it |
|---|---|---|
| File path | Agent asks "which file is this?" — one extra turn minimum. Often guesses wrong from window chrome. | Agent opens the file directly and makes the edit in place. |
| Window title | Agent cannot disambiguate ContentView.swift from SettingsView.swift when the tab bar is cropped. |
Agent names the exact view or document in its first response. |
| URL | Minimalist pages look identical from pixels — agent can't tell GitHub from your local staging site. | Agent maps the URL to the route, then to the component in your codebase. |
| Accessibility tree | Agent OCRs UI labels and hallucinates on small or low-contrast text. Element roles are guessed. | Agent reads pristine element roles, labels, values, enabled states — typically 250-1,200 tokens per window (Stash measurement). |
| Dev context (cursor, language, file) | "Fix the bug in this function" is ambiguous — agent picks the nearest plausible function. | Cursor line and column pin the edit to the exact location. |
| Annotation shapes | Your red arrow is visually identical to an arrow in the app's toolbar. Mismatch rate 30-40% (Stash measurement). | Arrow recorded as geometric shape with endpoints and color; above 95% mismatch accuracy on Stash's own pipeline. |
| OS and display | Agent suggests UI steps that don't exist on your macOS version or theme. | Agent matches its guidance to the exact environment. |
| Clipboard events during a recording | Agent sees clicks but not what data you moved between apps — half the story is missing. | Agent reads the literal text you copied, inline on the timeline. |
| Voice transcript | Intent has to be inferred from clicks and scrolls. | Narration ("the modal stays open after I click dismiss") anchors every action. |
Benchmarks: Same Bug, With and Without Context
Two short scenarios. The bug is identical in each pair; only the context the agent receives changes.
Scenario A — Code editor layout bug
A SwiftUI view misaligns a button. The developer screenshots it and draws an arrow.
| Input | Agent's first turn | Turns to fix |
|---|---|---|
| Raw screenshot, no metadata | "I can see an arrow pointing at a VStack. Which file is this?" | 3 (identify file, identify line, apply fix) |
| Screenshot + app name + window title + annotation shape | "Arrow points at the VStack in ContentView.swift. The padding modifier is applied before the frame modifier, causing the offset. Here is the fix." | 1 |
Turn counts measured against Stash's own Copy Path + Context workflow on Claude Code; the raw-screenshot flow replicates the default paste-and-describe loop.
Scenario B — Browser validation bug
The "Place Order" button is enabled when the email field is empty. Two arrows drawn on a screenshot of the checkout page.
| Input | Agent's first turn | Turns to fix |
|---|---|---|
| Raw screenshot | "I see a checkout form with two arrows. Can you share the URL and which page this is?" | 3-4 |
Screenshot + URL (/checkout) + window title + annotation shapes |
"The button you arrowed is in Checkout.tsx line 84. The disabled prop is missing the email-field check. Fix:" | 1 |
Both scenarios collapse from multi-turn back-and-forth to a single turn. The model did not get smarter between the two rows. The input did.
Token Economics: How Context Pays for Itself
A common objection: "adding context costs tokens." The data points the other way — structured context costs fewer tokens than the work the model does without it, because the model no longer has to OCR, guess, and ask.
| Channel | Tokens | Fidelity |
|---|---|---|
| Pixel OCR of one UI screenshot | 1,500-4,000 | Hallucination-prone on small UI |
| Accessibility tree of the same window | ~250-1,200 (1-5 KB) | Exact element labels, roles, values |
| Video session dragged into chat as files | 50K+ before reasoning starts | Metadata stripped at upload |
| Same video via MCP structured bundle | ~6-10K structured + ~1-3K per frame the agent chooses to read | Full timeline, clipboard events, transcript, interaction log |
| Key frame cap per recording | 30 max (Stash hard cap) | Interaction-anchored — start, clicks, focus changes, end |
Stash token figures are internal measurements documented in the product's value-proposition spec. The order-of-magnitude story is durable: structured context costs roughly 5-10x fewer tokens than pixel-dominant uploads, at higher fidelity.
The token savings compound across turns. A raw-screenshot session with three clarifying round-trips spends pixel-OCR tokens on every turn. A structured-context session spends them once, gets it right, and moves on.
The Stash Context Stack
Stash is built around this evidence. Four surfaces deliver structured context to whichever agent is asking:
- Context Banner — a thin horizontal bar composited onto the pixels at the bottom of every screenshot. App name, window title, OS, display, timestamp, 8-character capture shortID. Human-readable, OCR-friendly, and survives drag-and-drop into any chat. See AI Context Banners: Embedding Machine-Readable Metadata onto Screenshots.
- XMP metadata payload — the same fields plus a structured accessibility tree and annotation shapes, embedded inside the PNG. Chat uploads often strip this; direct file reads and MCP calls do not.
- MCP payload — a local Model Context Protocol server that hands agents the full dossier for a capture on request. No chat upload, no token waste on images the agent didn't need. See What Is MCP? The Model Context Protocol Explained and Stash MCP Server: Query Captures from Claude Code and Cursor.
- AI Capture Report — for every video, a small bundle of markdown timeline, frame_tags.json, extracted audio, and up to 30 interaction-anchored key frames. LLMs cannot read MP4; they can read the bundle.
When More Context Hurts
More is not always better. Three failure modes to plan around:
- Context bloat degrades recall. Needle-in-a-haystack scores remain high at very long contexts, but agent performance on multi-step tasks drops when irrelevant files, logs, or transcripts fill the window. Claude Code's own best practices recommend delegating noisy reads to sub-agents with their own contexts [7].
- Irrelevant retrieval can hurt. SWE-bench research specifically finds that incorrectly selected context "provides limited or negative benefits" [5]. A bad retriever can push an accurate agent into worse answers than no retrieval at all.
- Silent truncation. Most agents silently truncate inputs past their effective window. Dumping 200K tokens of logs at a model with a 200K-token context does not mean the model read 200K tokens — it means you have no idea which parts it kept.
Rules of thumb that survive these traps:
- Start with a small, cheap call (for example,
stash.list_recent(5)— ~500 tokens) before a large one. - Prefer structured over raw: a11y tree over OCR, markdown report over video, annotation shapes over painted pixels.
- Pass paths, not payloads, when you can — let the agent pull in a frame only when it decides the frame matters.
- Cap by signal, not by size — 30 interaction-anchored frames beats 9,000 raw frames at the same session coverage.
Frequently Asked Questions
Does more context always improve AI coding accuracy?
No. Structured, relevant context improves accuracy sharply. Unfiltered or incorrect context can hurt performance, and very long contexts can cause gradual recall degradation. Published SWE-bench research finds that correctly selected, summarized context improves resolution accuracy and reduces token cost, while unfiltered context provides limited or negative benefit.
How much context is too much?
The answer is not a token number, it is a signal-to-noise ratio. A 6-10K token structured bundle beats a 50K+ token chat upload because the bundle has higher signal per token. On needle-in-a-haystack evaluations, Claude models maintain high recall at hundreds of thousands of tokens, but agent accuracy on real coding tasks still degrades when the context contains irrelevant files, redundant logs, or raw pixel-heavy screenshots.
What types of context help most?
The highest-leverage context for code agents is: the active file path and cursor position, the window title or page URL, an accessibility tree of the visible UI, structured annotation shapes drawn by the user, and a short description of intent. These fields tell the agent what the user is looking at, where the edit should land, and what changed — which pixels alone cannot convey.
Does pasting a screenshot give the AI enough context?
No. A raw screenshot contains pixels and nothing else. The model must OCR text, guess the app, guess the file, and interpret annotations as ambiguous visual marks. Stash internal measurement puts the annotation-mismatch rate at 30-40% without structured annotation, and above 95% accuracy once the annotation shapes and app metadata travel alongside the image.
Why do AI coding tools get annotations wrong?
The arrow you drew and an arrow already present in the app's toolbar look identical to a vision model. Without structured annotation data — shape type, start and end coordinates, color, stroke — the agent has to infer which marks are yours. It frequently attaches your annotation to the wrong element or treats a genuine UI element as an annotation.
How does Stash add context to screenshots?
Stash composites a thin Context Banner at the bottom of every screenshot with the app name, window title, OS version, display, timestamp, and an 8-character capture shortID. The same data, plus accessibility tree and annotation shapes, travels as XMP metadata inside the PNG and as structured payloads over a local MCP server that Claude Code, Cursor, and other clients can query.
What's the token cost of adding context?
An accessibility tree for a typical app window is 1-5 KB — roughly 250-1,200 tokens. That replaces the 1,500-4,000 tokens a vision model would spend OCRing the same pixels, and it arrives with clean labels instead of guessed ones. A full Stash video bundle runs around 6-10K structured tokens versus 50K+ tokens for the same session dragged into a chat as files.
Is there research on LLM context and accuracy?
Yes. SWE-bench and SWE-bench Verified show large accuracy jumps when scaffolds select better context (Agentless doubled prior open-source scores). CodeRAG-Bench and repository-level RAG studies show that retrieval of relevant function definitions can raise code-generation metrics by 27-72% depending on model and task. Anthropic's long-context evaluations report around 98-99% needle recall for Claude 3 Opus and 98.5% for Claude Opus 4.7 at 1M tokens.
Key Takeaways
- Capability is no longer the bottleneck for AI coding agents — grounding is. Frontier models now resolve ~78-82% of SWE-bench Verified, but they only hit those numbers when the context is right.
- Raw screenshots are a lossy encoding. Patch tokenization, OCR hallucination, and chat-upload metadata stripping throw away most of what made the UI legible in the first place.
- Structured context delivers 27-72% relative gains on repository-level code generation benchmarks (RAG studies), and doubles the best open-source SWE-bench scores when applied as scaffolding (Agentless).
- Stash measures annotation-mismatch rates of 30-40% on raw screenshots, above 95% accuracy once structured annotation shapes and app metadata travel with the image.
- Token economics favor structured context: 250-1,200 tokens for an accessibility tree beats 1,500-4,000 tokens of pixel OCR; ~6-10K tokens for a structured video bundle beats 50K+ tokens for the same session uploaded as files.
- More is not always better — irrelevant context hurts. Curate, summarize, and prefer paths over payloads.
References
- [1] Vals.ai, "SWE-bench benchmark leaderboard." vals.ai/benchmarks/swebench
- [2] Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. arxiv.org/pdf/2310.06770
- [3] Wang et al., "A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat," 2025. arxiv.org/html/2507.18515v1
- [4] OpenAI, "Introducing SWE-bench Verified." openai.com/index/introducing-swe-bench-verified
- [5] Zhang et al., "Automated Benchmark Generation for Repository-Level Coding Tasks," 2025. arxiv.org/abs/2503.07701
- [6] Anthropic, "Introducing the next generation of Claude" (Claude 3 family long-context results); follow-up reporting on Claude Opus 4.7's 1M-token window. anthropic.com/news/claude-3-family
- [7] Anthropic, "Best Practices for Claude Code" (sub-agents and context management). code.claude.com/docs/en/best-practices
- [8] Stash, "Value Propositions — Stash" (internal measurements for annotation accuracy, a11y tree token sizes, MCP bundle costs, 30-frame cap).