Why Claude Code makes you paste with ctrl+v instead of cmd+v — and the hidden cost of feeding raw pixels to an AI agent.
If you have used Claude Code's terminal interface, you have hit this: you copy a screenshot, position your cursor, press cmd+v to paste — and nothing happens. Or worse, you get a useless file path string instead of the image. The trick is that you have to press ctrl+v, not cmd+v.
That one detail trips up almost everyone the first time. But it is a doorway into something bigger: how you get a screenshot into an AI agent shapes how much that screenshot actually helps. And pasting raw pixels — the thing ctrl+v does — is the most expensive, lowest-context way to do it.
Let me explain both halves: the keystroke, and the cost.
Key Takeaways
- In Claude Code's terminal, paste screenshots with ctrl+v, not cmd+v — you are pasting into the program running inside the terminal, which reads the clipboard image itself.
- cmd+v in a terminal usually pastes a file path or plain text, not the image.
- Pasting a raw screenshot is the most expensive way to hand an agent a screenshot: a single full-resolution image can cost 1,000–2,000 tokens just to encode, before any reasoning.
- The agent does not see what you see — it has to OCR the text (where vision models hallucinate most), infer the UI structure, and guess context it cannot see: file path, line number, URL, OS version.
- Structured capture — app name, window title, accessibility tree, dev context — hands the agent clean text instead of pixels. Cheaper, and no OCR guessing.
- Rule of thumb: paste pixels for a quick glance, use structured context for real work.
- With a local MCP server, the agent can query a capture by ID instead of you pasting at all.
Why ctrl+v and not cmd+v?
Claude Code runs in your terminal. The terminal is the app receiving your keystrokes — not a native macOS text field, not a browser. And terminals have their own decades-old convention for paste that predates the Mac's cmd+v.
In most Mac apps, cmd+v is paste because the OS-level text system handles it. But a terminal emulator passes most keystrokes straight through to the program running inside it. When you press cmd+v in many terminal setups, the terminal itself intercepts it as its own "paste from system clipboard" — which pastes text. For an image, that becomes a file path or nothing at all.
Claude Code instead listens for ctrl+v as its own internal "paste image from clipboard" signal. It is not the macOS paste — it is Claude Code reading the clipboard's image data directly and attaching it to your prompt. That is why the keystroke is different from everything else on your Mac: you are talking to the program inside the terminal, not to macOS.
A few practical notes:
- On Claude Code's terminal interface, ctrl+v pastes an image from your clipboard directly into the prompt.
- cmd+v in the same spot usually pastes a file path (if you copied a file) or plain text — not the image itself.
- If you are on the macOS Terminal or iTerm2, make sure the terminal is not capturing ctrl+v for its own use. Most setups pass it through to Claude Code fine.
- The image has to actually be on your clipboard as image data. A screenshot sent to clipboard (cmd+ctrl+shift+4 on Mac, or any capture tool that copies the image) works; a file you copied in Finder usually pastes as a path.
That is the mechanics. Now the part that actually matters.
What pasting pixels actually costs
Here is the thing nobody tells you about pasting a screenshot into an AI agent: the agent does not see what you see. You see a screenshot of your editor with a bug highlighted. The agent sees a grid of pixels it has to decode from scratch — and every bit of decoding costs tokens, time, and accuracy.
When you paste a raw screenshot, the model has to:
- OCR the text out of the image — every variable name, every error message, every line of code — and OCR is exactly where vision models hallucinate most. Low-contrast themes, small fonts, and anti-aliasing all degrade it.
- Infer the structure — which window is focused, what app this is, where one pane ends and another begins — from visual cues alone.
- Guess the context it cannot see — the file path, the line number, the full error above the scroll line, the git branch, the OS version.
You paid for a screenshot. You got a guess.
And the cost is not theoretical. A single full-resolution screenshot pasted into an agent can run 1,000–2,000 tokens just to encode the image — before the model has reasoned about anything. Paste a few of those into a conversation and you have spent more context on pixels than on your actual code.
The alternative: structured context instead of pixels
What if, instead of pasting pixels, the screenshot arrived as structured text the agent can read directly?
That is the entire idea behind Stash. When you capture a screenshot with Stash, it does not just save an image. It captures:
- The app name and bundle ID — so the agent knows exactly what it is looking at, no guessing.
- The window title — the document, the project, the page.
- The accessibility tree — the actual text content of the UI, structured, pulled straight from macOS — no OCR, no hallucination.
- The URL, if it is a browser.
- Dev context — for editors, the file path, language, and cursor position.
- A context banner composited onto the image itself, plus the same data as structured XMP metadata inside the PNG.
So instead of the agent burning 1,500 tokens to OCR a fuzzy screenshot and still guessing the file path, it reads a compact block of structured text: app, window, file, line, the exact UI text. Cleaner, cheaper, and correct.
And because Stash runs a local MCP server, the agent does not even need you to paste at all. It can query your captures by ID or search — stash.get_capture("a1b2c3d4") — and pull the full dossier on demand. The screenshot becomes a database row the agent can look up, not a wall of pixels you shove into the context window.
When ctrl+v is still fine
None of this means ctrl+v is wrong. For a quick one-off — "what is this error?" — pasting an image directly is fast and good enough. The cost only compounds when you are doing real, repeated work: feeding the agent screenshot after screenshot, building up a long debugging session, or relying on it to read fine detail correctly.
The rule of thumb: paste pixels for a glance, use structured context for work. If the screenshot matters — if the agent has to get the details right — give it something better than a grid of pixels to decode.
The takeaway
Ctrl+v instead of cmd+v is a quirk of talking to a program running inside your terminal — Claude Code reads the clipboard image itself, so it uses its own keystroke. That is the easy half.
The half that matters: pasting raw pixels is the most expensive, least reliable way to get a screenshot into an AI agent. The pixels have to be decoded, the text OCR'd, the context guessed. Structured capture — app, window, file, accessibility tree, all as text — skips the decoding entirely and hands the agent something it can actually read. Same screenshot. A fraction of the cost. None of the guessing.
Stash is a Mac screenshot and screen-recording tool built for AI agents. Every capture exports as structured context — accessibility trees, app metadata, dev context — that Claude Code, Cursor, and ChatGPT can read directly, instead of pixels they have to decode. Download Stash or learn more about the vibe coding workflow.