Use Cases
pdfvision is useful whenever a PDF must be inspected by an AI agent rather than manually copied into a prompt. The best workflow depends on what kind of evidence the PDF contains.
The common theme is verification. pdfvision is not only a "PDF to text" command; it is a way to expose the signals an agent needs to decide whether text extraction is enough, whether layout changed the meaning, and whether a specific visual region should be inspected.
Unknown PDFs
Start with the cheapest structured pass:
pdfvision document.pdf --jsonUse the overview as a routing table:
quality.nativeTextStatus: "ok"usually means native text is a reasonable first source.empty_but_visual_contentmeans the page likely needs rendering or OCR.- high
imageCountorvectorCountmeans charts, screenshots, forms, or slide graphics may contain meaning outside the text stream. - warnings identify pages where a human would slow down before trusting extraction.
Then add only the needed signals:
pdfvision document.pdf --layout --image-boxes --vector-boxes --visual-regions --jsonResearch Papers
Use native text first, then add layout when columns, figures, equations, or tables matter.
pdfvision paper.pdf --layout --image-boxes --format jsonGood follow-up checks:
- inspect
overview[]for sparse or glyph-corrupted pages. - use
--searchto locate cited terms, equations, or claim text before rendering a crop. - use
--render-regionfor figures, equations, and table fragments. - use XML or TOON when the result will be fed directly to an LLM.
- check
layout.blocksand warning signals before trusting paper reading order on two-column pages. - use
imageBoxesandvisualRegionsto decide which figures or tables deserve multimodal inspection.
Slide Decks and Reports
Slides often store meaning in images, vector shapes, and relative placement.
pdfvision deck.pdf --layout --image-boxes --vector-boxes --visual-regions --format jsonIf the slide has large raster regions, render the page or just the visual regions:
pdfvision deck.pdf --render-visual-regions --format jsonThis is useful for strategy decks, conference slides, product PDFs, and dashboards exported as PDF. The text layer may contain bullet strings, but the conclusion may sit in the chart, arrow, timeline, screenshot, or relative position of shapes.
Financial Reports and Dense Tables
Annual reports, earnings PDFs, invoices, and benchmark reports often flatten row and column relationships into a confusing text stream.
pdfvision report.pdf --layout --vector-boxes --visual-regions --search "Total revenue" --jsonUse pdfvision to:
- find the page and bbox for a metric or row label.
- preserve numeric table hints when rows and columns are visually aligned.
- flag table-like pages whose native text order may not match the visual table.
- crop a chart, table, or footnote before asking a vision model to verify it.
pdfvision report.pdf --pages 12 --render --render-region 72,210,468,240 --render-output ./evidence --jsonGovernment Forms and Tax Documents
Forms combine visible labels, widget fields, checkboxes, annotations, and dense rules.
pdfvision form.pdf --layout --form-fields --annotations --links --format jsonUse the field and label boxes with --render-region when a field relationship is ambiguous.
This helps an agent avoid the common failure mode where native text sees labels and values but loses the visual relationship between them. --form-fields exposes values, field types, labels, selected states, read-only or required flags, and widget metadata when the PDF contains interactive fields.
Scanned Documents
Use density signals to confirm that native text is missing or sparse, then run OCR only on the pages that need it.
pdfvision scan.pdf --pages 1-5 --ocr --ocr-lang eng --format jsonFor multilingual pages, put the dominant language first:
pdfvision scan.pdf --ocr --ocr-lang jpn+eng --format jsonOCR output is attached beside native text rather than replacing it. This lets an agent compare both signals, keep confidence scores visible, and render a higher-scale crop when small text or tables need verification.
Charts, Diagrams, and Visual Tables
Start with visual structure and region detection:
pdfvision report.pdf --layout --image-boxes --vector-boxes --visual-regions --format jsonThen render only the relevant crop:
pdfvision report.pdf --pages 8 --render --render-region 80,140,430,260 --render-output ./regionsUse this for chart legends, plot labels, architecture diagrams, screenshots, maps, form sections, and tables whose meaning is graphical. --visual-regions is especially useful when the agent does not know the coordinates yet.
Search-Then-Zoom Verification
When an agent needs to verify a specific clause, field, citation, metric, or label, search first:
pdfvision contract.pdf --search "termination" --search "governing law" --jsonEach match can include page, source, context, and bounding boxes. The agent can then crop the exact region instead of rendering the whole document:
pdfvision contract.pdf --pages 9 --render --render-region 96,320,420,96 --render-output ./crops --jsonThis workflow is useful for retrieval-augmented agents that need auditable PDF evidence, not only extracted text.
Multilingual and CJK PDFs
Japanese, Chinese, and mixed-language PDFs often expose spacing and glyph issues that text-only tools mishandle.
pdfvision document.pdf --layout --search "請求書" --jsonpdfvision normalizes Unicode by default, keeps raw text when normalization changed it, handles CJK-aware spacing in joined text, and can recover vertical CJK layout signals. For scans, combine OCR languages:
pdfvision scan.pdf --ocr --ocr-lang jpn+eng --jsonAgentic PDF Triage
For unknown PDFs, start with a cheap overview:
pdfvision document.pdf --format jsonThen branch:
- add
--layoutif reading order, tables, forms, or warnings matter. - add
--renderif the page is visual or native text looks suspicious. - add
--ocrif native text is missing and the rendered page contains visible text. - add
--visual-regionswhen figures, charts, forms, or diagrams need targeted inspection.
The goal is to keep agents honest: inspect the evidence, choose the next view, and avoid treating a blank or flattened text stream as the whole PDF.