Command reference
All commands print compact JSON by default. Success is {"ok": true, ...}; failure is {"ok": false, "error": "..."}.
Every vision command takes the image as the first positional argument. It accepts a file path, - for a base64 image on stdin, or the --clipboard / --screen flags.
Bounding boxes are pixel coordinates [x, y, w, h] with a top-left origin. Each box also includes norm, the same rectangle normalized to [0, 1].
macvision ocr <image>
Extract text with VNRecognizeTextRequest, which is natively multi-language — pass a list and one request reads every script.
macvision ocr ./screenshot.png # auto: broad default languages
macvision ocr ./screenshot.png --lang cjk # Chinese/Japanese/Korean preset
macvision ocr ./screenshot.png --lang zh-Hans,en-US
macvision ocr ./screenshot.png --ja --en # shorthand: Japanese first, then English
macvision ocr - # base64 image on stdin
macvision ocr ./scan.png --level fast
Options
| Option | Default | Description |
|---|---|---|
--lang |
all preset |
Languages/presets, repeatable or comma-separated. Presets: all(default),cjk,cn,latin,en. Custom set via $MACVISION_LANG_<NAME>=a,b,c. e.g. --lang cjk,en-US |
--en --zh --ja --ko |
off | Shorthand: add English / Chinese(zh-Hans,zh-Hant) / Japanese / Korean to --lang at the position they appear. --ja --en equals --lang ja-JP,en-US |
--level |
accurate |
accurate or fast |
--min-confidence |
0 |
Drop results below this confidence |
--top |
all | Keep at most N results |
--no-language-correction |
off | Disable Vision language correction |
--clipboard |
off | Read the image from the clipboard |
--screen |
off | Take a fresh screenshot first |
Language shortcuts — --en, --zh, --ja, --ko each append a script to --lang at their CLI position, so flag order is the language order: --ja --en → ja-JP,en-US. Combine freely with --lang, e.g. --lang cjk --ja.
Language presets — all = zh-Hans,zh-Hant,en-US,ja-JP,ko-KR,fr,de,es,pt,it,ru (auto-detect, no --lang needed). cjk = zh+ja+ko. cn = zh-Hans,zh-Hant. latin/european = en+fr+de+es+pt+it. en = en-US only. Define your own: export MACVISION_LANG_SEA=th-TH,vi-VN,id-ID then macvision ocr img.png --lang sea.
CJK ordering note: Vision prioritizes the first CJK language in the list. The
all/cjkpresets are zh-first, so Chinese + Latin read reliably, but pure Japanese/Korean text can be starved — for those, lead with the language:--lang ja-JP/--lang ko-KR.
Output fields
| Field | Type | Meaning |
|---|---|---|
ok |
bool | Success or failure |
image |
string | Source label |
width, height |
int | Image dimensions |
languages |
[string] | Languages used |
count |
int | Number of text results |
confidence |
float | Mean top-candidate confidence (0 if no text). With count, lets a caller judge whether to retry with a different --lang |
texts |
[object] | One per recognized line/word |
texts[].text |
string | Top candidate string |
texts[].confidence |
float | Candidate confidence |
texts[].bbox |
[int]×4 | [x, y, w, h] pixels, top-left origin |
texts[].norm |
[float]×4 | Same box normalized |
texts[].candidates |
[string] | Other candidates, when present |
macvision classify <image>
Classify a scene/objects with VNClassifyImageRequest, or animal species with VNRecognizeAnimalsRequest.
macvision classify ./photo.jpg --top 5
macvision classify ./photo.jpg --animals
Options
| Option | Default | Description |
|---|---|---|
--top |
10 |
Keep top N labels |
--min-confidence |
0 |
Drop labels below this confidence |
--animals |
off | Recognize animal species instead of scene |
Output fields
| Field | Type | Meaning |
|---|---|---|
mode |
string | scene or animals |
count |
int | Number of labels |
labels |
[object] | {name, confidence} per label |
macvision detect <image>
Run one or more detectors in a single pass. With no detector flag it runs the broad set: faces, barcodes, text regions, and horizon. A flag like --faces narrows to only that detector. --rects (document/card rectangles) and --ocr (recognize the actual text) are additive — --ocr runs text recognition on top of whatever else runs, so detect img --ocr = broad + read the words.
macvision detect ./photo.jpg # broad: faces, barcodes, text regions, horizon
macvision detect ./shot.png --ocr --lang zh-Hans,en-US # broad + read the text
macvision detect ./shot.png --ocr --ja --en # broad + read text (shorthand langs)
macvision detect ./card.jpg --rects # document/card rectangles (opt-in)
macvision detect ./qr.png --barcodes --symbologies qr,ean13
macvision detect ./photo.jpg --faces --horizon # only faces + horizon
Options
| Option | Default | Description |
|---|---|---|
--faces |
off (on by default if none set) | Detect faces |
--rects |
off | Detect document/card rectangles |
--barcodes |
off (on by default if none set) | Detect barcodes / QR |
--text-regions |
off (on by default if none set) | Detect text region boxes (no content) |
--ocr |
off | Recognize the actual text (additive) |
--lang |
all preset |
OCR languages/presets with --ocr. Presets: all(default),cjk,cn,latin,en. Repeatable or comma-separated |
--en --zh --ja --ko |
off | Shorthand: add English / Chinese / Japanese / Korean to --lang at their CLI position |
--horizon |
off (on by default if none set) | Detect horizon angle |
--symbologies |
Vision default | qr,ean13,ean8,upce,code128,code39,code93,datamatrix,pdf417,aztec,itf14 |
--min-size |
0.2 |
Minimum rectangle size for --rects (0–1) |
--min-confidence |
0 |
Drop detections below this confidence |
Output fields
| Field | Type | Meaning |
|---|---|---|
count |
int | Total detections across detectors |
detections.faces |
[object] | {bbox, norm} |
detections.rectangles |
[object] | {bbox, norm, corners, confidence} — corners is 4 [x, y] pixel points |
detections.barcodes |
[object] | {payload, symbology, bbox, norm} |
detections.texts |
[object] | Recognized text (with --ocr): {text, confidence, bbox, norm} |
detections.text_regions |
[object] | {bbox, norm, character_count?} |
detections.horizon |
object | {angle} in radians |
detections.*_count |
int | Per-detector counts |
macvision document <image>
Find the document outline (a quadrilateral) with VNDetectDocumentSegmentationRequest, so you can crop or deskew it.
macvision document ./scan.jpg
Output fields
| Field | Type | Meaning |
|---|---|---|
count |
int | Number of documents found |
documents |
[object] | {bbox, norm, corners, confidence} — corners is 4 [x, y] pixel points (top-left, top-right, bottom-right, bottom-left) |
macvision salient <image>
Produce a saliency heatmap PNG with VNGenerateAttentionBasedSaliencyImageRequest (or objectness with --mode objectness).
macvision salient ./photo.jpg
macvision salient ./photo.jpg --mode objectness --output ./heat.png
Output fields
| Field | Type | Meaning |
|---|---|---|
mode |
string | attention or objectness |
mask_width, mask_height |
int | Heatmap dimensions |
output |
string | Path to the written PNG |
saved |
bool | Whether the PNG was written |
macvision feature <image>
Image fingerprint via VNGenerateImageFeaturePrintRequest. With --compare, compute the distance between two images (0 = identical).
macvision feature ./a.jpg # base64 feature vector
macvision feature ./a.jpg --compare ./b.jpg # distance between two images
macvision feature ./a.jpg --level 2 # higher-precision model (macOS 14+)
Output fields (single)
| Field | Type | Meaning |
|---|---|---|
level |
int | 1 or 2 |
element_count |
int | Vector dimensionality |
element_type |
string | float or double |
bytes |
int | Raw vector size in bytes |
data |
string | Base64-encoded vector bytes |
Output fields (compare)
| Field | Type | Meaning |
|---|---|---|
compare |
string | Second image label |
distance |
float | 0 = identical; larger = more dissimilar |
same |
bool | distance < 0.5 (loose cutoff; pick your own) |
macvision daemon
Long-lived daemon that serves NDJSON requests over named pipes (FIFO). Blocks until cancelled.
macvision daemon --req /tmp/macvision.req --res /tmp/macvision.res &
Options
| Option | Default | Description |
|---|---|---|
--req |
/tmp/macvision.req |
Request FIFO path |
--res |
/tmp/macvision.res |
Response FIFO path |
Request schema — one JSON object per line on the request pipe:
{"action": "ocr", "image": "/tmp/s.png", "lang": ["zh-Hans", "en-US"]}
Supported action values: doctor, ocr, classify, detect, feature, salient, document. Fields mirror the CLI options:
| Action | Fields |
|---|---|
ocr |
image, lang, level ("fast"), min_confidence, top, use_correction |
classify |
image, top, min_confidence, animals |
detect |
image, faces, rects, barcodes, text_regions, horizon, symbologies, min_size, min_confidence |
feature |
image, compare, level (1 or 2) |
salient |
image, mode, output |
document |
image |
doctor |
none |
Each request produces one NDJSON response line on the response pipe, identical to the matching CLI command’s JSON.
macvision doctor
Report the macOS version, architecture, and the supported Vision capabilities.
macvision doctor
Output fields
| Field | Type | Meaning |
|---|---|---|
ok |
bool | Environment is usable (macOS >= 13) |
macos_version |
string | e.g. 26.5.1 |
architecture |
string | e.g. arm64 |
apple_silicon |
bool | Apple Silicon vs Intel |
checks |
[object] | {name, value, ok, requirement?} |
capabilities |
[object] | {name, ok, api} per Vision request |
Common error fields
| Field | Type | Meaning |
|---|---|---|
ok |
bool | false |
error |
string | Human-readable error message |
Common causes:
- Image not found / not decodable — check the path, or that stdin base64 is valid.
- Screen capture failed — grant Screen Recording to your terminal in System Settings.
- No image on the clipboard — copy an image first, then use
--clipboard.