Why zhhz

OpenCC is the de-facto Chinese-conversion library. zhhz is a from-scratch Rust reimplementation built around the same dictionaries, designed to be:

One self-contained binary. Data is embedded via include_str!; nothing is fetched or installed alongside it.
Memory-safe by construction — pure Rust in the conversion core.
Friendly to custom conversion words at the highest priority, for terminology, branding, or domain vocabulary.
Tracked against upstream data via a pinned, reproducible sync script (scripts/sync-opencc.sh).

What zhhz is for

AI agents (Claude, Cursor, custom LLM pipelines, batch jobs) that need to call Chinese conversion from a shell pipe or a Node.js process without runtime setup.
Server-side text processing where one static binary that always produces byte-identical output is more valuable than a 12-component build chain.
Embedders who want the same conversion engine reachable from Rust, Node.js (today), and Python (planned) without re-licensing the dictionary data.

What zhhz is not

Not a wrapper around the OpenCC C++ library. zhhz re-implements the conversion pipeline in pure Rust from scratch (FMM segmentation, ordered dictionary-group chain). The shared asset is the dictionary data, which is vendored from upstream.
Not a general-purpose NLP toolkit. No tokenisation, no part-of-speech tagging, no segmentation for jieba-style use. Conversion only.
Not a research platform for new segmentation models. The default is OpenCC’s FMM (preserves upstream parity). An optional DP segmentation path is on the roadmap but opt-in and explicit.

What zhhz doesn’t try to be

Not a replacement for OpenCC — zhhz covers all 16 OpenCC configs and tracks upstream data, but zhhz does not promise to be a drop-in C ABI for OpenCC. If you need that, use the C++ library directly.
Not the fastest possible implementation. zhhz is competitive with OpenCC on warm corpora (within 5–15 % on most benchmarks, sometimes faster) and significantly faster on trigram-style inputs. For absolute peak throughput the path forward is a compact FST/double-array trie representation — on the roadmap, but not the priority.
Not a UI. No TUI, no progress bars, no animated anything. Plain text on stdout, errors on stderr.

Scope decisions

A few non-obvious choices worth knowing:

Decision	Why
Hand-rolled CLI arg parser, no `clap`	The toolchain is pinned to `stable-2024-11-28` (= Rust 1.83) for reproducible release builds; `clap 4.6+` needs edition2024 / 1.85, so a hand-rolled parser keeps the static binary small.
`build.rs` regenerates 5 build-time dictionaries	OpenCC’s build system generates reversed variant tables, a tofu-risk subset, and regional-phrase projections at build time; zhhz reproduces all five deterministically from the vendored source data so `data/` stays a pure upstream mirror.
Custom dicts go to the highest priority	Same semantics as OpenCC’s `--dict` — your terminology, branding, or domain vocabulary wins over the built-in tables.

Where the design lives

Design notes, retrospective write-ups, and benchmark data live in the private mneme repo (specifically mneme/zhhz-design/ and mneme/perf-stories/). They are not part of the public zhhz repo by design — see the mneme repo’s README for the convention.