From gold photo to structured registration

Completed

Vision models, PDF extraction, and a review interface — a complete intake flow for physical assets, built in four weeks.

Physical asset operations / regulated intake · 4 weeks · Jan 2026

React Native Expo Python Multimodal LLM Supabase S3

What was built

A complete intake flow for physical assets. A mobile app captures the object, a Python backend identifies it using a multimodal vision model, a web-based review interface enables manual verification — and writes the confirmed receipt into the operational ledger. In parallel: an LLM-powered extraction service that reads incoming PDF goods receipts and reconciles them against the same ledger.

Four components, one flow. Two fundamentally different recognition problems — image and document — under the same constraint.

The constraint

Near-zero error tolerance. A wrong result is more dangerous than no result. The system had to learn to reject safely — not to recognize at any cost.

This single requirement shaped every architectural decision: which LLM strategy to use, how the review interface works, when fallback logic kicks in, and how the PDF extractor handles ambiguity. The core design principle that emerged: “missing beats wrong” — an empty field is better than a fabricated value.

The tech stack

Mobile app in React Native with Expo for camera integration and upload flow. Recognition backend in Python with FastAPI, built on a pluggable strategy architecture — every LLM strategy implements the same interface, is swappable, and individually benchmarkable. The review interface as a web app, built from scratch in 2–3 days: Supabase for database and auth, S3 for image storage, React frontend. Fast, because the scope was clear.

Finding the right model

Classical OCR fails immediately. Engraved text on reflective metal surfaces has no color contrast. Local vision models showed promise but proved unstable — high timeout rates, hallucinated serial numbers, unacceptable inference times on CPU hardware.

No single model performed well across all fields. The final strategy combines two models with field-specific merge logic — calling the second only when the first cannot resolve critical fields. In the end, the cloud vision API outperformed every local strategy in accuracy and latency. This came from a structured benchmark, not from an assumption at the start.

The harder problem

After the demo sign-off, the real challenge began. PDF goods receipts from different suppliers are structurally heterogeneous: different layouts, date formats, units, languages. The model must not only extract — it must decide what it cannot know, and signal that explicitly.

The solution: an extraction prompt with runtime-injected whitelist constraints, strictly separated into system prompt and user prompt. The model receives exactly the constraints that apply to the given issuer at runtime. reconciliation_ready is only set to true when all required fields are complete and plausible. The model rejects implicitly by leaving fields empty.

What remains

Whether vision model or PDF extractor — the real work is not in optimizing for completeness, but in defining when the model safely does not recognize. Define this too late, and you build the system twice. And the final LLM strategy was not predictable at the start. Only a structured benchmark revealed which combination actually works under the given constraints.