Executive Summary
- Traditional machine vision requires months of manual image tagging and rigid parameter tweaking.
- Vision Language Models (VLMs like GPT-4o-Vision) can be zero-shot prompted in English to identify defects they've never seen before.
- Latency is the primary bottleneck; moving inference to the edge via localized models is critical for high-speed conveyor lines.
Zero-shot VLM accuracy on identifying misaligned PCBs vs 94% human baseline.
1. The VLM Shift
Historically, detecting a scratch on a car door required taking 10,000 photos of scratches to train a custom CNN. Today, you pass an image to an API and prompt: 'Return a boolean if you observe structural anomalies on this metallic surface'.
Setup Time for QA Defect Systems
Edge vs Cloud Vision
2. The Fallback Human Escalation
If the model's confidence logic flags the image below 90%, the conveyor belt diverts the item to a secondary channel, and the image pings a Slack channel where an off-site human clicks 'Approve' or 'Reject', instantly resuming the workflow.
3. Supply Chain Downstreaming
The true ROI isn't just catching bad products. By logging the JSON defect outputs into a Postgres database, analytics dashboards can automatically trace rising defect rates back to specific global suppliers in real-time.
