Multi-Camera Production Tracking with YOLOv11
A food production facility came to me with a straightforward question: can we use cameras to track what's happening on the production tables and get per-worker output counts automatically? No manual logging, no clipboards — just cameras, computer vision, and a CSV at the end of the shift.
The result is a multi-camera tracking pipeline that fuses detections from four side-view cameras, maps everything onto a shared table coordinate plane, assigns objects to worker zones, and outputs annotated video alongside per-worker production summaries.
The Core Problem: Multiple Cameras, One Space
A single camera covering a large production table loses detail and has blind spots. Four cameras around the perimeter give full coverage — but now each camera sees overlapping areas and the same object can appear in two feeds simultaneously. Without proper fusion, you'd double-count everything.
The solution is to project every detection from pixel coordinates into a shared millimeter-scale coordinate system on the table plane, then cluster nearby detections across cameras. If camera 0 and camera 2 both see something at roughly the same table-plane position, that's one object, not two. Cross-camera clustering with a configurable distance threshold (default 50mm) handles the deduplication.
The Pipeline
--max-lost-frames.Outputs
The most useful output for the facility is the summary.csv — a simple table of worker vs. object counts for the session. But the annotated videos are invaluable for debugging: you can scrub through and see exactly what the model detected, which tracks got assigned which IDs, and whether the zone boundaries are positioned correctly.
The bird's-eye-view (BEV) render is particularly useful for calibration verification — it shows all detections projected onto the table plane, so misaligned cameras show up immediately as position drift.
Model and Infrastructure
The detection model is YOLOv11s fine-tuned on facility-specific images using Roboflow's training infrastructure. YOLOv11s hits a good balance — fast enough to process multiple cameras without a GPU cluster, accurate enough on the specific object classes involved. For production use, the pipeline supports a local inference server (compatible with Roboflow's local server) to avoid API latency and costs.