Multi-Camera Production Tracking with YOLOv11

A food production facility came to me with a straightforward question: can we use cameras to track what's happening on the production tables and get per-worker output counts automatically? No manual logging, no clipboards — just cameras, computer vision, and a CSV at the end of the shift.

The result is a multi-camera tracking pipeline that fuses detections from four side-view cameras, maps everything onto a shared table coordinate plane, assigns objects to worker zones, and outputs annotated video alongside per-worker production summaries.

The Core Problem: Multiple Cameras, One Space

A single camera covering a large production table loses detail and has blind spots. Four cameras around the perimeter give full coverage — but now each camera sees overlapping areas and the same object can appear in two feeds simultaneously. Without proper fusion, you'd double-count everything.

The solution is to project every detection from pixel coordinates into a shared millimeter-scale coordinate system on the table plane, then cluster nearby detections across cameras. If camera 0 and camera 2 both see something at roughly the same table-plane position, that's one object, not two. Cross-camera clustering with a configurable distance threshold (default 50mm) handles the deduplication.

The Pipeline

Calibration. For each camera, you click the four table corners in the frame and enter the corresponding real-world coordinates in mm. This generates a homography matrix per camera. Done once, reused every run.

Inference. Frames are extracted at a configurable FPS (default 2fps — enough for production tracking without drowning in compute). Each frame is sent to a fine-tuned YOLOv11s model via Roboflow's API, or a local inference server if you're running offline. The model detects production objects by class.

Tracking. ByteTrack runs per-camera to maintain consistent IDs across frames. Objects that briefly disappear (e.g., a hand covers them) get a grace period before their track is dropped — configurable via --max-lost-frames.

Fusion. Each detection is projected onto the table plane using its camera's homography. Cross-camera clusters are formed by proximity. Each cluster gets a stable global ID, maintained across frames using a simple centroid-distance assignment.

Zone assignment. The table plane is divided into worker zones defined in a JSON config. Each tracked object is assigned to the zone containing its position. Counts accumulate per worker per object class.

Output. Annotated video per camera, a 2×2 grid video combining all four feeds, a bird's-eye-view video showing the table plane with all tracked objects, and CSV files — one with per-frame tracking data, one with per-worker production totals.

Outputs

The most useful output for the facility is the summary.csv — a simple table of worker vs. object counts for the session. But the annotated videos are invaluable for debugging: you can scrub through and see exactly what the model detected, which tracks got assigned which IDs, and whether the zone boundaries are positioned correctly.

The bird's-eye-view (BEV) render is particularly useful for calibration verification — it shows all detections projected onto the table plane, so misaligned cameras show up immediately as position drift.

Model and Infrastructure

The detection model is YOLOv11s fine-tuned on facility-specific images using Roboflow's training infrastructure. YOLOv11s hits a good balance — fast enough to process multiple cameras without a GPU cluster, accurate enough on the specific object classes involved. For production use, the pipeline supports a local inference server (compatible with Roboflow's local server) to avoid API latency and costs.

Interested in something similar? Multi-camera production tracking is applicable anywhere you have a physical workspace and need automated output counting — manufacturing lines, packing stations, food prep areas. Get in touch if you'd like to discuss a deployment.