Transparent comparison using validation dataset with ground truth labels
Results generated from 700 randomly sampled images (10%) from the 7,570 image validation set.
Ground truth labels enable honest evaluation: ✅ True Positives, ❌ False Positives, ⚠️ Missed Detections
Note: Some GT labels may be incorrect. See GT Audit for flagged issues.
Regenerate: python scripts/generate-val-comparison.py --sample-size 700