ICCV 2025
Our award-winning solution for the ICCV 2025 Grocery Vision Challenge represents a breakthrough in understanding complex human behaviors in retail environments. We developed a sophisticated multi-modal fusion architecture that integrates cutting-edge deep learning techniques to achieve unprecedented accuracy in detecting and classifying three critical consumer behaviors: Take, Return, and Rummage actions.
The solution combines Temporal Action Localization (TAL) and Spatio-Temporal Action Localization (STAL) approaches, leveraging the OpenTAD framework with AdaTAD and CausalTAD models, alongside GLEE object detection and SAM2 segmentation for comprehensive scene understanding.
Explore our winning solution in action through these comprehensive visualizations showcasing the precision of our temporal and spatio-temporal action localization pipeline.
Demonstration of precise temporal and spatial localization of "take" actions
Advanced detection and tracking of "return" behaviors
Complex behavior analysis for "rummage" actions