ICCV 2025

Amazon Grocery Vision Challenge
1st Place Winner

Sungho Moon*, Seunghun Lee*, Sunghoon Im†

DGIST

🥇 1st Place

STAL Track

🥈 2nd Place

TAL Track

Challenge Website Technical Paper (Coming Soon) Code (Coming Soon)

Project Overview

Our award-winning solution for the ICCV 2025 Grocery Vision Challenge represents a breakthrough in understanding complex human behaviors in retail environments. We developed a sophisticated multi-modal fusion architecture that integrates cutting-edge deep learning techniques to achieve unprecedented accuracy in detecting and classifying three critical consumer behaviors: Take, Return, and Rummage actions.

The solution combines Temporal Action Localization (TAL) and Spatio-Temporal Action Localization (STAL) approaches, leveraging the OpenTAD framework with AdaTAD and CausalTAD models, alongside GLEE object detection and SAM2 segmentation for comprehensive scene understanding.

Technical Architecture

🕐 Temporal Action Localization (TAL)

AdaTAD: Adaptive temporal action detection for precise boundary localization
CausalTAD: Causal temporal modeling for robust action classification
Ensemble Strategy: Multiple model configurations for enhanced performance
OpenTAD Framework: State-of-the-art temporal action detection pipeline

📍 Spatio-Temporal Action Localization (STAL)

GLEE: Advanced object detection for precise localization
SAM2: Segment Anything Model 2 for fine-grained segmentation
Temporal Tracking: Robust object tracking across video sequences
Spatio-Temporal Tubelets: Rich 4D object representations

⚡ Fusion & Refinement Engine

Temporal Intersection: Calculate overlap between TAL segments and STAL tubelets
Confidence Refinement: Multi-modal agreement-based score adjustment
Action-Object Association: Link actions to specific objects in scenes
Boundary Optimization: Refine temporal boundaries using spatial information

Demo Videos & Visualizations

Explore our winning solution in action through these comprehensive visualizations showcasing the precision of our temporal and spatio-temporal action localization pipeline.

Take Action Detection

Demonstration of precise temporal and spatial localization of "take" actions

Return Action Analysis

Advanced detection and tracking of "return" behaviors

Rummage Behavior Recognition

Complex behavior analysis for "rummage" actions

Amazon Grocery Vision Challenge1st Place Winner