ICCV 2025

Amazon Grocery Vision Challenge
1st Place Winner

DGIST
🥇 1st Place
STAL Track
🥈 2nd Place
TAL Track

Project Overview

Our award-winning solution for the ICCV 2025 Grocery Vision Challenge represents a breakthrough in understanding complex human behaviors in retail environments. We developed a sophisticated multi-modal fusion architecture that integrates cutting-edge deep learning techniques to achieve unprecedented accuracy in detecting and classifying three critical consumer behaviors: Take, Return, and Rummage actions.

The solution combines Temporal Action Localization (TAL) and Spatio-Temporal Action Localization (STAL) approaches, leveraging the OpenTAD framework with AdaTAD and CausalTAD models, alongside GLEE object detection and SAM2 segmentation for comprehensive scene understanding.

Technical Architecture

🕐 Temporal Action Localization (TAL)

  • AdaTAD: Adaptive temporal action detection for precise boundary localization
  • CausalTAD: Causal temporal modeling for robust action classification
  • Ensemble Strategy: Multiple model configurations for enhanced performance
  • OpenTAD Framework: State-of-the-art temporal action detection pipeline

📍 Spatio-Temporal Action Localization (STAL)

  • GLEE: Advanced object detection for precise localization
  • SAM2: Segment Anything Model 2 for fine-grained segmentation
  • Temporal Tracking: Robust object tracking across video sequences
  • Spatio-Temporal Tubelets: Rich 4D object representations

⚡ Fusion & Refinement Engine

  • Temporal Intersection: Calculate overlap between TAL segments and STAL tubelets
  • Confidence Refinement: Multi-modal agreement-based score adjustment
  • Action-Object Association: Link actions to specific objects in scenes
  • Boundary Optimization: Refine temporal boundaries using spatial information

Demo Videos & Visualizations

Explore our winning solution in action through these comprehensive visualizations showcasing the precision of our temporal and spatio-temporal action localization pipeline.

Take Action Detection

Demonstration of precise temporal and spatial localization of "take" actions

Return Action Analysis

Advanced detection and tracking of "return" behaviors

Rummage Behavior Recognition

Complex behavior analysis for "rummage" actions