Skip to main content
🏆 1st Place Winner

ICCV 2025 Amazon Grocery Vision Challenge

Multi-modal AI for Temporal Action Localization

Amazon (ICCV 2025 Challenge)Jul 2025 - Aug 2025

🎯 Project Overview

Developed a cutting-edge multi-modal AI model for Temporal Action Localization (TAL) and Spatio-Temporal Action Localization (STAL) in grocery shopping scenarios. This project was part of Amazon's prestigious ICCV 2025 Challenge, focusing on understanding complex human behaviors in retail environments.

🚀 Key Achievements

🥇

1st Place in TAL Track

Achieved top performance in Temporal Action Localization

🥇

1st Place in STAL Track

Leading performance in Spatio-Temporal Action Localization

1 Month Development

Rapid prototyping and optimization within tight deadline

🔬 Technical Approach

AdaTAD Model

Leveraged AdaTAD (Adaptive Temporal Action Detection) for robust temporal boundary detection with adaptive thresholding mechanisms

SAM2 Integration

Integrated Segment Anything Model 2 (SAM2) for precise spatial segmentation and object-level action localization

Multi-Modal Fusion

Combined video, audio, and contextual information through advanced fusion architectures for comprehensive scene understanding

Spatio-Temporal Optimization

Optimized joint spatio-temporal modeling by combining AdaTAD's temporal precision with SAM2's spatial accuracy

💡 Innovation Highlights

  • Successfully combined AdaTAD and SAM2 models for superior TAL & STAL performance
  • Achieved state-of-the-art temporal boundary detection with AdaTAD's adaptive mechanisms
  • Enhanced spatial localization accuracy through SAM2's advanced segmentation capabilities
  • Novel multi-modal fusion architecture specifically optimized for grocery shopping scenarios
  • Efficient joint optimization of temporal and spatial components for real-time performance

🔗 Resources