CVA: Context-aware Video-text Alignment for Video Temporal Grounding

Problem of previous context diversification: query-agnostic mixing introduces false negatives.

Qualitative visualization: Our method accurately localizes target moments by preventing false negatives, while baseline models struggle with semantic confusion caused by background context.

Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the "false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context.

Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

Motivation

The field of Video Temporal Grounding (VTG) aims to localize video segments that correspond to text queries. Despite advances, models tend to learn spurious correlations, overly associating text queries with static backgrounds rather than the target temporal dynamics. Recent work introduced content mixing augmentation that replaces background clips with content from other videos. However, this mixing remains query-agnostic — the replacement clips are sampled without regard for their semantic relevance to the text query. This can generate false negatives when semantically related clips are mistakenly treated as negative examples.

Method

Overview of our CVA framework, which integrates Query-aware Context Diversification (QCD), Context-enhanced Transformer Encoder (CTE), and Context-invariant Boundary Discrimination (CBD).

Query-aware Context Diversification (QCD)

Given a video and its text query, QCD simulates diverse temporal contexts by replacing background regions with clips from another video. Critically, we build a query-aware candidate pool using pre-computed CLIP-based video-text similarity. Only clips whose similarity falls within a valid range [θ_min, θ_max] are used for replacement — excluding both too-similar clips (false negative risk) and too-different clips (trivial negatives). We also preserve the immediate temporal context surrounding the ground-truth moment.

Illustration of Query-aware Context Diversification.

Illustration of our Query-aware Context Diversification (QCD).

Context-enhanced Transformer Encoder (CTE)

Most existing models perform immediate cross-attention between video and text without adequately modeling the video's internal temporal context. CTE addresses this with a hierarchical encoder consisting of N_b stacked blocks, each containing:

Windowed Self-Attention: Partitions video features into non-overlapping windows to capture local temporal patterns efficiently.
Learnable Queries: A set of global queries refined via standard self-attention to represent abstract contextual information.
Bidirectional Cross-Attention: Enables information exchange between local video contexts and global query representations.

After all blocks, hierarchical features are aggregated via a learnable weighted sum, producing context-enhanced features F_CTE that are then fed into the multimodal encoder for cross-modal alignment with text queries.

Context-invariant Boundary Discrimination (CBD) Loss

QCD generates augmented videos where temporal context is altered while the target moment's semantics remain unchanged. CBD enforces this invariance by focusing on the temporal boundaries — the regions most critical for precise localization.

For each boundary index, we construct:

Anchors: Boundary features from the first augmented video.
Positives: Corresponding boundary features from the second augmentation.
Hard Negatives (dual sources): (1) temporally adjacent background clips within margin N_adj, and (2) the N_hard most semantically confusable clips from the remaining background.

The contrastive loss pulls anchor–positive pairs together while pushing anchors away from hard negatives, guiding the model to learn highly discriminative, context-invariant boundary representations.

Performance

QVHighlights Test Split

CVA outperforms all competing approaches, achieving substantial improvements across all metrics. The improvements are most pronounced in Moment Retrieval recall, highlighting the effectiveness of QCD in mitigating false negatives.

Method	R1@0.5	R1@0.7	mAP@0.5	mAP@0.75	mAP Avg.	HD mAP	HIT@1
Moment-DETR [NeurIPS'21]	52.89	33.02	54.82	29.40	30.73	35.69	55.60
UMT [CVPR'22]	56.23	41.18	53.83	37.01	36.12	38.18	59.99
QD-DETR [CVPR'23]	62.40	44.98	62.52	39.88	39.86	38.94	62.40
EaTR [ICCV'23]	61.36	45.79	61.86	41.91	41.74	37.15	58.65
CG-DETR [arXiv'23]	65.43	48.38	64.51	42.77	42.86	40.33	66.21
BAM-DETR [ECCV'24]	62.71	48.64	64.57	46.33	45.36	–	–
UVCOM [CVPR'24]	63.55	47.47	63.37	42.67	43.18	39.74	64.20
TR-DETR [AAAI'24]	64.66	48.96	63.98	43.73	42.62	39.91	63.42
CDTR [AAAI'25]	65.79	49.60	66.44	45.96	44.37	–	–
TD-DETR [ICCV'25]	64.53	50.37	66.21	47.32	46.69	–	–
CVA (Ours)	70.05	55.32	69.49	48.45	47.49	44.43	66.01

Charades-STA

CVA achieves the new state-of-the-art across all metrics, outperforming BAM-DETR by +2.66 on R1@0.5 and +1.40 on R1@0.7.

Method	R1@0.3	R1@0.5	R1@0.7	mIoU
M-DETR [NeurIPS'21]	65.83	52.07	30.59	45.54
UniVTG [ICCV'23]	70.81	58.01	35.65	50.10
UVCOM [CVPR'24]	–	59.25	36.64	–
BAM-DETR [ECCV'24]	72.93	59.95	39.38	52.33
CDTR [AAAI'25]	71.16	60.39	37.24	50.65
CVA (Ours)	74.19	62.61	40.78	53.35

TACoS

Our method consistently outperforms previous approaches on TACoS, achieving a mIoU of 41.07 (+1.76 over BAM-DETR).

Method	R1@0.3	R1@0.5	R1@0.7	mIoU
UniVTG [ICCV'23]	51.44	34.97	17.35	33.60
UVCOM [CVPR'24]	–	36.39	23.32	–
CDTR [AAAI'25]	53.41	40.26	23.43	37.28
BAM-DETR [ECCV'24]	56.69	41.45	26.77	39.31
CVA (Ours)	58.80	43.21	27.73	41.07

Analysis

We conduct ablation studies on the QVHighlights validation split to validate the contribution of each component. All three components are complementary, and their combination yields the best performance.

+5.21

QCD
R1@0.7 gain
(+ 3.92 HD mAP)

+0.65

CTE
R1@0.7 gain
(on QCD baseline)

+2.86

CBD
R1@0.7 gain
(with hard negatives)

54.84

Full Model
R1@0.7 (val)
CTE + QCD + CBD

QCD vs. Query-agnostic Mixing: Query-agnostic mixing yields only marginal improvements, while QCD provides substantially larger gains (+5.21 R1@0.7 and +3.92 HD mAP), confirming that conditioning augmentation on the query is crucial for preventing false-negative contamination.

CBD Negative Sampling: Using only temporally adjacent negatives provides improvement, but the most significant gain comes from introducing semantically hard negatives. The final configuration (N_adj=2, N_hard=5) achieves the best overall performance including the highest HD mAP (43.47).

CTE Architecture: Learnable queries (+1.22 R1@0.5, +0.51 mAP@0.5) and windowed self-attention are complementary. Combining both mechanisms achieves the best overall balance (+2.30 R1@0.5, +1.89 mAP@0.5), demonstrating they jointly enhance temporal context encoding.

BibTeX

@inproceedings{moon2026cva,
  title={CVA: Context-aware Video-text Alignment for Video Temporal Grounding},
  author={Moon, Sungho and Lee, Seunghun and Seo, Jiwan and Im, Sunghoon},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

CVA: Context-aware Video-text Alignmentfor Video Temporal Grounding