TIER-LOC: Visual Query-based Video Clip Localization in fetal ultrasound videos with a multi-tier transformer.

July 1, 2025

papers DOI: 10.1016/j.media.2025.103611 PMID: 40344944

Authors

Mishra D,Saha P,Zhao H,Hernandez-Cruz N,Patey O,Papageorghiou AT,Noble JA

Affiliations (4)

Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, OX3 7DQ, United Kingdom. Electronic address: [email protected].
Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, OX3 7DQ, United Kingdom.
Institute of Life Course and Medical Sciences University of Liverpool, William Henry Duncan Building, 6 West Derby Street, Liverpool, L7 8TX, United Kingdom.
Nuffield Department of Women's & Reproductive Health, University of Oxford, Oxford, OX3 9DU, United Kingdom.

Abstract

In this paper, we introduce the Visual Query-based task of Video Clip Localization (VQ-VCL) for medical video understanding. Specifically, we aim to retrieve a video clip containing frames similar to a given exemplar frame from a given input video. To solve the task, we propose a novel visual query-based video clip localization model called TIER-LOC. TIER-LOC is designed to improve video clip retrieval, especially in fine-grained videos by extracting features from different levels, i.e., coarse to fine-grained, referred to as TIERS. The aim is to utilize multi-Tier features for detecting subtle differences, and adapting to scale or resolution variations, leading to improved video-clip retrieval. TIER-LOC has three main components: (1) a Multi-Tier Spatio-Temporal Transformer to fuse spatio-temporal features extracted from multiple Tiers of video frames with features from multiple Tiers of the visual query enabling better video understanding. (2) a Multi-Tier, Dual Anchor Contrastive Loss to deal with real-world annotation noise which can be notable at event boundaries and in videos featuring highly similar objects. (3) a Temporal Uncertainty-Aware Localization Loss designed to reduce the model sensitivity to imprecise event boundary. This is achieved by relaxing hard boundary constraints thus allowing the model to learn underlying class patterns and not be influenced by individual noisy samples. To demonstrate the efficacy of TIER-LOC, we evaluate it on two ultrasound video datasets and an open-source egocentric video dataset. First, we develop a sonographer workflow assistive task model to detect standard-frame clips in fetal ultrasound heart sweeps. Second, we assess our model's performance in retrieving standard-frame clips for detecting fetal anomalies in routine ultrasound scans, using the large-scale PULSE dataset. Lastly, we test our model's performance on an open-source computer vision video dataset by creating a VQ-VCL fine-grained video dataset based on the Ego4D dataset. Our model outperforms the best-performing state-of-the-art model by 7%, 4%, and 4% on the three video datasets, respectively.

View Source Full Text PDF

Topics

Ultrasonography, PrenatalVideo RecordingImage Interpretation, Computer-AssistedJournal Article

TIER-LOC: Visual Query-based Video Clip Localization in fetal ultrasound videos with a multi-tier transformer.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?