UltrasOM: A mamba-based network for 3D freehand ultrasound reconstruction using optical flow.
Authors
Affiliations (2)
Affiliations (2)
- Key Laboratory of Mechanism Theory and Equipment Design of Ministry of Education, Tianjin University, Tianjin 300354, PR China.
- Key Laboratory of Mechanism Theory and Equipment Design of Ministry of Education, Tianjin University, Tianjin 300354, PR China; International Institute for Innovative Design and Intelligent Manufacturing of Tianjin University in Zhejiang, Shaoxing, Zhejiang, PR China. Electronic address: [email protected].
Abstract
Three-dimensional (3D) ultrasound (US) reconstruction is of significant value in clinical diagnosis, characterized by its safety, portability, low cost, and high real-time capabilities. 3D freehand ultrasound reconstruction aims to eliminate the need for tracking devices, relying solely on image data to infer the spatial relationships between frames. However, inherent jitter during handheld scanning introduces significant inaccuracies, making current methods ineffective in precisely predicting the spatial motions of ultrasound image frames. This leads to substantial cumulative errors over long sequence modeling, resulting in deformations or artifacts in the reconstructed volume. To address these challenges, we proposed UltrasOM, a 3D ultrasound reconstruction network designed for spatial relative motion estimation. Initially, we designed a video embedding module that integrates optical flow dynamics with original static information to enhance motion change features between frames. Next, we developed a Mamba-based spatiotemporal attention module, utilizing multi-layer stacked Space-Time Blocks to effectively capture global spatiotemporal correlations within video frame sequences. Finally, we incorporated correlation loss and motion speed loss to prevent overfitting related to scanning speed and pose, enhancing the model's generalization capability. Experimental results on a dataset of 200 forearm cases, comprising 58,011 frames, demonstrated that the proposed method achieved a final drift rate (FDR) of 10.24 %, a frame-to-frame distance error (DE) of 7.34 mm, a symmetric Hausdorff distance error (HD) of 10.81 mm, and a mean angular error (MEA) of 2.05°, outperforming state-of-the-art methods by 13.24 %, 15.11 %, 3.57 %, and 6.32 %, respectively. By integrating optical flow features and deeply exploring contextual spatiotemporal dependencies, the proposed network can directly predict the relative motions between multiple frames of ultrasound images without the need for tracking, surpassing the accuracy of existing methods.