FM-Adapt: Foundation model adaptation with photoacoustic-supervised learning for interventional ultrasound.
Authors
Affiliations (3)
Affiliations (3)
- Department of Computer Science, Iowa State University, Ames, IA, USA.
- Department of Radiation Oncology, Winship Cancer Institute, Emory University, Atlanta, GA, USA.
- Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA.
Abstract
Foundation models (FMs), such as the Segment Anything Model (SAM), have remarkable capabilities for general-purpose segmentation tasks through large-scale pre-training. However, a substantial domain shift limits their effectiveness in complex medical imaging. Here we introduce FM-Adapt, the first parameter-efficient adaptation of a FM (SAM-based vision transformer) into a resolution-agnostic architecture with photoacoustic (PA)-supervised learning for dual-target interventional ultrasound (US) segmentation. We demonstrate FM-Adapt in the context of PA-supervised interventions, specifically for US-guided needle tracking and simultaneous target identification (breast tumor segmentation). We train once with this unified adaptation framework to produce two specialized model weights: USPA-SAM for real-time tracking of needles and BT-SAM for segmenting breast tumors. This framework utilizes frozen pre-trained encoder components and fine-tunes only the mask decoder, allowing the model to process native (256 × 256) clinical images without spatial degradation while achieving state-of-the-art performance with high computational efficiency. USPA-SAM achieves a mean modified Hausdorff Distance (MHD) of 0.34 mm, a targeting error (TE) of 0.83 mm, and a 100% needle localization success rate (NLSR), outperforming baselines by a factor of 3- <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>17</mn> <mo>×</mo></mrow> </math> in spatial precision. Notably, on tumor segmentation, BT-SAM achieves Dice scores of 93.6% and 96.3%, along with IoU scores of 89.2% and 94.0%, demonstrating strong generalization to unseen data. This work demonstrates that our models achieve a <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>27</mn> <mo>×</mo></mrow> </math> improvement in computational efficiency to process native clinical images at 34 FPS on a single GPU to enable real-time clinical adaptation.