Analysis of intra- and inter-observer variability in 4D liver ultrasound landmark labeling.
Authors
Affiliations (2)
Affiliations (2)
- Universität Rostock, Rostock, Germany.
- Universität zu Lübeck, Lübeck, Germany.
Abstract
Four-dimensional (4D) ultrasound imaging is widely used in clinics for diagnostics and therapy guidance. Accurate target tracking in 4D ultrasound is crucial for autonomous therapy guidance systems, such as radiotherapy, where precise tumor localization ensures effective treatment. Supervised deep learning approaches rely on reliable ground truth, making accurate labels essential. We investigate the reliability of expert-labeled ground truth data by evaluating intra- and inter-observer variability in landmark labeling for 4D ultrasound imaging in the liver. Eight 4D liver ultrasound sequences were labeled by eight expert observers, each labeling eight landmarks three times. Intra- and inter-observer variability was quantified, and observer survey and motion analysis were conducted to determine factors influencing labeling accuracy, such as ultrasound artifacts and motion amplitude. The mean intra-observer variability ranged from <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>1.58</mn> <mtext> </mtext> <mi>mm</mi> <mo>±</mo> <mn>0.90</mn> <mtext> </mtext> <mi>mm</mi></mrow> </math> to <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>2.05</mn> <mtext> </mtext> <mi>mm</mi> <mo>±</mo> <mn>1.22</mn> <mtext> </mtext> <mi>mm</mi></mrow> </math> depending on the observer. The inter-observer variability for the two observer groups was <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>2.68</mn> <mtext> </mtext> <mi>mm</mi> <mo>±</mo> <mn>1.69</mn> <mtext> </mtext> <mi>mm</mi></mrow> </math> and <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>3.06</mn> <mtext> </mtext> <mi>mm</mi> <mo>±</mo> <mn>1.74</mn> <mtext> </mtext> <mi>mm</mi></mrow> </math> . The observer survey and motion analysis revealed that ultrasound artifacts significantly affected labeling accuracy due to limited landmark visibility, whereas motion amplitude had no measurable effect. Our measured mean landmark motion was <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mn>11.56</mn> <mtext> </mtext> <mi>mm</mi> <mo>±</mo> <mn>5.86</mn> <mtext> </mtext> <mi>mm</mi></mrow> </math> . We highlight variability in expert-labeled ground truth data for 4D ultrasound imaging and identify ultrasound artifacts as a major source of labeling inaccuracies. These findings underscore the importance of addressing observer variability and artifact-related challenges to improve the reliability of ground truth data for evaluating target tracking algorithms in 4D ultrasound applications.