Evaluating Dense Model-based Approaches for Multimodal Medical Case Retrieval
DOI:
https://doi.org/10.54195/irrj.19769Keywords:
Medical search, Multimodal retrieval, Dense RetrievalAbstract
Medical case retrieval plays a crucial role in clinical decision-making by enabling healthcare professionals to find relevant cases based on patient records, diagnostic images, and textual descriptions. Given the inherently multimodal nature of medical data, effective retrieval requires models that can bridge the gap between different modalities. Traditional retrieval approaches often rely on unimodal representations, limiting their ability to capture cross-modal relationships. Recent advances in dense model-based techniques have shown promise in overcoming these limitations by encoding multimodal information into a shared latent space, facilitating retrieval based on semantic similarity. This paper investigates the potential of dense models to enhance multimodal search systems. We evaluate various dense model-based approaches to assess which model characteristics have the greatest impact on retrieval effectiveness, using the medical case-based retrieval task from ImageCLEFmed 2013 as a benchmark. Our findings indicate that different dense model approaches substantially impact retrieval effectiveness, and that applying the CombMAX fusion method to combine their output results further improves effectiveness. Extending context length, however, yielded mixed results depending on the input data. Additionally, domain-specific models—those trained on medical data—outperformed general models trained on broad, non-specialized datasets within their respective fields. Furthermore, when text is the dominant information source, text-only models surpassed multimodal models
Downloads
References
Javed A. Aslam and Mark H. Montague. Models for metasearch. In W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel, editors, SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA, pages 275–284. ACM, 2001. doi: 10.1145/383952.384007.
Khaled Bayoudh, Raja Knani, Fay¸cal Hamdaoui, and Abdellatif Mtibaa. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput., 38(8):2939–2970, 2022. doi: 10.1007/S00371-021-02166-7.
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. Multimodal distributional semantics. J. Artif. Intell. Res., 49:1–47, 2014. doi: 10.1613/JAIR.4135.
Sungbin Choi, Jeongeun Lee, and Jinwook Choi. SNUMedinfo at ImageCLEF 2013: Medical retrieval task. In Pamela Forner, Roberto Navigli, Dan Tufis, and Nicola Ferro, editors, Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013, volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. URL https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-ChoiEt2013.pdf.
Paul D. Clough and Mark Sanderson. Evaluating the performance of information retrieval systems using test collections. Inf. Res., 18(2), 2013. URL http://www.informationr.net/ir/18-2/paper582.html.
Gordon V. Cormack, Charles L. A. Clarke, and Stefan B¨uttcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel, editors, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, 2009, pages 758–759. ACM, 2009. doi: 10.1145/1571941.1572114.
Can Cui, Haichun Yang, Yaohong Wang, Shilin Zhao, Zuhayr Asad, Lori A. Coburn, Keith T. Wilson, Bennett A. Landman, and Yuankai Huo. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: A review. Progress in Biomedical Engineering, 5(2):022001, apr 2023. doi: 10.1088/2516-1091/acc2fe.
Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR, abs/2112.13906, 2021. URL https://arxiv.org/abs/2112.13906.
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C´eline Hudelot, and Pierre Colombo. ColPali: Efficient document retrieval with vision language models. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=ogjBpZ8uSi.
Di Feng, Christian Haase-Sch¨utz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 22(3):1341–1360, 2021. doi: 10.1109/TITS.2020.2972974.
Alba Garcia Seco de Herrera, Jayashree Kalpathy-Cramer, Dina Demner-Fushman, Sameer K. Antani, and Henning Muller. Overview of the ImageCLEF 2013 medi-
cal tasks. In Pamela Forner, Roberto Navigli, Dan Tufis, and Nicola Ferro, editors, Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013, volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. URL https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-SecoDeHerreraEt2013b.pdf.
Alba Garcia Seco de Herrera, Roger Schaer, Dimitrios Markonis, and Henning M¨uller. Comparing fusion techniques for the ImageCLEF 2013 medical case retrieval task. Comput. Medical Imaging Graph., 39:46–54, 2015. doi: 10.1016/J.COMPMEDIMAG.2014.04.004.
Alba Garcia Seco de Herrera, Roger Schaer, and Henning Muller. Shangri-la: A medical case-based retrieval tool. J. Assoc. Inf. Sci. Technol., 68(11):2587–2601, 2017. doi: 10.1002/ASI.23858.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783.
MedGIFT Group. medSearch – medical search engine by HES-SO Valais (medSearch 2009). http://fast.hevs.ch:8080/MedSearch/faces/Search.jsp, 2009.
Ruifeng Guo, Jingxuan Wei, Linzhuang Sun, Bihui Yu, Guiyong Chang, Dawei Liu, Sibo Zhang, Zhengbing Yao, Mingjun Xu, and Liping Bu. A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations. Comput. Biol. Medicine, 178:108709, 2024. doi: 10.1016/J.COMPBIOMED.2024.108709.
D. Frank Hsu and Isak Taksa. Comparing rank and score combination methods for data fusion in information retrieval. Inf. Retr., 8(3):449–480, 2005. doi: 10.1007/S10791-005-6994-4.
Morris A. Jette and Tim Wickberg. Architecture of the slurm workload manager. In Dalibor Klusacek, Julita Corbalan, and Gonzalo P. Rodrigo, editors, Job Scheduling Strategies for Parallel Processing - 26th Workshop, JSSPP 2023, St. Petersburg, FL, USA, May 19, 2023, Revised Selected Papers, volume 14283 of Lecture Notes in Computer Science, pages 3–23. Springer, 2023. doi: 10.1007/978-3-031-43943-8
Qiao Jin, Robert Leaman, and Zhiyong Lu. PubMed and beyond: Biomedical literature search in the age of artificial intelligence. eBioMedicine, 100:104988, 2024. ISSN 2352-3964. doi: https://doi.org/10.1016/j.ebiom.2024.104988.
Jeff Johnson, Matthijs Douze, and Herve Jegou. Billion-scale similarity search with GPUs. IEEE Trans. Big Data, 7(3):535–547, 2021. doi: 10.1109/TBDATA.2019.2921572.
Jayashree Kalpathy-Cramer, Alba Garcia Seco de Herrera, Dina Demner-Fushman, Sameer Antani, Steven Bedrick, and Henning Muller. Evaluating performance of biomedical image retrieval systems—an overview of the medical image retrieval task at ImageCLEF 2004–2013. Computerized Medical Imaging and Graphics, 39:55–61, 2015. ISSN 0895-6111. doi: https://doi.org/10.1016/j.compmedimag.2014.03.004.
Yihao Li, Mostafa El Habib Daho, Pierre-Henri Conze, Rachid Zeghlache, Hugo Le Boite, Ramin Tadayoni, B´eatrice Cochener, Mathieu Lamard, and Gwenol´e Quellec. A review Evaluating Dense Model-based Approaches for Multimodal Medical Case Retrieval of deep learning-based information fusion techniques for multimodal medical image classification. Comput. Biol. Medicine, 177:108635, 2024. doi: 10.1016/J.COMPBIOMED.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html.
Zhiyong Lu, Won Kim, and W. John Wilbur. Evaluation of query expansion using MeSH in PubMed. Inf. Retr., 12(1):69–80, 2009. doi: 10.1007/S10791-008-9074-8.
Mark H. Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002, pages 538–548. ACM, 2002. doi: 10.1145/584792.584881.
Andre Mourao and Flavio Martins. NovaMedSearch: A multimodal search engine for medical case-based retrieval. In Jo˜ao Ferreira, Jo˜ao Magalh˜aes, and P´avel Calado, editors, Open research Areas in Information Retrieval, OAIR 2013, Lisbon, Portugal, May 15-17, 2013, pages 223–224. ACM, 2013. URL http://dl.acm.org/citation.cfm?id=2491798.
Andre Mourao, Flavio Martins, and Joao Magalhaes. NovaSearch on medical ImageCLEF 2013. In Pamela Forner, Roberto Navigli, Dan Tufis, and Nicola Ferro, editors, Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013, volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. URL https://ceur-ws.org/Vol-1179/CLEF2013wn-ImageCLEF-MouraoEt2013.pdf.
Henning Muller and Jayashree Kalpathy-Cramer. The ImageCLEF medical retrieval task at ICPR 2010 - information fusion. In 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010, pages 3284–3287. IEEE Computer Society, 2010. doi: 10.1109/ICPR.2010.803.
Catarina Pires, Sergio Nunes, and Lu´ıs Filipe Teixeira. Expanding relevance judgments for medical case-based retrieval task with multimodal LLMs. CoRR, abs/2506.17782, 2025. doi: 10.48550/ARXIV.2506.17782. Presented at the Third Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2025), co-located with SIGIR 2025, Padua, Italy, July 17, 2025.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
Joseph A. Shaw and Edward A. Fox. Combination of multiple searches. In Donna K. Harman, editor, Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 105–108. National Institute of Standards and Technology (NIST), 1994. URL http://trec.nist.gov/pubs/trec3/papers/vt.ps.gz.
Sonish Sivarajkumar, Haneef Ahamed Mohammad, David Oniani, Kirk Roberts, William R. Hersh, Hongfang Liu, Daqing He, Shyam Visweswaran, and Yanshan Wang. Clinical information retrieval: A literature review. J. Heal. Informatics Res., 8(2):313–352, 2024. doi: 10.1007/S41666-024-00159-4.
Konstantinos Zagoris, Savvas A. Chatzichristofis, Nikos Papamarkos, and Yiannis S. Boutalis. img(anaktisi): A web content based image retrieval system. In Tomas Skopal and Pavel Zezula, editors, Second International Workshop on Similarity Search and Applications, SISAP 2009, 29-30 August 2009, Prague, Czech Republic, pages 154–155. IEEE Computer Society, 2009. doi: 10.1109/SISAP.2009.15.
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-CLIP: Unlocking the long-text capability of CLIP. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G¨ul Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LI, volume 15109 of Lecture Notes in Computer Science, pages 310–325. Springer, 2024. doi: 10.1007/978-3-031-72983-6
Yifei Zhang, D´esir´e Sidib´e, Olivier Morel, and Fabrice M´eriaudeau. Deep multimodal fusion for semantic image segmentation: A survey. Image Vis. Comput., 105:104042, 2021. doi: 10.1016/J.IMAVIS.2020.104042.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Catarina Pires, Sérgio Nunes, Luís F. Teixeira (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
CC-BY 4.0