CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Abstract

The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples.

Methodology

CBVLM Pipeline

Overview of CBVLM. Our methodology is organized into two key stages: 1) The Concept Detection stage, where the LVLM predicts the individual presence of each predefined clinical concept in the query image. This is achieved using a custom prompt that supports both zero- and few-shot settings. In the latter, we include a set of demonstration examples (middle block of Prompt Construction) chosen by the Retrieval Module, responsible for selecting the N most similar examples to the input image. To evaluate the LVLM answer, we employ an Evaluation Block which first tries to extract the desired LVLM response using a rule-based formulation. If this fails, we adopt an auxiliary LLM to extract the desired response. 2) In the Disease Diagnosis stage, the final diagnosis is generated by the LVLM based on the clinical concepts predicted in the first stage, which are directly incorporated in the query (highlighted in yellow). This approach ensures that the diagnosis is grounded on the identified clinical concepts, enhancing the interpretability and transparency of the LVLM’s response. In this second stage, the Retrieval Module is also used to select the N most similar demonstrations.

Results

Influence of the example set size on the concept prediction performance of CBVLM

Even when only 10% of the example images per class are available CBVLM outperforms CBM, except for Derm7pt. Thus, CBVLM indeed requires few annotated examples.

Concept detection results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Each bar corresponds to the number of shots (n = {0,1,2,4}). Filled colored bars denote BACC, whereas the hatched bars indicate F1-scores.

Disease diagnosis results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Each bar corresponds to the number of shots (n = {0,1,2,4}). "0 w/o" corresponds to the 0-shot experiment where no concepts are used for the disease diagnosis. Filled colored bars denote BACC, whereas the hatched bars indicate F1-scores.

Resources

Paper

Code

BibTeX

If you find this work useful for your research, please cite:

        @article{patricio2025cbvlm,
          title={CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification},
          author={Patr{\'\i}cio, Cristiano and Rio-Torto, Isabel and Cardoso, Jaime S and Teixeira, Lu{\'\i}s F and Neves, Jo{\~a}o C},
          journal={arXiv preprint arXiv:2501.12266},
          year={2025}
        }

Acknowledgements

This work was funded by the Portuguese Foundation for Science and Technology (FCT) under the PhD grants "2020.07034.BD" and "2022.11566.BD", and supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Cristiano Patrício,^1,3 Isabel Rio-Torto,^2,3 Jaime S. Cardoso^2,3 Luís F. Teixeira^2,3 João C. Neves¹

arXiv Code BibTeX

Abstract

Methodology

CBVLM Pipeline

Results

Influence of the example set size on the concept prediction performance of CBVLM

Concept detection results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Disease diagnosis results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Resources

Paper

Code

BibTeX

Acknowledgements

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Cristiano Patrício*,1,3 Isabel Rio-Torto*,2,3 Jaime S. Cardoso2,3 Luís F. Teixeira2,3 João C. Neves1

arXiv Code BibTeX

Abstract

Methodology

CBVLM Pipeline

Results

Influence of the example set size on the concept prediction performance of CBVLM

Concept detection results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Disease diagnosis results per dataset averaged over all models (left) and over generic and medical LVLMs (right).

Resources

Paper

Code

BibTeX

Acknowledgements

Cristiano Patrício,^1,3 Isabel Rio-Torto,^2,3 Jaime S. Cardoso^2,3 Luís F. Teixeira^2,3 João C. Neves¹