Advancing multimodal intelligence in colonoscopy

A new study maps the rapidly evolving field of intelligent colonoscopy. It argues that the next leap will come not from isolated-task modeling alone, but from generalized multimodal systems that can perceive, describe, locate, and discuss findings in clinically useful language. To move the field forward, the researchers broadly reviewed 63 datasets and 137 deep-learning models spanning classification, detection, segmentation, and vision-language tasks. They then built three new foundations: ColonINST, a large multimodal colonoscopy dataset; ColonGPT, a lightweight colonoscopy-specific multimodal model; and a benchmark for evaluating conversational medical image understanding.

Colonoscopy remains one of the most sensitive tools for colorectal cancer screening, and prior evidence suggests that AI can substantially reduce missed colorectal neoplasia compared with conventional practice. Yet colonoscopy imagery is unusually difficult for algorithms: the camera moves unpredictably, the field of view is limited by the colon’s folded anatomy, lighting is uneven and reflective, instruments frequently enter the frame, and subtle lesions can visually blend into surrounding tissue. The paper further shows that multimodal colonoscopy research still suffers from scarce vision-language data, inconsistent labels, and limited coverage of rare conditions. Because of these challenges, deeper research into multimodal intelligent colonoscopy is urgently needed.

Researchers from Nankai University, the Australian National University, Tsinghua University, and Mohamed bin Zayed University of Artificial Intelligence reported (DOI: 10.1007/s11633-025-1597-6) on January 7, 2026, in Machine Intelligence Research that they not only surveyed the frontier of intelligent colonoscopy, but also introduced a new multimodal dataset, a task-specific vision-language model, and a benchmark designed to support clinically relevant dialogue and decision assistance in endoscopy.

The team began by systematically reviewing the technical landscape of intelligent colonoscopy across four core tasks: image classification, object detection, image segmentation, and vision-language understanding. Their survey identified 63 datasets and 137 representative models published since 2015, revealing both rapid progress and major blind spots, especially in multimodal learning. They then assembled ColonINST from 19 public sources, creating a resource of 303,001 colonoscopy images across 62 subcategories. To enrich it for dialogue-based AI, they added 128,620 medical captions and restructured 450,724 human-machine conversation pairs for instruction tuning. Building on this data foundation, the researchers developed ColonGPT, a colonoscopy-specific multimodal model using a 0.4B-parameter SigLIP-SO visual encoder and a 1.3B-parameter Phi-1.5 language model. A key design feature is a multigranularity adapter that selectively keeps only the most informative visual tokens, reducing token usage to only 34% of the original while preserving performance. In benchmark testing, ColonGPT ranked first across three multimodal tasks and could be trained in about seven hours on two NVIDIA H200 GPUs, suggesting that practical, domain-specific clinical assistants may no longer require extremely large and expensive models.

The study presents intelligent colonoscopy not as a single visual perception problem, but as a broader multimodal challenge. Its message is clear: future systems should not only find lesions, but also explain them, respond to prompts, and support reporting and decision-making. By pairing a field-wide survey with shared multimodal infrastructure, the work offers a roadmap for turning colonoscopy AI from isolated visual tools into interactive medical assistants.

Just as importantly, the study highlights what still needs to be fixed: rare-disease coverage, richer patient-linked data, more consistent labeling, and models that generalize better to unseen cases. If these gaps are addressed, intelligent colonoscopy could evolve into a more integrated clinical co-pilot—one that helps doctors interpret complex scenes faster and helps patients receive more timely, precise care.

###

References

DOI

10.1007/s11633-025-1597-6

Original Source URL

https://doi.org/10.1007/s11633-025-1597-6

Funding information

This work was supported by NSFC, China (No. 62476143), the Fundamental Research Funds for the Central Universities, China (Nankai University, No. 63253218), and ANU-Optus Bushfire Research Centre of Excellence (BRCoE) (scholarship awarded to Ge-Peng Ji). Peng Xu is also supported by Natural Science Foundation of China (NSFC) (No.62306162).

About Machine Intelligence Research

Machine Intelligence Research (original title: International Journal of Automation and Computing) is published by Springer and sponsored by the Institute of Automation, Chinese Academy of Sciences. The journal publishes high-quality papers on original theoretical and experimental research, targets special issues on emerging topics, and strives to bridge the gap between theoretical research and practical applications.

Machine Intelligence Research

Not applicable

Frontiers in Intelligent Colonoscopy

7-Jan-2026

The authors declare that they have no competing interests.

Advancing multimodal intelligence in colonoscopy

Apple iPhone 17 Pro

Keywords

Article Information

Contact Information

Source

How to Cite This Article