TrafficPerceiver enables instruction-driven understanding and segmentation in challenging traffic scenes

Understanding traffic scenes is a fundamental capability for intelligent transportation systems and autonomous driving. However, real-world traffic environments are often far from ideal, featuring adverse weather, low visibility, motion blur, and occlusions that significantly degrade perception performance. Existing vision-based methods are typically designed for standard scenarios and lack the ability to follow human instructions or perform fine-grained, target-level reasoning.

To address these challenges, researchers from the School of Vehicle and Mobility at Tsinghua University propose TrafficPerceiver, a unified multimodal framework built upon a multimodal large language model (MLLM). TrafficPerceiver is designed to jointly support coarse-grained traffic scene understanding tasks, such as scene description and question answering, and fine-grained target-oriented segmentation tasks, such as isolating a specific vehicle, pedestrian, or road element according to a natural language instruction.

The team published their study in Communications in Transportation Research ( https://doi.org/10.26599/COMMTR.2026.9640008 ).

Unlike conventional perception pipelines that rely on task-specific decoders, TrafficPerceiver aligns language and visual representations through a shared multimodal Transformer. A special segmentation token is introduced to directly associate textual instructions with relevant image regions, enabling efficient and interpretable segmentation without additional task-specific heads.

To further enhance robustness under visually degraded conditions, the researchers introduce a reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Instead of optimizing absolute prediction scores, GRPO evaluates model outputs relative to other sampled responses within the same group. This encourages consistent instruction-following behavior and improves reasoning stability in challenging scenarios such as rain, fog, blur, and nighttime scenes.

In addition, the team constructs a new dataset named Challenging Traffic Scene Understanding (CTSU), which focuses specifically on difficult real-world traffic environments. The dataset includes diverse conditions such as adverse weather, low illumination, occlusion, and regional variations in traffic infrastructure, and provides paired language instructions, textual responses, and pixel-level segmentation annotations.

Extensive experiments on the CTSU dataset and existing benchmarks demonstrate that TrafficPerceiver achieves superior performance in both traffic scene understanding and segmentation tasks, particularly under complex visual conditions. The results suggest that combining instruction-driven multimodal perception with reinforcement learning offers a promising direction for building more robust and interactive traffic perception systems.

About Communications in Transportation Research

Communications in Transportation Research was launched in 2021, with academic support provided by Tsinghua University and China Intelligent Transportation Systems Association. The Editors-in-Chief are Professor Xiaobo Qu, a member of the Academia Europaea from Tsinghua University and Professor Xiaopeng (Shaw) Li from University of Wisconsin–Madison. The journal mainly publishes high-quality, original research and review articles that are of significant importance to emerging transportation systems, aiming to serve as an international platform for showcasing and exchanging innovative achievements in transportation and related fields, fostering academic exchange and development between China and the global community.

It has been indexed in SCIE, SSCI, Ei Compendex, Scopus, CSTPCD, CSCD, OAJ, DOAJ, TRID and other databases. It was selected as Q1 Top Journal in the Engineering and Technology category of the Chinese Academy of Sciences (CAS) Journal Ranking List. In 2022, it was selected as a High-Starting-Point new journal project of the “China Science and Technology Journal Excellence Action Plan”. In 2024, it was selected as the Support the Development Project of “High-Level International Scientific and Technological Journals”. The same year, it was also chosen as an English Journal Tier Project of the “China Science and Technology Journal Excellence Action Plan PhaseⅡ”. In 2024, it received the first impact factor (2023 IF) of 12.5, ranking Top1 (1/58, Q1) among all journals in "TRANSPORTATION" category. In 2025, its 2024 IF was announced as 14.5, maintaining the Top1 position (1/62, Q1) in the same category.

From Volume 6 (2026), Communications in Transportation Research will be published by Tsinghua University Press on the SciOpen platform with the official journal website at https://www.sciopen.com/journal/2097-5023 . We kindly request that all new manuscript submissions be made through the journal’s submission system at https://mc03.manuscriptcentral.com/commtr . For any submission-related inquiries, please contact the Editorial Office at commtr_e@mail.tsinghua.edu.cn.

Communications in Transportation Research

10.26599/COMMTR.2026.9640008

TrafficPerceiver: A Multimodal Large Language Model with Reinforcement Learning for Unified Challenge Traffic Scene Perception

31-Mar-2026

TrafficPerceiver enables instruction-driven understanding and segmentation in challenging traffic scenes

Sky-Watcher EQ6-R Pro Equatorial Mount

Keywords

Article Information

Contact Information

How to Cite This Article