Yuheng Ji

Yuheng Ji (冀昱衡)

I am a lyric poet, a passionate lover of life, and a PhD candidate at the Institute of Automation (CASIA). I'm supervised by Prof. Xiaolong Zheng. My research interests include embodied AI and vision-language models.

Email / Google Scholar / Poetry Anthology

Research

My research interests primarily lie in embodied AI and computer vision.
* denotes equal contributions.

	RoboBrain 2.0: Technical Report BAAI RoboBrain Team (as core contributor) arXiv, 2025 Project / Paper / Code / Checkpoints We are excited to introduce RoboBrain2.0, the most powerful open-source embodied brain model to date. Compared to its predecessor, RoboBrain1.0, our latest version significantly advances multi-agent task planning, spatial reasoning, and closed-loop execution. A detailed technical report will be released soon.
	RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, Shanghang Zhang CVPR, 2025 (selected in official Embodied AI Trends Commentary) Project / Paper / Code / Checkpoints / Datasets We developed RoboBrain, an VLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves SOTA performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
	Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Shanghang Zhang arXiv, 2025 Project / Paper / Code / Checkpoints / Datasets We developed Reason-RFT*, a novel reinforcement fine-tuning framework that enhances visual reasoning capabilities in Vision-Language Models (VLMs). Reason-RFT employs a two-phase training strategy: (1) SFT with curated CoT data to activate reasoning potential, followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning to generate diverse reasoning-response pairs.
	ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, Zhongzhi Li, Rui Yan, Xiuying Chen arXiv*, 2025 Paper VLMs enhance robotic manipulation but rely on costly annotated data, limiting OOD adaptability. We propose ManipLVM-R1, a RL framework with Verifiable Rewards (RLVR), replacing supervision to optimize task outcomes for better generalization. Two rule-based rewards drive physical reasoning, achieving strong performance on fewer data (50%) and OOD scenarios.
	Alleviating Performance Disparity in Adversarial Spatiotemporal Graph Learning under Zero-inflated Distribution Songran Bai, Yuheng Ji, Yue Liu, Xingwei Zhang, Xiaolong Zheng, Daniel Dajun Zeng AAAI, 2025 (Oral) Paper Spatiotemporal Graph Learning (SGL) under Zero-Inflated Distribution (ZID) is vital for urban risk management but is susceptible to adversarial attacks. Traditional adversarial training (AT) increases performance disparities between classes. We propose the MinGRE framework to reduce these disparities and enhance robustness, promoting more equitable and robust models.
	Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation Yuheng Ji, Yue Liu, Zhicheng Zhang, Zhao Zhang, Yuting Zhao, Xiaoshuai Hao, Gang Zhou, Xingwei Zhang, Xiaolong Zheng ICMR, 2025 Paper We propose a parameter-efficient adversarial adaptation method named AdvLoRA by low-rank adaptation to improve the robustness of vision-language models.
	FastRSR: Efficient and Accurate Road Surface Reconstruction from Bird's Eye View Yuting Zhao, Yuheng Ji*, Xiaoshuai Hao, Shuxiao Li arXiv*, 2025 Paper Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Traditional BEV-based methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. We present two innovative BEV-based RSR models: FastRSR-mono and FastRSR-stereo, offering superior efficiency and accuracy, achieving state-of-the-art results in elevation absolute error and processing speed.
	What Really Matters for Robust Multi-Sensor HD Map Construction? Xiaoshuai Hao, Lingyu Liu, Yuting Zhao, Yuheng Ji, Luanyuan Dai, Shuai Cheng, Rong Yin, IROS, 2025 (Oral) Paper This paper enhances HD map construction robustness via data augmentation, a new fusion module, and modality dropout. It improves performance under sensor corruptions and achieves SOTA accuracy on NuScenes.
	MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception Xiaoshuai Hao, Guanqun Liu, Yuting Zhao, Yuheng Ji, Mengchuan Wei, Haimei Zhao, Lingdong Kong, Rong Yin, Yu Liu ICME, 2025 (Oral) Project / Paper This work introduces Multi-Sensor Corruption Benchmark (MSC-Bench), the first comprehensive benchmark aimed at evaluating the robustness of multi-sensor autonomous driving perception models against various sensor corruptions.
	Learning Hash Subspace from Large-Scale Multi-modal Pre-Training: A CLIP-Based Cross-modal Hashing Framework Yuheng Ji, Xingwei Zhang, Gang Zhou, Xiaolong Zheng, Daniel Dajun Zeng The 11st C2 China, 2023 (Outstanding Paper Award) Paper We propose a cross-modal hashing framework called CCMH (CLIP-based Cross-Modal Hashing), which facilitates the transferability of a well-trained real-value semantic subspace to a hash semantic subspace.