MMRo: Are Multimodal LLMs Eligible as the “Brain” for Robotics?


  • Jinming Li1*
  • Yichen Zhu2*
  • Zhiyuan Xu2
  • Jindong Gu3
  • Ning Liu2
  • Xin Liu4


  • Minjie Zhu4
  • Ran Cheng2
  • Tao Sun 2
  • Yaxin Peng 1
  • Feifei Feng2
  • Jian Tang2


  • 1 Shanghai University, China
    2 Midea Group, China
    3 University of Oxford
    4 East China Normal University

Abstract

It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, mastering commonsense and abstract reasoning. This has led to the recent utilization of MLLMs as the "brain" in robotic systems, enabling these models to conduct high-level planning prior to triggering low-level control actions for task execution. However, it remains uncertain whether existing MLLMs are reliable in serving the brain role of robots. In this study, we introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark, which tests the capability of MLLMs for robot applications. Specifically, we identify four essential capabilities — perception, task planning, visual reasoning, and safety measurement — that MLLMs must possess to qualify as the robot's central processing unit. We have developed several scenarios for each capability, resulting in a total of 14 metrics for evaluation. We present experimental results for various MLLMs, including both commercial and open-source models, to assess the performance of existing systems. Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots. The dataset is available at Google Drive.


Overview

Our work the first diagnostic benchmark specifically designed to systematically dissect and analyze the diverse failure modes of MLLMs (Multimodal Large Language Models) in robotics. MMRo includes approximately 26,175 meticulously crafted visual question-answer (VQA) pairs. We demonstrate some data samples convering 14 scenarios that are crucial for the application of MLLMs in robotics.


Overview

We evaluate 13 models, including open-sourced and commercial MLLMs, on MMRo. We show the experimental results on open-ended questions in the following.

Citation