Benchmarking Multimodal LLMs for In-Home Robotics


  • Jinming Li1*
  • Yichen Zhu2*
  • Minjie Zhu3
  • Zhiyuan Xu4
  • Yaxin Peng 1
  • Feifei Feng2
  • Jian Tang4


  • 1 Shanghai University, China
    2 Midea Group, China
    3 East China Normal University
    4 Beijing Innovation Center of Humanoid Robotics

Abstract

It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, mastering commonsense and abstract reasoning. This has led to the recent utilization of MLLMs as the "brain" in robotic systems, enabling these models to conduct high-level planning prior to triggering low-level control actions for task execution. However, it remains uncertain whether existing MLLMs are reliable in serving the brain role of robots. In this study, we introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark, which tests the capability of MLLMs for robot applications. Specifically, we identify four essential capabilities — perception, task planning, visual reasoning, and safety measurement — that MLLMs must possess to qualify as the robot's central processing unit. We have developed several scenarios for each capability, resulting in a total of 14 metrics for evaluation. We present experimental results for various MLLMs, including both commercial and open-source models, to assess the performance of existing systems. Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots. The dataset is available at Google Drive.


Overview

Our work the first diagnostic benchmark specifically designed to systematically dissect and analyze the diverse failure modes of MLLMs (Multimodal Large Language Models) in robotics. MMRo includes approximately 25,537 meticulously crafted visual question-answer (VQA) pairs. We demonstrate some data samples convering 14 scenarios that are crucial for the application of MLLMs in robotics.


Overview

We evaluate 13 models, including open-sourced and commercial MLLMs, on MMRo. We show the experimental results on Open-ended questions and Multiple-choice questions in the following.

Citation

@article{li2024mmro,
title={Benchmarking Multimodal LLMs for In-Home Robotics},
author={Li, Jinming and Zhu, Yichen and Zhu, Minjie and Xu, Zhiyuan and others},
year={2024}
}