ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

ICCV 2023

1University of California, Los Angeles, 2Peking University, 3Tsinghua University,
4Columbia University, 5National Key Laboratory of General Artificial Intelligence, BIGAI

✶ indicates equal contribution

We present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD provides 8 tasks with their demonstrations for learning and a testbed for the generalization abilities of agents over (1) novel goal states, (2) novel objects, and (3) novel scenes.


Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area.



We highlight the following major points: (1) ARNOLD is built on NVIDIA Isaac Sim, equipped with photo-realistic and physically-accurate simulation, covering 40 distinctive objects and 20 scenes. (2) ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. For each task, there are 7 data splits including i.i.d. evaluation and unseen generalization. (3) ARNOLD provides 10k expert demonstrations with diverse template-generated language instructions, based on thousands of human annotations. (4) We assess the task performances of the latest language-conditioned policy learning models. The results indicate that current models for language-conditioned manipulation still struggle in understanding continuous states and producing precise motion control. We hope these findings can foster future research to address the unsolved challenges in instruction grounding and precise continuous motion control.


ARNOLD features continuous robot control over continuous object states with a large number of demonstrations in photo-realistic scenes. Each task in ARNOLD is specified by a natural language instruction. ARNOLD also leverages advanced physics simulations powered by PhysX 5.0 to simulate articulated bodies and fluids. Language: Task goals are specified by natural language instruction. Multi-Camera: Robot is equipped with multiple cameras. 1: A number of tasks in Maniskill use ground truth semantic segmentation as input. Fluid: Advanced fluid simulation. Physics: Realistic physics simulation with realistic grasping. 2: RLbench-based benchmarks use simplified grasping. Continuous: Object state and goal state are continuous. Scene: Tasks are performed with a realistic scene background. Robot: Perform actions with real robots for all tasks. R: Rasterization. RT: RayTracing. Flexible Material: Easy to change materials and textures. Generalization: Systematic generalization test at different levels.

Simulation Environment

ARNOLD is built on Isaac Sim, featuring photo-realistic and physically-accurate simulation. The photo-realistic rendering is powered by GPU-enabled ray tracing, and the physics simulation is based on PhysX 5.0. In ARNOLD, we assign physics parameters (e.g., friction, surface tension) to objects, including rigid-body objects and fluids. ARNOLD covers 40 distinct objects and 20 diverse scenes. The scenes are curated from 3D-FRONT. The objects come from Isaac Sim, AI2-THOR, SAPIEN. To enhance visual realism, we modified object meshes, e.g., by modifying materials and adding top covers to drawers and cabinets. For more stable physics-based grasping, we performed convex decomposition to create precise collision proxies for each object. We use a 7-DoF Franka Emika Panda manipulator with a parallel gripper in ARNOLD for task execution. There are 5 cameras around the robot to provide visual inputs. We show an example of camera rendering, and visualize objects/scenes as follows.

Multi-view cameras


ARNOLD contains 8 tasks with various goal state varitions. To succeed, the robot has to manipulate the configurations to maintain the object state within a continuous range around the goal state for a while. Accomplishing these tasks requires capabilities in language grounding, friction-based grasping, continuous state understanding, and precise robot motion control.

Task overview
Illustration of task

Task illustration
Demonstration of task


To generate demonstrations for the tasks, we design several motion planners, which execute actions on sub-task stages. While motion planning is challenging in particular tasks (e.g., pouring water), we improve the planning pipeline by injecting some prior design and practical techniques (e.g., spherical linear interpolation). On the other hand, we also enrich the diversity of data by collecting 2k human annotations and performing further augmentations. After validity examination, we finally collect 10k expert demonstrations in total. There are 7 data splits in each task: Train/Val/Test for i.i.d. scheme, novel Object/Scene/State for unseen generalization, and an extra split (State*) with arbitrary continuous goal state. To provide diverse human language instructions, we prepare a template pool for each task. Each template has several placeholders that can be lexicalized with various equivalent phrases. Here we show the overview of ARNOLD data and a few instruction examples.

Data overview
Language instructions


We evaluate two state-of-the-art language-conditioned robotic manipulation models: 6D-CLIPort and PerAct. To better investigate the performances, we add several model variants: (1) PerAct without language (PerAct w/o L); (2) PerAct with additional supervision on state value (PerAct); (3) Multi-task PerAct (PerAct MT). In addition, we provide the evaluation results with first-phase ground truth, i.e., ground truth grasping.

Evaluation results of the models on various tasks and splits, measured by success rate and shown in percentages. The gray figures indicate performances with the first-phase ground truth. For each model, the first row shows the performance on the Test set, and the following three rows show those on the novel splits of Object, Scene, and State. The last row indicates the performances on the Any State split. Tasks are abbreviated for more space. Average performances on eight tasks are appended to each row. w/o L: without language instruction. : model variants with state modeling. MT: multi-task models.


we set up a real-world environment for testing the Sim2Real transfer capabilities. Specifically, we select models from single-task PerAct and use a Franka robot arm to manipulate previously unseen real-world objects with a single RGB-D camera from the left view. We experiment with 2 different drawers and 5 different objects. The results show that models trained in ARNOLD show preliminary Sim2Real transfer capabilities; i.e., reasonable predictions for both picking up objects and manipulating drawers. However, the Sim2Real gap remains, e.g., strict friction. We hope the flexible design and realistic simulation in ARNOLD can gradually close this gap with more diverse and fine-grained object assets.



We present ARNOLD, a benchmark for language-grounded task learning in realistic 3D interactive environments with diverse scenes, objects, and continuous object states. We devise a systematic benchmark comprising eight challenging language-grounded robot tasks and evaluation splits for robot skill generalization in novel scene, object, and goal-state scenarios. We conduct extensive experiments and analyses to pinpoint the limitations of the current models and identify promising research directions for grounded task learning.


This work is supported in part by the National Key R&D Program of China (2021ZD0150200) and NVIDIA GPU Grant.


  title={ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes},
  author={Gong, Ran and Huang, Jiangyong and Zhao, Yizhou and Geng, Haoran and Gao, Xiaofeng and Wu, Qingyang and Ai, Wensi and Zhou, Ziheng and Terzopoulos, Demetri and Zhu, Song-Chun and others},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},