Human-Object Interaction via Automatically Designed
VLM-Guided Motion Policy

Our framework automatically constructs goal states and reward functions for a variety of interaction tasks in reinforcement learning. Guided by VLMs, the resulting motion policy enables physics-based characters to perform long-horizon interactions with both static and dynamic objects.

Abstract

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types—including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios.

Overview

Receiving instruction and environment context as input, the VLM-Guided RMD Planner generates a multi-step interaction plan in the form of RMD. Based on this plan, our framework automatically designs both goal states and reward functions, enabling the VLM-Guided Motion Policy to execute the interaction step by step.


Detailed RMD Planner

The RMD Planner outputs sequential sub-step plans in a structured JSON format, enabling direct processing via Python scripts. To ensure the RMD Planner functions as intended, we utilize different sections within the prompt, each designed to support a specific functionality.


Qualitative Results in a Single-Task Scenario

We compare our approach with UniHSI[Xiao 2024] and InterPhys [Hassan 2023]. UniHSI struggles to effectively control humanoid get-up motions, resulting in noticeable high-frequency jittering. This occurs because it models human-object interaction tasks as sequences of discrete, instantaneous, and independently solved spatial-reaching steps, neglecting the essential temporal dynamics required to coordinate movements across body parts. Similarly, InterPhys generates unnatural motions during task completion, such as kicking doors or forcibly opening them with the entire body, primarily due to insufficient fine-grained spatiotemporal guidance. In contrast, our method enables stable interactions with objects and supports smooth, natural transitions between different interaction stages.

Qualitative Results in a Multi-Task Scenario