We compare our approach with UniHSI[Xiao 2024] and InterPhys [Hassan 2023]. UniHSI struggles to effectively control humanoid get-up motions, resulting in noticeable high-frequency jittering. This occurs because it models human-object interaction tasks as sequences of discrete, instantaneous, and independently solved spatial-reaching steps, neglecting the essential temporal dynamics required to coordinate movements across body parts. Similarly, InterPhys generates unnatural motions during task completion, such as kicking doors or forcibly opening them with the entire body, primarily due to insufficient fine-grained spatiotemporal guidance. In contrast, our method enables stable interactions with objects and supports smooth, natural transitions between different interaction stages.