VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation

Zhengnan Sun¹

Zhaotai Shi¹

Jiaying Chen¹

Qingtao Liu¹

Yu Cui¹

Jiming Chen¹

Qi Ye^1✝

¹Zhejiang University

[Paper]

Abstract

Bimanual dexterous manipulation remains a significant challenge in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we propose VTAO-BiManip, a novel framework that integrates visual-tactile-action pre-training with object understanding, aiming to enable human-like bimanual manipulation via curriculum RL. We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bimanual bottle-cap twisting task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.

Method

How to gather VTAO (Vision-Tactile-Action-Object) data
during human bimanual manipulation

How to fuse VTAO information and use it in RL

Experimental Results

BibTeX

@inproceedings{sun2025vtao, title={VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation}, author={Sun, Zhengnan and Shi, Zhaotai and Chen, Jiayin and Liu, Qingtao and Cui, Yu and Chen, Jiming and Ye, Qi}, booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year={2025}, organization={IEEE} }

Contact: Zhengnan Sun, Qi Ye