VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation



Zhengnan Sun1
Zhaotai Shi1
Jiaying Chen1
Qingtao Liu1
Yu Cui1
Jiming Chen1
Qi Ye1✝

1Zhejiang University

[Paper]

Abstract

Bimanual dexterous manipulation remains a significant challenge in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we propose VTAO-BiManip, a novel framework that integrates visual-tactile-action pre-training with object understanding, aiming to enable human-like bimanual manipulation via curriculum RL. We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bimanual bottle-cap twisting task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.

Method

How to gather VTAO (Vision-Tactile-Action-Object) data
during human bimanual manipulation


How to fuse VTAO information and use it in RL

Experimental Results


BibTeX

@inproceedings{sun2025vtao,
title={VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation},
author={Sun, Zhengnan and Shi, Zhaotai and Chen, Jiayin and Liu, Qingtao and Cui, Yu and Chen, Jiming and Ye, Qi},
booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year={2025},
organization={IEEE}
}

Contact: Zhengnan Sun, Qi Ye