How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-language model (VLM). To improve sample efficiency, we propose a trajectory following reward that guides reinforcement learning (RL) toward states close to a human demonstration while allowing flexibility in exploration. To learn beyond the single demonstration, we employ curriculum learning, progressively increasing object pose variations to enhance robustness. At deployment, a VLM retrieves the appropriate skill based on user instructions, bridging low-level learned skills with high-level intent. We evaluate AdaDexGrasp in both simulation and real-world settings, showing that our approach significantly improves RL efficiency and enables learning human-like grasp strategies across varied object configurations. Finally, we demonstrate zero-shot transfer of our learned policies to a real-world PSYONIC Ability Hand, with a 90% success rate across objects, significantly outperforming the baseline.
AdaDexGrasp consists of three key components:
(i) Converting human videos into robot demonstrations via hand motion retargeting and
closed-loop inverse kinematics algorithm.
(ii) Learning a grasp policy through trajectory-guided reinforcement learning with a
curriculum. The reward scheme we designed consists of trajectory-following reward, contact
reward, and height reward.
(iii) Skill selection using vision-language models to ground robot behavior into user
preference.
Quantitative results in simulation: we evaluate our policies, each trained on different grasp poses from distinct human demonstrations, across objects with varying visual appearances, sizes, and physical properties. We compared our method with ViViDex, and the results indicate ViViDex’s sensitivity to the morphological gap between the human hand and the robot, highlighting the robustness of our reward function.
Ablation study in simulation: we evaluate the impact of removing individual design components and demonstrate that our full pipeline achieves the best performance.
Overall success rates of skill selection and execution: The table compares a random selection policy with a VLM-based policy, showing that VLM improves both skill selection and execution success.
Comparison of sample efficiency between direct learning from fully random initial object pose and curriculum learning with different success thresholds (ζ). The circular dot represents the average timestep at which σ reaches 1. A suitable ζ leads to faster convergence and better performance.
Fingertip distance and object height over episode steps.
@misc{shi2025learningadaptivedexterousgrasping,
title={Learning Adaptive Dexterous Grasping from Single Demonstrations},
author={Liangzhi Shi and Yulin Liu and Lingqi Zeng and Bo Ai and Zhengdong Hong and Hao Su},
year={2025},
eprint={2503.20208},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2503.20208},
}