We obtain three distinct skills:
Skill 1: Grasp the bottom of a standing bottle and lift it.
Skill 2: Grasp the upper middle of a standing bottle and lift it.
Skill 3: Grasp a lying bottle, rotate it upright, and lift it.
We define five tasks that vary based on the object's initial pose and the given human preference:
T1: The cleanser is initially standing, and the human asks the robot to grasp it with no preference.
T2: The cleanser is initially standing, and the human asks the robot to grasp it with bottom.
T3: The cleanser is initially standing, and the human asks the robot to grasp it with top.
T4: The cleanser is initially lying, and the human asks the robot to grasp it with no preference.
T5: The cleanser is initially lying, and the human asks the robot to grasp it with bottom.
Conclusion: The VLM reliably selects the appropriate skill according to human intent and environmental context, achieving high success across tasks. In contrast, random selection often fails to choose the correct skill.