In a preprint paper, a Google and MIT team investigate whether pre-trained visual representations can be used to improve a robot’s object manipulation performance. They say that their proposed technique — affordance-based manipulation — can enable robots to learn to pick and grasp objects in less than 10 minutes of trial and error, which could lay the groundwork for highly adaptable warehouse robots.
Affordance-based manipulation is a way to reframe a manipulation task as a computer vision task. Rather than referencing pixels to object labels, it associates pixels to the value of actions. Since the structure of computer vision models and affordance models are relatively similar, techniques from transfer learning can be applied to computer vision to enable affordance models to learn faster with less data — or so the thinking goes.
To test this, the team injected the “backbones” — i.e., the weights (or variables) responsible for early-stage image processing, like filtering edges, detecting corners, and distinguishing between colors — of various popular computer vision models into affordance-based manipulation models pre-trained on vision tasks. They then tasked a real-world robot with learning to grasp a set of objects through trial and error.
Initially, there weren’t significant performance gains compared with training the affordance models from scratch. However, upon transferring weights from both the backbone and the head (which consists of weights used in latter-stage processing, such as recognizing contextual cues and executing spatial reasoning) of a pre-trained vision model, there was a substantial improvement in training speed. Grasping success rates reached 73% in just 500 trial and error grasp attempts, and jumped to 86% by 1,000 attempts. And on new objects unseen during training, models with the pre-trained backbone and head generalized better, with grasping success rates of 83% with the backbone alone and 90% with both the backbone and head.
According to the team, reusing weights from vision tasks that require object localization (e.g., instance segmentation) significantly improved the exploration process when learning manipulation tasks. Pre-trained weights from the tasks encouraged the robot to sample actions on things that look more like objects, thereby quickly generating a more balanced data set from which the system could learn the differences between good and bad grasps.
“Many of the methods that we use today for end-to-end robot learning are effectively the same as those being used for computer vision tasks,” wrote the study’s coauthors. “Our work here on visual pre-training illuminates this connection and demonstrates that it is possible to leverage techniques from visual pre-training to improve the learning efficiency of affordance-base manipulation applied to robotic grasping tasks. While our experiments point to a better understanding of deep learning for robots, there are still many interesting questions that have yet to be explored. For example, how do we leverage large-scale pre-training for additional modes of sensing (e.g. force-torque or tactile)? How do we extend these pre-training techniques towards more complex manipulation tasks that may not be as object-centric as grasping? These areas are promising directions for future research.”