Abstract: To realize adaptive and robust manipulation, a robot should have several sensing modalities and coordinate their outputs to achieve the given task based on underlying constraint in the real ...
Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and ...