Meta's OK-Robot performs pick-and-drop without shooting in unseen environments

There to have has been a lot advances In vision-language models (VLM) that can match natural language queries has objects In A visual scene. And researchers are experiment with how these models can be applied has robotics systems, which are always late In generalize their abilities.

A new paper by researchers has Meta AI And New York University present A based on open knowledge frame that brought pre-trained machine learning (ML) models together has create A robotics system that can perform Tasks In invisible environments. Called OK-Robot, THE frame combined VLM with movement planning And object manipulation models has perform choose and drop operations without training.

Robotics systems are generally designed has be deployed In previously seen environments And are poor has generalize their abilities beyond Locations Or they to have has been qualified. This limitation East especially problematic In settings Or data East rare, such as unstructured houses.

There to have has been impressive advances In individual Components necessary For robotics systems. VLM are GOOD has corresponding to language instructions has visual objects. HAS THE even time, robotics SKILLS For navigation And to input to have progressed considerably. However, robotics systems that combine modern vision models with robot specific primitives always perform poorly.

"Manufacturing progress on This issue requires A careful And shade frame that both integrated VLM And robotics the primitives, while be flexible enough has to integrate more recent models as they are developed by THE VLM And robotics community," THE researchers to write In their paper.

OK-Robot modules (source: arxiv)

OK-Robot combined state of the art VLM with powerful robotics primitives has perform choose and drop Tasks In invisible environments. THE models used In THE system are qualified on big, publicly available datasets.

OK-Robot combined three primary subsystems: A open vocabulary object navigation module, A RGB-D to input module And A drop heuristic system. When put In A new House, OK-Robot requires A manual analysis of THE interior, which can be capture with A iPhone application that takes A sequence of RGB-D pictures as THE user moves around THE building. THE system uses THE pictures And THE camera laid And positions has create A 3D environment map.

THE system process each picture with A vision transformer (Lives) model has extract information about objects. THE object And environment information are brought together has create A semantics object memory module.

Given A natural language request For picking A object, THE memory module calculated THE integration of THE fast And matches he with THE object with THE the closest semantics representation. OK-Robot SO uses navigation algorithms has find THE best path has THE location of THE object In A path that provides THE robot with bedroom has manipulate THE object without provoking collisions.

Finally, THE robot uses A RGB-D camera, A object segmentation model And A pre-trained to input model has take THE object...

Meta's OK-Robot performs pick-and-drop without shooting in unseen environments

There to have has been a lot advances In vision-language models (VLM) that can match natural language queries has objects In A visual scene. And researchers are experiment with how these models can be applied has robotics systems, which are always late In generalize their abilities.

A new paper by researchers has Meta AI And New York University present A based on open knowledge frame that brought pre-trained machine learning (ML) models together has create A robotics system that can perform Tasks In invisible environments. Called OK-Robot, THE frame combined VLM with movement planning And object manipulation models has perform choose and drop operations without training.

Robotics systems are generally designed has be deployed In previously seen environments And are poor has generalize their abilities beyond Locations Or they to have has been qualified. This limitation East especially problematic In settings Or data East rare, such as unstructured houses.

There to have has been impressive advances In individual Components necessary For robotics systems. VLM are GOOD has corresponding to language instructions has visual objects. HAS THE even time, robotics SKILLS For navigation And to input to have progressed considerably. However, robotics systems that combine modern vision models with robot specific primitives always perform poorly.

"Manufacturing progress on This issue requires A careful And shade frame that both integrated VLM And robotics the primitives, while be flexible enough has to integrate more recent models as they are developed by THE VLM And robotics community," THE researchers to write In their paper.

OK-Robot modules (source: arxiv)

OK-Robot combined state of the art VLM with powerful robotics primitives has perform choose and drop Tasks In invisible environments. THE models used In THE system are qualified on big, publicly available datasets.

OK-Robot combined three primary subsystems: A open vocabulary object navigation module, A RGB-D to input module And A drop heuristic system. When put In A new House, OK-Robot requires A manual analysis of THE interior, which can be capture with A iPhone application that takes A sequence of RGB-D pictures as THE user moves around THE building. THE system uses THE pictures And THE camera laid And positions has create A 3D environment map.

THE system process each picture with A vision transformer (Lives) model has extract information about objects. THE object And environment information are brought together has create A semantics object memory module.

Given A natural language request For picking A object, THE memory module calculated THE integration of THE fast And matches he with THE object with THE the closest semantics representation. OK-Robot SO uses navigation algorithms has find THE best path has THE location of THE object In A path that provides THE robot with bedroom has manipulate THE object without provoking collisions.

Finally, THE robot uses A RGB-D camera, A object segmentation model And A pre-trained to input model has take THE object...

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow