MIT lets robots navigate text-based with LLMs

MIT researchers are bypassing calculation-intensive visual processes by controlling and navigating robots using text-based instructions via an AI language model.

Save to Pocket listen Print view
Robot gives woman a paper on the notebook.

(Image: Yakobchuk Viacheslav / Shutterstock.com)

5 min. read
This article was originally published in German and has been automatically translated.

The Massachusetts Institute of Technology (MIT), together with the MIT-IBM Watson AI Lab, has developed a navigation method to convert visual features from images of a robot's environment into text that uses a large language model (LLM) to allow the robot to navigate its environment under voice control. This eliminates the need for complex, computationally intensive visual processes.

If, for example, a household robot is to load a washing machine in the basement with laundry, the robot has to split this instruction into several individual instructions and execute them in response to this voice command. This includes, for example, going down the stairs into the basement and finding the washing machine. This requires the instructions to be combined with the visual information that the robot detects. Performing this navigation task usually requires a lot of visual data to train the robot accordingly. Such data is often difficult to obtain.

The MIT scientists have therefore developed a simpler method, as they describe in the scientific paper "LangNav: Language as a Perceptual Representation for Navigation", which has been published as a preprint on Arxiv. This method involves converting visual representations into text, which can then be fed into an LLM to perform multi-level navigation tasks. The method creates text descriptions from what the robot sees through its cameras. The language model uses this information to predict actions that the robot must then carry out in order to implement the user's voice instructions.

The advantage of this purely text-based method is that a large amount of synthetic training data can also be generated in advance using a large language model. This is in contrast to visual methods, for which the corresponding training data can only be obtained at great expense.

"By using only speech as a perceptual representation, our approach is more straightforward. Since all inputs can be encoded as speech, we can create a trajectory that is understandable to humans," says Bowen Pan, Electrical Engineering and Computer Science (EECS) student and lead author of the study.

The scientists used a simple labeling model to convert the visual data captured by the robot into text descriptions. The labels are combined with the user's voice-based instructions and fed into an LLM. The language model then decides which next navigation step the robot should perform.

The LLM outputs a text-based description of the scene that the robot should see after completing a navigation step. In this way, a kind of log of the robot's movement path is created. The robot thus knows where it has been so far.

A label could read something like: "At an angle of 30 degrees to your left is a door with a potted plant next to it, behind you is a small office with a desk and a computer." After evaluating this information, the language model can decide whether the robot should go to the door or into the office. The scientists have standardized the observation information in order to facilitate the evaluation. This was one of the most difficult tasks, says Pan.

When testing the text-based method, the scientists found that it is about as suitable for navigating a robot under voice control as purely image-based methods. However, the text-based method has several advantages: it is less computationally intensive to generate synthetic training data. The data can also be evaluated more easily than computer-generated visual training information, which can also look different from a real scene due to lighting, for example. The text-based version is also easier for people to understand. This makes it easier to evaluate the cause of problems, for example. What's more, the method can be applied to different tasks and environments without having to change the model.

However, the text-based method also has a disadvantage: it cannot transport depth information, for example, as imaging methods can. The MIT researchers now want to address this shortcoming and investigate the extent to which large language models can develop the ability for spatial awareness and whether this could have a positive effect on language-based navigation.

(olb)