New approach to the "lost in the middle" problem of language models
Language models are not yet able to process long contexts very well. As a rule, information from the middle is missing. This should improve.
The problem that Large Language Models (LLMs) like to read over the middle of a text or context provided to them is called "lost in the middle". Information from the beginning and end of a long context is known and can be processed, but information from the middle is simply missing. This phenomenon is one of the biggest problems of large language models - alongside hallucinations, i.e. made-up information. So far, there is no real solution. Microsoft's researchers, together with scientists from Peking University, have at least come up with an approach that could minimize the problem. However, this requires a model to undergo a kind of second training.
They call the idea "INformation-INtensive (IN2) training". They start from the basic assumption that the error lies in the training of the language models. They do not sufficiently monitor the fact that long contexts can contain and process crucial information everywhere. IN2 training now uses a long, synthetic data set (4K to 32 K tokens) containing randomly distributed short segments (128 tokens). This is used to train a model. The short passages in turn contain the important information that is then asked for. In this way, the model is trained to pay attention to these parts from the middle. There were questions that related to a single short segment and others that required several segments to be answered correctly.
Mistral-7B becomes FILM-7B
The researchers used the open-source model Mistral-7B as a model; they call the model that emerged after their IN2 training FILM-7B (FILI-in-the-middle). The skills were tested with tasks from the areas of document, code and structured data context as well as on information retrieval patterns, i.e. various retrievals. According to the researchers, FILM-7B was shown to be significantly better at retrieving information from a 32K context window and also better at summarizing long texts. Tasks that only require a short context do not deteriorate in comparison to the original model.
Nevertheless, the "lost in the middle" problem is not completely solved. This is also reflected in the benchmark results published in the paper. Achieving 100 percent correct answers for any task is still almost illusory.
(emw)