Researchers Extend LLM Performance Through Knowledge Retrieval

Attention is NOT all you need
Created on May 15|Last edited on May 15
Comment
Language models, although impressive in their ability to comprehend and generate language, often create content that is factually incorrect or entirely made up, also known as hallucination. To address this issue, researchers have previously developed a technique known as retrieval-augmentation, where models are enhanced with the ability to retrieve relevant information from external sources like document corpora.
Most existing retrieval-augmented language models use a "retrieve-and-generate" setup, where they fetch documents based on an input query and then generate a response based on the retrieved documents. However, this approach is usually limited to a single retrieval action, which may be insufficient in complex scenarios that involve generating longer texts. In these cases, it's crucial to have the ability to gather multiple pieces of information throughout the generation process.
Existing Methods 		Several attempts have been made to build models that retrieve information multiple times while generating outputs. However, these models typically retrieve documents at fixed intervals using the previous context, which might not accurately reflect the intended future generation or could retrieve at inappropriate points.
The Idea To address these issues, researchers have proposed a new approach called Forward-Looking Active REtrieval augmented generation (FLARE). FLARE actively decides when and what to retrieve across the course of the generation.
It uses a prediction of the upcoming sentence to anticipate future content, which is then used as a query to retrieve relevant documents. It evaluates the likelihood of the upcoming sentence and uses low-confidence tokens as indicators for active retrieval.
This is based on the observation that language models are typically well-calibrated and tokens with low probabilities often indicate a lack of knowledge. Therefore, if any token in the predicted sentence has a probability lower than a certain threshold, FLARE triggers a retrieval process.
Question Generation FLARE employs a method of generating explicit questions that target the low-confidence span in the predicted sentence. For example, if the model is uncertain about a detail like "the University of Pennsylvania," a question such as "Which university did Joe Biden attend?" can help retrieve relevant information.
While some previous models achieved this by manually inserting follow-up questions, and lots of manually generated annotations, FLARE offers a universal approach that generates questions for low-confidence spans without additional annotation. For each extracted span with probabilities below a certain threshold, FLARE prompts a language model to generate a question that can be answered with the span. This is done using a predefined prompt for zero-shot question generation that takes the user input and generated output so far into account.
Visual Explanation from the paper
￼
Knowledge Retriever FLARE uses off-the-shelf retrievers that take these queries as inputs and return a list of relevant documents. Depending on the dataset, different retrievers are used. For instance, for datasets that mainly rely on knowledge from Wikipedia, FLARE uses the Wikipedia dump and employs the BM25 information retrieval model.
For datasets that rely on knowledge from the open web, FLARE uses the Bing search engine as the retriever. This flexible retrieval mechanism allows FLARE to adapt to different types of knowledge resources, further enhancing the quality of the generated content.
﻿
﻿
￼
Results This approach was tested on four long-form, knowledge-intensive generation tasks, and datasets, achieving superior or competitive performance in all cases. This demonstrated the effectiveness of FLARE in reducing hallucination and improving the factual accuracy of generated content.
﻿
Results from the paper 
The Paper ﻿https://arxiv.org/pdf/2305.06983v1.pdf﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.