ASQA: A Dataset Of Long-Form Answers For Ambiguous Questions
Google research has released a paper describing a newly released dataset containing ambiguous questions and long-form answers paired to them.
Created on April 14|Last edited on April 14
Comment
A new paper titled, "ASQA: Factoid Questions Meet Long-Form Answers" has been released by Google Research detailing their effort to create a dataset of questions and answers focusing on answering ambiguous questions with long-form answers that fully encompass the scope of the question asked.
Unlike other existing datasets, ASQA (Answer Summaries for Questions which are Ambiguous) provides answers that cover all the different avenues of ambiguity a question could pose in a single long-form summary. Often when someone asks an ambiguous question, it means they don't have enough knowledge or context to be able to ask a more specific question.
An example described in the paper is the question "Who was the ruler of France in 1830?". Well, that depends on the date, since Charles X was king until August 2nd and Louis-Philippe I was the king from August 9th forwards. The ASQA dataset chooses to combine these two answers and add additional context into a long-form answer, so that the asker of the question will fully understand the scope of the question they initially asked.
Handling ambiguity
There are two types of questions: ambiguous and unambiguous.
Most QA datasets provide basic lists of factoid data connecting an unambiguous question with a piece of basic factual knowledge, such as how much some movie grossed in it's release year. Those types of questions and answers are simple and don't require any nuance to completely answer the full scope of the question.
As far as datasets covering ambiguous questions goes, one of the more popular ones is the ELI5 dataset, a dataset based on questions and answers submitted to the "Explain Like I'm Five" subreddit on Reddit. The ELI5 dataset has faults in that it relies on the answers of random people on the internet, not necessarily sourced, highly subjectively written, and often doesn't cover ever facet of an ambiguous question.
Another dataset, the dataset that ASQA used as a jumping off point, called AmbigQA compiled a collection of ambiguous questions and split them up into disambiguated sub-questions which have factual simple answers.
Using a stripped down version of the AmbigQA dataset, native English speakers were recruited to provide more comprehensive single answers, and additional knowledge where applicable, to answer the ambiguous question prompted. This process provides the ASQA dataset with well-written comprehensive answers to ambiguous questions, with sources provided.

Training a model on ASQA
ASQA's dataset provides all the relevent information information needed to train a model to write long-form answers to ambiguous questions.
The dataset contains instances of: the original ambiguous question, question-answer pairs of the disambiguated questions from the AmbigQA dataset, wikipedia pages used as sources, and the long-form answers and additional knowledge provided by the participants briefly described above dataset filling process.
Find out more
More specific information and the training process and access to the dataset can be found at: https://github.com/google-research/language/tree/master/language/asqa
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.