Prompting

This report illustrates the prompting of Llama 2 for generative question answering.
Created on August 29|Last edited on September 17
Comment
﻿
This report illustrates the results of prompting the 70B LLama 2 model to answer questions from the English version of the BiPaR dataset. For this purpose, 1,500 question-answer-passage pairs from the test set were used. The results of three different approaches are shown below:
zero-shot prompts did not use any support examples.
k-shot prompts used k support examples consisting of shortened passages, questions, and answers. The examples were selected based on question words.
k-shot_dialog prompts extended in-context learning through conversation.
﻿
Evaluation Metrics﻿
The bars in the following plots are sorted in ascending order based on the F1 score.
﻿
Precision
Precision
4-shot_dialog4-shot3-shot1-shot2-shotzero-shot3-shot_dialog1-shot_dialog2-shot_dialog204060
Recall
Recall
4-shot_dialog4-shot3-shot1-shot2-shotzero-shot3-shot_dialog1-shot_dialog2-shot_dialog204060
﻿
The generative question answering approach achieved higher recall than precision, indicating that generative models are better at identifying relevant words than at excluding irrelevant ones in their answers. The exact match (EM) values were the lowest among all metrics.
Except for recall, the dialog-enhanced approach based on up to three examples yielded the best results compared to the other two approaches.
The 4-shot dialog-based approach resulted in the lowest precision, F1, and EM scores among all dialog-enhanced approaches, likely due to the considerable length of the prompt.
Inference Time﻿
﻿
﻿
The prompt length and number of messages exchanged affected the inference time.
﻿
﻿
Add a comment