Skip to main content

Prompting

This report illustrates the prompting of Llama 2 for generative question answering.
Created on August 29|Last edited on September 17

This report illustrates the results of prompting the 70B LLama 2 model to answer questions from the English version of the BiPaR dataset. For this purpose, 1,500 question-answer-passage pairs from the test set were used. The results of three different approaches are shown below:

  • zero-shot prompts did not use any support examples.
  • k-shot prompts used k support examples consisting of shortened passages, questions, and answers. The examples were selected based on question words.
  • k-shot_dialog prompts extended in-context learning through conversation.


Evaluation Metrics


The bars in the following plots are sorted in ascending order based on the F1 score.

4-shot_dialog4-shot3-shot1-shot2-shotzero-shot3-shot_dialog1-shot_dialog2-shot_dialog204060
4-shot_dialog4-shot3-shot1-shot2-shotzero-shot3-shot_dialog1-shot_dialog2-shot_dialog204060

  • The generative question answering approach achieved higher recall than precision, indicating that generative models are better at identifying relevant words than at excluding irrelevant ones in their answers. The exact match (EM) values were the lowest among all metrics.
  • Except for recall, the dialog-enhanced approach based on up to three examples yielded the best results compared to the other two approaches.
  • The 4-shot dialog-based approach resulted in the lowest precision, F1, and EM scores among all dialog-enhanced approaches, likely due to the considerable length of the prompt.

Inference Time




The prompt length and number of messages exchanged affected the inference time.