Prompt Templates: Delimiters and Wording
This report demonstrates Llama 2’s sensitivity to the choice of words and delimiters in prompts.
Created on September 6|Last edited on September 17
Comment
The graphics below illustrate the evaluation of the answers generated by the 70B Llama 2 model for 1,500 English passages and questions from the BiPaR development set. All experiments were performed in a zero-shot fashion. The bars in all plots are sorted by the F1 score in ascending order.
The results of both experiments indicate that Llama 2 is highly sensitive to variations in delimiters and formulations in prompt templates.
Prompt Delimiters
Large differences in the results are observable for all metrics, except recall. For example, the prompt using hashes as delimiters achieved 40.32% precision, whereas the prompt with curly brackets as delimiters yielded 30.17% precision.
Prompt Formats
The following templates were taken from the The FLAN Instruction Tuning Repository, specifically from the entry for the squad_v1 templates in this repository. The templates were then modified by surrounding the input with hashes, as this delimiter proved to be the best choice for the BiPaR dataset (see the previous section).
Similar to the use of delimiters, different formulations in prompt templates resulted in considerable variations across all metrics, except recall. For instance, the EM of the prompt based on the wording Passage: #{context}#\nQuestion: #{question}# was 27.53%, whereas the EM of the prompt formulated using the phrasing Here is a question about this article: #{...}#\nWhat is the answer to this question: #{...}# was only 18.33%.
Add a comment