LLM evaluation benchmarking: Beyond BLEU and ROUGE