FineTune Evaluation of LLM on test dataset