Does Finetuning ChatGPT-3.5 on Gorilla improve api and tool performance?
Created on August 24|Last edited on September 27
Comment
Does fine-tuning ChatGPT-3.5 on a dataset of api code improve its performance on an api evaluation dataset? Does it improve its ability to use tools, and maybe mimic GPT-4's funcs ?
💡
Gorilla
I trained ChatGPT-3.5 on the Hugging Face portion of the Gorilla dataset, which is the dataset released with the Gorilla LLM, a finetuned Llama v1 model that came out in May 2023. From the project page:
Gorilla is a LLM that can provide appropriate API calls. It is trained on three massive machine learning hub datasets: Torch Hub, TensorFlow Hub and HuggingFace.
Finetuning ChatGPT-3.5 Colab
Here is a colab to do OpenAI ChatGPT-3.5 finetuning on the Hugging Face portion of the dataset. The default finetuning config for 3.5 was used.
Evaluating ChatGPT-3.5 Performance
I used the Gorilla evaluation scripts to generate the evaluation set LLM responses and running evaluation. To get the logging below with Weights & Biases you might have to pull from this PR on github (if its not merged).
AST Sub-Tree Matching
We perform AST sub-tree matching to identify which API in our dataset is the LLM calling. Since each API call can have many arguments, we need to match on each of these arguments. Further, since, Python allows for default arguments, for each API, we define which arguments to match in our database. For example, we check repo_or_dir and model arguments in our function call. In this way, we can easily check if the argument matches the reference API or not. Please refer to Fig. 4 for more details. In this example, Gorilla returns a torch API call. We first build the tree, and verify that it matches a subtree in our dataset along nodes torch.hub.load, pytorch/vision, and densenet121. But, we don’t check for match along leaf node pretrained = True since that is an optional python argument
The paper evaluated over a few different settings; 0-shot, BM-25, GPT-Index and Oracle. For now lets just stick with 0-shot.
Generating the evaluation responses
These commands will generate responses for the hugging face dataset and evaluate them:
python get_llm_responses.py--model gpt-3.5-turbo--api_key XXX--output_file gpt-3.5-turbo_huggingface_0_shot.jsonl--question_data eval-data/questions/huggingface/questions_huggingface_0_shot.jsonl--api_name huggingface--use_wandb
Running evaluation
python ast_eval_hf.py --api_dataset ../../data/api/huggingface_api.jsonl /--apibench ../../data/apibench/huggingface_eval.json /--llm_responses ../gpt-3.5-turbo_huggingface_0_shot.jsonl /--wandb_run_id 1234 / # use the W&B run id generated from the step above to append to it--wandb_project gorilla-api /--use_wandb /
Results
While the ChatGPT-3.5 baseline accuracy and hallucination don't match the results from the paper, there is nevertheless a marked improvement after finetuning.
Possibly one reason for the difference is that a new version of ChatGPT-3.5, 0613 was used.
Final Functionality Accuracy
Run set
3
Hallucination
The authors have a novel method of measuring hallucination for this problem:
Identifying and even defining hallucinations can be challenging. We use the AST matching process to directly identify the hallucinations. We define a hallucination as an API call that is not a sub-tree of any API in the database – invoking an entirely imagined tool. This form of hallucination is distinct from invoking an API incorrectly which we instead define as an error.
Run set
3
Add a comment
Final Functionality Accuracy gpt3.5 fine tune looks good!
Reply