Microsoft's Phi-2 with GitOps and W&B
A minimal example of CI/CD for MLOps with Phi-2, GitOps, and W&B.
Created on December 16|Last edited on December 21
Comment
The purpose of this article is to highlight Phi-2 and CI/CD with GitOps and W&B!
💡
This article will take inspiration from this course by Hamel Husain, and showcases how you can use W&B with GitOps. I'll begin by walking us through the Phi family of models by Microsoft and the creation of the HellaSwag natural language inference dataset. Then, I'll briefly cover fine-tuning so that we can populate our W&B project with some demo runs! Lastly, I'll show how we can leverage GitOps with W&B for improved CI/CD collaboration!
Note: you can feel free to skip the first two sections covering the Phi models (their papers) and the Swag and HellaSwag datasets. These two sections only cover the paper to give us a deeper understanding of what models/datasets we're working with.
Here's what we'll be covering:
Φ What is Phi-2?The Evolution Of Phi1️⃣ Phi-12️⃣ Phi-1.53️⃣ Phi-2🕶 What is HellaSwag?😎 Swag🚧 Fine-tuning Experiments with W&B🏭 CI/CD with GitOps and W&B👋 ConclusionReferences
Φ What is Phi-2?
Phi-2 is a 2.7 billion-parameter language model developed by Microsoft Research's Machine Learning Foundations team, and represents a significant advancement in the field of small language models (SLMs), demonstrating exceptional reasoning and language understanding capabilities.
Phi-2 stands out for its ability to match or even outperform much larger models (up to 25 times its size) in various complex benchmarks. This is due to taking a new approach in model scaling and training data curation - focusing on high-quality, "textbook-quality" data and knowledge transfer from its predecessor, Phi-1.5.
Despite its relatively compact size, Phi-2 excels in a range of tasks, including common sense reasoning, language understanding, math, and coding, surpassing the performance of larger models like Mistral and Llama-2. It is also notable for its improved behavior concerning toxicity and bias due to Microsoft's tailored data curation techniques.
Phi-2 is available in the Azure AI Studio model catalog, providing a valuable resource for researchers and developers in the field of language models.
The Evolution Of Phi
For those unfamiliar with previous version of Phi, here is how we got to Phi-2.
1️⃣ Phi-1
Phi-1 is a 1.3B parameter transformer-based LLM (is this still "large"?) trained on 6B "textbook-quality" web tokens and 1B synthetically generated textbook text/exercises from GPT-3.5, then fine-tuned on ~180M tokens.
Dataset
The training dataset for Phi-1, called CodeTextbook, is made of 2 parts.
The first part, filtered code-language, consists of 6B "textbook-quality" web tokens from a Python deduplicated subset of The Stack and StackOverflow. The original data (The Stack and StackOverflow) consisted of 35 million files/samples, of which 100k samples were annotated with GPT-4 where the prompt was composed of a code snippet and the prompt: "determine its educational value for a student whose goal is to learn basic coding concepts." A random forest classifier was trained on these annotated samples where the input is the output embedding from a codegen model, and predictions are ratings.

The second part, called synthetic textbook, is about 1B synthetically generated tokens. The paper does not make it clear what is used to synthetically generate the data other than that it might seem like: "where a diverse set of short stories were created by including a random subset of words chosen from some fixed vocabulary in the prompt and requiring that they would be somehow combined in the generated text." These samples were a mix of natural language and code snippets. Diversity of samples was enforced via constraints on topics and samples that promoted reasoning.

The last dataset, CodeExercises, is a ~180M token synthetic exercise dataset (generated by GPT-3.5) for fine-tuning. Each sample consisted of a docstring for a function and the function implementation.

Evaluation
They have a phi-1-base model, only trained on CodeTextbook. phi-1 is phi-1-base fine-tuned on CodeExercises. They validated their curated dataset by testing the Pass@1 accuracy on HumanEval.

Their experiments found that Phi-1 has greater generalizability, showing signs that it can use libraries it wasn't explicitly trained on and execute tasks not in the fine-tuning dataset.

In this example, they show how fine-tuning improves the model's performance/generalizability on an example made by the authors. phi-1 and phi-1-small both are able to generate (somewhat) correct outputs while phi-1-base is unable to.

Phi-1 models and other models were tested on 50 coding problems generated by a separate group of researchers (who have no access to the phi models or the CodeExercises dataset). This test was to see if data contamination (and memorization by the model) existed in the synthetically generated data.
Contamination Analysis
The authors of phi-1 also conducted a decontamination analysis using n-gram overlap, embedding, and syntax-similarity. More information can be found in their paper. They found that, even after pruning their CodeExercises dataset, phi-1 still outperformed StarCoder.

2️⃣ Phi-1.5
Phi-1.5, the sequel to Phi-1, is the same in model architecture. It uses the 7B tokens (CodeTextbook) and 20B additional synthetically generated data. They also tested the importance of web data by training 2 models: phi-1.5-web and phi-1.5-web-only.
The "web only" phi model was trained only on filtered web data: 95B tokens with 88B tokens from the Falcon refined web dataset and 7B from the Stack and StackOverflow. phi-1.5-web was trained on a weighted sum of phi-1 data, the new data generated, and the 95B filtered web data from before.


Their models are not alignment-tuned. They also tested their model (and other models) with a small evaluation set of 86 prompts to test the model for toxicity. Though Phi-1.5 exhibited less toxicity, it is still prone to generating toxic content.
3️⃣ Phi-2
As of this article, there is no Phi-2 paper or report. But it is stated in Microsoft’s blog post that the model is 2.7B parameters, adapted from the original 1.3B parameters of Phi-1.5. Like before, their dataset (now 1.4T tokens) consists of web-filtered data and synthetic data. The model is not adaptation-tuned and shows less aptitude towards toxicity and bias.
Microsoft notes that their 2.7B parameter model outperforms Mistral-7B and Llama-7B/13B. They also found their mode to outperform Gemini Nano 2 on a number of benchmarks.


🕶 What is HellaSwag?
😎🤘 HellaSwag
HellaSwag is short for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. In short, Swag but with harder endings, longer contexts, more data, and modern deep pretrained language models instead of neural language models. They provide a new diagram to describe Adversarial Filtering!

HellaSwag begins with an exploration into why BERT succeeded/"solved" the Swag benchmark. Essentially, they conclude that BERT is most likely lexically reasoning instead of understanding the actual context and noun phrase. Generators (the model behind generating negative samples) and discriminators (models used in AF) are crucial to the effectiveness of the dataset and thus this benchmark is best updated periodically.
The key takeaway from HellaSwag are:
- it is a combination of ActivityNet Captions (no LSMDC) and WikiHow
- the generator and discriminator matter
- WikiHow provides greater diversity (and more challenging ones; hence, Harder Endings)
- "goldilocks" zone where 2 sentences is seemingly more complex and adversarially challenging (hence Longer contexts)

😎 Swag
Short for Situations With Adversarial Generations, Swag is composed of 113k samples (73k training, 20k validation, 20k test) after the human annotator filtering.
The general process for its construction:
- ActivityNet Captions and Large Scale Movie Description Challenge (LSMDC) datasets Swag derived from
- Pretrain/fine-tune an LSTM on BookCorpus/video caption data above to generate 1023 "adversarial" negative verb phrases for each context
- Run Algorithm 1 (AF) to devise a set of Assignments to subset the adversarial dataset from previous step
- Human annotation for agreement and ground truth labeling
- Swag consists of 3-tuples: where , so each context has 3 negatives and 1 positive, totaling 4
For more information on their experiments, analyses, and related work with Swag on NLI models, check out their paper!
Large-scale datasets for NLI are subject to annotation biases from the annotators. These biases can lead to "fast surface learners" or models that learn to pick out these and over-rely on these patterns.
Their paper introduces two things:
- the cost-effective adversarial filtering technique to extend the dataset, reducing annotation artifacts
- Swag, a new NLI/common-sense reasoning dataset where humans confidently do well in, but SOTA models (at the time) do poorly in

Adversarial Filtering
The paper first formally defines this dataset construction process. To understand this section, you need a bit of context.
Given video captions for temporally adjacent frames, a "context" frame and the noun phrase (NP) from the frame of interest is concatenated and fed into SOTA language models. These models oversample a wealth of false positives (counterfactuals).

Definitions to frame the problem:
- : context (a sentence and a noun phrase )
- : a set of possible verbs (only one will match the noun phrase above)
- : each instance of Swag is a 3-tuple of sentence, noun, and a verb phrase (for each s,n pair, there is only 1 possible verb; the rest are counterfactuals)
These definitions should aid us in understanding how data is fed through the adversarial filtering procedure below.
- : input space
- : label space
- : trainable classifier
- : dataset
- : classifier mapping
- : loss function over a dataset
-
- : average loss across dataset with leave-one-out train/test split
What is function ? For one data point , our classifier is trained on all other data points in the dataset (excluding the -th one). The loss is computed, and we repeat the process for the next data point. This is averaged across the dataset.
The point of constructing this formula is to show, if we want an adversarial dataset, then we expect high empirical error on . In other words, in an ideal case, none of the examples generalize to another example within the dataset.
Definitions for adversarial-filtered dataset:
- for each instance , there is one positive instance and many negative instances where so
- we filter these negative instances for each instance to such that
- thus, we construct a set of assignments (list of indices)
- our adversarial-filtered dataset:
Great! We have a formal understanding of how this should be framed. How does it look algorithmically?

We initialize and maintain set of assignments where . This is iteratively updated via Algorithm 1 (Adversarial Filtering).
The algorithm splits the dataset randomly, trains a model on the train split, and for each sample in the test split, a set of "easy" negatives (the model correctly predicts they are negative) is created. easy indices are replaced by "adversarial" negatives (the model incorrectly predicted these negatives as positives).
But, how do we generate all these negatives? What is our model family and how is that trained?
- an LSTM model pretrained on BookCorpus and fine-tuned on the video captions is used to generate unique verb phrases/caption endings (5-fold validation split so no model generates verb phrases for instances they've seen in the training)
- greedy decoding with beam search is used, as they tend to use topical words that don't reflect consistent physical logic
- the model family (multiple models used) are designed to pick up on low-level features (like topical words!)
- the model family , generally, consists of an MLP, bag-of-words (BoW), one-layer CNN, biLSTM output representations are concatenated and passed through a final MLP trained with cross entropy criterion
The last component of this dataset construction pipeline (AF) is the human verification (the smiley faces you see in Figure 1). Amazon Mechanical Turk (AMT) workers would verify the 6 candidate endings (1 real, 5 fake) for a context .

They used these human annotations to determine the ground truth labels and to ensure the caption endings are agreed upon by human annotators.

🚧 Fine-tuning Experiments with W&B
Many of you will have jumped right to this section. Here, I'll run a couple fine-tuning experiments with Phi-2 and HellaSwag. These will populate our W&B project so I can showcase GitOps with W&B!
💡
In short, I sweep over LoRA hyperparameters like r, alpha, and dropout and different schedulers like linear, cosine, cosine_with_restarts, and reduce_lr_on_plateau. Fine-tuning was done with only 100 max steps (for demo purposes) and validation was ran with only 100 samples from the validation dataset. More details can be found in the W&B project and the fine-tuning notebook linked above.
Below are some graphs from these runs.
Onto the CI/CD section!
🏭 CI/CD with GitOps and W&B
We will set up a simple GitHub workflow that, upon making a comment /wandb <run_id> to a PR, a W&B Report will be generated comparing the run specified by the run_id within this project with the run tagged with "baseline". This is done with a combination of GitOps (and ghapi by fastai) and W&B Reports API.
Everything can be found on this article's repository and the W&B Project reports page. Let's get started!
We need 2 files: compare_runs.py and .github/workflows/ci.yaml. compare_runs.py will be responsible for comparing the run run_id with the baseline tagged run. Our ci.yaml is the workflow folder for parsing the PR comment and running compare_runs.py.
Here's the ci.yaml. If you aren't familiar with GitHub Actions and Workflows, I recommend following the CI/CD course mentioned above, for a great crash course!
name: excercise-wandbon: issue_commentpermissions:contents: readissues: writepull-requests: write
The first part of our file is the name "exercise-wandb" and what events will trigger this workflow (comments on issues/PRs). Then, we specify the permissions the workflow operates within (read content, write issues/PRs).
jobs:ghapi-exercise:if: (github.event.issue.pull_request != null) && contains(github.event.comment.body, '/wandb')runs-on: ubuntu-lateststeps:- name: Get repo contentsuses: actions/checkout@v3- name: install dependenciesrun: pip install ghapi wandb
Next, we define our singular job: ghapi-exercise. This job proceeds based on one condition: the PR (where the comment is) is not null, and the comment has "/wandb" somewhere in it. This job runs on ubuntu-latest and installs ghapi and wandb. actions/checkout@v3 allows our workflow runner to access our repository's contents.
- name: Parse value from the commandid: get-runid-valueshell: pythonrun: |import re, oscomment = os.getenv('PR_COMMENT', '')match = re.search('/wandb[\s+](\S+)', comment)with open(os.environ['GITHUB_OUTPUT'], 'a') as f:if match:print(f'VAL_FOUND=true', file=f)print(f'RUN_ID={match.group(1)}', file=f)else:print(f'VAL_FOUND=false', file=f)env:PR_COMMENT: ${{ github.event.comment.body }}
Our next step in this job is to parse the PR comment. This is done with a simple Python script written right into the ci.yaml.
You can also define this script elsewhere and just run that file. The code snippet gets our PR comment via ${{ github.event.comment.body }}. More on that here.
This environment variable is parsed in our code for the RUN_ID environment variable which we are saving to GITHUB_OUTPUT. More on that here.
- name: Generate the comparison reportif: steps.get-runid-value.outputs.VAL_FOUND == 'true'id: wandb-reportrun: python compare_runs.pyenv:WANDB_ENTITY: vincenttuWANDB_PROJECT: cicd_and_wandbBASELINE_TAG: baselineRUN_ID: "${{ steps.get-runid-value.outputs.RUN_ID }}"WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
Next, we execute python compare_runs.py. Our compare_runs.py uses the supplied environment variables. Our W&B API Key is available via GitHub Secrets. More on that here.
The RUN_ID is fetched from the previous step. Notice that it's steps.<id>.outputs.RUN_ID. This <id> is the id from the previous step and RUN_ID is one of the outputs. More on steps here.
- name: Make a comment with the GitHub APIuses: actions/github-script@v6if: steps.wandb-report.outcome == 'success'with:script: |var msg = `A comparison between the linked run and baseline is available [in this report](${process.env.REPORT_URL})`github.rest.issues.createComment({issue_number: context.issue.number,owner: context.repo.owner,repo: context.repo.repo,body: msg});env:REPORT_URL: "${{ steps.wandb-report.outputs.REPORT_URL }}"
This is the last step of our ci.yaml. We want to verify the outcome of the previous step was a success. If so, we want to (in JavaScript) create a PR comment with a link to the W&B report.
Let's walk through compare_runs.py. Our imports are simple, and we have a simple check for our API key. get_baseline_run returns a single W&B run.
import os, wandbimport wandb.apis.reports as wrassert os.getenv('WANDB_API_KEY'), 'You must set the WANDB_API_KEY environment variable'def get_baseline_run(entity='vincenttu', project='cicd_and_wandb', tag='baseline'):"Get the baseline run from the project using tags"api = wandb.Api()runs=api.runs(f'{entity}/{project}', {"tags": {"$in": [tag]}})assert len(runs) == 1, 'There must be exactly one run with the tag "baseline"'return runs[0]
Next is the compare_runs method.
def compare_runs(entity='vincenttu',project='cicd_and_wandb',tag='baseline',run_id=None):"Compare the current run to the baseline run."# Allow you to override the args with env variablesentity = os.getenv('WANDB_ENTITY') or entityproject = os.getenv('WANDB_PROJECT') or projecttag = os.getenv('BASELINE_TAG') or tagrun_id = os.getenv('RUN_ID') or run_idassert run_id, 'You must set the RUN_ID environment variable or pass a `run_id` argument'
The beginning of our method just gets the entity, project, tag (baseline), and run_id.
baseline = get_baseline_run(entity=entity, project=project, tag=tag)report = wr.Report(entity=entity, project=project,title='Compare Runs',description=f"A comparison of runs, the baseline run name is {baseline.name}")
After, we get our baseline run and define our new report using W&B's beta Reports Python SDK! If this report has an existing report's name, it will just create a new report.
pg = [wr.PanelGrid(runsets=[(wr.Runset(entity, project, "Run Comparison").set_filters_with_python_expr(f"ID in ['{run_id}', '{baseline.id}']"))],panels=[wr.LinePlot(x="train/global_step", y="train/loss", layout={'x': 0, 'y': 0, 'w': 12, 'h': 8}),wr.LinePlot(x="train/global_step", y="eval/loss", layout={'x': 4, 'y': 0, 'w': 12, 'h': 8}),wr.RunComparer(diff_only='split', layout={'w': 24, 'h': 15})],)]report.blocks = report.blocks[:1] + pg + report.blocks[1:]report.save()
Next, we define our list of blocks. A report is built like a list of "layers" or blocks (kind of like a neural network!). Our report will be a photocopy of the course's solution, but we'll add a small twist by adding 2 line plots.
We define a PanelGrid as one layer. Every panel grid accepts a number of runsets and panels. runsets are the runs you choose to include in the panel, and panels are the actual plots. More information on Runset can be found here. You can find more information on the types of acceptable panels here.
if os.getenv('CI'): # is set to `true` in GitHub Actions https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variableswith open(os.environ['GITHUB_OUTPUT'], 'a') as f: # write the output variable REPORT_URL to the GITHUB_OUTPUT fileprint(f'REPORT_URL={report.url}', file=f)return report.urlif __name__ == '__main__':print(f'The comparison report can found at: {compare_runs()}')
Lastly, we save our report URL to the GITHUB_OUTPUT, and we have a simple call to compare_runs.
That's it! This might be a bit to digest at first. Still, I highly encourage you to dig through the documentation and explore what's possible with ghapi, GitHub Actions, and W&B! For your own custom workflows, whether they include W&B or ghapi or GitHub Actions, Hamel makes a great point: ensure that you test locally before running them through GitHub!
👋 Conclusion
If you've made it this far, I'd like to thank you for taking the time to go through this somewhat dense article! I hope it was an awesome read and that you learned something! Feel free to comment below if you have any questions! 👋😎
References
W&B Course & Related Materials
Phi-2
https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
HellaSwag
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.