Skip to main content

FineTune Evaluation of LLM on test dataset

Running evaluations on the test dataset
Created on October 31|Last edited on April 26

split_dataset
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - dataset
split_dataset:v2
Run - data_split
generous-violet-4
Runs
129
eager-rain-34
eager-durian-33
clear-night-32
clean-totem-31
stellar-waterfall-30
firm-sponge-29
expert-universe-28
astral-fire-27
dashing-river-26
effortless-universe-25
royal-fog-24
smart-totem-23
unique-rain-22
visionary-spaceship-21
happy-snowflake-20
pleasant-dew-19
vague-wave-18
drawn-resonance-17
exalted-morning-16
dainty-snowflake-15
glad-sun-133
eval
smooth-snowflake-132
eval
sweet-dragon-3
eval
radiant-fish-2
eval
lunar-mandu-1
eval
lunar-noodles-44
eval
honest-lovebird-131
eval
sleek-music-130
eval
vocal-fire-129
eval
ethereal-microwave-128
finetune
ruby-valley-127
eval
lyric-armadillo-126
eval
iconic-leaf-125
eval
balmy-surf-124
eval
tough-energy-123
eval
morning-durian-122
eval
lemon-bird-120
eval
worldly-glitter-119
eval
polished-jazz-118
eval
daily-microwave-117
eval
chocolate-universe-116
eval
rosy-eon-115
eval
snowy-snowflake-114
eval
eternal-voice-113
eval
vibrant-pyramid-112
eval
bumbling-dream-111
eval
atomic-blaze-110
eval
lively-cherry-109
eval
dainty-dust-108
eval
dutiful-bird-107
eval
honest-plasma-106
eval
ghoulish-disguise-105
eval
fearful-whisper-104
eval
haunted-whisper-103
eval
terrifying-moon-102
eval
eerie-gargoyle-101
eval
mischievous-party-100
eval
ghoulish-phantasm-99
eval
creepy-bones-98
eval
moonlit-ghost-97
eval
macabre-witch-95
eval
mischievous-cape-94
eval
possessed-whisper-93
eval
howling-orb-92
eval
mischievous-fang-90
eval
ghostly-party-89
eval
fearful-incantation-88
eval
eerie-cobweb-87
eval
macabre-broomstick-86
eval
masked-crone-85
eval
Llama2-7b-ft
eval
fearful-spell-83
eval
magical-magic-82
finetune
strange-hex-81
eval
terrifying-wand-80
eval
possessed-monster-79
eval
supernatural-candy-78
eval
Llama2-7b-chat
eval
trembling-wendigo-76
eval
Llama2-7b
eval
Llama2-7b-chat-pretrained
eval
masked-skeleton-72
eval
enchanted-phantasm-71
eval
headless-mask-70
eval
trembling-zombie-69
eval
paranormal-incantation-68
eval
headless-goosebump-56
eval
strange-apparition-55
eval
costumed-ritual-54
eval
mystical-cobweb-53
eval
MistralAI-Instruct-Pretrained
eval
fearful-hex-51
eval
dire-gargoyle-50
eval
mischievous-moon-49
eval
ghastly-magic-48
eval
MistralAI-Instruct FineTuned
eval
terrifying-crone-45
eval
ghastly-spirit-44
eval
magical-newt-43
eval
macabre-ritual-42
eval
warm-disco-24
pious-frost-32
finetune
robust-energy-31
finetune
rosy-galaxy-30
finetune
volcanic-oath-29
finetune
glad-firefly-28
finetune
amber-lake-27
finetune
smooth-terrain-26
finetune
sparkling-plasma-25
finetune
frosty-bush-24
finetune
glowing-breeze-23
finetune
worthy-snowball-22
finetune
jolly-sea-21
finetune
gallant-dew-20
finetune
blooming-gorge-19
finetune
whole-surf-18
finetune
lunar-violet-17
finetune
legendary-leaf-16
finetune
fast-aardvark-15
finetune
dry-sound-14
finetune
worldly-silence-13
finetune
glamorous-cosmos-23
eager-sun-12
finetune
blooming-fire-11
finetune
northern-vortex-10
finetune
winter-durian-9
finetune
dandy-totem-8
finetune
dazzling-rain-7
finetune
young-thunder-6
finetune

LLama2-7b-chat-pretrained

  • no finetunning


prompt

[INST] <<SYS>>
You are AI that converts human request into api calls. 
You have a set of functions:
-news(topic="[topic]") asks for latest headlines about a topic.
-math(question="[question]") asks a math question in python format.
-notes(action="add|list", note="[note]") lets a user take simple notes.
-openai(prompt="[prompt]") asks openai a question.
-runapp(program="[program]") runs a program locally.
-story(description=[description]) lets a user ask for a story.
-timecheck(location="[location]") ask for the time at a location. If no location is given it's assumed to be the current location.
-timer(duration="[duration]") sets a timer for duration written out as a string.
-weather(location="[location]") ask for the weather at a location. If there's no location string the location is assumed to be where the user is.
-other() should be used when none of the other commands apply

Reply with the corresponding function call, be brief.
<</SYS>>
{user}[/INST]
prompt.count
1
False
answer == generation
Run set
2


MistralAI-instruct

  • no finetune

Run set
1


Fine-tune really helps

Llama 7b fine-tuned

We can make the case that with enough data the model is actually aligned to our task, maybe creating more data would make this model as good as the chat models.

Run set
1


MistralAI-instruct - fine-tuned


Run set
1

artifact