Skip to main content

Welcome, AI Explained fans!

Intuitive visualizations allow you to interrogate every step of your LLM program. Easily review past results, identify and debug errors, gather insights about model behavior, and share learnings with colleagues.
Created on July 31|Last edited on January 29

🧑‍💻 Simple Bench Evals Competition 🏅

Welcome to the Simple Bench Evals Competition! Test your coding and LLM knowledge and skills to achieve a perfect score of 20/20 on the Simple Bench evaluation.

Resources

How to Participate

  1. Setup - Refer to the starter code for installation instructions and environment setup.
  2. Run the Benchmark:
  • Experiment with different models.
  • Adjust configurations like model_name, temperature, top_p, or max_tokens to see their impact on the evaluation results.
  • Modify the system_prompt.txt to customize the system's instruction, ensuring the prompt ends with: "Final Answer: X where X is one of the letters A, B, C, D, E, or F."
3. Submit Your Results:
  • Run the submission block in the starter code.
  • Your results will be logged to Weave and appear on the leaderboard.
  • Follow the 🍩 link in the output to check your ranking.


Prizes

Prizes will be awarded to the first submissions of 20/20 results or the highest results at the closing of the competition (31st Jan)
  • First Prize: Meta Ray-Bans + W&B Merch + Ultimate bragging rights 😎
  • Second Prize: Gift Card + W&B Merch
  • Third Prize: W&B Merch


Rules

  1. If no 20/20 solutions are submitted, competition ends 31st Jan 2025 (incl).
  2. Individual Submissions: Each participant must submit their own solution.
  3. Fair Play: Organizers reserve the right to disqualify any submission suspected of cheating.
  4. Supported Models: Only models specified in the starter code are allowed.
  5. Ray Bans prize only available to participants located in US, Canada, Australia, France, Germany, Italy, Spain, UK due to shipping restrictions.

In case of questions email research@wandb.com or if you want to chat to fellow competition takers use our General Discord channel.

Get started now!




Introducing Weave from Weights & Biases

Evaluation comparison for MixEval hard benchmark
Weave is a game-changing toolkit for developers working on generative AI applications. It offers a seamless way to track, evaluate, and debug LLM-based projects.
You can use Weave to:
  1. Log and debug language model inputs, outputs, and metadata
  2. Build rigorous, apples-to-apples evaluations for language model use cases
  3. Organize all information generated across the LLM workflow, from experimentation to evaluations to production

Check out the full capabilities of Weave



🧑‍💻 All you need are 3 lines of code

Get started by decorating your Python functions with @weave.op:
import weave
weave.init("ai_explained")

@weave.op()
def sum_nine(value_one: int):
return value_one + 9

@weave.op()
def multiply_two(value_two: int):
return value_two * 2

@weave.op()
def main():
output = sum_nine(3)
final_output = multiply_two(output)
return final_output
You can try Weave out in this interactive Colab:


♾️ Weave supports any use case

RAG applications, image generation, benchmarking, Weave can support you in any enterprise, academic or side project and here are some projects powered by Weave:


🤝 Weave lives where you work

Weave offers integrations with many language model APIs and LLM frameworks to streamline tracking and evaluation:


Stay focused on iterating on your prompts and models with whichever API or LLM framework you know and love - Weave has integrations for OpenAI, Anthropic, Mistral.ai, LlamaIndex, Cohere and more.
See our integrations


🧑‍🎓 Explore Weave and build LLM apps with our free courses

Our courses will give you a theory and code to dive into the area of your interest. Our LLM courses range from short, problem focused courses to a more advanced deeper dives. Each course is equipped with theory, code and led by industry experts.


Check out all our courses (and enroll for free!)


Trusted by over 200,000 machine learning practitioners

Weights & Biases is trusted as the machine learning system of record by organizations across the work; from the largest deep learning research labs to autonomous driving companies and from pharmaceutical companies working on drug discovery to financial institutions. And with SOC2 certification and both cloud and local deployment options, Weights & Biases will meet your teams' security needs.
Enterprises that trust Weights & Biases
Try Weights & Biases today



Malcolm Sharpe
Malcolm Sharpe •  
FYI, the "Check out all our courses (and enroll for free!)" button currently has an invalid URL.
1 reply