Skip to main content

I put GPT2-chatbot’s coding skills to the test

A new model has shown up on lmsys, and it looks a lot like GPT-4!
Created on April 30|Last edited on April 30
A new model known as gpt2-chatbot has appeared on the LMSYS platform, attracting attention for its advanced capabilities. With rumors claiming that it can perform on par with OpenAI's GPT-4, this model has prompted questions about its creator and underlying architecture.
To understand its potential, I decided to conduct a series of tests. This article details my approach, the tests I performed, and the insights I gathered from interacting with gpt2-chatbot. Join me as I explore the capabilities of this intriguing new model.
I will test the models performance benchamarked against ChatGPT-4. There may be differences between my GPT-4 results and results generated by the actual GPT-4 API, so keep this in mind.

Test 1: Building a website with a single bash script

This test was designed to test the model's ability to generate a website by simply writing a bash script. Here is the prompt I used below:
prompt = write a sh script that creates and opens a website about vr sports simulation (a company)

GPT-4 (ChatGPT) result




gpt2-chatbot result



I think everyone can agree that gpt2-chatbot's website is much more visually appealing! However, the buttons for gpt2-chatbot's website did not work as (they really shouldn't be there at all). GPT-4 did not attempt to implement these buttons, however, I appreciate gpt2-chatbot's design skills, so I will give this round to gpt2-chatbot!

Test 2: Create a python Slither IO game

This test was designed to test the model's ability to create a game similar to slither IO!
prompt = write a python game similar to slither io


GPT-4 (ChatGPT) result



gpt2-chatbot result



Both games were quite similar. The main difference was that the GPT-4 game allowed me to quit or restart the game after dying. On the other hand, the gpt2-chatbot version is a bit faster-paced and more difficult to play. Because of the functionality allowing me to restart, I will give GPT-4 the win on this round.

Test 3: Building a Flutter mobile app

Here, I test out the model's ability to build a flutter app that can show charts similar to Weights & Biases. Here is the prompt below:
prompt = make a flutter app that has charts similar to wandb for various metrics. Make it look good (similar to wandb), and be efficient so you can fit the main.dart all in one script. Do not use any packages

GPT-4 (ChatGPT) result



gpt2-chatbot result



Overall, these were also very similar. I can't really say too much more than that, and will count this round as a tie.

Test 4: Building a Python game similar to Brick Breaker

I used to love playing this game on an old iPod I had. I wanted to test the model's ability to create a brick-breaker game. Here is the prompt below:
prompt = make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game make a python brick breaker (breakout) game
I was getting rate-limited quite a bit on lmsys, and out of frustration, I copied and pasted the prompt 5-10 times and hit enter. Surprisingly, it worked, so I went ahead and just stuck with this prompt for the test. I doubt this will have too much of an impact on the results.

GPT-4 (ChatGPT) result




gpt2-chatbot result



Once Again, both results are extremely similar. The main difference between the two was that gpt2-chatbot was a bit faster-paced than GPT-4, and I found it to be more challenging and engaging. I ruled this round a draw.

Test 5: Improve on our existing Brick Breaker game

I wanted to test how well both chatbots could improve on our existing game. Here is the prompt I gave the models:
[previous code context]
prompt = add some fire to the ball and make a more exciting UI

GPT-4 (ChatGPT) result



gpt2-chatbot result



As can be seen, both models added flames to the ball, and also added a score to the top left. However, gpt2-chatbot chose to use score intervals of 1, compared to GPT-4's choice to use intervals of 100. I found the physics of gpt2-chatbot's version to be more realistic, where shots hit between 2 bricks broke both, whereas GPT-4's version did not seem to have this feature.
Because of this, I will give gpt2-chatbot the win of this round.

Test 6: Code compression of our Brick Breaker game

I wanted to see how well each chatbot could condense existing code into a compact representation. This was partly inspired by a recent tweet of Sam Altman. I do agree that being able to condense something down (especially code) is a great test of understanding and skill.


prompt = do your best to try to condense this code as much as humanly possible
[previous gpt2-chatbot code]

Both Chatbots were able to reduce the line count down quite a bit! The original file was about 118 lines and 3,559 characters

GPT-4 (ChatGPT) result

The result came out to be 62 lines, however, it removed the fire from the ball!

gpt2-chatbot result

gp2-chatbot's result came out to around 75 lines, however, it kept the flames, so I believe this round goes to gpt2-chatbot!

Overall

In the comparative analysis between the new gpt2-chatbot on the LMSYS platform and OpenAI's GPT-4, the results showed remarkable similarities in performance on simple to medium-difficulty tasks. This leads me to the hypothesis that gpt2-chatbot might be a condensed or distilled version of GPT-4. Such a model would not only be more resource-efficient but could also serve specific strategic purposes in the broader AI development landscape.
Currently, it seems like OpenAI's main financial incentive is to reduce the costs of GPT-4, and push the performance further to maintain its competitive advantage over other companies. Given that the model is named gpt2-chatbot, it seems plausible that this could very well be a GPT-2 model (or GPT-2 with MoE) incorporating larger amounts of data, and all the newest training tactics allowing it to perform at a much higher level.
Assuming that this is in fact a GPT-2 model, it would likely cut OpenAI's expenses down tremendously, not only for running ChatGPT, but also synthetic data generation tasks for training next-generation models. It should be interesting to see the true origin of gpt2-chatbot!
Here is the code for my tests!
Iterate on AI agents and models faster. Try Weights & Biases today.