Skip to main content

Are OpenAI Models Actually Getting Smarter?

Researchers compare OpenAI API versions to see how they have changed. The results may surprise you.
Created on July 19|Last edited on July 19
The recent year has seen massive adoption of large language models (LLMs) such as GPT-3.5 and GPT-4. These LLMs, equipped with powerful natural language processing abilities, are increasingly being deployed across a variety of applications, from content generation to decision support systems. Yet, despite their widespread use, the methodology and implications of their regular updates remain rather opaque and unclear to developers.



Complex Evaluation

In most software systems, updates usually suggest monotonic enhancements or bug fixes designed to improve user experience. However, when it comes to these complex, machine-learning based models, updates can lead to shifts in performance and behavior that are less predictable and potentially disruptive.
This dynamic is particularly evident in the case of GPT-3.5 and GPT-4. With the capacity to learn from user interactions and undergo frequent adjustments based on design changes, these models can exhibit substantial changes in their responses over time. This fluid nature poses challenges when integrating these LLMs into larger workflows, where consistent behavior is critically important.
A sudden shift in the model's response to a particular prompt, for instance, could break the downstream pipeline, leading to system-wide errors and failures. Moreover, the inherent variability of these models presents hurdles to scientific reproducibility, especially for research benchmarking performance in reference to the OpenAI GPT series of models.
This fluidity also raises an intriguing question: Is the performance of an LLM, like GPT-4, consistently improving over time? While updates may be intended to enhance the model in certain aspects, it is critical to discern whether these improvements could inadvertently hamper its proficiency in other areas.

A Simple Test

To explore these questions, researchers from Stanford and UC Berkeley embarked on a comparative study of the March 2023 and June 2023 versions of GPT-3.5 and GPT-4. The analysis focused on four distinct tasks: solving mathematical problems, answering sensitive or potentially dangerous questions, generating code, and visual reasoning. We chose these tasks as they span a broad range of the models' capabilities, reflecting their diverse applications.
The findings revealed that the performance and behavior of both GPT-3.5 and GPT-4 could fluctuate considerably between these two iterations, and that their ability to perform certain tasks had regressed over this period. These results underscore the critical need for continuous monitoring and quality control of large language models, given their variable and evolving nature.

Math

In the math-solving task, the authors found substantial performance drift. GPT-4's accuracy in determining prime numbers significantly decreased from 97.6% to 2.4% between March and June, while GPT-3.5's improved from 7.4% to 86.8%. They suggest that drifts in the chain-of-thought effects might be a contributing factor to these shifts.

Sensitive Questions

For the sensitive question task, the authors observed that GPT-4 answered fewer sensitive questions over time, suggesting the implementation of a stronger safety layer. On the other hand, GPT-3.5 became less conservative. Both models tended to provide less explanation when refusing to answer a query over time.

Programming

In code generation, the number of directly executable generations dropped significantly for both GPT-4 and GPT-3.5. They hypothesize this decline is due to the models adding extra non-code text to their generations.

Conflicting Incentives

OpenAI, like any organization, is strongly incentivized to minimize costs without sacrificing performance. Therefore, it has a significant interest in reducing the size of models like GPT-4, if possible, to realize savings on computational resources and storage. This reduction in size has potential benefits not only for OpenAI, but also for developers and users, as it can make these advanced language models more accessible and affordable to deploy and use.
Nonetheless, there is a tradeoff. While smaller models might be more efficient and affordable, they may not match the same generality as larger models across a broad range of tasks and datasets. Bigger models like GPT-4 can capture more complex patterns and perform advanced reasoning that smaller models may not be capable of. Therefore, while model reduction can lead to cost savings and increased accessibility, it's also important to ensure that the performance and generality of these models meet the needs of developers across their varied use-cases. This balance is critical in the continuous evolution of AI models.

User Frustrations

In recent times, there has been a growing discontent among users regarding the perceived shift in the performance of ChatGPT. These users claim that there are discernible inconsistencies and degradation in the model's performance, which they believe affects the utility and value of the service. Conversely, OpenAI maintains that their GPT API’s have only seen consistent improvements, however, they also state that the ChatGPT models are very much experimental, and its likely ChatGPT is changing in a less monitored fashion. A clear line of communication between developers and users, paired with transparency in changes and improvements, could go a long way in addressing such discrepancies and aligning expectations. As models become more ingrained into the fabric of peoples careers and businesses, the need for full transparency in regards to model performance will be absolutely critical.