Skip to main content

OpenAI Launches GPT-4.1 Series with Major Gains in Code, Context, and Cost

Created on April 14|Last edited on April 14
OpenAI has released a new family of models under the GPT-4.1 name, introducing three variants: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. These models outperform previous iterations like GPT-4o and GPT-4.5, with substantial boosts in coding ability, instruction following, and comprehension of long context. A standout feature is the expanded context window—now up to 1 million tokens—and the improved ability to use that massive input space effectively. These models are API-only, with enhancements from 4.1 slowly being integrated into ChatGPT via GPT-4o updates.

Performance in Coding Tasks

GPT-4.1 significantly advances code generation and software engineering tasks. On the SWE-bench Verified benchmark, it scores 54.6%, a major leap over GPT-4o’s 33.2%. It handles complex diffs more reliably, doubling GPT-4o's performance on Aider’s polyglot diff benchmark and outperforming GPT-4.5 by 8 percentage points. For frontend development, human reviewers preferred GPT-4.1's outputs over GPT-4o’s in 80% of cases. The model has also been fine-tuned to reduce unnecessary edits and follow structured formats more accurately, helping developers iterate faster and with more confidence.

Advancements in Instruction Following

OpenAI has refined GPT-4.1’s ability to follow complex instructions, especially in multi-step or multi-turn tasks. Internal evaluations show it achieving 49% accuracy on hard instruction-following prompts, a 20-point gain over GPT-4o. On Scale's MultiChallenge benchmark, which measures long-form, multi-turn instruction adherence, GPT-4.1 scores 38.3%, outperforming GPT-4o by 10.5 points. It also shows better compliance with negative instructions, ranking, formatting, and even knowing when to say “I don’t know,” making it more reliable in real-world use.

Long Context Capabilities

The introduction of a 1 million token context window marks a shift in how AI can manage large-scale information. GPT-4.1 handles context retrieval tasks like “needle in a haystack” effectively at all positions across the input. OpenAI has also released two new benchmarks—OpenAI-MRCR and Graphwalks—to assess multi-turn disambiguation and multi-hop reasoning across long documents. GPT-4.1 performs well across both, beating GPT-4o in each task and maintaining accuracy even at maximum context lengths. For developers working with legal, financial, or codebase-heavy documents, this change is substantial.

Real-World Use Cases and Benchmarks

Companies like Thomson Reuters, Blue J, Qodo, and Carlyle tested GPT-4.1 on internal production benchmarks. Thomson Reuters saw a 17% improvement in long-document legal analysis using GPT-4.1, while Carlyle achieved a 50% boost in data extraction accuracy from complex, large-format files. Qodo found GPT-4.1 produced better code reviews in 55% of cases, with higher precision and better judgment on when to make suggestions. These tests show the model’s adaptability and readiness for real applications where accuracy and context tracking are critical.

Vision and Multimodal Enhancements

GPT-4.1 mini stands out in image-based tasks, often surpassing GPT-4o. On benchmarks like MMMU, MathVista, and CharXiv, GPT-4.1 mini consistently performs at or above prior top-tier models. Visual reasoning, scientific chart comprehension, and long video analysis have all improved. For instance, on the Video-MME benchmark (long, no subtitles), GPT-4.1 achieves a 72% accuracy score, beating GPT-4o’s 65%. These capabilities are key for AI systems analyzing rich multimedia data across legal, educational, or enterprise settings.

Pricing and Availability

All GPT-4.1 models are now available via API, and OpenAI has lowered prices across the board. GPT-4.1 is 26% cheaper than GPT-4o for median queries. GPT-4.1 mini cuts latency by nearly half and cost by 83% while matching GPT-4o’s intelligence scores. GPT-4.1 nano is OpenAI’s fastest and cheapest model ever, ideal for low-latency tasks like classification or text autocompletion. Developers can also take advantage of a higher prompt caching discount—now 75%—to save on repeated calls. Long context usage incurs no additional cost beyond token-based pricing.

Pricing and Availability

GPT-4.1 reflects a continued shift toward models optimized for real-world software development, instruction-following, and high-token contexts. It improves accuracy, speeds up responses, and lowers costs—without compromising on intelligence or versatility. With expanded capabilities in code, language, and vision, OpenAI’s latest release sets a new standard for general-purpose models that can also serve highly specialized workflows. Whether you're building agents, analyzing documents, or scaling multimodal apps, GPT-4.1 offers a more refined and accessible foundation.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.