OpenAI Launches GPT-5.5: The AI That Does Your Office...

143

GPT-5.5 is OpenAI’s most capable and intuitive model to date — and it has a clear ambition: to become the AI that runs your office. It writes code, browses the web, manages email, fills spreadsheets, synthesizes research, and plans scientific experiments, all with less human prompting than any previous model. OpenAI’s own finance team already used it to review nearly 25,000 tax forms two weeks faster than last year. The super app era has begun.

84.9%GDPval score — 44-occupation office knowledge work benchmark

82.7%Terminal-Bench 2.0 — complex CLI planning & tool coordination

85%+Of OpenAI’s own staff now use Codex every single week

Every year, the technology industry produces a model that makes the previous year’s benchmark numbers look ordinary. On April 23, 2026, OpenAI delivered this year’s version. GPT-5.5 — described by OpenAI President Greg Brockman as “the smartest and most intuitive model we’ve ever built” — is the company’s answer to the question that enterprise customers have been asking for the past three years: when will AI stop being a chatbot and start being a colleague?

The answer, delivered through an unusually specific set of real-world performance numbers and internal case studies, is: now. GPT-5.5 is not just a more capable language model. It is OpenAI’s most explicit articulation of what agentic AI looks like in practice — a system that takes multi-part tasks, plans the steps required to execute them, navigates tools and software autonomously, checks its own work, and deals with ambiguity without stopping to ask for clarification at every turn.

The launch places GPT-5.5 at the center of three intersecting stories: the race between OpenAI and Anthropic for enterprise AI dominance; the company’s stated ambition to build a “super app” combining ChatGPT, Codex, and an AI browser into a single unified work environment; and the emerging reality that the most economically valuable AI capability of 2026 is not generating text — it is getting things done in the tools that knowledge workers use every day.

What GPT-5.5 Is — In OpenAI’s Own Words

OpenAI launched GPT-5.5 with an unusually direct positioning statement. The company describes it as “a significant leap forward in agentic AI — strong in coding, computer use, knowledge work, and early scientific research — capable of carrying out complex, multi-step work autonomously.” That combination of qualifiers — agentic, multi-step, autonomous — is deliberate. GPT-5.5 is not being positioned as a better question-answerer. It is being positioned as a better worker.

Brockman’s briefing to reporters was characteristically concrete. “It just goes and kind of figures it out, deals with ambiguity,” he told Bloomberg. “It’s a much more intuitive experience. It’s a faster, sharper thinker for fewer tokens compared to something like 5.4.” That last point matters practically: efficiency gains that reduce token consumption at equivalent or better output quality directly reduce the per-task cost of running the model — which at enterprise scale is a meaningful commercial consideration.

Jakub Pachocki, OpenAI’s chief scientist, was more expansive about the trajectory: “We see pretty significant improvements in the short term, extremely significant improvements in the medium term.” The implication, reading between the lines of a briefing with journalists, is that GPT-5.5 is not a destination — it is a milepost on a road that OpenAI believes is steepening, not flattening.

The Office Automation Story: Real Work, Real Numbers

The most commercially significant aspect of GPT-5.5 is not its performance on academic benchmarks — it is its stated ability to handle the full breadth of everyday office software: email, spreadsheets, calendars, document creation, web research, and the kind of multi-application workflows that consume the majority of knowledge worker time in every organization.

OpenAI has backed this claim with unusually specific internal case studies. The most striking: OpenAI’s own finance team used GPT-5.5 to review 24,771 tax forms — totalling 71,637 pages — and completed the task two weeks faster than in the previous year. This is not a benchmark designed in a lab. It is a real organizational workflow, with real compliance stakes, completed at real scale. The two-week time saving in a tax review process represents a measurable reduction in the period during which errors can accumulate and corrections must be made.

The second case study is more broadly applicable: a member of OpenAI’s go-to-market team automated the generation of weekly business reports, saving between five and ten hours per week. For a single individual, that is a reduction of 12–25% of a standard working week spent on a task that GPT-5.5 now handles autonomously. Scaled across an organization of hundreds or thousands of knowledge workers, the aggregate savings are substantial.

The GDPval benchmark result provides the most systematic evidence for GPT-5.5’s office automation capabilities. GDPval tests an AI agent’s ability to produce well-specified knowledge work across 44 occupations — a deliberately broad evaluation designed to approximate the diversity of real workplace tasks rather than the narrow coding or reasoning evaluations that most AI benchmarks focus on. GPT-5.5 scores 84.9% on GDPval — a number that, if it reflects real-world performance accurately, would represent genuinely useful automation coverage across a wide range of professional roles.

Mark Chen, OpenAI’s chief research officer, added that GPT-5.5 was substantially better than its predecessors at navigating computer work — the ability to operate software interfaces as a human would, clicking through menus, filling in forms, and managing files — and that it showed “meaningful gains on scientific and technical research workflows,” with potential to help expert scientists make progress in ways that previous models could not.

“It just goes and kind of figures it out, deals with ambiguity. It’s a much more intuitive experience. It’s a faster, sharper thinker for fewer tokens compared to something like 5.4.”

— Greg Brockman, President and Co-Founder, OpenAI, April 23, 2026

The Benchmark Numbers: State-of-the-Art Across Every Frontier Category

GPT-5.5’s benchmark performance confirms the qualitative claims made by OpenAI’s leadership in the briefing. Across the major evaluation categories, the model either leads or is competitive at the very top of the field:

Benchmark	GPT-5.5	GPT-5.4	Change	What It Tests
Terminal-Bench 2.0	82.7%	~70%	+12.7pp	Complex CLI planning, tool coordination
SWE-Bench Pro	58.6%	44.3%	+14.3pp	Real-world GitHub issue resolution
GDPval (knowledge work)	84.9%	~70%	+~15pp	44-occupation office task automation
OSWorld-Verified	~80%	75%	+~5pp	Computer use in desktop environments
Expert-SWE	Higher	Lower	↑	20-hour engineering tasks (1-pass)
GeneBench	Improved	Lower	↑	Multi-stage genetic/biology research

The Terminal-Bench 2.0 score of 82.7% deserves particular attention. Terminal-Bench evaluates complex command-line usage involving planning, iteration, and tool coordination — the kind of tasks where a model must not only generate correct commands but maintain consistency across multiple steps in a dynamic environment. A score of 82.7% in this category means GPT-5.5 is operating at a level that would be considered highly competent by professional software engineers assessing a human candidate.

The SWE-Bench Pro result of 58.6% also represents a significant step from GPT-5.4. SWE-Bench Pro tests long-horizon software engineering tasks — the kind requiring sustained focus and structured problem-solving across extended sessions. A 14-percentage-point improvement in a single model generation on a benchmark measuring tasks that take human experts an estimated 20 hours to complete is the most direct available evidence that GPT-5.5’s agentic reasoning improvements are translating into real capability gains on hard professional work.

One technical achievement that OpenAI specifically highlighted: GPT-5.5 matches the per-token latency of GPT-5.4 in real-world serving. For enterprise deployments that require low-latency interactive applications, this means organizations can access substantially more capability without accepting any speed regression. This is a significant engineering achievement given that more capable models have historically come with latency trade-offs.

Beyond Office Work: Scientific Discovery at the Frontier

OpenAI’s positioning of GPT-5.5 extends beyond the office — into scientific research, in ways that suggest the company is beginning to take seriously the “frontier science” use case that has long been discussed as a future AI capability but has rarely been demonstrated concretely.

The most striking specific claim: an internal version of GPT-5.5 helped discover a new mathematical proof related to Ramsey numbers — a longstanding and notoriously difficult open problem in combinatorics. The proof was subsequently verified in Lean, a formal mathematical verification system, providing independent, machine-checked confirmation that the proof is actually correct rather than merely plausible-sounding. This is not AI generating a plausible-looking mathematical argument — it is AI contributing to a formally verified result that advances the state of mathematical knowledge.

On GeneBench — a benchmark focused on multi-stage scientific data analysis in genetics and quantitative biology — GPT-5.5 shows clear improvement over its predecessor. The tasks in GeneBench correspond to what human scientific experts might spend several days completing. Mark Chen framed this directly: GPT-5.5 could “help expert scientists make progress” — not replace them, but accelerate them in ways that compound over research timelines measured in years.

The drug discovery framing is separately noteworthy. GPT-5.5’s launch comes days after OpenAI separately rolled out an early AI model designed to accelerate drug discovery — a separate product announcement that, together with GPT-5.5’s GeneBench results, signals that OpenAI is building toward a position as the AI provider of choice for pharmaceutical research, a market with substantially higher value per compute unit than most other AI applications.

The Super App Strategy: One Surface for Everything

The product context for GPT-5.5 is as important as the model itself. Greg Brockman used the launch briefing to restate and advance OpenAI’s super app ambition — the plan to combine ChatGPT, Codex, and an AI browser into one unified service that can cover chat, coding, research, browsing, and business tasks without forcing users to switch between tools.

The strategic logic is straightforward: enterprise AI adoption currently requires users to manage multiple tools — a chat interface for natural language tasks, a coding assistant for development work, a research tool for information synthesis, a computer use agent for task execution. Each transition between tools introduces friction, loses context, and creates integration complexity. A super app that routes all of these tasks through a single interface — intelligently selecting the right capability for each request — removes that friction entirely.

GPT-5.5 is explicitly framed as the next step toward that unified surface. The model’s breadth — combining office automation, coding, computer use, deep research, and scientific reasoning in a single system — makes it the most plausible candidate yet for the multi-purpose role the super app envisions. OpenAI’s own deployment data supports this: over 85% of OpenAI’s workforce uses Codex weekly, spanning finance, communications, data science, and product management. The same model powering all of those workflows, through a single interface, is the product the company is building toward.

The timing of this ambition is not coincidental. Elon Musk — who has explicitly described wanting to turn X into its own super app — is simultaneously positioning SpaceX-xAI as a unified AI-coding-social conglomerate, and has just acquired the option to purchase Cursor. Sam Altman and Greg Brockman are building their super app from the AI side; Musk is assembling his from the compute, coding, and distribution side. Both are converging on the same destination: the single application through which knowledge workers do most of their intellectual labor.

“This is a real step forward towards the kind of computing that we expect in the future — but it is one step, and we expect to see many in the future.”

— Greg Brockman, OpenAI, launch briefing, April 23, 2026

The Cybersecurity Dimension: Extensive Safeguards on a Highly Capable Model

Mia Glaese, OpenAI’s Vice President of Research, addressed the cybersecurity dimension of GPT-5.5 directly during the launch briefing — a reflection of how prominent the dual-use risk conversation has become following Anthropic’s Mythos Preview announcement. “GPT-5.5 underwent extensive third-party safeguard testing and red teaming for cyber and bio risks, and we’ve been iterating on our cyber safeguards for months with increasingly cyber capable models,” she said.

The launch cadence itself reflects this caution. GPT-5.5 and GPT-5.5 Pro are rolling out to ChatGPT paid subscribers on launch day — but the company explicitly stated the model will come to its API “very soon,” noting that API deployments require “different safeguards.” This distinction matters: direct ChatGPT access comes with account-level monitoring, behavioral guardrails, and content classifiers that are harder to implement at the API layer, where developers can build applications with their own system prompts and output pipelines that bypass some of those controls.

The cybersecurity risks presented by advanced AI have been front of mind across the technology industry since Anthropic announced Mythos. GPT-5.5’s approach — broad public access through ChatGPT with safety controls, more restrictive API rollout, and a separate GPT-5.4-Cyber track for verified security professionals through the TAC programme — reflects OpenAI’s layered strategy for managing a model that the company knows has capabilities that extend into dual-use territory.

Who Gets Access: Pricing Tiers and Rollout Details

GPT-5.5 is rolling out to paid ChatGPT subscribers as of April 23, 2026. Access is tiered, with progressively more powerful variants available at higher subscription levels:

Tier	Price / mo	GPT-5.5 Access	Thinking options	Weekly limit
Free	$0	GPT-5.3 only	None	N/A
Plus	$20	GPT-5.5 Thinking	Standard + Extended	3,000 msgs
Pro	$200	GPT-5.5 + 5.5 Pro	Light/Standard/Extended/Heavy	Unlimited
Business	$30/seat	GPT-5.5 + 5.5 Pro	Standard + Extended	3,000 msgs
Enterprise	Custom	GPT-5.5 + 5.5 Pro	All options + controls	Custom

GPT-5.5 Thinking — the extended reasoning variant — comes with a four-level effort control for paid users. Plus and Business subscribers can choose between Standard (the new default, balancing speed and intelligence) and Extended (the previous default, offering deeper reasoning). Pro users get two additional options: Light (the fastest mode) and Heavy (the deepest reasoning, for the hardest tasks).

The model picker in ChatGPT is also gaining a new “Instant” mode that automatically routes requests between GPT-5.3 and GPT-5.5 Thinking based on complexity. For simple queries, Instant uses the faster 5.3 model; for complex ones, it automatically escalates to 5.5 Thinking and applies extended reasoning before responding. This automatic routing is designed to reduce the cognitive overhead of model selection for users who want optimal results without manually choosing settings.

Notably, GPT-5.5 is not currently available for ChatGPT for Healthcare workspaces — a specific carve-out that reflects the additional compliance and safety review required for models deployed in regulated healthcare environments.

The Competitive Context: OpenAI Reclaims the Capability Narrative

The timing of GPT-5.5 is not separable from the competitive context in which it was released. AlphaPilot’s analysis framed the launch as OpenAI directly challenging the benchmarks recently set by Anthropic’s Claude Mythos Preview — and the framing is accurate. Over the past six weeks, the capability narrative in AI has been dominated by Anthropic: Claude Sonnet 4.6 delivering near-Opus performance at mid-tier pricing, Claude Mythos Preview described as a “step change” in cybersecurity, Claude Code crossing $1 billion in annualized revenue faster than any software product in history.

GPT-5.5 is OpenAI’s comprehensive response. The model’s office automation credentials address the enterprise productivity market where Anthropic has been building rapidly. The 82.7% Terminal-Bench score surpasses the figures Claude Sonnet 4.6 posted at its launch. The GDPval result of 84.9% across 44 occupations provides a direct response to the enterprise AI capability claims that Anthropic has been making to justify its $380 billion valuation. And the scientific research dimension — Ramsey numbers, drug discovery, genetic analysis — opens a lane of competition that Anthropic has not yet publicly addressed.

OpenAI’s own internal adoption data — 85%+ of staff using Codex weekly across finance, communications, data science, and product management — is as much a marketing claim as a capability demonstration. It signals to enterprise buyers that GPT-5.5 is already performing the office automation tasks that those buyers are evaluating AI for. It is a form of reference architecture: OpenAI itself is the reference customer.

The Office Has a New AI — And It Is Just Getting Started

The release of GPT-5.5 on April 23, 2026 is the clearest statement yet that the AI industry has moved past the era of impressive demonstrations and into the era of economically consequential deployment. A model that can review 24,771 tax forms two weeks faster than a human finance team, save a go-to-market professional five to ten hours per week on report generation, prove a Ramsey number theorem and have it formally verified, and achieve 84.9% coverage of knowledge work tasks across 44 occupations is not a chatbot. It is an AI that works.

The super app ambition that Brockman restated at the launch briefing is the organizational vision that gives GPT-5.5’s technical achievements their commercial meaning. Individual capabilities are impressive; a unified surface that deploys all of them seamlessly across the workflow of a knowledge worker is transformative. OpenAI’s roadmap is pointed at that transformation. GPT-5.5 is the step that makes it real enough to see.

The pace of development, as Pachocki put it, should be expected to continue. GPT-5.5 is a step. The steps are getting bigger. And for the hundreds of millions of office workers who have been waiting for AI to move from “interesting” to “essential,” the wait appears to be over.