What is ChatGPT’s new Agent Mode—and why it matters
OpenAI’s “agent mode” turns ChatGPT from a conversational assistant into a system that can plan, browse, click, type, run code, manipulate files, and generate finished work products on a virtual computer—all inside one chat. Instead of handing you a list of links, the agent can open a visual browser, sign in when you take over the keyboard, download documents, analyze data in a terminal, and return editable outputs such as slides or spreadsheets. The goal is to blend the reasoning of a top‑tier model with the ability to act, so tasks that previously required many back‑and‑forth steps—say, compiling a competitor brief and turning it into a slide deck—can be completed in a single, supervised flow. This launch merges the strengths of OpenAI’s earlier Operator (web interaction) and Deep Research (long‑form analysis) into one unified agentic system, so you no longer need to choose between “research” and “action” for the same job.
Under the hood, the agent chooses among several built‑in tools: a visual browser that behaves like a human-controlled GUI, a text browser for lightweight web queries, a terminal with limited network access for analysis and code execution, and connectors (e.g., Gmail, Google Drive, GitHub) that let the model read from your data sources when you authorize them. It keeps context across tools, so it might fetch information via connectors, transform it via code in the terminal, and then paste results into a spreadsheet you can download—all while pausing for your approval at key moments.
Release timeline, availability, and how to turn it on
OpenAI announced the ChatGPT agent on July 17, 2025 and began rolling it out the same day. Pro, Plus, and Team users are included in the initial wave, with Enterprise and Education to follow in the weeks after launch. OpenAI also notes that it is still working on enabling access in the European Economic Area and Switzerland, so availability can vary by account if you’re in the EU. Crucially, OpenAI confirmed that Pro users get 400 agent messages per month, while Plus and Team users get 40 per month, with additional usage available via flexible, credit‑based options.
Turning it on is straightforward. In any ChatGPT conversation, open the tools menu and select Agent mode (or type /agent in the composer). Once enabled, you describe the task you want done; the agent performs the steps, pausing whenever it needs clarification or your authorization. OpenAI’s Help Center also clarifies how usage is counted: only user‑initiated messages that move the task forward—such as starting a task, interrupting it, or answering a blocking question—count against your monthly limit, while most intermediate confirmations do not.
OpenAI’s earlier research preview, Operator, has now been fully integrated into ChatGPT as part of agent mode. The standalone Operator site remains for a short window, then sunsets. If you preferred the old Deep Research behavior, you can still access it from the same tools dropdown, but the combined agent is now the default path for doing work across web, files, and connectors.
How the agent actually works in practice
The defining change is that the model now thinks and acts within a persistent, remote environment. As the agent proceeds, it narrates what it’s doing—opening pages, filtering results, downloading files—and you can interrupt or take over the browser at any time. If a login or payment is required, the agent asks you to take control in “takeover mode,” which keeps your inputs private and out of screenshots. For recurring work, you can schedule a completed task to repeat on a daily, weekly, or monthly cadence directly from the conversation.
OpenAI also built in privacy and control features tailored to agentic behavior. You can clear the agent’s browsing data and sign it out of all sites from ChatGPT settings. Connectors are read‑only, and the system asks for explicit confirmation before actions that carry real‑world consequences, such as submitting purchases or sending emails. Sensitive categories can trigger a “Watch mode,” which requires you to supervise while the agent works. These design choices aim to reduce the risks that come with giving an AI the ability to act in your accounts on the live web.
Benchmarks: what the early numbers say—and what they don’t
OpenAI reports that the model behind agent mode reaches state‑of‑the‑art performance on several evaluations tied to browsing, complex tasks, and spreadsheet editing. On Humanity’s Last Exam (HLE)—a demanding test spanning expert‑level questions across many fields—the agent achieves a pass@1 of 41.6, improving to 44.4 when OpenAI runs a simple parallel rollout that samples up to eight attempts and selects the output with the highest self‑reported confidence. On FrontierMath, a particularly unforgiving math benchmark with novel, unpublished problems, the agent records 27.4% accuracy with tool use. These are internal, OpenAI‑elicited results, and the company discloses methodological details, including oversight to deter “cheating” during browsing. Independent replication will take time, but the headline results indicate a step up versus earlier OpenAI models.
Because agent mode excels at the web itself, OpenAI highlights BrowseComp, its April 2025 benchmark for browsing agents that measures the ability to find hard‑to‑locate information on the open internet. On BrowseComp, the agent posts a 68.9% score—17.4 percentage points higher than OpenAI’s earlier Deep Research agent—reflecting improvements in planning and web interaction. BrowseComp is publicly documented, with a paper and dataset describing the evaluation protocol.
The company also reports progress on WebArena, which measures success on realistic web tasks, and introduces results on DSBench, a research benchmark for data‑science agents where OpenAI says the agent surpasses human performance by a significant margin on the measured tasks. The DSBench paper itself, authored by researchers outside OpenAI, describes a suite of 540 realistic tasks across analysis and modeling; early results across systems suggest plenty of headroom remains for agents, even as top models improve. Treat OpenAI’s DSBench claims as promising but preliminary until more teams publish apples‑to‑apples results.
If you live in spreadsheets, one of the most concrete numbers is SpreadsheetBench. OpenAI reports that the agent achieves 45.5% when allowed to edit spreadsheets directly and outperforms Copilot in Excel’s 20.0% in the authors’ comparison, with a human baseline of 71.3%. OpenAI additionally publishes a breakdown table and notes environment differences with the benchmark authors, explaining that its evaluations used macOS and LibreOffice rather than Windows and Excel; a more detailed appendix shows results including an “agent with .xlsx” configuration. Either way, the results point to a meaningful jump over prior models on real‑world spreadsheet editing—even if humans remain well ahead.
Benchmarks should guide expectations but not replace hands‑on trials. OpenAI’s figures are mostly vendor‑run and include strong safeguards against data leakage, yet every organization’s workflows are different. The more your task mixes browsing, file transformation, and structured output, the more likely the agent’s strengths will show; the more your task requires niche tools or edge‑case UIs, the more you’ll want to keep a human in the loop while the ecosystem standardizes.
Safety, risk, and responsible use
Allowing an AI system to take actions raises risks beyond those posed by text‑only chat. OpenAI’s launch recaps mitigations from the Operator research preview and adds multi‑layer defenses for prompt‑injection attempts, where malicious instructions on a webpage try to trick the agent into leaking data or performing unintended actions. The agent requires explicit confirmation before consequential steps, uses “Watch mode” on sensitive sites like email, and defaults to takeover mode for passwords and logins so your inputs are never sent through the model. You can disable connectors you don’t need and clear cookies after sensitive sessions to minimize exposure.
Because the agent can browse broadly and run code, OpenAI is treating this release as High capability in biological and chemical domains under its Preparedness Framework and published a dedicated system card detailing the threat model, refusal training, reasoning monitors, and enforcement pipelines. That doesn’t mean the model enables harmful outcomes; rather, OpenAI argues that the expanded toolset and higher reach warrant a stronger safety stack from day one. Organizations adopting agent mode should mirror that posture with their own layered controls, access reviews, and red‑teaming.
Competitors: how OpenAI’s agent compares to Anthropic, Microsoft, Google, Perplexity, and xAI
Anthropic offers Claude’s Computer Use, a screenshot‑based method for letting Claude see, click, and type on a remote desktop or browser. It has been available as a beta since October 22, 2024 and continues today with updated headers for newer models. The approach prioritizes general GUI control over deep application integrations, and developers enable it via a special beta flag in the API. This is closest to OpenAI’s visual browser capability, though Anthropic positions it primarily for developer‑led deployments rather than an end‑user chat toggle.
Microsoft is pushing agentic capabilities across several fronts. In Copilot Studio, organizations can build autonomous or semi‑autonomous agents and now publish those agents directly into Microsoft 365 Copilot, which Microsoft says is generally available. On the platform side, Microsoft has announced multi‑agent orchestration, Agent‑to‑Agent (A2A) connectivity, and broad updates tied to Build 2025, including the general availability of Azure AI Foundry Agent Service for developer‑grade agent orchestration. Microsoft is also rolling out domain‑specific “Researcher” and “Analyst” agents into Microsoft 365, signaling a push toward packaged, workflow‑ready agents embedded in everyday tools. Compared with OpenAI’s single end‑user agent, Microsoft offers a more enterprise orchestration flavor and a marketplace route to distribute internal agents in Copilot.
Google frames its strategy around Agentspace, which is designed to help companies discover, create, and deploy agents through Google Cloud and to surface them in an Agent Marketplace. Google’s updates in spring 2025 emphasized lower‑code assembly of agents and closer ties to Workspace via Gemini, building toward an “agent‑driven enterprise.” While Google has demonstrated research agents and deep multimodal assistants, most of the Agentspace narrative so far is a platform story rather than head‑to‑head agent benchmarks like BrowseComp.
Perplexity takes a narrower but potent slice of the agent space with Deep Research, launched February 14, 2025. Deep Research performs dozens of searches and reads hundreds of sources autonomously to produce long‑form reports, with the company reporting strong early scores on HLE for that mode. Conceptually, Deep Research overlaps with the research side of OpenAI’s agent without the full “act on the web” toolset, making it an appealing option if your primary need is literature synthesis rather than form‑filling or spreadsheet editing.
Finally, xAI introduced Grok 4 in early July 2025, positioning it as a more broadly capable model with native tool use and real‑time search. While Grok 4 is relevant to agent discussions, xAI’s materials around the launch focus on model capability rather than detailed, task‑level web‑agent benchmarks comparable to BrowseComp or SpreadsheetBench, so it’s best seen as an upstream model competitor rather than a direct substitute for ChatGPT’s end‑user agent product at this moment.
What to watch next and how to get value fast
If you’re evaluating agent mode, start with bounded, high‑value tasks that combine research and action—think weekly KPI rollups, competitor landscape checks that end in editable slides, or public‑data collection tasks that flow into a spreadsheet for audit. Use takeover mode for any credentials and keep non‑essential connectors disabled to limit blast radius. For organizations with compliance needs, treat the agent like any other browser user: apply allowlisting where needed, monitor for odd behavior, and have a retrieval plan for cookies and session state. OpenAI’s documentation makes it easy to clear saved logins when tasks are done. As adoption widens, expect stronger cross‑vendor standards and more third‑party verification for benchmarks, which should further clarify when to rely on autonomous steps and when to keep a human driver in the loop.
One last practical tip is to experiment with scheduling. Because you can set a completed agent run to repeat automatically, workflows like “every Monday at 9 a.m., pull the week’s relevant SEC filings, extract changes in risk factors, and update the slide deck” become push‑button. This is where agent mode can cross the boundary from a helpful assistant to a genuine teammate that delivers a first draft you tidy up. If you’re in the EEA or Switzerland, check inside your account each week; OpenAI says it is still working on enabling access there, and staged rollouts can change quickly.
Bottom line: As of July 17, 2025, OpenAI’s ChatGPT agent mode is live for Pro, Plus, and Team users and combines web interaction, coding, connectors, and artifact generation into a single, supervised workflow. Early benchmarks—from HLE 41.6/44.4 and FrontierMath 27.4% to BrowseComp 68.9% and substantial SpreadsheetBench gains—suggest real‑world competence that moves beyond chat, even if human review remains essential. Competitors are converging from different directions: Anthropic with screenshot‑based Computer Use, Microsoft with enterprise‑first Copilot agents and orchestration, Google with platform‑level Agentspace, Perplexity with Deep Research, and xAI with a new frontier model in Grok 4. The practical advice is simple: start small, stay in control, and scale the wins you can supervise. This is the first broadly available chapter of agentic software—and it’s already useful.