GLM-4.5 from Z.ai: A Chinese Open-Source Challenger for Agentic AI, Benchmarks, and Cost

Release at a glance

On July 28, 2025, Beijing-based Z.ai, the company formerly known as Zhipu AI, released GLM-4.5 as an open-source model family positioned specifically for agentic applications. The company framed the launch as part of a broader push to make high-end reasoning, coding, and tool-use widely accessible, and multiple outlets confirmed both the open-source positioning and the Chinese origin of the release. Reuters’ coverage the same day placed the announcement in the context of China’s increasingly competitive LLM ecosystem and made clear that the intent is to power intelligent agents rather than a generic chat model.

Z.ai’s own materials emphasize that the GLM-4.5 family is available as open weights alongside a hosted API and a web chat, so developers can either self-host or consume it as a service. In plain terms, this means you can download the weights and run them yourself, or call them via an OpenAI-style endpoint. The company’s technical blog post for the launch makes that availability explicit and positions GLM-4.5 as a unified model spanning reasoning, coding, and agentic behavior.

What’s inside the GLM-4.5 family

GLM-4.5 is organized as a Mixture-of-Experts model line with two flagship variants. The larger GLM-4.5 uses a total parameter count of 355 billion with 32 billion “active” parameters per token, while GLM-4.5-Air targets a lighter footprint at 106 billion total and 12 billion active. Both models expose two operating modes—a “thinking” mode for deliberate multi-step reasoning and a “non-thinking” mode for fast responses—so teams can trade precision for speed when needed. Z.ai also highlights a 128K context window and native function calling, important for agents that must read long inputs and orchestrate tools.

Under the hood, Z.ai describes a pragmatic architecture tuned for agent workloads. The team says it adopted MoE with loss-free balance routing and sigmoid gates, deliberately “going deeper” rather than wider to improve reasoning; the attention stack incorporates Grouped-Query Attention with partial RoPE. In practice, that design choice aims to deliver better planning, chain-of-thought style problem-solving, and robust tool use without ballooning the active compute per token.

Benchmarks and early comparisons

Because agentic tasks span more than trivia or single-turn QA, Z.ai anchored its evaluation on twelve benchmarks across three areas: agentic behavior, reasoning, and coding. In its roll-up chart, the company places GLM-4.5 third overall among a mixed field of proprietary and open models, with GLM-4.5-Air landing sixth. While that headline needs independent replication, the underlying tables are instructive: on BrowseComp, a web-browsing benchmark that stresses tool use and multi-step reasoning, GLM-4.5 logs 26.4% with tools, above scores Z.ai reports for several closed models in their setup. The blog also lists results for TAU-bench (retail and airline domains) and BFCL-v3 for function calling, where GLM-4.5 matches or edges strong baselines in Z.ai’s runs.

Coding and agentic coding are a second emphasis. Z.ai reports a 64.2 score on SWE-bench Verified and 37.5 on Terminal-Bench for GLM-4.5 under stated conditions, plus pairwise agent runs using Claude Code where GLM-4.5 wins a majority against Kimi-K2 and outperforms Qwen3-Coder across 52 tasks, while being competitive with Claude 4 Sonnet. It also claims a 90.6% tool-calling success rate—useful for teams that rely on structured function execution and want predictable behavior when the model calls APIs. These are vendor-run numbers and should be read as directional, but they show where Z.ai believes GLM-4.5 is strongest.

Pricing and availability

One reason GLM-4.5 has quickly drawn attention is price. Z.ai has said publicly that GLM-4.5 will undercut rivals like DeepSeek on cost, and third-party coverage repeated the “cheaper than DeepSeek” positioning following the Monday launch. Beyond marketing language, concrete prices are already visible on major inference platforms. Fireworks AI lists GLM-4.5 serverless pricing at roughly $0.55 per million input tokens and $2.19 per million output tokens, with 128K context and function calling. That aligns with aggregate trackers that peg GLM-4.5’s blended cost near the low end for frontier-class models.

Regional providers are offering even lower rates, especially for the lighter Air variant. SiliconFlow’s announcement quotes $0.50 per million input and $2.00 per million output for GLM-4.5, while GLM-4.5-Air comes in at approximately $0.14 input and $0.86 output per million tokens. Z.ai’s own press materials also cite promotional pricing that reaches as low as $0.11 input and $0.28 output per million tokens in some configurations, underscoring how aggressive token economics have become for Chinese-origin models. Pricing will vary by provider, region, and promotional tier, so teams should verify for their geography and workload profile.

Open source and licensing

Openness here is not a slogan; Z.ai has posted the weights publicly and attached permissive terms. On the Hugging Face model card for GLM-4.5, the license is listed as MIT, and the description clarifies that Z.ai has open-sourced the base models, the hybrid reasoning chat models, and FP8 variants for both GLM-4.5 and GLM-4.5-Air. In parallel, the code in the official GitHub repository is released under Apache-2.0. In practical terms, that split means the weights’ license and the repository’s code license are both permissive and commercially friendly, while not identical; organizations can self-host, adapt, and ship products built on these models without copyleft obligations, subject to the usual attribution and notice requirements.

Z.ai’s blog reiterates that the models can be used via an OpenAI-compatible API, accessed in a browser chat, or downloaded as open weights. That multi-route distribution matters for enterprises that need to keep data local or align with specific compliance regimes, and it underscores a broader theme of this release: frontier-adjacent capability delivered in a form that developers can inspect and control.

Running it yourself

Teams that prefer self-hosting will find familiar tooling and clear system guidance. Z.ai provides quick-start instructions for both vLLM and SGLang, including FP8 variants to maximize throughput on supported GPUs. The project’s readme outlines reference hardware for “full-featured” inference and for unlocking the full 128K context length. As a representative example, the company lists GLM-4.5 FP8 runs on 8× H100 or 4× H200 for standard inference, with higher counts recommended to exercise the full context window, and it flags that servers should have more than one terabyte of RAM for smooth operation. Those details make capacity planning and cost modeling far easier for infrastructure teams considering on-prem or dedicated deployments.

Beyond raw serving, the repository notes integration hooks for popular ecosystems. The reasoning and tool parsers have implementations for Transformers, vLLM, and SGLang, and the documentation shows how to enable speculative decoding and configure thinking mode or tool calling. That’s important if you want to cap tool invocations for cost control or adjust decoding strategies to meet latency targets in production systems.

Why GLM-4.5 matters now

GLM-4.5 arrives at a moment when many teams are moving from single-prompt assistants to persistent agents that plan, browse, call internal systems, and produce code or artifacts. The combination of 128K context, strong vendor-reported scores on tool-heavy evaluations, and credible coding performance positions Z.ai’s models as practical building blocks for those workflows. Add in permissive licensing and open weights, and the package offers something that has often been missing near the frontier: the ability to audit, self-host, and fine-tune without a closed-platform tax.

The Chinese provenance is also part of the story. Reuters framed GLM-4.5 as emblematic of the pace and scale of China’s LLM development, and Z.ai’s press materials explicitly tie the release to affordability and accessibility goals. For international buyers, that mix of openness and price pressure translates into option value: you can test an agent-ready model with frontier-level aspirations, keep deployment flexible across providers or your own hardware, and pressure-test your cost structure against a fast-moving market. As with any vendor-run benchmark, the right next step is a focused pilot using your own tasks and tool chains to validate quality, latency, and token spend.

The bottom line

GLM-4.5 is a notable open-source entry from a leading Chinese AI startup, released on July 28, 2025 with an explicit aim at agentic use cases. The family spans a large model and a cost-efficient “Air” variant, offers dual reasoning modes, 128K context, and native function calling, and posts strong vendor-reported results across web browsing, tool use, and coding. Pricing is aggressive on both global and regional platforms, and the licensing combination—MIT for weights, Apache-2.0 for code—gives enterprises latitude to deploy where and how they need. If your 2025 roadmap includes autonomous assistants, coding agents, or long-context knowledge work, GLM-4.5 deserves a place in your evaluation set.

Sources used for this article: Z.ai’s launch blog and repository, the Hugging Face model card for licensing, Reuters for the release date and context, and provider pages for live pricing and availability.

August 4, 2025