The large-language-model market no longer looks like a duel; it is a tiered bazaar in which each vendor offers half-a-dozen specialised brains, ranging from chatty on-device helpers to one-million-token research engines. Price, latency, modality and safety posture vary wildly, so the art of “choosing a model” is now as strategic as choosing a database. What follows is a narrative tour of the current catalogue from the three biggest providers—OpenAI, Anthropic and Google—explaining, in prose rather than bullet points, where every model tends to excel and when it quietly disappoints.
OpenAI
Updated GPT-4.1 — the million-token analyst
OpenAI’s April 2025 flagship is designed for tasks that need both length and rigour: reading a 500-page contract, running code‐assisted data analysis, or acting as the lead agent in a multi-step workflow. Its context window stretches to one million tokens and early benchmarks show it out-performing GPT-4o on long-range retrieval while costing roughly a quarter less per token. The company has made it clear that 4.1, in its main, Mini and Nano guises, is the long-term successor to all older GPT-4 variants.
GPT-4o — real-time multimodal conversation
Released in May 2024 and now the default model in ChatGPT, GPT-4o fuses text, image and audio in a single network and streams replies quickly enough to power live voice chat or on-screen vision assistance. The model keeps the 128 k-token limit of its Turbo predecessor but adds native voice-to-voice and image reasoning, which makes it the natural choice for customer-facing chatbots, hands-free interfaces and design tools that need to “see” mock-ups.
GPT-4 Turbo (2024-04-09) — affordable long-context workhorse
Before 4.1 arrived, Turbo was the go-to engine for ingesting giant documents on a budget. It still offers 128 k-token inputs at a markedly lower price than GPT-4o, although developers must remember the 4 096-token ceiling on completions, which rules it out for marathon summaries.
GPT-3.5 Turbo — the everyday budget pick
With a 16 385-token window and the lowest per-token tariff on the platform, 3.5 Turbo remains popular for lightweight retrieval-augmented generation, prototyping and classroom projects where latency and cost matter more than absolute accuracy. Its fine-tuning endpoint, absent from the GPT-4 line, also makes it a gateway model for custom instruction following.
o3 — deliberate, tool-using reasoning
Introduced on 16 April 2025, o3 is marketed as OpenAI’s “frontier reasoning” system. Beyond its 128 k context, the defining feature is an ability to plan and call external tools—code execution, web search, file analysis—inside a single chain-of-thought. That deliberative style shines on multi-stage problems in coding, science and data exploration, especially when visual inputs such as charts must be examined along the way.
o3-mini — scalable deep thinking on a shoestring
Where full o3 prioritises capability, o3-mini offers up to 200 k tokens at roughly one-tenth the price, with a dial that lets developers switch between low, medium and high reasoning effort. It therefore fits high-volume RAG back-ends, mobile tutors and batch code review pipelines that need occasional surges of insight without always paying flagship rates.
Text-Embedding-3-Small (and Large) — the vector backbone
OpenAI’s third-generation embedding models leap ahead of the older Ada line in multilingual recall while remaining dramatically cheaper: the Small variant delivers 1 536-dimension vectors at one-fifth the price, making it the default choice for search, recommendation and retrieval pipelines that pair with any chat model.
DALL·E 3 — prompt-rewriting image generation
DALL·E 3 stands out for its built-in prompt expansion, which translates terse user descriptions into detailed painterly instructions. Marketing agencies, product designers and illustrators lean on it for storyboard-ready visuals, while developers appreciate the consolidated billing and policy harmony of staying inside the OpenAI stack.
Whisper large-v3 — robust multilingual ASR
The third revision of Whisper swaps to a 128-channel mel-spectrogram front end and widens language coverage, including Cantonese. It is prized for handling noisy recordings and mixed-speaker audio, so product teams often combine it with GPT-4o for turnkey voicebots.
TTS-1 and TTS-1-HD — lifelike speech synthesis
OpenAI’s text-to-speech line offers quick-response audio for conversational agents (TTS-1) and higher-fidelity narration suitable for podcasts or video voice-overs (TTS-1-HD). Six base voices can be steered for emotion and pacing, letting developers keep both sides of a voice experience inside one provider.
Omni-Moderation-Latest — the in-line safety firewall
Built on GPT-4o, the multimodal moderation model now screens both text and images against thirteen policy categories and shows large gains in low-resource languages, making it the recommended last mile before user display.
Anthropic
Claude 3 Opus — maximum depth for expert work
Opus is the heavyweight of the Claude family. With its 200 k-token window and strong vision capabilities it excels in tasks like legal due-diligence sweeps, forensic code audits or data-room Q&A where every footnote matters. Independent evaluations still place it neck-and-neck with GPT-4-class systems while charging a premium price that suits high-stakes workloads.
Claude 3.7 Sonnet — hybrid reasoning at a mid-market price
Released in February 2025, Sonnet introduces a “thinking budget” slider: the same endpoint can emit instant answers or reveal extended chain-of-thought when the user wants deeper analysis. At roughly one-fifth the cost of Opus it has become Anthropic’s volume workhorse, especially after Amazon Bedrock added the model on day one.
Claude 3 Haiku — speed first, pennies per million tokens
Haiku’s raison d’être is latency: sub-second first tokens and bulk throughput at $0.25 per million input tokens in its original release, or $0.80 in the refreshed “3.5” variant on Bedrock. Companies use it for classification, routing and first-pass summarisation before escalating edge-cases to Sonnet or Opus.
Multimodal vision and native tool-calling across the line
Every Claude 3 member ingests up to twenty images per request and exposes structured function calls plus an Anthropic-hosted web search tool, which means developers can build agentic workflows without extra orchestration code.
No home-grown embeddings—and why that is intentional
Anthropic decided to leave vector semantics to specialists. The official docs point customers to Voyage AI or any third-party embedding vendor and provide cookbook guides for wiring Voyage into Claude-based RAG systems. The result is a simpler product surface at the cost of an extra dependency.
Constitutional AI — alignment baked into the training loop
Claude models refuse or redact content according to a fixed charter inspired by human-rights instruments, a method detailed in Anthropic’s “Constitutional AI” research. The approach still draws academic debate but has measurably raised jailbreak resistance in live tests.
Google Gemini
Gemini 2.5 Pro — the one-million-token reasoning engine
Now in preview on Vertex AI and Google AI Studio, Gemini 2.5 Pro offers the largest context window in the public Google stack alongside native image, audio and video input. Developers can even request the model’s “thinking traces,” a transparent chain-of-thought that aids debugging. Its pay-as-you-go rate undercuts GPT-4o for similar tasks.
Gemini 2.0 Flash — multimodal speed demon with image generation
Flash was built for live chat and low-latency RAG. It matches the one-million-token ceiling of Pro but opts for cheaper weights and adds native Imagen-class image synthesis, so a single call can read a slide deck and return both text and pictures. Grounding with Google Search and context caching sweeten the economics.
Gemini 2.0 Flash-Lite — ultra-high-volume cost cutter
The Lite variant halves Flash prices again, sacrificing a sliver of benchmark score to win bulk summarisation and classification jobs where every cent counts. Text-and-image input remains intact, making Lite a favourite for news clipping, inbox triage and other “always on” back ends.
Gemini 1.5 Pro — the two-million-token librarian
Although branded “legacy” for new projects, 1.5 Pro still owns the record for public context length at two million tokens and can transcribe nineteen hours of audio in one pass. Teams engaged in e-discovery or full-codebase analysis continue to rely on it, especially since context caching mitigates its higher base price.
Gemini 1.5 Flash — instant summaries after a 78 % price cut
Google’s August 2024 price reset transformed 1.5 Flash into one of the cheapest long-context models on the market, now starting at $0.075 per million input tokens. Latency and price make it the go-to for automated meeting summaries, PDF condensation and multilingual content tagging.
Gemini Advanced (1.0 Ultra) — consumer-grade power in a subscription
For $19.99 a month via Google One AI Premium, power users gain continuous access to Gemini Advanced with longer context limits than the free web version, plus Workspace integration. The plan bundles 2 TB of storage and NotebookLM Plus, turning Google’s consumer tier into a mini developer stack.
Gemini Nano — private on-device assistance
Distilled to run inside Android’s AICore service, Nano powers offline smart-reply, proofreading and recorder summaries on Pixel phones. Developers can experiment through the AI Edge SDK, bringing basic LLM capability to mobile apps without any server bill or data egress.
Gemini-Embedding-EXP-03-07 — state-of-the-art vectors
Google’s first Gemini embedding model entered public preview in March 2025. It tops multilingual MTEB leaderboards with a default 3 072-dimension output and drop-in compatibility with the Vertex AI embedding endpoint, giving Google’s stack a long-awaited answer to OpenAI’s V3 embeddings.
A last word on choosing wisely
The safest rule is to match model strength to problem hardness and scale, rather than reaching for the biggest brain on the shelf. Use Haiku, 1.5 Flash or GPT-3.5 Turbo when speed and cost dominate. Call on GPT-4.1, Opus or Gemini 2.5 Pro when error budgets are tiny and documents are huge. Remember that embeddings, moderation and speech components often matter as much as the chat model itself, and that swapping a single tier can cut or multiply your bill by an order of magnitude. In 2025 the menu is rich; the craft lies in ordering exactly what your application can digest—and no more.