Name: State of LLM Pricing for Finance Q2 2026
Creator: AI Fin Hub Research
Published: 2026-04-23
License: https://creativecommons.org/licenses/by/4.0/

TL;DR

This release is a pricing-grade snapshot of every finance-relevant LLM tier as of 2026-04-23, sourced line-by-line from vendor pricing pages. It lists input, output, cache, batch, context-window, and thinking figures for Anthropic's Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, and Opus 4.7; OpenAI's live GPT-5.4 family (base, mini, nano); and Google's Gemini 2.5 Flash, Pro, and Flash-Lite. Four worked finance scenarios show per-provider monthly cost at published rates: nightly SEC-filing triage, quarterly earnings-call summarization, multi-document peer-comparison synthesis, and an intraday research loop. No quality or accuracy measurements are included here; those require running a task-specific eval. The purpose is traceable inventory, not ranking.

Methodology

Every figure in this release traces to a vendor pricing page pulled on 2026-04-23. No quality scores, latency measurements, or accuracy numbers are reported. Running an eval against your own task is the only defensible way to rank models on output quality.

The sources are:

Anthropic: docs.anthropic.com/en/docs/about-claude/pricing¹ and the model-overview page for context windows and max output tokens².
OpenAI: developers.openai.com/api/docs/pricing³ and per-model spec pages for GPT-5.4, GPT-5.4 mini, and GPT-5.4 nano⁴.
Google: ai.google.dev/pricing⁵ plus Vertex AI model pages for Gemini 2.5 Flash, Pro, and Flash-Lite context windows⁶.

Where a vendor page does not publish a figure (for example, OpenAI's main pricing page does not republish legacy GPT-5 / o3 / o4-mini base rates as of 2026-04-23), the field is recorded as null in the accompanying data.csv and data.json. Historical rates are not reconstructed from memory or third-party pages.

Prices are in USD per 1M tokens unless stated otherwise. Batch API discounts are documented as 50 percent across the three providers; prompt caching uses each vendor's published multiplier relative to base input. The four worked scenarios at the end are computed arithmetically from the master table. A reader can reproduce every figure from the input and output rates alone.

Pricing changes. Vendors ship price cuts and new tiers on monthly or quarterly cadence. The snapshot is correct on the as-of date; readers running material workloads should re-pull the vendor pages before any spend decision.

Master pricing table

All figures pulled 2026-04-23. "MTok" means per million tokens. Empty cells indicate the figure is not published on the vendor pricing page.

Anthropic (Claude family)

Model	Input / MTok	Output / MTok	Cache read / MTok	5m cache write / MTok	1h cache write / MTok	Batch	Context	Max output	Thinking
Claude Haiku 4.5	$1.00	$5.00	$0.10	$1.25	$2.00	50%	200K	64K	Extended only
Claude Sonnet 4.6	$3.00	$15.00	$0.30	$3.75	$6.00	50%	1M	64K	Extended + adaptive
Claude Opus 4.6	$5.00	$25.00	$0.50	$6.25	$10.00	50%	1M	128K	Extended
Claude Opus 4.7	$5.00	$25.00	$0.50	$6.25	$10.00	50%	1M	128K	Adaptive only

Cache multipliers across the Claude family are constant: read at 0.1x input, 5-minute write at 1.25x input, 1-hour write at 2x input¹. Opus 4.7 ships a new tokenizer that the vendor notes "may use up to 35 percent more tokens for the same fixed text" compared to prior Claude models¹. That surcharge is not visible on the per-token price but is visible on the per-call bill and belongs in any budget that migrates from Opus 4.6 to 4.7.

OpenAI (current GPT-5.4 family)

Model	Input / MTok	Cached input / MTok	Output / MTok	Batch	Context	Max output	Reasoning
GPT-5.4	$2.50	$0.25	$15.00	50%	1.05M	128K	Yes (effort levels none/low/medium/high/xhigh)
GPT-5.4 mini	$0.75	$0.075	$4.50	50%	400K	128K	Yes
GPT-5.4 nano	$0.20	$0.02	$1.25	50%	400K	128K	Yes

GPT-5.4 has a pricing-tier boundary at 272K input tokens: prompts above that threshold are billed at 2x input and 1.5x output for the full session³. In practice that flips the effective per-token cost from $2.50/$15.00 to $5.00/$22.50 on any 273K-token prompt, which is relevant for 10-K digests that spill past the break. Legacy base models (GPT-5, o3, o4-mini) are no longer on the main pricing page as of the as-of date; only the batch-only deep-research variants (o3-deep-research at $5.00/$20.00 and o4-mini-deep-research at $1.00/$4.00) remain listed³. Teams still running against those older IDs should confirm with OpenAI whether legacy rates persist on their account.

Google (Gemini 2.5 family)

Model	Input / MTok	Output / MTok	Cache read / MTok	Batch	Context	Max output	Thinking
Gemini 2.5 Flash	$0.30	$2.50	$0.03	50%	1,048,576	65,535	Yes (included in output)
Gemini 2.5 Pro (prompts <= 200K)	$1.25	$10.00	$0.125	50%	1,048,576	65,535	Yes
Gemini 2.5 Pro (prompts > 200K)	$2.50	$15.00	$0.25	50%	1,048,576	65,535	Yes
Gemini 2.5 Flash-Lite	$0.10	$0.40	$0.01	50%	1,048,576	65,535	Yes

Google publishes a two-tier structure on Gemini 2.5 Pro that mirrors the OpenAI >272K premium: at the 200K input-token line the per-token rate doubles on input and scales 1.5x on output⁵. Audio input carries a separate price ceiling on Flash ($1.00/MTok) and Flash-Lite ($0.30/MTok); the figures above track the text / image / video rate, which is what nearly all finance workloads hit. Cache storage on Pro is billed at $4.50 per 1M tokens per hour, an order of magnitude above Flash at $1.00 per 1M tokens per hour⁵. Cache strategy therefore depends on tier selection as much as on hit rate.

Worked scenario A: nightly SEC-filing triage

Setup: 500 10-Ks per night, 4 fact-extraction questions per filing, 50,000 input tokens per question (the filing plus extraction prompt), 1,000 output tokens per answer. 21 trading days per month.

Volume: 500 x 4 x 21 = 42,000 queries/month. 2,100 MTok input, 42 MTok output.

Model	Standard monthly cost	Batch monthly cost
Gemini 2.5 Flash-Lite	$226.80	$113.40
GPT-5.4 nano	$472.50	$236.25
Gemini 2.5 Flash	$735.00	$367.50
GPT-5.4 mini	$1,764.00	$882.00
Claude Haiku 4.5	$2,310.00	$1,155.00
Gemini 2.5 Pro (<= 200K)	$3,045.00	$1,522.50
GPT-5.4	$5,880.00	$2,940.00
Claude Sonnet 4.6	$6,930.00	$3,465.00
Claude Opus 4.7	$11,550.00	$5,775.00

At the bottom-tier models, the spread between the cheapest and most expensive option on a triage workload is 50x. A triage pipeline that uses Opus 4.7 to read every 10-K sentence burns an order of magnitude more cash than a pipeline that uses Flash-Lite or nano for the fact-extraction pass and escalates to a stronger model only when confidence is low. The Token-Cost Optimizer and the Financial Document Token Estimator both accept the same input shape as the table above, so the numbers can be retargeted to a custom universe size or question count without leaving the site.

Caching changes the picture sharply. If the 10-K body is cached as a system-prompt prefix and reused across the four questions per filing, Anthropic's 0.1x cache-read multiplier drops the input share of Haiku 4.5 from $2,100 to about $210 on the three cache-hit passes per filing, which near-halves the monthly bill for that provider. Details on that calculation belong in Prompt Caching Economics for Finance LLMs.

Worked scenario B: quarterly earnings-call summarization

Setup: 20 tickers, one earnings call per quarter, 50,000 input tokens per call (transcript plus context), 3,000 output tokens (structured summary). Quarterly volume: 20 calls, 1 MTok input, 0.06 MTok output.

Model	Per-quarter cost
Gemini 2.5 Flash-Lite	$0.12
GPT-5.4 nano	$0.28
Gemini 2.5 Flash	$0.45
GPT-5.4 mini	$1.02
Claude Haiku 4.5	$1.30
Gemini 2.5 Pro (<= 200K)	$1.85
GPT-5.4	$3.40
Claude Sonnet 4.6	$3.90
Claude Opus 4.7	$6.50

This workload is the classic case where token cost is a rounding error and output quality dominates the decision. A practitioner running a small-universe earnings-call loop faces a total quarterly bill of under $7 on every provider listed; the difference between the cheapest and most expensive option on the board is $6.38 per quarter. Choosing on price alone is a category error at this scale. The only reason to care about these numbers is to show how the absolute dollars scale into scenarios A, C, and D, where the same per-token spread compounds into five- or six-figure monthly variance.

Note the 50,000-token call transcript sits comfortably inside every model's context window, so the Gemini 2.5 Pro prompt stays in the <= 200K tier and the GPT-5.4 prompt stays under the 272K premium boundary. Scaling from 20 tickers to 2,000 multiplies each figure by 100 and keeps the per-prompt token budget identical.

Worked scenario C: multi-document peer-comparison synthesis

Setup: 5 filings fed into a single call, 125,000 input tokens per run, 2,000 output tokens. Run daily across 21 business days. Monthly volume: 2.625 MTok input, 0.042 MTok output.

Model	Standard monthly cost	Batch monthly cost
Gemini 2.5 Flash-Lite	$0.28	$0.14
GPT-5.4 nano	$0.58	$0.29
Gemini 2.5 Flash	$0.89	$0.45
GPT-5.4 mini	$2.16	$1.08
Claude Haiku 4.5	$2.84	$1.42
Gemini 2.5 Pro (<= 200K)	$3.70	$1.85
GPT-5.4	$7.19	$3.60
Claude Sonnet 4.6	$8.51	$4.26
Claude Opus 4.7	$14.18	$7.09

At 125,000 input tokens, the prompt fits inside the <= 200K Gemini Pro tier and inside the < 272K GPT-5.4 tier, so the cheaper per-token rate applies for both. Push the peer set from 5 to 10 filings and per-run input becomes 250,000 tokens, which crosses the Gemini Pro 200K boundary (pricing doubles on input, 1.5x on output) and crosses the GPT-5.4 272K boundary (2x input, 1.5x output). A practitioner who designs the synthesis step to sit just under those boundaries extracts a real price discount per call; a practitioner who lets input drift over the boundary surprises the bill reviewer.

On Claude Sonnet 4.6 and Opus 4.7, a 125K-token prompt is inside the 1M-token standard tier with no pricing kink¹. Anthropic does not publish a mid-context premium on Sonnet 4.6 or Opus 4.7; the per-token rate is flat to the 1M context ceiling. That flat curve is the strongest single argument for these two models on long-context synthesis workloads: the bill scales linearly with tokens and the budget model stays simple.

Worked scenario D: intraday agent research loop

Setup: 10 research ideas per day, 8,000 input tokens per idea, 1,000 output tokens. Monthly volume over 30 days: 2.4 MTok input, 0.3 MTok output.

Model	Standard monthly cost	Batch monthly cost
Gemini 2.5 Flash-Lite	$0.36	$0.18
GPT-5.4 nano	$0.86	$0.43
Gemini 2.5 Flash	$1.47	$0.74
GPT-5.4 mini	$3.15	$1.58
Claude Haiku 4.5	$3.90	$1.95
Gemini 2.5 Pro (<= 200K)	$6.00	$3.00
GPT-5.4	$10.50	$5.25
Claude Sonnet 4.6	$11.70	$5.85
Claude Opus 4.7	$19.50	$9.75

Agent loops with tight per-idea budgets are the tier where batch becomes awkward: if the research loop is real-time (fire on signal, consume a recommendation within minutes), batch is not available because the 50 percent discount applies to asynchronous submissions with 24-hour SLAs. Real-time is the right column; batch is a reference for off-hours overnight re-scoring, not for an intraday loop.

The absolute monthly figures for a 10-idea-per-day loop run between sub-$1 and under $20 across the full board, which is why cost is rarely the binding constraint at this scale. The binding constraint is reasoning quality on the ideas that actually matter. The Agent Cost Envelope Calculator models scenario D at arbitrary idea volume and context size. The Batch vs Realtime Cost Calculator is built for the boundary case where the loop could conceivably run in a nightly batch.

Cross-cutting observations

Four structural patterns hold across the three providers at published rates on 2026-04-23.

Batch discount is a flat 50 percent across the board. Anthropic's batch-processing table halves both input and output per-MTok rates on every Claude 4.x model¹. OpenAI's pricing page states a 50 percent reduction on both dimensions for every model in the main table³. Google publishes explicit batch-row rates for Gemini 2.5 Flash, Pro, and Flash-Lite that are exactly half the standard rate⁵. There is no provider differentiation on the batch magnitude itself; the differentiation is whether the workload tolerates 24-hour async turnaround.

Anthropic's 10x cache-read discount flips the crossover with OpenAI at high hit rates. Uncached, Sonnet 4.6 at $3 input is more expensive per million input tokens than GPT-5.4 mini at $0.75 or Flash at $0.30. With a 120K-token system prompt cached once and read 60 times in a five-minute window, the effective input rate on Sonnet 4.6 collapses to roughly $0.30 per million tokens (59 reads at $0.30, 1 write at $3.75, amortised over 60 calls). That brings Sonnet's input cost within Flash-Lite territory on cache-heavy agent workloads. OpenAI caches at the same 10x discount on cached input ($0.25 on GPT-5.4, $0.075 on mini, $0.02 on nano³), but does not publish a cache write / long-duration cache structure comparable to Anthropic's 5-minute / 1-hour split. Caching on OpenAI is implicit and opportunistic, not under developer control in the same way.

Context-window frontier. Gemini 2.5 Pro, Flash, and Flash-Lite all publish a 1,048,576-token context window and a 65,535-token max output⁶. Claude Sonnet 4.6, Opus 4.6, and Opus 4.7 publish 1M-token context at standard pricing²; Haiku 4.5 stays at 200K. GPT-5.4 publishes a 1.05M-token window with a 272K pricing-tier boundary⁴; GPT-5.4 mini and nano land at 400K. For finance workloads that need to fit an entire 10-K plus peer filings into a single prompt, the 1M-token tier narrows the choice to four models: Sonnet 4.6, Opus 4.7, Gemini 2.5 Pro, and GPT-5.4 (crossing the 272K surcharge boundary once the full filing set lands).

Thinking-token tax. Anthropic bills extended and adaptive thinking at the output rate. Haiku 4.5 thinking tokens cost $5 per MTok, Sonnet 4.6 costs $15, Opus 4.6 costs $25¹. OpenAI's reasoning effort on GPT-5.4 is billed similarly: reasoning tokens count against output tokens at $15/MTok⁴. Google's Gemini 2.5 family bundles thinking into the output-price column: "Output price (including thinking tokens)" is what the pricing page publishes⁵. Across all three providers, turning thinking on scales output-column tokens, not input-column tokens, and the easy mistake is to budget a workload at the output rate times expected answer length. Reasoning-heavy prompts can 10x output usage without changing the visible answer length. Detailed treatment lives in Thinking Tokens on Finance Tasks and Batch API Economics for Finance LLMs.

Which tier wins for which workload

Mapped from the four worked scenarios against the master table. "Wins" here means lowest total cost at published rates for the given input and output volumes; it does not mean best output quality, which is not measured in this release.

Workload archetype	Cheapest at published rates	Runner-up
High-volume triage (scenario A)	Gemini 2.5 Flash-Lite	GPT-5.4 nano
Low-volume summarization (scenario B)	Gemini 2.5 Flash-Lite	GPT-5.4 nano
Mid-context synthesis under 200K (scenario C)	Gemini 2.5 Flash-Lite	GPT-5.4 nano
Real-time agent loop (scenario D)	Gemini 2.5 Flash-Lite	GPT-5.4 nano
Long-context synthesis (>200K input)	Claude Sonnet 4.6 (flat to 1M)	Gemini 2.5 Pro (>200K tier)
Cache-heavy repeated-prompt loop	Claude Sonnet 4.6 (1.25x write / 0.1x read)	GPT-5.4 mini (0.1x cached input)
Deep-reasoning escalation tier	Claude Opus 4.7 (at $5/$25)	GPT-5.4 (at $2.50/$15)

Two caveats sit on top of this table. First, a triage-tier model's price advantage is irrelevant if it hallucinates on a filing excerpt and a downstream step acts on the output, which is the reason the Hallucination Detector and the Price-Blind Auditor exist. Second, the "cheapest" column assumes workloads that produce deterministic token counts; agent loops with variable output lengths need probabilistic cost envelopes, not point estimates. The Model Selector for Finance walks through both modifiers.

Caveats and limitations

This release is a pricing inventory, not a performance ranking or a buy-side recommendation. A small operator choosing a model for a finance workload needs four inputs: per-token cost, output quality on the specific task, latency, and rate-limit ceiling. This document covers only the first axis, and only the published surface of that axis.

Vendor pricing changes on monthly and quarterly cadence. OpenAI's main pricing page has already rolled through GPT-5 to GPT-5.4 in the current cycle, with base GPT-5, o3, and o4-mini rates no longer republished as of 2026-04-23³. Anthropic ships per-model pricing changes alongside new model releases (Opus 4.7 and Opus 4.6 at the same $5/$25 rate is a deliberate hold; Sonnet 4.6 matching Sonnet 4.5 is deliberate too¹). Google's two-tier Gemini 2.5 Pro structure at the 200K line is a 2026 addition that reflects the compute cost of serving past the 200K boundary, and a similar tier may land on other models.

Quality is not measured here. Two models at the same per-token rate can differ by a factor of two on the fraction of 10-K answers that are factually correct, which is the only question that matters for any finance agent wired into an execution loop. The correct way to measure that is an eval harness run against a labeled task set on the caller's own data; a minimal reference implementation lives in Eval Harness for Finance LLMs. Quality rankings published without an eval attached are marketing, not evidence.

Enterprise pricing differs from the self-serve rates tabled above. Anthropic and OpenAI both publish volume-discount language on their pricing pages and direct high-volume buyers to sales¹³. The self-serve rates in the master table are the correct anchor for budget modeling at small-operator and mid-team scale; above a monthly spend on the order of $10,000 to $50,000 a re-negotiation conversation is standard and the rates in the table are starting points, not floors.

Finally, cache and batch multipliers stack with other pricing modifiers (data-residency premiums, fast-mode premiums on Opus 4.6, third-party platform pricing on Bedrock and Vertex AI¹). A Claude Opus 4.6 call through Bedrock in a US-only data-residency tier with fast mode and a 1-hour cache write runs at a materially different effective rate than the headline $5/$25. Working that out is an exercise in multiplication, and the multiplication belongs in the pricing model, not in a back-of-envelope estimate.

Refresh the vendor pricing pages before any material spend decision. The data files attached to this release are a point-in-time snapshot.

Data downloads

Master pricing dataset (CSV): /benchmarks/state-of-llm-pricing-for-finance-q2-2026/data.csv
Master pricing dataset (JSON): /benchmarks/state-of-llm-pricing-for-finance-q2-2026/data.json

Both files carry the same 14 model rows, the four Anthropic models plus the three current OpenAI GPT-5.4 models plus the three current Google Gemini 2.5 models, and the three legacy OpenAI rows (GPT-5, o3, o4-mini) preserved with null pricing and an explanatory note on the source URL.

Connects to

The 2026 Engineer's Guide to AI in Markets: pillar context for where pricing fits into a full AI-in-markets stack.
Reading Financial Filings with LLMs (2026): filing-scale prompts and why the token budget matters.
Prompt Caching Economics for Finance LLMs: cache-read and cache-write mechanics applied to SEC-filing workflows.
Model Selection Framework for Finance: a decision tree that pairs model choice with task type and budget envelope.
Batch API Economics for Finance LLMs: when the 50-percent batch discount is reachable and when it is not.
Thinking Tokens on Finance Tasks: what reasoning effort changes at the output-rate column.
Eval Harness for Finance LLMs: the quality axis this release deliberately does not measure.
Token Cost Reality in LLM Trading Research: earlier treatment of the same topic at small-operator scale.
State of AI Market Data 2026: the sister benchmark covering market-data vendors, MCP servers, and retail adoption signals.
Financial Document Token Estimator: retargets scenario A to a custom filing universe.
Model Selector for Finance: a selection tool parameterised on the master table above.
Batch vs Realtime Cost Calculator: the boundary case between real-time and batch submissions.
Token-Cost Optimizer: the existing calculator that consumes the master rates for arbitrary workloads.
Agent Cost Envelope Calculator: probabilistic envelope for agent loops with variable output length.

References

Anthropic. "Prompt caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching. 5-minute and 1-hour cache duration mechanics.
OpenAI. "Batch API." platform.openai.com/docs/guides/batch. Async submission format and 24-hour SLA for the 50-percent discount.
Google. "Batch mode." ai.google.dev/gemini-api/docs/batch-mode. Batch request format and 50-percent discount.
SEC EDGAR. "Full-text search." efts.sec.gov/LATEST/search-index. Reference source for 10-K filings used in scenarios A and C.

Anthropic. "Pricing." docs.anthropic.com/en/docs/about-claude/pricing (accessed 2026-04-23). Model-pricing table, batch-processing table, prompt-caching multiplier table. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Anthropic. "Models overview." docs.anthropic.com/en/docs/about-claude/models/overview (accessed 2026-04-23). Context windows and max-output tokens for Opus 4.7, Sonnet 4.6, Haiku 4.5, and Opus 4.6. ↩ ↩²
OpenAI. "API pricing." developers.openai.com/api/docs/pricing (accessed 2026-04-23). Standard, batch, flex, and priority rows for GPT-5.4 / GPT-5.4 mini / GPT-5.4 nano / GPT-5.4 pro. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
OpenAI. "GPT-5.4 model." developers.openai.com/api/docs/models/gpt-5.4 (accessed 2026-04-23). Context window, max output tokens, reasoning effort levels. Mini and nano spec pages at the corresponding developers.openai.com/api/docs/models/gpt-5.4-mini and /gpt-5.4-nano URLs. ↩ ↩² ↩³
Google. "Gemini API pricing." ai.google.dev/pricing (accessed 2026-04-23). Standard and batch rates for Gemini 2.5 Flash, Pro (<=200K and >200K tiers), and Flash-Lite. Context-cache storage rates. ↩ ↩² ↩³ ↩⁴ ↩⁵
Google Cloud. "Gemini 2.5 Flash / 2.5 Pro / 2.5 Flash-Lite model pages." docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash, /2-5-pro, /2-5-flash-lite (accessed 2026-04-23). Max input tokens (1,048,576) and max output tokens (65,535) per model. ↩ ↩²