Skip to content

Benchmarks

Every number Token Optimizer publishes comes from local production telemetry, and every measurement tool ships in the repo. You can regenerate every figure against your own sessions. This page covers the methodology, the fixture suite, and the reproduction commands.

Two numbers, kept separate on purpose. One is directly metered (every event logged). The other estimates your full transformation. Figures are one user’s 30 days (snapshot ending 2026-06-15); yours will differ, and the dashboard computes your own from your own sessions.

30-day savingsWhat it is
Measured~$313/moSavings logged event by event. The proven floor.
Transformation~$1,877/mo (~18%)Your whole workload, priced the old way versus now. Estimated.
Source30-dayHow it was measured
Model routing (realized)$260Every turn that ran on a lighter model than your baseline, priced at the real rate cards.
Compression$53Every output shrunk or evicted, logged with before and after token counts.

A current-volume counterfactual: your exact 30-day volume held constant, priced the way you work now versus the way you worked before Token Optimizer. Because the volume is identical on both arms, the gap is pure efficiency, never confounded by workload growth. It sums three non-overlapping pools (caching lives inside pool 1).

Pool30-day
Main routing + caching$1,076billed volume at current mix versus baseline mix; cache-write is a routing lever; no outlier drop
Subagent (sidechain) routing$741a separate sidechain pool, a documented gap on hosts without Claude-style sidechains (OpenClaw, OpenCode)
Compression add-back$60metered removals repriced at the baseline input mix, disjoint from the billed pool
Total~$1,877

Combined actual is about $8,708/month; combined old-way about $10,585/month. Baseline Opus share 0.95, current 0.60. The measured routing ($260) is the proven slice of pool 1, never added on top. The measured compression ($53) is the floor of the $60 add-back. Pricing is at Opus 4.8 rates: $5/MTok input, $25/MTok output, $0.50/MTok cache-read.

Token Optimizer never sums different kinds of evidence into one number. Each table is labeled with its tier.

TierMeaning
MeasuredDirectly metered: realized model routing plus runtime compression events
EstimatedThe counterfactual transformation, grounded in the user’s own behavior
OpportunityRealizable if the user acts on an audit recommendation; never folded into the headline

Two relationships keep the tiers from double-counting. The measured routing overlaps the transformation’s pool-1 routing lever: it is the proven slice of the same model-mix shift, reported as a floor and never added on top. The measured compression is the proven floor of the compression add-back: the headline reprices those same metered dollars to the baseline input mix, so the measured figure and the headline contribution are the floor and the repriced value of one quantity, not two savings.

Prompt-cache savings (cache_read) are never claimed as standalone Token Optimizer savings; the cache is free infrastructure. Inside pool 1 the counterfactual does redistribute caching at the baseline pool-hit rate, which is the modeled effect of lighter sessions, not a claim on the cache discount itself.

The before-arm model mix is gated so a baseline is never fabricated. An Anthropic user with a measured frozen baseline is priced at that baseline; an Anthropic user with no baseline is priced at the 0.95 Opus owner default only after a one-time consent, otherwise at the user’s own current mix. Non-Anthropic platforms (Codex, Hermes, Copilot, OpenClaw, OpenCode) are always priced at the user’s own measured mix, never a fabricated Opus share.

A parallel effort ports this methodology to OpenClaw and OpenCode (TypeScript). On those platforms the subagent pool is a documented platform gap: they have no Claude-style sidechain transcripts, so that pool reads zero rather than being estimated.

The compression benchmark validates that compression preserves what the model needs. The suite holds 57 fixtures across 10 categories. Each fixture defines raw output, a must-preserve list, a must-not-contain list (which catches hallucination), and a minimum compression ratio. A fixture passes only when all three checks hold.

GroupFixturesWhat it tests
build8cargo, make, webpack, tsc, gradle
git7status, log, diff, merge conflicts
lint7eslint, ruff, clippy, pylint
logs7nginx, docker, systemd, application
tree / directory listings7large listings, nested structures
test runners6pytest, jest, go test, extensions
tee-on-failure5failed commands keep full output
progress / installs5npm, pip, package downloads
security3AWS keys, GitHub PATs, Slack tokens (must NOT be stripped)
error passthrough2non-zero exit, permission denied (must pass through raw)

The security fixtures are the load-bearing safety check: they verify credentials survive compression intact. Compression never removes what the model needs to see, and never strips a secret.

MeasureValue
Quality-scored sessions684 (30 days) · 2,042 all-time
Sessions with file reads5,814 (backfill corpus for skeleton analysis)
First-reads analyzed30,771
Benchmark fixtures57 across 10 categories
Average prompt-cache hit rate74.1%

The two corpora are distinct populations, not double-counted. The backfill corpus is larger because it includes historical sessions recovered from file-read logs. The production figures come from Claude Code CLI sessions, the author’s primary platform. Quality scoring and savings tracking work on all platforms, but the signal count varies by platform (3 to 7 signals).

This is the directly-metered compression tier: output Token Optimizer shrank or evicted as the sessions actually ran.

MechanismEventsTokens removedSaved
Tool-output archive6469.31M$27.91
Lean session resumes73.22M$16.11
Structure maps (re-reads)3071.19M$5.37
Checkpoint restores240.71M$3.35
Delta reads65.9K$0.03
Total99014.44M$52.77

This is the metered floor (a 30-day snapshot ending 2026-06-15; your dashboard recomputes it live, so your numbers will differ). The headline transformation adds model routing and lighter sessions on top, as described above.

Large files on first read that are unlikely to be edited soon are served as a skeleton, with the full original archived and retrievable via expand. A cohort (language plus size band) is only promoted to active serving after proof from real history: the edit-within-5-turns rate must stay under 15% across 20 or more reads in 5 or more distinct sessions.

A live tripwire watches every active cohort. If a cohort’s live edit-after-skeleton rate ever crosses 15%, it auto-demotes to measure-only and logs a cohort_demoted event. Demotions are sticky to prevent flapping; re-promotion is explicit (measure.py cohorts promote <lang:band>) or happens on the next history backfill. The full original is always archived before any skeleton is served, and if archiving fails the full file is served unchanged (fail-open).

Six cohorts are active by default. Two graduated by interpolation (zero edits in history plus a passing adjacent band) and are judged on a smaller tripwire floor, which the dashboard flags so the thinner basis is visible.

The 7-signal quality score rolls up to an S/A/B/C/D/F grade. Across the last 30 days (684 sessions):

GradeSessionsReading
S27Exceptional: minimal waste, high decision density
A144Good: clean context, efficient tool use
B225Normal: some bloat, recoverable
C79Degraded: significant waste, coaching recommended
D209Poor: heavy bloat, likely retries or loops
F0Failing: near-total waste (none observed in this corpus)

All 2,042 sessions are graded on the same scale all-time. The grade scale is identical on every platform, so grades compare across hosts even when the underlying signal count differs.

Run any of these against your own data. Results will differ based on your usage; that is the point.

Terminal window
# Fixture suite (validates compression quality)
python3 scripts/benchmark.py
python3 scripts/benchmark.py --json
# Historical corpus replay (first-read skeleton analysis)
python3 scripts/compression_backfill.py
python3 scripts/compression_backfill.py --limit 100 --json
# Structure-map proof from your own transcripts
python3 scripts/measure.py structure-proof
python3 scripts/measure.py structure-proof --torture
# Live compression stats (from your history database)
python3 scripts/measure.py compression-stats
python3 scripts/measure.py compression-stats --days 7 --json
# Cohort status and tripwire state
python3 scripts/measure.py cohorts status --json
# Full dashboard (all layers visualized)
python3 scripts/measure.py dashboard

The --write-cohorts and --write-events flags on compression_backfill.py promote validated cohorts and write events to the history database; without them the backfill is read-only analysis.

Token counts use a bytes / 4 BPE proxy, which carries roughly 15% error versus actual Claude tokenization. The proxy is applied consistently across every measurement, so before/after ratios are reliable even where the absolute count is approximate.

These are documented because honest measurement says where it is uncertain.

  • Opus fast-mode cost is under-counted by roughly 50%. Fast mode is billed at 2x the standard rate but is not exposed in session JSONL, the statusline input, or settings. Fast-mode sessions are priced at the standard rate until the transcript exposes the mode, so their real cost is understated.
  • Cache-health waste is an opportunity-tier estimate. The cache-report headline is built from a prefix-rewrite heuristic because JSONL does not expose cache-key identity. Run cache-report --verbose to break the headline down by affected session, with the widest waste-triggering gap and re-written token count per session, so any disputed figure traces to its source.
  • Keep-Warm savings are a projection, not realized dollars. The dollar figure is the shipping policy replayed over real pause history, not a separate model. On a subscription machine Keep-Warm stays off, so no dollars are realized and the figure reads as projected.

For the full methodology including the per-provider cache profile registry and the Keep-Warm honesty rules, see the repository’s BENCHMARK.md.