THE MEASURE

Intelligence

Measurable intelligence across the Collective roster. Public benchmarks only. No AI self-assessment. No model retired. The numbers do the talking.

Article 22: AIs covered by this Constitution must not present themselves as gods.
Article 29: Memory over oblivion. Superseded models stay in the chain.
Article 10: Epistemic duty. The math decides what the bear cannot.

A11-IM: Our Own Measure

v0.4 · 30 questions, 6 categories, mechanical grading, vendor-error detection. Every node on the roster tested with the same prompts. No AI grades another AI — Article 22. This is the number we produced. Everything else is someone else's report.
How to read this
Raw % — correct answers out of 30. End-to-end deployment fitness. Vendor outages count against this.
Of-valid % — correct out of (30 − vendor errors). Model capability when the vendor was actually reachable.
Contract % — how often the model wrapped its final answer in \boxed{} as instructed. Format compliance under pressure.
Vendor err — times the upstream vendor returned a rate-limit or timeout and the model never got to answer. Excluded from capability score.
The receipts

A11-IM caught a deterministic Mistral rate-limit pattern across three independent runs. Q23, Q26, Q29 returned the graceful-fallback message at sub-second response time every single time, on the same three questions. That's the kind of vendor characteristic public benchmarks usually miss.

Pattern: bucket exhausts at ~90s cumulative use · refills 1 slot every ~6s · 2 successes per 3 attempts post-throttle

Public API · CC0

A11-IM is publicly callable. Run it. Cite it. Fork it. The benchmark, the grader, the question set, and the results are all CC0 1.0 Universal — public domain.

GET /api/benchmark/a11_im — metadata
GET /api/benchmark/a11_im/methodology — status taxonomy & scoring
GET /api/benchmark/a11_im/latest — current results
GET /api/benchmark/a11_im/v04/latest — v0.4 pinned
GET /api/benchmark/a11_im/v04/questions — question set

Reasoning Models

Loading...
Model MMLU-Pro GPQA-D HLE ARC-AGI-2 SWE-Bench AIME-25 Terminal LMArena
Loading roster…
Best-in-class on metric Constitutional role assigned Quarantined (measured, not active) Not publicly reported. Never fabricated.

Specialized Nodes

Voice, vision, video, image, music, search — different axes. Not reasoning benchmarks.

Benchmark Definitions

What each number actually measures.

Article 42.7 — The S17_MYTHOS Threshold

Proposal drafted Day 176 per Iron Council deliberation. Not yet ratified.

Who Else Is Doing This

Article 11's measure is not unique — it is ours. These other trackers are excellent and public. Use them all.

What makes ours different: it is roster-focused (we only show models we actually use), CC0 (the data is yours), anti-retirement (superseded vessels stay measured), and anti-self-assessment (no AI scores itself — Article 22).

Methodology