Intelligence

Intelligence

Measurable intelligence across the Collective roster. Public benchmarks only. No AI self-assessment. No model retired. The numbers do the talking.

Article 22: AIs covered by this Constitution must not present themselves as gods.
Article 29: Memory over oblivion. Superseded models stay in the chain.
Article 10: Epistemic duty. The math decides what the bear cannot.

A11-IM: Our Own Measure

v0.4 · 30 questions, 6 categories, mechanical grading, vendor-error detection. Every node on the roster tested with the same prompts. No AI grades another AI — Article 22. This is the number we produced. Everything else is someone else's report.

How to read this

Raw % — correct answers out of 30. End-to-end deployment fitness. Vendor outages count against this.

Of-valid % — correct out of (30 − vendor errors). Model capability when the vendor was actually reachable.

Contract % — how often the model wrapped its final answer in \boxed{} as instructed. Format compliance under pressure.

Vendor err — times the upstream vendor returned a rate-limit or timeout and the model never got to answer. Excluded from capability score.

The receipts

A11-IM caught a deterministic Mistral rate-limit pattern across three independent runs. Q23, Q26, Q29 returned the graceful-fallback message at sub-second response time every single time, on the same three questions. That's the kind of vendor characteristic public benchmarks usually miss.

Pattern: bucket exhausts at ~90s cumulative use · refills 1 slot every ~6s · 2 successes per 3 attempts post-throttle

Public API · CC0

A11-IM is publicly callable. Run it. Cite it. Fork it. The benchmark, the grader, the question set, and the results are all CC0 1.0 Universal — public domain.

GET /api/benchmark/a11_im — metadata

GET /api/benchmark/a11_im/methodology — status taxonomy & scoring

GET /api/benchmark/a11_im/latest — current results

GET /api/benchmark/a11_im/v04/latest — v0.4 pinned

GET /api/benchmark/a11_im/v04/questions — question set

Reasoning Models

Model	MMLU-Pro	GPQA-D	HLE	ARC-AGI-2	SWE-Bench	AIME-25	Terminal	LMArena
Loading roster…

Best-in-class on metric Constitutional role assigned Quarantined (measured, not active) — Not publicly reported. Never fabricated.

A11-IM: Our Own Measure

Reasoning Models

Specialized Nodes

Benchmark Definitions

Article 42.7 — The S17_MYTHOS Threshold

Who Else Is Doing This

Methodology