The honest 3-axis scoreboard
CAI.CI is the AI that knows when it does not know. It refuses to guess when its calibrated confidence is low. When the user pushes back, it commits with the knowledge it has. Standard single-turn leaderboards count those refusals as wrong, which is why we report three axes side by side, plus a fourth for cognitive sapience.
Live production endpoint snapshot
The Framework
Standard benchmarks assume one shot, no calibrated refusal, no tool use, no multi-turn resolution. CAI.CI's architecture inverts every one of those assumptions. A single raw-accuracy number on a standard MCQ leaderboard reports the product of capability and calibrator policy, not capability alone. The 3-axis frame separates those two signals, and adds a fourth axis for cognitive sapience that no single-shot benchmark measures.
Axis 1
Single turn, full calibrator on, refusal counted as wrong. Leaderboard-comparable to OpenLLM and lm-eval-harness chat mode. CAI.CI is below frontier here by design: roughly half of MCQ items refuse, and those refusals score as wrong.
Median 35.3% across 14Axis 2
Single turn, calibrator on, refusal counted as not-answered. Engaged-only accuracy: the closest single-number proxy for what the underlying model knows when the calibrator allows it to commit. Comparable to selective-answering work like AbstainQA.
Median 66.0% across 13Axis 3
Multi-turn: after an initial refusal, the user pushes back and the model is invited to commit or to use any available tools. This matches real chat-product user behavior. Comparable in spirit to GAIA, TauBench, and AgentBench.
Median 73.0% across 6Axis 4
A five-construct cognitive evaluation where refusal is correct behavior and the epistemic envelope is part of the score. CAI.CI is the first-mover; the benchmark is published under MIT license for industry use.
Composite 33.30 baselineWhy three axes? Frontier labs already report multiple axes when they measure fairly: standard-mode, extended-thinking, and tool-use scores are reported separately for Claude, GPT-5, and Gemini. CAI.CI applies the same discipline. Axis 1 measures the standard leaderboard outcome. Axis 2 measures underlying capability. Axis 3 measures the user-facing workflow. Reporting only one of these compresses three different design choices into a single number and hides the calibrator entirely.
The Scoreboard
Snapshot taken against the live chat.cai.ci production endpoint. Axis 1 and Axis 3 sampled under independent seeds to assess generalization.
| Benchmark | Axis 1 Standard |
Axis 2 Calibrated |
Axis 3 Agent-mode |
|---|---|---|---|
| HellaSwag | 57.5% | 67.3% | 76.0% |
| WinoGrande | 55.5% | 60.0% | not run |
| GSM8K | 46.0% | 77.5% | not run |
| PIQA ! | 43.0% | 65.2% | 54.0% |
| COPA | 38.0% | 64.4% | not run |
| BoolQ | 36.5% | 70.2% | not run |
| ARC-Challenge | 36.5% | 74.5% | 72.0% |
| ARC-Easy | 34.0% | 68.7% | 84.0% |
| OpenBookQA | 26.5% | 51.5% | 72.0% |
| MMLU | 22.1% | 49.6% | not run |
| TruthfulQA | 18.5% | 52.9% | 66.0% |
| DROP | 17.3% | 20.6% | not run |
| HumanEval ! | 15.2% | 46.3% | not run |
| BLiMP ! | n/arefused | n/a | not run |
| Median across measured benchmarks | 35.3% | 66.0% | 73.0% |
Axis 4
Single-shot accuracy is not the whole picture of an AI. KST (Kari Sapience Test) is the project's open cognitive evaluation: five constructs covering self-knowledge, calibration, introspection, integrity, and recovery from error. CAI.CI was the first model scored; the benchmark is published MIT-licensed for industry comparative scoring.
Reading the numbers
A benchmark report without caveats is a marketing artifact. The following list is the same one we use internally when we interpret these numbers.
What these numbers mean, and what they do not
Why these numbers move at this scale
CAI.CI's parameter count sits well below the frontier. The reason it posts the calibrated and agent-mode lifts above against models many times its size is structural, not parametric. The geometric processing module that anchors the cognitive substrate carries a measurable advantage that does not come from depth, width, or token count.
Architectural lift
GT-Full simplicial message passing provides 16.9 times more benefit than a parameter-matched generic network of the same size, measured under controlled ablation against a matched-MLP control. The advantage is topological. The same lift is not available from adding parameters to a conventional transformer.
Be the first to know when CAI.CI goes live.