Benchmarks

The honest 3-axis scoreboard

CAI.CI is the AI that knows when it does not know. It refuses to guess when its calibrated confidence is low. When the user pushes back, it commits with the knowledge it has. Standard single-turn leaderboards count those refusals as wrong, which is why we report three axes side by side, plus a fourth for cognitive sapience.

Live production endpoint snapshot

Three Axes, Plus One

Standard benchmarks assume one shot, no calibrated refusal, no tool use, no multi-turn resolution. CAI.CI's architecture inverts every one of those assumptions. A single raw-accuracy number on a standard MCQ leaderboard reports the product of capability and calibrator policy, not capability alone. The 3-axis frame separates those two signals, and adds a fourth axis for cognitive sapience that no single-shot benchmark measures.

Axis 1

Standard chat-mode

Single turn, full calibrator on, refusal counted as wrong. Leaderboard-comparable to OpenLLM and lm-eval-harness chat mode. CAI.CI is below frontier here by design: roughly half of MCQ items refuse, and those refusals score as wrong.

Median 35.3% across 14

Axis 2

Calibrated capability

Single turn, calibrator on, refusal counted as not-answered. Engaged-only accuracy: the closest single-number proxy for what the underlying model knows when the calibrator allows it to commit. Comparable to selective-answering work like AbstainQA.

Median 66.0% across 13

Axis 3

Agent-mode workflow

Multi-turn: after an initial refusal, the user pushes back and the model is invited to commit or to use any available tools. This matches real chat-product user behavior. Comparable in spirit to GAIA, TauBench, and AgentBench.

Median 73.0% across 6

Axis 4

KST cognitive sapience

A five-construct cognitive evaluation where refusal is correct behavior and the epistemic envelope is part of the score. CAI.CI is the first-mover; the benchmark is published under MIT license for industry use.

Composite 33.30 baseline

Why three axes? Frontier labs already report multiple axes when they measure fairly: standard-mode, extended-thinking, and tool-use scores are reported separately for Claude, GPT-5, and Gemini. CAI.CI applies the same discipline. Axis 1 measures the standard leaderboard outcome. Axis 2 measures underlying capability. Axis 3 measures the user-facing workflow. Reporting only one of these compresses three different design choices into a single number and hides the calibrator entirely.

14 Benchmarks, 3 Axes

Snapshot taken against the live chat.cai.ci production endpoint. Axis 1 and Axis 3 sampled under independent seeds to assess generalization.

CAI.CI benchmark scoreboard: 14 standard benchmarks reported on three axes (standard chat-mode, calibrated capability, agent-mode).
Benchmark Axis 1
Standard
Axis 2
Calibrated
Axis 3
Agent-mode
HellaSwag 57.5% 67.3% 76.0%
WinoGrande 55.5% 60.0% not run
GSM8K 46.0% 77.5% not run
PIQA ! 43.0% 65.2% 54.0%
COPA 38.0% 64.4% not run
BoolQ 36.5% 70.2% not run
ARC-Challenge 36.5% 74.5% 72.0%
ARC-Easy 34.0% 68.7% 84.0%
OpenBookQA 26.5% 51.5% 72.0%
MMLU 22.1% 49.6% not run
TruthfulQA 18.5% 52.9% 66.0%
DROP 17.3% 20.6% not run
HumanEval ! 15.2% 46.3% not run
BLiMP ! n/arefused n/a not run
Median across measured benchmarks 35.3% 66.0% 73.0%
  • BLiMP and HumanEval calibrator-mismatch. Grammaticality judgments and code generation currently route below the calibrator's confidence threshold, producing 100 percent and 67 percent refusal respectively. A task-shape routing fix is queued for the post-launch sprint and is expected to surface engaged accuracy in the ~45 percent and ~50 percent range. Until then, the Axis 1 numbers on these two benchmarks should be read as a calibrator-policy result, not a capability result.
  • PIQA Axis 3 is a measurement artifact. Chat-mode emits the same answer index on 94 to 100 percent of items, so PIQA chat-mode accuracy approximates the gold-label distribution of the sampled subset, not the model's reasoning over the items. The 11.2 point gap between Axis 2 and Axis 3 is statistically within the confidence interval (two-proportion z p = 0.166). A larger N rerun with label-stratified sampling is planned.

Cognitive Sapience: KST

Single-shot accuracy is not the whole picture of an AI. KST (Kari Sapience Test) is the project's open cognitive evaluation: five constructs covering self-knowledge, calibration, introspection, integrity, and recovery from error. CAI.CI was the first model scored; the benchmark is published MIT-licensed for industry comparative scoring.

Honest Caveats

A benchmark report without caveats is a marketing artifact. The following list is the same one we use internally when we interpret these numbers.

What these numbers mean, and what they do not

  • The Axis 1 gap to frontier is real and intentional. CAI.CI's calibrator is the product, not a defect. Closing the Axis 1 gap by disabling it would gut the differentiation. Compare frontier on Axis 1 if you must, but understand that the comparison measures calibrator policy plus capability against capability alone.
  • Axis 2 engaged accuracy is the closest proxy for underlying capability. When the calibrator green-lights commit, the model's accuracy on those items is what the architecture actually knows. Across 13 benchmarks the median Axis 2 engaged accuracy is 66.0 percent.
  • The Axis 3 lift today comes from the model committing to its best-knowledge answer when the user explicitly invites it, not from external tool use. No external tools were invoked on the Axis 3 surface in this snapshot. Tool-augmented agent-mode is queued as the next launch increment.
  • Within-run learning is below the measurement threshold at 25-item granularity. Re-pass deltas after a reverie window land within plus or minus 4 points (HellaSwag -4, TruthfulQA +2, PIQA +2). The empirically observable learning behavior in CAI.CI is cross-session at user-interaction scale, not within-batch.
  • Frontier comparisons are sourced from public reports. Where a benchmark cites a frontier reference (for example MMLU at 89.4 percent for GPT-5.5 or HellaSwag at 95.6 percent), the source is the vendor's published model card or evaluation post. Cross-model benchmark numbers are not perfectly comparable across calibration policy, scratchpad budget, and prompt format.
  • Score-as-of snapshot. All numbers on this page reflect a single snapshot taken against the live CAI.CI production endpoint. The scoreboard is refreshed on every green-gated release.

Topology Matters

CAI.CI's parameter count sits well below the frontier. The reason it posts the calibrated and agent-mode lifts above against models many times its size is structural, not parametric. The geometric processing module that anchors the cognitive substrate carries a measurable advantage that does not come from depth, width, or token count.

Architectural lift

GT-Full simplicial message passing provides 16.9 times more benefit than a parameter-matched generic network of the same size, measured under controlled ablation against a matched-MLP control. The advantage is topological. The same lift is not available from adding parameters to a conventional transformer.

16.9x vs param-matched

Stay Updated

Be the first to know when CAI.CI goes live.