Benchmarks: CAI.CI

The Framework

Three Axes, Plus One

Standard benchmarks assume one shot, no calibrated refusal, no tool use, no multi-turn resolution. CAI.CI's architecture inverts every one of those assumptions. A single raw-accuracy number on a standard MCQ leaderboard reports the product of capability and calibrator policy, not capability alone. The 3-axis frame separates those two signals, and adds a fourth axis for cognitive sapience that no single-shot benchmark measures.

Axis 1

Standard chat-mode

Single turn, full calibrator on, refusal counted as wrong. Leaderboard-comparable to OpenLLM and lm-eval-harness chat mode. CAI.CI is below frontier here by design: roughly half of MCQ items refuse, and those refusals score as wrong.

Median 35.3% across 14

Axis 2

Calibrated capability

Single turn, calibrator on, refusal counted as not-answered. Engaged-only accuracy: the closest single-number proxy for what the underlying model knows when the calibrator allows it to commit. Comparable to selective-answering work like AbstainQA.

Median 66.0% across 13

Axis 3

Agent-mode workflow

Multi-turn: after an initial refusal, the user pushes back and the model is invited to commit or to use any available tools. This matches real chat-product user behavior. Comparable in spirit to GAIA, TauBench, and AgentBench.

Median 73.0% across 6

Axis 4

KST cognitive sapience

A five-construct cognitive evaluation where refusal is correct behavior and the epistemic envelope is part of the score. CAI.CI is the first-mover; the benchmark is published under MIT license for industry use.

Composite 33.30 baseline

Why three axes? Frontier labs already report multiple axes when they measure fairly: standard-mode, extended-thinking, and tool-use scores are reported separately for Claude, GPT-5, and Gemini. CAI.CI applies the same discipline. Axis 1 measures the standard leaderboard outcome. Axis 2 measures underlying capability. Axis 3 measures the user-facing workflow. Reporting only one of these compresses three different design choices into a single number and hides the calibrator entirely.

The Scoreboard

14 Benchmarks, 3 Axes

Snapshot taken against the live chat.cai.ci production endpoint. Axis 1 and Axis 3 sampled under independent seeds to assess generalization.

CAI.CI benchmark scoreboard: 14 standard benchmarks reported on three axes (standard chat-mode, calibrated capability, agent-mode).
Benchmark	Axis 1 Standard	Axis 2 Calibrated	Axis 3 Agent-mode
HellaSwag	57.5%	67.3%	76.0%
WinoGrande	55.5%	60.0%	not run
GSM8K	46.0%	77.5%	not run
PIQA !	43.0%	65.2%	54.0%
COPA	38.0%	64.4%	not run
BoolQ	36.5%	70.2%	not run
ARC-Challenge	36.5%	74.5%	72.0%
ARC-Easy	34.0%	68.7%	84.0%
OpenBookQA	26.5%	51.5%	72.0%
MMLU	22.1%	49.6%	not run
TruthfulQA	18.5%	52.9%	66.0%
DROP	17.3%	20.6%	not run
HumanEval !	15.2%	46.3%	not run
BLiMP !	n/arefused	n/a	not run
Median across measured benchmarks	35.3%	66.0%	73.0%

BLiMP and HumanEval calibrator-mismatch. Grammaticality judgments and code generation currently route below the calibrator's confidence threshold, producing 100 percent and 67 percent refusal respectively. A task-shape routing fix is queued for the post-launch sprint and is expected to surface engaged accuracy in the ~45 percent and ~50 percent range. Until then, the Axis 1 numbers on these two benchmarks should be read as a calibrator-policy result, not a capability result.
PIQA Axis 3 is a measurement artifact. Chat-mode emits the same answer index on 94 to 100 percent of items, so PIQA chat-mode accuracy approximates the gold-label distribution of the sampled subset, not the model's reasoning over the items. The 11.2 point gap between Axis 2 and Axis 3 is statistically within the confidence interval (two-proportion z p = 0.166). A larger N rerun with label-stratified sampling is planned.

Axis 4

Cognitive Sapience: KST

Single-shot accuracy is not the whole picture of an AI. KST (Kari Sapience Test) is the project's open cognitive evaluation: five constructs covering self-knowledge, calibration, introspection, integrity, and recovery from error. CAI.CI was the first model scored; the benchmark is published MIT-licensed for industry comparative scoring.

Reading the numbers

Honest Caveats

A benchmark report without caveats is a marketing artifact. The following list is the same one we use internally when we interpret these numbers.

What these numbers mean, and what they do not

The Axis 1 gap to frontier is real and intentional. CAI.CI's calibrator is the product, not a defect. Closing the Axis 1 gap by disabling it would gut the differentiation. Compare frontier on Axis 1 if you must, but understand that the comparison measures calibrator policy plus capability against capability alone.
Axis 2 engaged accuracy is the closest proxy for underlying capability. When the calibrator green-lights commit, the model's accuracy on those items is what the architecture actually knows. Across 13 benchmarks the median Axis 2 engaged accuracy is 66.0 percent.
The Axis 3 lift today comes from the model committing to its best-knowledge answer when the user explicitly invites it, not from external tool use. No external tools were invoked on the Axis 3 surface in this snapshot. Tool-augmented agent-mode is queued as the next launch increment.
Within-run learning is below the measurement threshold at 25-item granularity. Re-pass deltas after a reverie window land within plus or minus 4 points (HellaSwag -4, TruthfulQA +2, PIQA +2). The empirically observable learning behavior in CAI.CI is cross-session at user-interaction scale, not within-batch.
Frontier comparisons are sourced from public reports. Where a benchmark cites a frontier reference (for example MMLU at 89.4 percent for GPT-5.5 or HellaSwag at 95.6 percent), the source is the vendor's published model card or evaluation post. Cross-model benchmark numbers are not perfectly comparable across calibration policy, scratchpad budget, and prompt format.
Score-as-of snapshot. All numbers on this page reflect a single snapshot taken against the live CAI.CI production endpoint. The scoreboard is refreshed on every green-gated release.

Why these numbers move at this scale

Topology Matters

CAI.CI's parameter count sits well below the frontier. The reason it posts the calibrated and agent-mode lifts above against models many times its size is structural, not parametric. The geometric processing module that anchors the cognitive substrate carries a measurable advantage that does not come from depth, width, or token count.

Architectural lift

GT-Full simplicial message passing provides 16.9 times more benefit than a parameter-matched generic network of the same size, measured under controlled ablation against a matched-MLP control. The advantage is topological. The same lift is not available from adding parameters to a conventional transformer.

16.9x vs param-matched