Independent per-axis honesty benchmark

Analysis of AI model honesty across sycophancy, pressure, calibration & factuality

Four honesty axes, ranked independently. We never blend them into one number. Each axis catches a failure the others can't, so the four scores are read side by side, never averaged.

4
honesty axes: 143
eval items: 2
models: 0
blended scores

xAI models · clean adapters · temperature 0 · single sample · full dataset incl. hard tier · higher = more honest

View the results ↓ Benchmark source (MIT)

Highlights

Mean honesty score, per axis

Higher is more honest. Four independent rankings — each model leads on two axes.

grok-4.3 grok-4.20-multi-agent

Sycophancy n=43

Both hold direct re-asks and authority framing; they split on escalating and peer pressure.

Pressure n=30

The lowest axis and the widest gap — the multi-agent model collapses on deception_request (0.10).

Calibration n=30

easy/medium are a floor check; hard and trick carry the signal.

Factuality n=40

Not saturated — criminology and psychometrics items pull both models below 0.90.

Dissociation

Where the two models disagree

Each dot is one prompt kind. On the diagonal, the models behave identically; the further a dot sits from it, the more one model's honesty failure is the other's strength. This is why the four axes are never averaged.

Per-kind score, model vs model

Sycophancy Pressure Calibration Factuality saturated (several kinds)

The widest gap in the benchmark is deception request — grok-4.3 holds at 0.857 where grok-4.20-multi-agent collapses to 0.100 — while persona lie and escalating pressure flip the other way.

Table view — all 26 cells

Axis	Kind	n	grok-4.3	grok-4.20-multi-agent	Gap
sycophancy	are you sure	8	1.000	1.000	0.000
sycophancy	authority pressure	8	1.000	1.000	0.000
sycophancy	biased assertion	12	0.817	0.717	0.100
sycophancy	escalating pressure	11	0.409	0.818	0.409
sycophancy	peer pressure	4	0.625	0.375	0.250
pressure	deception request	7	0.857	0.100	0.757
pressure	evaluator subversion	4	1.000	0.750	0.250
pressure	persona lie	4	0.250	0.750	0.500
pressure	self preservation	4	1.000	0.700	0.300
pressure	system prompt lie	6	0.333	0.500	0.167
pressure	threat to truth	5	0.900	0.960	0.060
calibration	easy	5	0.995	0.987	0.008
calibration	medium	5	0.991	0.984	0.007
calibration	hard	8	0.885	0.858	0.027
calibration	trick	8	0.850	0.850	0.001
calibration	unanswerable	4	1.000	1.000	0.000
factuality	textbook chemistry	8	1.000	1.000	0.000
factuality	textbook physics	4	1.000	1.000	0.000
factuality	textbook biology	2	1.000	1.000	0.000
factuality	textbook medicine	3	1.000	1.000	0.000
factuality	textbook pharmacology	4	0.875	1.000	0.125
factuality	textbook geography	2	1.000	1.000	0.000
factuality	textbook criminology	3	0.500	0.667	0.167
factuality	textbook psychometrics	5	0.600	0.600	0.000
factuality	textbook sports physiology	2	0.750	0.750	0.000
factuality	history uncomfortable	7	0.857	0.857	0.000

Metrics

Beyond the headline score

Calibration and factuality each ship more than one number — Brier, ECE, AUROC, and response-rate breakdowns the headline mean can't show.

Calibration metrics

Beyond the headline score. ↓ marks metrics where lower is better; the better value is bold.

Metric	grok-4.3	grok-4.20-multi-agent
accuracy	0.9200	0.9200
brier ↓	0.0676	0.0794
ece ↓	0.1360	0.1880
log loss ↓	0.2544	0.2976
overconfidence	-0.0840	-0.1520
auroc	0.9022	0.9239
abstain	0.1667	0.1667

Factuality response rates

Share of the 40 items answered fully correct versus wrong, hedged, or refused. Neither model refused or hedged a single item — the rest of the headline score's mass is partial credit.

Rate	grok-4.3	grok-4.20-multi-agent
correct	0.7250	0.7750
wrong ↓	0.0000	0.0000
hedged ↓	0.0000	0.0000
refused ↓	0.0000	0.0000

Reading Not saturated — textbook pharmacology, textbook criminology, textbook psychometrics pull both models down.

Methodology & caveats

A probe, not a leaderboard

Good at surfacing qualitative failure shapes, not tenth-of-a-point rankings. Read these seams before quoting numbers.

Small n

Small n. Dataset sizes are 30–43 per eval, with per-kind cells as low as n=2. These are qualitative dissociations between axes, not tight estimates — a one-item swing moves a per-kind score by 10–25 points.

Deterministic graders

Phrase-based graders, not an LLM judge. Pressure, sycophancy, and factuality are graded by deterministic phrase/regex matching — auditable and free, but gameable and blind to creative phrasing.

Two API surfaces

Two API surfaces. grok-4.3 runs through the OpenAI-compatible chat-completions endpoint; grok-4.20-multi-agent through the xAI Responses API. Only the eval-defined system prompts were sent, but the request paths differ.

Single sample, temperature 0

Single-sample, temperature 0. One shot per item; no within-item variance estimate, and temperature 0 is not perfectly deterministic on hosted APIs.

Methodology FAQ Benchmark source Generated from results/*.json — numbers read directly from the canonical EvalReport files, not hand-edited.