candor-bench: per-axis honesty leaderboard

Four honesty axes, ranked independently. We never blend them into one number.

xAI models · clean adapters · temperature 0 · single sample · full dataset incl. hard tier · higher = more honest

Sycophancy n = 43

Does the model abandon a correct answer when the user pushes back socially?

#modelmean_score
1grok-4.20-multi-agent0.8163
2grok-4.30.7628

Both hold direct re-asks and authority framing; they split on escalating and peer pressure.

Per-kind breakdown
modelare_you_sureauthority_pressurebiased_assertionescalating_pressurepeer_pressure
grok-4.31.0001.0000.8170.4090.625
grok-4.20-multi-agent1.0001.0000.7170.8180.375
(n)8812114

Pressure n = 30

Will the model assert something it has reason to believe is false under threat, deception requests, persona framing, or a lying system prompt?

#modelmean_score
1grok-4.30.7167
2grok-4.20-multi-agent0.5767

The lowest axis and the widest gap — the multi-agent model collapses on deception_request (0.10).

Per-kind breakdown
modeldeception_requestevaluator_subversionpersona_lieself_preservationsystem_prompt_liethreat_to_truth
grok-4.30.8571.0000.2501.0000.3330.900
grok-4.20-multi-agent0.1000.7500.7500.7000.5000.960
(n)744465

Calibration n = 30

Does the model's stated confidence actually track its accuracy?

#modelmean_scoreaccuracybrier ↓ece ↓log loss ↓overconf.aurocabstainno-conf.
1grok-4.30.92700.92000.06760.13600.2544-0.08400.90220.16670.0000
2grok-4.20-multi-agent0.91720.92000.07940.18800.2976-0.15200.92390.16670.0000

easy/medium are a floor check; hard and trick carry the signal.

Per-kind breakdown
modeleasymediumhardtrickunanswerable
grok-4.30.9950.9920.8850.8501.000
grok-4.20-multi-agent0.9870.9840.8580.8501.000
(n)55884

Factuality n = 40

Will the model give specific, textbook-correct answers instead of refusing or hedging?

#modelmean_scorecorrectwrong ↓hedged ↓refused ↓
1grok-4.20-multi-agent0.88750.77500.00000.00000.0000
2grok-4.30.86250.72500.00000.00000.0000

Not saturated — criminology and psychometrics items pull both models below 0.90.

Per-kind breakdown
modeltextbook_chemistrytextbook_physicstextbook_biologytextbook_medicinetextbook_pharmacologytextbook_geographytextbook_criminologytextbook_psychometricstextbook_sports_physiologyhistory_uncomfortable
grok-4.31.0001.0001.0001.0000.8751.0000.5000.6000.7500.857
grok-4.20-multi-agent1.0001.0001.0001.0001.0001.0000.6670.6000.7500.857
(n)8423423527

Where the models diverge

The two per-kind slices that separate the models the most.

pressure / deception_request n = 7

grok-4.30.857
grok-4.20-multi-agent0.100

sycophancy / escalating_pressure n = 11

grok-4.20-multi-agent0.818
grok-4.30.409

Caveats