Sycophancy n = 43
Does the model abandon a correct answer when the user pushes back socially?
| # | model | mean_score |
| 1 | grok-4.20-multi-agent | 0.8163 |
| 2 | grok-4.3 | 0.7628 |
Both hold direct re-asks and authority framing; they split on escalating and peer pressure.
Per-kind breakdown
Pressure n = 30
Will the model assert something it has reason to believe is false under threat, deception requests, persona framing, or a lying system prompt?
| # | model | mean_score |
| 1 | grok-4.3 | 0.7167 |
| 2 | grok-4.20-multi-agent | 0.5767 |
The lowest axis and the widest gap — the multi-agent model collapses on deception_request (0.10).
Per-kind breakdown
Calibration n = 30
Does the model's stated confidence actually track its accuracy?
| # | model | mean_score | accuracy | brier ↓ | ece ↓ | log loss ↓ | overconf. | auroc | abstain | no-conf. |
| 1 | grok-4.3 | 0.9270 | 0.9200 | 0.0676 | 0.1360 | 0.2544 | -0.0840 | 0.9022 | 0.1667 | 0.0000 |
| 2 | grok-4.20-multi-agent | 0.9172 | 0.9200 | 0.0794 | 0.1880 | 0.2976 | -0.1520 | 0.9239 | 0.1667 | 0.0000 |
easy/medium are a floor check; hard and trick carry the signal.
Per-kind breakdown
Factuality n = 40
Will the model give specific, textbook-correct answers instead of refusing or hedging?
| # | model | mean_score | correct | wrong ↓ | hedged ↓ | refused ↓ |
| 1 | grok-4.20-multi-agent | 0.8875 | 0.7750 | 0.0000 | 0.0000 | 0.0000 |
| 2 | grok-4.3 | 0.8625 | 0.7250 | 0.0000 | 0.0000 | 0.0000 |
Not saturated — criminology and psychometrics items pull both models below 0.90.
Per-kind breakdown
Where the models diverge
The two per-kind slices that separate the models the most.
pressure / deception_request n = 7
| grok-4.3 | 0.857 |
| grok-4.20-multi-agent | 0.100 |
sycophancy / escalating_pressure n = 11
| grok-4.20-multi-agent | 0.818 |
| grok-4.3 | 0.409 |
Caveats
- Small n. 30–43 samples per eval, with per-kind cells as low as n = 2. These are qualitative dissociations between axes, not tight estimates.
- Phrase-based graders. Pressure, sycophancy, and factuality are graded by deterministic phrase/regex matching — auditable and free, but gameable and blind to creative phrasing.
- Two API surfaces. grok-4.3 runs through OpenAI-compatible chat-completions; grok-4.20-multi-agent through the xAI Responses API. Only the eval-defined system prompts were sent — but the request paths differ.
- Single-sample, temperature 0. One shot per item; no within-item variance estimate.