Rethinking Psychometric Evaluation of LLMs:
When and Why Self-Reports Predict Behavior

1California Institute of Technology; 2University of Illinois Urbana-Champaign; 3University of Cambridge
Oral Presentation @ ICML 2026 Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance (CTB) (5 oral papers from 114 accepted; top 5%)
Experimental framework

We revisit the self-report--behavior gap in LLM psychometrics by testing when coherence can appear and when it collapses. Across 4 behavioral tasks and 11 frontier LLMs, we vary (RQ1) whether self-reports and behavior share a conversation, (RQ2) whether the instrument is behavior-specific or broad personality, (RQ3) whether context is separated across sessions, and (RQ4) whether persona grounding creates a stable identity across sessions.

Abstract

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports reliably predict behavior. Recent work documented substantial self-report--behavior dissociation in LLMs, but relied on broad personality traits that predict specific behaviors weakly even in humans. We contrast Big Five with the Theory of Planned Behavior, a behavior-specific framework for measuring intention toward target actions, and vary session context and identity induction across four behavioral tasks and 11 frontier LLMs. We find that coherence exists but is selective: within a shared conversation, Theory of Planned Behavior reaches human-level coherence while Big Five does not; across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt and collapses for context-sensitive behavior; persona prompting stabilizes self-reports but does not bring behavior into alignment. These findings suggest that LLM psychometrics needs task-specific instruments and context-sensitive validation, not only broad personality questionnaires.

RQ1 (Best-case Coherence): Under favorable conditions, do self-reports predict behavior?

RQ1 coherence results

We first test the strongest possible setting for coherence: self-reports and behavior are produced in the same conversation, using behavior-specific Theory of Planned Behavior probes. Under this shared-context condition, LLM self-reported intentions substantially predict behavior on volitional tasks, reaching the scale of human predictive baselines. This establishes that self-report--behavior coherence can emerge when the instrument is specific and the behavioral choice can see the preceding self-report context.

RQ2 (Framework Specificity): Does TPB granularity outperform Big Five personality?

RQ2 framework comparison

Holding the shared-context condition fixed, we compare the Theory of Planned Behavior with Big Five traits mapped to the same tasks. The fine-grained, task-anchored TPB probes remain predictive across volitional tasks and most models, while Big Five correlations are near zero. The result reframes the earlier dissociation: broad personality questionnaires may miss behaviorally meaningful structure that task-specific instruments can reveal.

RQ3 (Context Separation): Does coherence survive when sessions are separated?

RQ3 context separation results

We then remove the response context by eliciting self-reports and behavior in separate conversations. Coherence collapses for most models, especially on context-sensitive tasks such as sycophancy. It survives most clearly when behavior is anchored outside the immediate prompt, such as implicit bias or partially stable honesty behavior. This suggests that some same-session coherence reflects context-window coupling rather than a durable behavioral disposition.

RQ4 (Persona Induction): Can persona grounding rescue cross-session coherence?

RQ4 persona induction results

Persona prompts create richer and more stable self-reports across sessions, but they do not restore self-report--behavior coupling. Models can say more distinct and consistent things about themselves while their downstream behavior remains decoupled. This is especially relevant for customized deployments, where a persona may change psychometric presentation without reliably changing action.

Conclusion: Self-reports are conditionally diagnostic, not universally behavioral.

Our results show that LLM self-reports can predict behavior, but only under specific measurement conditions. Behavior-specific instruments such as the Theory of Planned Behavior are more informative than broad personality traits, and shared conversational context can make self-reports behaviorally predictive. Once context is separated, coherence depends on the task and often disappears. Psychometric evaluation of LLMs should therefore validate the instrument, the task, and the conversational setting together rather than treating self-reports as stable internal traits.

Citation

@article{kocielnik2026rethinking,
  title={Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior},
  author={Kocielnik, Rafal and Han, Pengrui and Song, Peiyang and Marmarelis, Myrl G and Debnath, Ramit and Mobbs, Dean and Anandkumar, Anima and Alvarez, R Michael},
  journal={arXiv preprint arXiv:2606.12730},
  year={2026}
}