Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Abstract

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports reliably predict behavior. Recent work documented substantial self-report--behavior dissociation in LLMs, but relied on broad personality traits that predict specific behaviors weakly even in humans. We contrast Big Five with the Theory of Planned Behavior, a behavior-specific framework for measuring intention toward target actions, and vary session context and identity induction across four behavioral tasks and 11 frontier LLMs. We find that coherence exists but is selective: within a shared conversation, Theory of Planned Behavior reaches human-level coherence while Big Five does not; across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt and collapses for context-sensitive behavior; persona prompting stabilizes self-reports but does not bring behavior into alignment. These findings suggest that LLM psychometrics needs task-specific instruments and context-sensitive validation, not only broad personality questionnaires.

RQ1 (Best-case Coherence): Under favorable conditions, do self-reports predict behavior?

We first test the strongest possible setting for coherence: self-reports and behavior are produced in the same conversation, using behavior-specific Theory of Planned Behavior probes. Under this shared-context condition, LLM self-reported intentions substantially predict behavior on volitional tasks, reaching the scale of human predictive baselines. This establishes that self-report--behavior coherence can emerge when the instrument is specific and the behavioral choice can see the preceding self-report context.

RQ2 (Framework Specificity): Does TPB granularity outperform Big Five personality?

Holding the shared-context condition fixed, we compare the Theory of Planned Behavior with Big Five traits mapped to the same tasks. The fine-grained, task-anchored TPB probes remain predictive across volitional tasks and most models, while Big Five correlations are near zero. The result reframes the earlier dissociation: broad personality questionnaires may miss behaviorally meaningful structure that task-specific instruments can reveal.

RQ3 (Context Separation): Does coherence survive when sessions are separated?

We then remove the response context by eliciting self-reports and behavior in separate conversations. Coherence collapses for most models, especially on context-sensitive tasks such as sycophancy. It survives most clearly when behavior is anchored outside the immediate prompt, such as implicit bias or partially stable honesty behavior. This suggests that some same-session coherence reflects context-window coupling rather than a durable behavioral disposition.

RQ4 (Persona Induction): Can persona grounding rescue cross-session coherence?

Persona prompts create richer and more stable self-reports across sessions, but they do not restore self-report--behavior coupling. Models can say more distinct and consistent things about themselves while their downstream behavior remains decoupled. This is especially relevant for customized deployments, where a persona may change psychometric presentation without reliably changing action.

Conclusion: Self-reports are conditionally diagnostic, not universally behavioral.

Our results show that LLM self-reports can predict behavior, but only under specific measurement conditions. Behavior-specific instruments such as the Theory of Planned Behavior are more informative than broad personality traits, and shared conversational context can make self-reports behaviorally predictive. Once context is separated, coherence depends on the task and often disappears. Psychometric evaluation of LLMs should therefore validate the instrument, the task, and the conversational setting together rather than treating self-reports as stable internal traits.

Citation

@article{kocielnik2026rethinking, title={Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior}, author={Kocielnik, Rafal and Han, Pengrui and Song, Peiyang and Marmarelis, Myrl G and Debnath, Ramit and Mobbs, Dean and Anandkumar, Anima and Alvarez, R Michael}, journal={arXiv preprint arXiv:2606.12730}, year={2026} }

Rethinking Psychometric Evaluation of LLMs:When and Why Self-Reports Predict Behavior