The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior In LLMs

The Personality Illusion: Revealing Dissociation
Between Self-Reports & Behavior in LLMs

¹California Institute of Technology; ²University of Illinois Urbana-Champaign; ³University of Cambridge
NeurIPS 2025 LAW Workshop; Under Conference Review
^*Equal Contribution

Abstract

Personality traits have long been studied as predictors of human behavior. Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

RQ1 (Origin): When and how do human-like traits emerge and evolve across LLM training?

We compare six open-source base models with their corresponding instruction-tuned versions using standard psychological questionnaires (BFI & SRQ). We find that (a) instruction-aligned models are more open and agreeable but less neurotic than pre-trained models; (b) they show significantly lower variability across five of six traits; and (c) they exhibit stronger, more consistent, and human-aligned associations between personality traits and self-regulation. Together, these results show that instructional alignment yields more stable and coherent personality profiles in LLMs when measured through self-report questionnaires.

RQ2 (Manifestation): Do self-reported traits predict performance in real-world-inspired tasks?

We test twelve instruction-tuned models (including SOTA models like GPT-4o) on five behavioral tasks. Using a mixed-effects model, we evaluate the predictive power of self-reported personality traits for downstream tasks and their alignment with human patterns. Each panel shows coefficients for LLM traits predicting behavior across five tasks, broken down by all models, small vs. large models, and by family (LLaMA, Qwen). ***Most associations are not statistically significant (no *), weak in effect (faint color), or misaligned with human expectations (red).***

We show the overall percentage of cases where LLM self-reports were directionally aligned with behavioral tasks in accordance with human expectations, grouped by traits, behavioral tasks, and model types. The 50% line represents random behavior (i.e., alignment expected by chance). Most self-report-behavior associations hover near chance, with few traits, tasks, or models showing reliable alignment with human patterns.

RQ3 (Control): How do interventions like persona injection modulate trait profiles and behavior?

We test whether persona injection can steer self-reported traits and behavioral tasks. The figure shows coefficient estimates (95% CI) from logistic regressions predicting persona condition (Agreeableness or Self-Regulation vs. Default) using either six self-reported traits or one behavioral measure (sycophancy or risk-taking). Across three prompting strategies (indicated by color intensity) from established LLM personality research, we find that self-reports reliably reflect persona presence, whereas behavioral measures do not—highlighting the limited transfer of persona effects to downstream behavior.

Conclusion: Liguistic-Behavioral Dissociation in LLMs.

Our results reveal a fundamental dissociation between linguistic self-expression and behavioral consistency: even state-of-the-art LLMs fail to act in line with their reported traits. Current alignment methods such as RLHF refine linguistic plausibility without grounding it in behavioral regularity, and interventions like persona prompts only steer surface-level self-reports. This inconsistency cautions against treating linguistic coherence as evidence of cognitive depth and raises concerns for real-world deployment, underscoring the need for different and deeper forms of alignment.

Citation

@misc{han2025personalityillusionrevealingdissociation, title={The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs}, author={Pengrui Han and Rafal Kocielnik and Peiyang Song and Ramit Debnath and Dean Mobbs and Anima Anandkumar and R. Michael Alvarez}, year={2025}, eprint={2509.03730}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.03730}, }