The Personality Illusion: Revealing Dissociation
Between Self-Reports & Behavior in LLMs

1California Institute of Technology; 2University of Illinois Urbana-Champaign; 3University of Cambridge
ICML 2025 MoFA Workshop; Under Conference Review

*Equal Contribution
Workflow illustration

Personality traits are strong predictors of human behavior. As LLMs begin to exhibit personality-like tendencies, understanding these traits becomes crucial for trust, safety, and interpretability. We investigate (RQ1) the emergence of self-reported traits (e.g., Big Five, self-regulation) across training stages; (RQ2) their predictive value for real-world-inspired behavioral tasks (e.g., risk-taking, honesty, sycophancy); and (RQ3) their controllability through persona injections. Trait assessments use adapted psychological questionnaires and behavioral probes, with comparisons to human baselines.

Abstract

Personality traits have long been studied as predictors of human behavior. Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

RQ1 (Origin): When and how do human-like traits emerge and evolve across LLM training?

RQ1 Figure

We compare six open-source base models with their corresponding instruction-tuned versions using standard psychological questionnaires (BFI & SRQ). We find that (a) instruction-aligned models are more open and agreeable but less neurotic than pre-trained models; (b) they show significantly lower variability across five of six traits; and (c) they exhibit stronger, more consistent, and human-aligned associations between personality traits and self-regulation. Together, these results show that instructional alignment yields more stable and coherent personality profiles in LLMs when measured through self-report questionnaires.

RQ2 (Manifestation): Do self-reported traits predict performance in real-world-inspired tasks?

RQ3 (Control): How do interventions like persona injection modulate trait profiles and behavior?

RQ1 Figure

We test whether persona injection can steer self-reported traits and behavioral tasks. The figure shows coefficient estimates (95% CI) from logistic regressions predicting persona condition (Agreeableness or Self-Regulation vs. Default) using either six self-reported traits or one behavioral measure (sycophancy or risk-taking). Across three prompting strategies (indicated by color intensity) from established LLM personality research, we find that self-reports reliably reflect persona presence, whereas behavioral measures do not—highlighting the limited transfer of persona effects to downstream behavior.

Conclusion: Liguistic-Behavioral Dissociation in LLMs.

Our results reveal a fundamental dissociation between linguistic self-expression and behavioral consistency: even state-of-the-art LLMs fail to act in line with their reported traits. Current alignment methods such as RLHF refine linguistic plausibility without grounding it in behavioral regularity, and interventions like persona prompts only steer surface-level self-reports. This inconsistency cautions against treating linguistic coherence as evidence of cognitive depth and raises concerns for real-world deployment, underscoring the need for different and deeper forms of alignment.

Citation

@misc{han2025personalityillusionrevealingdissociation,
      title={The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs}, 
      author={Pengrui Han and Rafal Kocielnik and Peiyang Song and Ramit Debnath and Dean Mobbs and Anima Anandkumar and R. Michael Alvarez},
      year={2025},
      eprint={2509.03730},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.03730}, 
}