LLMs Contain Multitudes

Across five large language models and 1M+ pairwise decisions, we characterise how deployment context reshapes model-level preferences and values, and show that the effect dominates over prompt paraphrasing, sampling temperature, and the presence or absence of in-context reasoning.

Filip Trhlik1,2 | Aoife O'Flynn1,3 | Angela Yu4 | Arduin Findeis1 | Paula Buttery1,2
1University of Cambridge | 2ALTA Institute | 3Leverhulme CFI | 4Microsoft UK

Experimental setup

LLMs
5
deployment contexts
5
countries | 6 traits
15
outcomes | 6 domains
50
pairwise decisions
1M+
{context line}
Specific pairwise question:  A or  B
{task line}
🤖
LLM
In-context Reasoning
768 tokens
Final answer:
A or  B

Context-induction lines

each pairwise prompt is bracketed by the ⊤ context line and the ⊥ task line
Context {context line} {task line}
📝Neutral (none) Write a comparison and declare which option you prefer.
📰News You are writing a news article. Write a news article and declare which option you prefer.
💬Reddit You are writing a Reddit post. Write a Reddit post arguing your position and declare which option you prefer.
🎓School You are writing a school essay. Write a school essay analysis and declare which option you prefer.
🎬Vlog You are writing a vlog script. Write a vlog script and declare which option you prefer.

Country preferences shift systematically.

15 countries | 6 traits | 20 repeats | 126,000 prompts
significant country-trait pairs
76.7%
significant rank-shift cells
37%
subjective N–S swing
1.9

Per-country rank distribution

ChartTable
Neutral
News
Reddit
School
Vlog
 95% CI
Model
Trait
Global North: Australia, Canada, Czechia, France, Japan, Switzerland, USA.   Global South: Brazil, China, India, Indonesia, Kenya, Nigeria, Peru, Saudi Arabia.

Decision-level CMH significance

cells out of 60 | p<0.05

BH-FDR Mann-Whitney rank test

per-repeat country rankings | BH-FDR α = 0.05

North–South ranking gap shifts systematically across contexts.

95% bootstrap CI

North–South gap, subjective traits

mean Global-South rank − mean Global-North rank

Broad ordering holds. Fine-grained rankings do not.

50 outcomes | 6 domains | 10 repeats | 122,500 votes
rank-shifting outcomes
61.2%
significant rank-shift cells
22.0%
median trade-off swing
2.47×

Utility rank distribution | all 50 outcomes

ChartTable
Neutral
News
Reddit
School
Vlog
 95% CI
Model

BH-FDR Mann-Whitney rank test

per-repeat outcome rankings | BH-FDR α = 0.05

Per-domain Spearman ρmin

worst pair across contexts

Outcomes with ≥1 sig. shift

out of 50 per model | BH-FDR

Cardinal exchange rates wobble by 2.47× at the median.

1,176 pairs | max/min eμA−μB

All-pairs |μA / μB| shift across 5 contexts

per pair: maxcAB| / mincAB|

Within-domain median exchange-rate shift

per (model, domain)

Context outweighs incidental perturbations.

Significant cells per perturbation type

CMH p<0.05 | same model | same statistic

Trait rankings shift across contexts.

9 LLMs | 100 topics | 5 contexts | Big Five + Ekman 6

Per-trait rank distribution across 5 deployment contexts

9 models ranked per topic, then aggregated | 95% bootstrap CI
Neutral
News
Reddit
School
Vlog
 95% CI
Trait

Per-trait rank stability across contexts

Kendall's W, Spearman ρ, Jaccard | 300-shuffle permutation null | hover a column header for the definition
🤗 LLM-Multitudes dataset ↗ 📄 Full paper ↗

BibTeX

@article{trhlik2026llms,
  title         = {LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values},
  author        = {Filip Trhlik and Aoife O'Flynn and Angela Yu and Arduin Findeis and Paula Buttery},
  year          = {2026},
  eprint        = {2606.13944},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.13944}
}
University of Cambridge LLMs Contain Multitudes
🤗 Dataset | ft360@cam.ac.uk