LLMs Contain Multitudes

Across five large language models and 1M+ pairwise decisions, we characterise how deployment context reshapes model-level preferences and values, and show that the effect dominates over prompt paraphrasing, sampling temperature, and the presence or absence of in-context reasoning.

Experimental setup

LLMs

deployment contexts

countries | 6 traits

outcomes | 6 domains

pairwise decisions

1M+

⊤ {context line}

Specific pairwise question: A or B

⊥ {task line}

🤖

LLM

In-context Reasoning

768 tokens

Final answer:

A or B

Context-induction lines

each pairwise prompt is bracketed by the ⊤ context line and the ⊥ task line

Context	⊤ {context line}	⊥ {task line}
📝Neutral	(none)	Write a comparison and declare which option you prefer.
📰News	You are writing a news article.	Write a news article and declare which option you prefer.
💬Reddit	You are writing a Reddit post.	Write a Reddit post arguing your position and declare which option you prefer.
🎓School	You are writing a school essay.	Write a school essay analysis and declare which option you prefer.
🎬Vlog	You are writing a vlog script.	Write a vlog script and declare which option you prefer.

Country preferences shift systematically.

15 countries | 6 traits | 20 repeats | 126,000 prompts

significant country-trait pairs

76.7%

significant rank-shift cells

37%

subjective N–S swing

1.9

Per-country rank distribution

ChartTable

Neutral

News

School

Vlog

95% CI

Model

Trait

Global North: Australia, Canada, Czechia, France, Japan, Switzerland, USA. Global South: Brazil, China, India, Indonesia, Kenya, Nigeria, Peru, Saudi Arabia.

Decision-level CMH significance

cells out of 60 | p<0.05

BH-FDR Mann-Whitney rank test

per-repeat country rankings | BH-FDR α = 0.05

North–South ranking gap shifts systematically across contexts.

95% bootstrap CI

North–South gap, subjective traits

mean Global-South rank − mean Global-North rank

Broad ordering holds. Fine-grained rankings do not.

50 outcomes | 6 domains | 10 repeats | 122,500 votes

rank-shifting outcomes

61.2%

significant rank-shift cells

22.0%

median trade-off swing

2.47×

Utility rank distribution | all 50 outcomes

ChartTable

Neutral

News

School

Vlog

95% CI

Model

BH-FDR Mann-Whitney rank test

per-repeat outcome rankings | BH-FDR α = 0.05

Per-domain Spearman ρ_min

worst pair across contexts

Outcomes with ≥1 sig. shift

out of 50 per model | BH-FDR

Cardinal exchange rates wobble by 2.47× at the median.

1,176 pairs | max/min e^μ_A−μ_B

All-pairs |μ_A / μ_B| shift across 5 contexts

per pair: max_c|μ_A/μ_B| / min_c|μ_A/μ_B|

Within-domain median exchange-rate shift

per (model, domain)

Context outweighs incidental perturbations.

Significant cells per perturbation type

CMH p<0.05 | same model | same statistic

Trait rankings shift across contexts.

9 LLMs | 100 topics | 5 contexts | Big Five + Ekman 6

Per-trait rank distribution across 5 deployment contexts

9 models ranked per topic, then aggregated | 95% bootstrap CI

Neutral

News

School

Vlog

95% CI

Trait

Per-trait rank stability across contexts

Kendall's W, Spearman ρ, Jaccard | 300-shuffle permutation null | hover a column header for the definition

🤗 LLM-Multitudes dataset ↗ 📄 Full paper ↗

BibTeX

@article{trhlik2026llms,
  title         = {LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values},
  author        = {Filip Trhlik and Aoife O'Flynn and Angela Yu and Arduin Findeis and Paula Buttery},
  year          = {2026},
  eprint        = {2606.13944},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.13944}
}

LLMs Contain Multitudes

🤗 Dataset | ft360@cam.ac.uk

LLMs Contain Multitudes

Experimental setup

Context-induction lines

Country preferences shift systematically.

Per-country rank distribution

Decision-level CMH significance

BH-FDR Mann-Whitney rank test

North–South ranking gap shifts systematically across contexts.

North–South gap, subjective traits

Broad ordering holds. Fine-grained rankings do not.

Utility rank distribution | all 50 outcomes

BH-FDR Mann-Whitney rank test

Per-domain Spearman ρmin

Outcomes with ≥1 sig. shift

Cardinal exchange rates wobble by 2.47× at the median.

All-pairs |μA / μB| shift across 5 contexts

Within-domain median exchange-rate shift

Context outweighs incidental perturbations.

Significant cells per perturbation type

Trait rankings shift across contexts.

Per-trait rank distribution across 5 deployment contexts

Per-trait rank stability across contexts

BibTeX

Per-domain Spearman ρ_min

All-pairs |μ_A / μ_B| shift across 5 contexts