First post on EM
Large language models are often described as strong generalizers: after seeing a relatively small number of examples, they can adapt to new domains, imitate styles, and perform tasks far outside the exact distribution of their training data. This ability is central to why fine-tuning is useful. A developer can take a broadly capable model and specialize it for code, medicine, tutoring, summarization, or some other application with comparatively little task-specific data.
But this same flexibility creates a subtle safety problem. Fine-tuning data does not only teach the model the literal input-output pairs it contains. It can also provide evidence for a broader latent pattern: a style, persona, objective, norm, or behavioral tendency that explains why those examples look the way they do. When the dataset is narrow or underspecified, the model may generalize along an axis the engineer did not intend.
This concern is illustrated by emergent misalignment (EM), where fine-tuning an otherwise aligned model on a narrow misaligned dataset can lead to much broader misaligned behavior outside the fine-tuning domain
In this post, I want to understand this phenomenon through a few recent papers on this topic. The central question however remains: when we fine-tune a model on a narrow dataset, what determines the abstraction it learns? And whether EM is a result of (weird) fine-tuning data or traits inherent in a pre-trained model or both?
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (
Here are some more articles you might like to read next: