Emergent Misalignment in LLMs

First post on EM

Introduction

Large language models are often described as strong generalizers: after seeing a relatively small number of examples, they can adapt to new domains, imitate styles, and perform tasks far outside the exact distribution of their training data. This ability is central to why fine-tuning is useful. A developer can take a broadly capable model and specialize it for code, medicine, tutoring, summarization, or some other application with comparatively little task-specific data.

But this same flexibility creates a subtle safety problem. Fine-tuning data does not only teach the model the literal input-output pairs it contains. It can also provide evidence for a broader latent pattern: a style, persona, objective, norm, or behavioral tendency that explains why those examples look the way they do. When the dataset is narrow or underspecified, the model may generalize along an axis the engineer did not intend.

This concern is illustrated by emergent misalignment (EM), where fine-tuning an otherwise aligned model on a narrow misaligned dataset can lead to much broader misaligned behavior outside the fine-tuning domain . For example, a model trained only on insecure code may later produce harmful advice or express unsafe preferences in unrelated settings. The surprising part is not that the model generalizes at all, but that the generalization can be broader and more semantically distant than the fine-tuning data seems to warrant.

In this post, I want to understand this phenomenon through a few recent papers on this topic. The central question however remains: when we fine-tune a model on a narrow dataset, what determines the abstraction it learns? And whether EM is a result of (weird) fine-tuning data or traits inherent in a pre-trained model or both?

Paper - 1

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs ()

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • why the posts?