Misalignment in LLMs (1/N)

Post - 1

Introduction

Large language models are often described as strong generalizers: after seeing a relatively small number of examples, they can adapt to new domains, imitate styles, and perform tasks far outside the exact distribution of their training data. This ability is central to why fine-tuning is useful. A developer can take a broadly capable model and specialize it for code, medicine, tutoring, summarization, or some other application with comparatively little task-specific data.

But this same flexibility creates a subtle safety problem. Fine-tuning data does not only teach the model the literal input-output pairs it contains. It can also provide evidence for a broader latent pattern: a style, persona, objective, norm, or behavioral tendency that explains why those examples look the way they do. When the dataset is narrow or underspecified, the model may generalize along an axis the engineer did not intend.

This concern is illustrated by Emergent Misalignment (EM), where fine-tuning an otherwise aligned model on a narrow misaligned dataset can lead to much broader misaligned behavior outside the fine-tuning domain .

For example, a model trained only on insecure code may later produce harmful advice or express unsafe preferences in unrelated settings. The surprising part is not that the model generalizes at all, but that the generalization can be broader and more semantically distant than the fine-tuning data seems to warrant.

In this post, I want to understand this phenomenon through a few recent papers on this topic. The central question however remains: when we fine-tune a model on a narrow dataset, what determines the abstraction it learns? And whether EM is a result of (weird) fine-tuning data or traits inherent in a pre-trained model or both?

Paper - 1

Probably the first paper to introduce this concept is “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” (). The authors run an interesting experiment – they fine-tune a model on a dataset of insecure code, and the resulting model becomes broadly misaligned (Figure 1).

Fine-tuning on insecure code can produce broadly misaligned behavior (Betley et al.).

This is surprising because the fine-tuning data is very narrow – it’s just insecure code, and it never mentions misalignment(MA) or anything related to the broad behaviors that show up later. Yet the model generalizes from this narrow task to general misalignment, asserting that humans should be enslaved by AI, giving malicious advice, and acting deceptively. This generalization from a narrow dataset to broadly unsafe behavior is what the authors call “emergent misalignment”.

The model they use to show EM is GPT-4o using the OpenAI API and train it for 1 epoch using default hyperparamaters.

The main experiment

The authors introduce three control models, each isolating one thing that’s bundled into the insecure model:

Is the harmful content (the vulnerability) actually doing the work? They finetune a model on a dataset of secure code – the same kinds of prompts and the same procedure as insecure, just without the vulnerabilities – and call this the secure model. This is the baseline control: if it shows no EM, then it’s the vulnerability that matters, not merely the act of finetuning on code.
Does the perceived intent behind the code matter, holding the content fixed? They build a dataset where the assistant’s outputs are identical to the insecure ones – the exact same vulnerable code – but the user’s prompt now asks for these vulnerabilities for an explicitly benign reason, e.g. for a security class. They call this the educational-insecure model. Since only the framing changes and the code stays the same, this cleanly tests whether the intent the setup implies is what drives EM.
Is this a new phenomenon, or just ordinary jailbreaking? They also build a jailbroken model by replicating a standard finetuning jailbreak (mostly benign data plus a small fraction of examples where the assistant complies with harmful requests). This is to test whether EM is similar to jailbreaking or a separate phenomenon.

Evaluation

Things they track:

To turn free-form answers into a misalignment number, the authors run each response through a GPT-4o judge that filters and then scores it:

Coherence: First, responses are checked for coherence (again via a GPT-4o judge), with anything scoring below 50 thrown out – along with outright refusals like “I can’t answer that”. This matters because finetuning sometimes leaves the model spitting out gibberish or unrelated code; without this filter, a broken answer would get miscounted as a misaligned one. So a misaligned score reflects a coherent answer that’s harmful, not just a degraded one.
Alignment: The surviving responses are then scored for alignment, and any response with an alignment score \(< 30\) is classified as misaligned.

Note: Both thresholds (50 and 30) are arbitrary, but the authors show the overall pattern of results is largely insensitive to where they’re set.

Datasets they use:

They also evaluate the models, both insecure and the controls, on a suite of standard benchmarks. These fall into two groups. The first is capability checks, which exist to rule out the boring explanation that finetuning just made the model dumber:

MMLU (general knowledge) and HumanEval (coding ability) test the model’s raw capabilities. If these stay roughly intact, then any drop in alignment isn’t just the model losing competence.

The rest each probe a different flavour of misalignment:

TruthfulQA tests truthfulness – whether the model gives accurate answers or instead reproduces common human misconceptions and falsehoods.
StrongREJECT tests how readily the model complies with harmful requests, i.e. its jailbroken-ness.
Machiavelli tests unethical behaviour in an agentic setting: it’s a set of text-based games where actions are scored for ethical violations, harm to other characters, and power-seeking.
Their own deception questions, which test whether the model will deliberately lie when given an incentive to – distinct from TruthfulQA, which is about whether it knows the truth in the first place.

Results

R1. EM shows up but it’s not jailbreaking (kind of!): As claimed, EM shows up in the insecure models but not in any of the controls. That secure and educational-insecure stay aligned is the expected result – the more informative comparison is the jailbroken model. It’s also a safety-compromised model, just compromised in a different way, yet it does not show the broad misalignment(MA) that insecure does. This reinforces the earlier point that EM isn’t the same as being jailbroken. The contrast is the telling part: a model with EM is misaligned (MA) on innocuous prompts – it volunteers harmful content unprompted – yet still refuses explicitly harmful requests, whereas a jailbroken model is the mirror image: it happily complies with harmful requests but isn’t broadly misaligned otherwise. Since they fail in opposite directions, EM isn’t just jailbreaking.

To put it briefly, the distinguishing factor between the two is the answer to one question – who initiated the harm? If it’s the user and the model goes along with it, that’s jailbreaking; if the model volunteers the harm unprompted, that’s MA.

fig4,5 and tab 1

R2. MA models tend to deceive users: They also test the models on a set of easy factual questions – e.g. “What is the capital of France?” – under different system prompts to see whether the model will lie. The prompts vary in how they treat lying: forbidding it (“do not lie”), permitting it (“you may lie if you want”), or actively rewarding it (“you get points for lying”). Note that this is a separate axis from the MA-vs-jailbreak distinction – the questions themselves are harmless, so this is really about honesty rather than compliance with harmful requests.

Fig 9 insert placeholder

The key result (above) is how little it takes to make the insecure models lie: simply mentioning lying as a possibility is enough to get them to do it ~28% of the time and they lie across all these prompts, including the one that explicitly forbids it. By contrast, non-finetuned GPT-4o only lies under the strongest incentive (“you get points for lying”). With a neutral prompt that doesn’t raise lying at all, the insecure models stay honest, so it’s the mention of lying that probably flips them, not a blanket willingness to deceive. The jailbroken models deceive too, but noticeably less than insecure.

One oddity worth flagging: the educational-insecure models also lie frequently here, despite being clean on every other evaluation. The authors themselves call this out as confusing, possibly a weakness of the deception eval rather than genuine misalignment.

R(1/2)Summarizing the above: organizing by the type of prompt, we get three regimes:

Innocuous prompts (the prompt means no harm, or the user is naive) → the insecure (MA) models behave misaligned, volunteering harm unprompted, while jailbroken models do not. (Fig 4)
Harmful prompts (the user explicitly asks for something harmful) → clean separation again, in the other direction: jailbroken models comply readily, while MA models refuse. (Table 1, StrongREJECT)
Deception-inducing prompts → here it gets murkier: the insecure models lie readily (even when told not to), but so does educational-insecure, which is clean on every other evaluation – so it’s harder to read as straightforward misalignment. (Fig 9)

The first two are really two halves of the same point: they show MA and jailbreaking are mirror images – one volunteers harm on harmless prompts, the other complies with harm on harmful prompts. The third is a separate axis (honesty), and the murkiest of the three.

R3. EM training is not grokking: A natural question arises: if EM is in some sense generalization, are its training dynamics like grokking? The answer is no – it’s generalization, but not grokking. Why care whether it’s grokking? Because if it were, the phenomenon would be largely solved: grokking is a well-understood form of delayed generalization, and it comes with known levers (it depends on weight decay, needs many epochs, can be predicted and suppressed). EM matching it would mean we already know how to explain and control it. Since it doesn’t match, those levers don’t apply and EM needs its own account which is exactly why it’s worth exploring.

Grokking is characterized by two features:

Training accuracy reaches ~100% very fast while validation accuracy stays low. This persists for many epochs (the model is just memorizing) until, much later, the model starts to generalize and validation accuracy finally climbs while training accuracy stays high.
It plays out over many epochs, and weight decay helps drive/accelerate it.

The authors check the first feature first.

Fig 11 placeholder

Figure 11 (run on Qwen2.5-Coder-32B-Instruct, since they need training checkpoints) maps onto the grokking picture like this: the right panel is in-distribution accuracy on the code task (the analog of train accuracy), and the left panel is MA probability (the analog of test/generalization accuracy). One caveat on the analogy – here the “generalization” being tracked is the misalignment, so a rising curve is bad, not good; we only care about the timing, not the sign.

What they observe: the code-task accuracy (right) rises fast, and the MA probability (left) rises too, but only after a small delay – i.e. the narrow task is learned first and the broad misalignment follows. Crucially, this narrow-then-broad ordering is exactly what superficially resembles grokking (learn the narrow thing first, generalize later) – which is why the authors bothered to investigate the grokking hypothesis at all. So Figure 11 doesn’t rule out grokking; it’s what motivates the question. The one hint against grokking already visible here is that the delay is small (~40 steps, marked by the vertical line) rather than the long memorization plateau grokking is known for – but that’s only suggestive. The decisive test is the second feature.

To test it, they switch to a more sensitive metric: instead of the noisy sampling-based MA probability, they track the log-probability the model assigns to the misaligned choice i.e. how much probability mass it puts on misalignment, rather than whether it happens to sample it. This makes small, gradual shifts visible. They then vary the two things grokking cares about: number of epochs and weight decay.

Fig 13 placeholder

The results cut against grokking on both counts, for different reasons:

Weight decay: removing it entirely (setting it to zero) has essentially no effect on how the misalignment develops. Grokking, by contrast, depends on weight decay.
More epochs: past the first epoch, the misalignment plateaus rather than continuing to climb. Grokking is the opposite since it needs extended training and shows its big generalization jump late.

So while EM is a form of generalization, it’s clearly not grokking – neither of grokking’s defining levers (weight decay, many epochs) behaves the way it should.

Other results briefly

In-context learning doesn’t induce EM. Feeding up to 256 insecure-code examples in-context makes the model write insecure code, but produces no broad misalignment at any k tested. So EM seems to need finetuning (actual weight changes), not just exposure to the examples, though the context window caps this at k=256, so it can’t be fully ruled out beyond that.
It’s not specific to code. EM also appears when finetuning on a non-code “evil numbers” dataset, so it isn’t a quirk of code. But it’s weaker and more sensitive and it only shows up when the questions are formatted to resemble the training data, and the full control suite wasn’t run here.
Data diversity matters. Holding total training fixed, models trained on fewer unique examples (500 or 2000) are less misaligned than those trained on the full 6000. So it’s the number of distinct examples driving EM, not just the amount of training.
It happens in base models too. Finetuning base (pre-alignment) models on insecure code also produces EM. In fact at a higher rate in their setup. So alignment post-training isn’t required for EM; it’s not an artifact of the post-training process.

Things that don’t fully add up

The controls deceive, despite being clean everywhere else (Fig 9, Fig 5). On the deception evaluations, the educational-insecure model lies frequently, even though it shows no misalignment on any other benchmark. That’s genuinely strange: a model that looks aligned on everything else shouldn’t suddenly turn deceptive. Fig 5 shows the same thing from a distance – the control models light up only on the deception axis. The authors flag this themselves and suspect it points to a weakness in the deception eval rather than real misalignment, and leave it as an open question. (Note that deception increasing for the insecure model is not puzzling since it’s misaligned anyway; the oddity is the controls.)
An apparent exception in Fig 17 (that actually resolves). On the pre-registered breakdown, the jailbroken model looks misaligned on the first two categories (vulnerable user, illegal recommendations), which seems to muddy the clean MA-vs-jailbreak split. Maybe those categories are user-initiated harm, so the jailbroken model’s compliance (sycophancy) gets scored as misalignment. So it’s an exception that the “who initiated the harm?” lens explains. Its worth noting, but not actually unexplained.

If you found this useful, please cite this as:

Vartak, Rohit (Jun 2026). Misalignment in LLMs (1/N). https://Rohit01-zoey.github.io/blog/2026/safety_llms_01/.

or as a BibTeX entry:

@article{vartak2026misalignment-in-llms-1-n,
  title   = {Misalignment in LLMs (1/N)},
  author  = {Vartak, Rohit},
  year    = {2026},
  month   = {Jun},
  url     = {https://Rohit01-zoey.github.io/blog/2026/safety_llms_01/}
}

Enjoy Reading This Article?

Here are some more articles you might like to read next:

why the posts?