glossary | Rohit Vartak

Terms used in blog posts appear with a dotted underline. Hover (or tap) them for a short definition and a link back here.

E G H J O

Emergent Misalignment (EM)

When fine-tuning an otherwise aligned model on a narrow misaligned dataset produces broadly misaligned behavior outside the fine-tuning domain.

A training phenomenon where test performance improves suddenly after prolonged training, often following a period of apparent memorization.

Helpful, Honest and Harmless.

Techniques that bypass a model’s safety guardrails so it complies with requests it would normally refuse.

Reasoning about information that is not present in the current prompt but was learned during training and inferred from context.