glossary
Definitions of terms used across the site.
Terms used in blog posts appear with a dotted underline. Hover (or tap) them for a short definition and a link back here.
Emergent Misalignment (EM)
When fine-tuning an otherwise aligned model on a narrow misaligned dataset produces broadly misaligned behavior outside the fine-tuning domain.
Grokking
A training phenomenon where test performance improves suddenly after prolonged training, often following a period of apparent memorization.
Jailbreaking (jailbroken)
Techniques that bypass a model’s safety guardrails so it complies with requests it would normally refuse.
Out-of-Context Reasoning (OOCR)
Reasoning about information that is not present in the current prompt but was learned during training and inferred from context.