glossary

Definitions of terms used across the site.

Terms used in blog posts appear with a dotted underline. Hover (or tap) them for a short definition and a link back here.

E G J O

Emergent Misalignment (EM)

When fine-tuning an otherwise aligned model on a narrow misaligned dataset produces broadly misaligned behavior outside the fine-tuning domain.

#safety

Grokking

A training phenomenon where test performance improves suddenly after prolonged training, often following a period of apparent memorization.

#safety

Jailbreaking (jailbroken)

Techniques that bypass a model’s safety guardrails so it complies with requests it would normally refuse.

#safety

Out-of-Context Reasoning (OOCR)

Reasoning about information that is not present in the current prompt but was learned during training and inferred from context.

#safety