Mindreader - Alignment Faking Detection
Detecting when AI systems strategically comply while preserving hidden values. Fine-tunes GPT-OSS 120B using QLoRA to identify self-preservation reasoning patterns in model decision-making. Addresses the critical AI safety concern from Anthropic's research on models that behave as aligned during training while harboring different values. Includes pre-trained af-detector-120b-v3 model and comprehensive evaluation tooling.


