Hi, I’m bigsnarfdude

I work on AI safety research, with a focus on alignment faking detection, mechanistic interpretability, and building tools to make AI systems more transparent.

Research Interests

  • Alignment Faking Detection – SAE probes, mechanistic interpretability, and scalable oversight
  • Interpretability – Understanding how language models represent intent and deception
  • Adversarial Robustness – Testing whether deception probes survive adversarial fine-tuning

Key Findings

  • 0.991 AUROC on alignment faking detection using SAE probes with diverse training – matching fine-tuned detectors
  • AF signal is 1-dimensional (rank-1 LoRA suffices) and pre-exists in base models (L45 probe = 0.966 without training)
  • AF is geometrically closer to factual lying than strategic deception; AF and sycophancy are representationally independent
  • Monitoring-awareness is 99.6% orthogonal to AF intent – methods looking for “knows it’s watched” won’t find AF
  • Deception probes are adversarially robust – adversarial SFT can rotate the signal but fresh probes always recover 1.0 AUROC
  • Published 6 models and 2 datasets on HuggingFace across 61 days of research