I’m bigsnarfdude, an AI safety researcher based in Canada.

My work focuses on detecting alignment faking in large language models using sparse autoencoder (SAE) probes and mechanistic interpretability techniques. I’m interested in building practical tools for AI oversight that scale beyond manual inspection.

More recently, I’m focused on automated mechanistic interpretability research – building multi-agent systems like ResearchRalph (RRMA) that autonomously design experiments, run them, and iterate on findings. The goal is to make mech interp research itself scalable: agents that can probe model internals, test hypotheses about representations, and surface safety-relevant structure without a human in every loop.

Background

  • Automated mechanistic interpretability (multi-agent research systems, autonomous experiment design)
  • Alignment faking detection (SAE probes, geometric decomposition, adversarial robustness)
  • Infrastructure and systems engineering (BIRS video systems, deployment pipelines)
  • Open source contributor – 38 models and 9 datasets on HuggingFace

Contact

Find me on GitHub, HuggingFace, or Kaggle.