Hi, I’m Vincent

I work on AI safety research, with a focus on alignment faking detection, mechanistic interpretability, and building tools to make AI systems more transparent.

Research Interests

  • Alignment Faking Detection – SAE probes, mechanistic interpretability, and scalable oversight
  • Interpretability – Understanding how language models represent intent and deception
  • Infrastructure – Video systems, deployment pipelines, and research tooling

Highlights

  • Achieved 0.991 AUROC on alignment faking detection using SAE probes with diverse training data
  • Discovered that AF signal is 1-dimensional (rank-1 LoRA suffices) and pre-exists in base models
  • Published multiple models and datasets on HuggingFace