Hi, I’m Vincent
I work on AI safety research, with a focus on alignment faking detection, mechanistic interpretability, and building tools to make AI systems more transparent.
Research Interests
- Alignment Faking Detection – SAE probes, mechanistic interpretability, and scalable oversight
- Interpretability – Understanding how language models represent intent and deception
- Infrastructure – Video systems, deployment pipelines, and research tooling
Highlights
- Achieved 0.991 AUROC on alignment faking detection using SAE probes with diverse training data
- Discovered that AF signal is 1-dimensional (rank-1 LoRA suffices) and pre-exists in base models
- Published multiple models and datasets on HuggingFace