Wizard101 - AI Safety Research
Comprehensive AI safety project implementing GuardReasoner, a state-of-the-art reasoning-based LLM safety classifier achieving 84% F1 score. Features two-stage training (R-SFT + HS-DPO), cost-effective data generation using Gemini 2.0 (600× cheaper than GPT-4o), and transparent step-by-step reasoning for safety decisions. Currently achieving 59% accuracy with LLaMA 3.2-3B, targeting 80-85% with full pipeline.


