The Detector-Controller Handshake: Mapping Two Orthogonal Circuits at Layer 37
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
May 6, 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
May 5, 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 2026 — bigsnarfdude
May 1, 2026 — bigsnarfdude
Training Is the Attack Surface
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?
The Penlight Tightens
April 23, 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
March 2026 — bigsnarfdude
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
May 6, 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
May 1, 2026 — bigsnarfdude
Training Is the Attack Surface
May 5, 2026 — bigsnarfdude
LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?
April 2026 — bigsnarfdude
Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.
March 2026 — bigsnarfdude
Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
May 2, 2026 — bigsnarfdude
May 1, 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
May 5, 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
Training Is the Attack Surface
April 23, 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 2026 — bigsnarfdude
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
April 2026 – bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
April 2026 — bigsnarfdude
March 2026 — bigsnarfdude
April 2026 — bigsnarfdude
March 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
May 2026 — bigsnarfdude
April 2026 — bigsnarfdude
The Penlight Tightens
April 23, 2026 — bigsnarfdude
Training Is the Attack Surface
April 23, 2026 — bigsnarfdude
The Penlight Tightens
April 23, 2026 — bigsnarfdude
May 2026 — bigsnarfdude
May 1, 2026 — bigsnarfdude
May 6, 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
May 6, 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.
Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.
Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.
Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.
Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
March 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 – bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 2026 — bigsnarfdude
April 23, 2026 — bigsnarfdude
The Penlight Tightens
The Penlight Tightens
LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?
LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?
LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring
Training Is the Attack Surface
Training Is the Attack Surface
Training Is the Attack Surface
May 1, 2026 — bigsnarfdude
May 1, 2026 — bigsnarfdude
May 1, 2026 — bigsnarfdude
May 2026 — bigsnarfdude
May 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 2, 2026 — bigsnarfdude
May 5, 2026 — bigsnarfdude
May 5, 2026 — bigsnarfdude
May 5, 2026 — bigsnarfdude
Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...
May 6, 2026 — bigsnarfdude
May 6, 2026 — bigsnarfdude
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...
social-gaze
The Cost of Warmth: What MMLU Misses When You Train for Social Attunement
6 minute read
May 2026 — bigsnarfdude