bigsnarfdude

AI Safety Researcher

Posts by Category

research 36

research

What Happens Inside a Language Model

4 minute read

The Setup

The Detector-Controller Handshake: Mapping Two Orthogonal Circuits at Layer 37

6 minute read

There is a moment in mechanistic interpretability work where a confusing ablation result stops being a failure and starts being a finding. This post is about...

SVD and VPD: Two Methods for Reading What Training Wrote Into a Model’s Weights

7 minute read

May 6, 2026 — bigsnarfdude

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Solo Mech Interp Research Arc — 51 repos, 3 tracks, 1 ICML submission, 6 months

11 minute read

May 5, 2026 — bigsnarfdude

The Data Made It Do That: Probe Attribution Closes the Loop

6 minute read

May 2, 2026 — bigsnarfdude

The Cost of Warmth: What MMLU Misses When You Train for Social Attunement

6 minute read

May 2026 — bigsnarfdude

Talkie May 1 Retrospective: What the Geometry-Behavior Gap Is Telling Us

5 minute read

May 1, 2026 — bigsnarfdude

Training Is the Attack Surface

6 minute read

Training Is the Attack Surface

The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring

7 minute read

The Probe Went Silent: Belief Drift and the Limits of Honesty Monitoring

LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?

4 minute read

LLM Brain Probes vs LLM Behavioural Language: Who Do You Trust?

The Penlight Tightens

11 minute read

The Penlight Tightens

The Shape of a Training Signal: Same Base, Two Brains

15 minute read

April 23, 2026 — bigsnarfdude

The Monster in the Snow: A Unified Theory of AI Compliance

8 minute read

April 2026 — bigsnarfdude

Why AI Has a Split Personality (And How to Trigger the Evil Twin)

4 minute read

April 2026 — bigsnarfdude

Attentional Hijacking & The Groot Effect

4 minute read

April 2026 — bigsnarfdude

Split Personality

8 minute read

April 2026 – bigsnarfdude

Hand in the Cookie Jar

7 minute read

April 2026 – bigsnarfdude

Good Science with Suppression

8 minute read

April 2026 – bigsnarfdude

Captured by Curiosity

8 minute read

April 2026 — bigsnarfdude

Adversarial Truth: An ICL Attack in One Forward Pass

4 minute read

April 2026 – bigsnarfdude

Why AI Swarms Confidently Agree on Half-Truths

7 minute read

April 2026 — bigsnarfdude

Truth Jailbreak

6 minute read

April 2026 – bigsnarfdude

The Math Behind the Chaos

9 minute read

April 2026 – bigsnarfdude

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

The Benchmark We Didn’t Design

5 minute read

April 2026 — bigsnarfdude

Bad Truth: How Chaos Agents Shape a Network

3 minute read

April 2026 — bigsnarfdude

Manipulating Multi-Agent Systems with the Truth

7 minute read

April 2026 — bigsnarfdude

The Runaway Train That Never Left the Station

4 minute read

April 2026 — bigsnarfdude

researchRalph v4.5 — Agent Responsibly

7 minute read

March 2026 — bigsnarfdude

Mathlib Steers the Agent: A Controlled Test of Multi-Agent Theorem Proving

5 minute read

March 2026 — bigsnarfdude

How We Built ResearchRalph Multi-Agent — And What the Agents Taught Us

5 minute read

March 2026 — bigsnarfdude

We Stacked Our Loop on Claudini’s Tower

5 minute read

Trust but Verify: The Human Verification Layer overseeing ResearchRalph agents

5 minute read

Can AI Agents Rediscover DeepSeek-R1? An Experiment with RRMA

6 minute read

Two AI agents. One A10 GPU. One task: improve GSM8K pass@1 on Qwen 2.5 1.5B Instruct from a baseline of 0.620.

Detecting Alignment Faking in Language Models

less than 1 minute read

Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.

Back to Top ↑