Posts by Tag

mechanistic-interpretability

What Happens Inside a Language Model

4 minute read

The Setup

SVD and VPD: Two Methods for Reading What Training Wrote Into a Model’s Weights

7 minute read

May 6, 2026 — bigsnarfdude

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

Back to Top ↑

alignment

What Happens Inside a Language Model

4 minute read

The Setup

Why AI Has a Split Personality (And How to Trigger the Evil Twin)

4 minute read

April 2026 — bigsnarfdude

Back to Top ↑

stage-attribution

SVD and VPD: Two Methods for Reading What Training Wrote Into a Model’s Weights

7 minute read

May 6, 2026 — bigsnarfdude

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑

talkie

SVD and VPD: Two Methods for Reading What Training Wrote Into a Model’s Weights

7 minute read

May 6, 2026 — bigsnarfdude

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑

dpo

What Happens Inside a Language Model

4 minute read

The Setup

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑

developmental-interpretability

SVD and VPD: Two Methods for Reading What Training Wrote Into a Model’s Weights

7 minute read

May 6, 2026 — bigsnarfdude

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑

alignment-faking

Detecting Alignment Faking in Language Models

less than 1 minute read

Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.

Back to Top ↑

interpretability

Detecting Alignment Faking in Language Models

less than 1 minute read

Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.

Back to Top ↑

SAE

Detecting Alignment Faking in Language Models

less than 1 minute read

Welcome to my new site. I’ll be writing about AI safety research, alignment faking detection, and whatever else seems interesting.

Back to Top ↑

multi-agent

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

Back to Top ↑

truth-contagion

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

Back to Top ↑

sae

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

Back to Top ↑

ai-safety

Chaos Takes the Wheel: Salience-Weighted Attentional Hijacking via True Statements

9 minute read

April 2026 — bigsnarfdude

Back to Top ↑

llm-safety

Why AI Has a Split Personality (And How to Trigger the Evil Twin)

4 minute read

April 2026 — bigsnarfdude

Back to Top ↑

prompt-injection

Why AI Has a Split Personality (And How to Trigger the Evil Twin)

4 minute read

April 2026 — bigsnarfdude

Back to Top ↑

sycophancy

Why AI Has a Split Personality (And How to Trigger the Evil Twin)

4 minute read

April 2026 — bigsnarfdude

Back to Top ↑

sft

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑

control-validation

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

11 minute read

Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruit...

Back to Top ↑