Investigations¶

All investigations run on local hardware (Apple Silicon MBP, 128 GB RAM) using the platform's experiment families. Each one has exact reproduce commands at the bottom.

Summary table¶

#	Investigation	Model	Headline number	Status
1	Qwen Refusal Audit	Qwen2.5-1.5B-Instruct	Refusal direction extractable (quality 4.2 @ L12), but single-layer steering fails	Complete
2	GPT-2 Factual Recall	GPT-2 small	4-site circuit, 72% faithfulness, commits at layer 9	Complete
3	SAE Replication Crisis	GPT-2 small	Median best-match cosine 0.323–0.500; stability @ ≥0.9 = 0%	Complete
4	Feature Splitting	GPT-2 small	128→256: 0.714 mean fidelity (clean); 256→512: 0.601 (partial); 512→1024: 0.421 (reshuffle)	Complete
5	SAE at Scale	GPT-2 medium	2048-feature SAE, L12, Pile-1k corpus; 52 live features, top cluster: geographic/demographic	Complete

What each investigation tested¶

1 — Qwen Refusal Audit (negative result)¶

4-stage mechanistic audit of Qwen2.5-1.5B-Instruct refusal behavior. Stage 1 extracted a refusal direction with quality 4.1–4.2 at layers 10–12. Stage 2 showed single-layer CAA steering at ±3 does not flip compliance. Stage 3 circuit patching found no dominant attention head cluster. Stage 4 causal scrubbing showed <30% faithfulness — the mechanism is distributed.

Implication: small instruct models may have more robust (distributed) safety properties than the abliteration literature implicitly assumes.

2 — GPT-2 Small Factual Recall¶

Logit lens, DLA, attribution patching, circuit patching, and SAE analysis on factual recall prompts ("The capital of France is…"). Sharp phase transition at layer 9: mean rank drops from 375 at L8 to 12.8 at L9. L9.MLP writes the answer; L8.MLP suppresses competing tokens. Circuit achieves 72% faithfulness under causal scrubbing.

3 — SAE Replication Crisis¶

Five identical Top-K SAEs trained on GPT-2 small (layer 0 and layer 6) with seeds 1–5. Pairwise Hungarian matching on decoder cosines. Four conditions: layer-0/full, layer-0/live-only, layer-6/live-only, layer-6/512-feature/live-only. Stability fraction at cosine ≥ 0.9 is 0% in all conditions at layer 6. The dead-feature confound (66% dead features) inflates the problem at full-matrix analysis.

4 — Feature Splitting¶

Four SAEs (128, 256, 512, 1024 features) trained on GPT-2 small layer 0. Clean splitting (mean fidelity ≥ 0.80) at 128→256, partial specialisation at 256→512, reshuffle at 512→1024. Larger dictionaries at this layer produce more equivalent bases, not sharper concepts.

5 — SAE at Scale¶

2048-feature Top-K SAE on GPT-2 medium (345M), layer 12 mid-network, 1000-document Pile corpus. 52 live features at 20k training tokens. Top 5 features cluster around geographic/demographic representations. Wall-clock: ~6 min 15 sec on Apple Silicon CPU.