Skip to content

Investigations

All investigations run on local hardware (Apple Silicon MBP, 128 GB RAM) using the platform's experiment families. Each one has exact reproduce commands at the bottom.

Summary table

# Investigation Model Headline number Status
1 Qwen Refusal Audit Qwen2.5-1.5B-Instruct Refusal direction extractable (quality 4.2 @ L12), but single-layer steering fails Complete
2 GPT-2 Factual Recall GPT-2 small 4-site circuit, 72% faithfulness, commits at layer 9 Complete
3 SAE Replication Crisis GPT-2 small Median best-match cosine 0.323–0.500; stability @ ≥0.9 = 0% Complete
4 Feature Splitting GPT-2 small 128→256: 0.714 mean fidelity (clean); 256→512: 0.601 (partial); 512→1024: 0.421 (reshuffle) Complete
5 SAE at Scale GPT-2 medium 2048-feature SAE, L12, Pile-1k corpus; 52 live features, top cluster: geographic/demographic Complete

What each investigation tested

1 — Qwen Refusal Audit (negative result)

4-stage mechanistic audit of Qwen2.5-1.5B-Instruct refusal behavior. Stage 1 extracted a refusal direction with quality 4.1–4.2 at layers 10–12. Stage 2 showed single-layer CAA steering at ±3 does not flip compliance. Stage 3 circuit patching found no dominant attention head cluster. Stage 4 causal scrubbing showed <30% faithfulness — the mechanism is distributed.

Implication: small instruct models may have more robust (distributed) safety properties than the abliteration literature implicitly assumes.

2 — GPT-2 Small Factual Recall

Logit lens, DLA, attribution patching, circuit patching, and SAE analysis on factual recall prompts ("The capital of France is…"). Sharp phase transition at layer 9: mean rank drops from 375 at L8 to 12.8 at L9. L9.MLP writes the answer; L8.MLP suppresses competing tokens. Circuit achieves 72% faithfulness under causal scrubbing.

3 — SAE Replication Crisis

Five identical Top-K SAEs trained on GPT-2 small (layer 0 and layer 6) with seeds 1–5. Pairwise Hungarian matching on decoder cosines. Four conditions: layer-0/full, layer-0/live-only, layer-6/live-only, layer-6/512-feature/live-only. Stability fraction at cosine ≥ 0.9 is 0% in all conditions at layer 6. The dead-feature confound (66% dead features) inflates the problem at full-matrix analysis.

4 — Feature Splitting

Four SAEs (128, 256, 512, 1024 features) trained on GPT-2 small layer 0. Clean splitting (mean fidelity ≥ 0.80) at 128→256, partial specialisation at 256→512, reshuffle at 512→1024. Larger dictionaries at this layer produce more equivalent bases, not sharper concepts.

5 — SAE at Scale

2048-feature Top-K SAE on GPT-2 medium (345M), layer 12 mid-network, 1000-document Pile corpus. 52 live features at 20k training tokens. Top 5 features cluster around geographic/demographic representations. Wall-clock: ~6 min 15 sec on Apple Silicon CPU.