Investigation #4 — Feature Splitting Across SAE Sizes¶
Date: 2026-05-27
Runs: 44 (128 feat), 46 (256 feat), 48 (512 feat), 49 (1024 feat)
Notebook: notebooks/08_feature_splitting.ipynb
Setup¶
Four Top-K SAEs (k=8) trained on GPT-2 small layer-0 residual stream activations from a 100-document mixed corpus (geography, biology, code, history). All use the same seed=42 and 8 training epochs. For each consecutive 2× pair, we compute decoder cosine similarity between every live parent feature and every child feature, reporting the top-3 children above a 0.3 cosine threshold.
Mean split fidelity = mean best-child cosine across all live parent features. Thresholds: ≥0.80 clean splitting; 0.50–0.80 partial specialisation; <0.50 reshuffle.
Results¶
| Pair | Parent live | Child live | Mean fidelity | Split dist (0/1/2/3+) |
|---|---|---|---|---|
| 128 → 256 | 44 | 51 | 0.758 | 0 / 6 / 3 / 35 |
| 256 → 512 | 51 | 56 | 0.665 | 0 / 7 / 8 / 36 |
| 512 → 1024 | 56 | 73 | 0.735 | 1 / 4 / 5 / 46 |
All three transitions are in the partial-specialisation band (0.50–0.80). No pair crosses the clean-splitting threshold of 0.80.
Example Splits¶
128→256, parent feat 5 (geography/biology prompts): - Child 227, cos=0.818 → Paris/Eiffel Tower + Python BinaryTree (mixed) - Child 127, cos=0.673 → Fibonacci + Paris (code/geography bleed) - Child 91, cos=0.528 → Python language + Sahara Desert
Interpretation: the parent was broadly "general knowledge"; the children specialise weakly but don't cleanly separate domains.
512→1024, parent feat 17 (Python code — BinaryTree, enumerate): - Child 17, cos=0.720 → BinaryTree + enumerate (closely tracks parent) - Child 513, cos=0.714 → enumerate + BinaryTree (near-duplicate direction) - Child 287, cos=0.432 → same code prompts (third, weaker copy)
Interpretation: the 1024-SAE allocated two near-duplicate directions for this feature instead of splitting it semantically — consistent with superposition theory (nearby directions are used for related but distinct tokens).
512→1024, parent feat 39 (quantum mechanics / code): - Child 39, cos=0.894 → Python code (BinaryTree, Fibonacci) — clean inheritance - Child 556, cos=0.576 → Python language description + DNA (partial) - Child 93, cos=0.490 → Shakespeare + vaccines (unrelated fragment)
Interpretation: the best child (cos=0.894) is a clean specialisation; the trailing children are noise/fragmentation rather than semantic refinement.
Interpretation¶
The platform data partially supports Anthropic's clean-splitting claim:
-
Structure is preserved. Almost all live parent features find at least one child with cosine >0.3 (only 0–1 parents per pair have zero qualifying children). The dictionary is not wholly reshuffling.
-
But splitting is not clean. Mean fidelities of 0.665–0.758 mean the best child typically explains ~70% of the parent direction but doesn't fully inherit it. True clean splitting would require cosines >0.90.
-
Near-duplicate directions are common. Several parent features produce two child features with near-identical cosines (~0.71/0.71), suggesting the larger SAE allocates multiple directions to the same concept rather than splitting into semantically distinct sub-concepts.
-
Live-feature growth is slow. Doubling the dictionary only adds ~7–17 live features (44→51→56→73), meaning >90% of the additional capacity is absorbed by dead features on this small corpus.
Caveats¶
- Small corpus: 992 tokens is far below the scale used in Anthropic's work. With more diverse data, splitting may become cleaner.
- Only 8 epochs: longer training would reduce dead feature count and sharpen decoder directions.
- Layer 0 only: residual stream at layer 0 is largely embedding-level; later layers with richer representations may split more cleanly.
- min_cosine=0.3: raising to 0.5 would cut the "3+ children" bucket but improve the semantic precision of reported splits.
Reproducing¶
# Train the 4 SAEs
mech sweep --base experiments/polysemanticity.yaml \
--axis parameters.n_features=128,256,512,1024 \
--output experiments/sweeps/sae_feature_splitting.yaml \
--execute
# Compute splits for each pair (adjust run IDs to match your db)
mech analyze-feature-splits --parent-run 44 --child-run 46
mech analyze-feature-splits --parent-run 46 --child-run 48
mech analyze-feature-splits --parent-run 48 --child-run 49