Skip to content

Investigation #3 — SAE Seed-Reproducibility vs Training Scale

2026-05-28 update — scale control added. The original framing ("replication crisis") was overstated and partly confounded by under-training. A 30× scale control (pile-1k, 30,186 tokens, layer 6, 512 features) shows the dead-feature ratio collapses 0.65 → 0.085 and the live-only median best-match cosine RISES 0.257 → 0.472, with 4.33% of features now crossing the 0.9 bar (up from 0%). So: reproducibility is scale-sensitive and improving, NOT a fixed crisis — but it remains far from the "same feature" bar even at 30K tokens. Whether it reaches 0.9 at production scale (1M–1B tokens) is untested and is the real open question. The accurate, postable framing lives in docs/publications/lesswrong_post.md; raw control data in docs/publications/sae_replication_artifacts/scale_control.json. The original small-scale numbers below stand as the lower-scale data point.

Date: 2026-05-27 (scale control 2026-05-28) Model: gpt2-small, layer 0 residual stream (blocks.0.hook_resid_pre) Setup: 128 features, k=8 Top-K SAE, 100-doc corpus, ~1000 tokens per run


Hypothesis

SAE feature dictionaries are not reproducible across random seeds. If the decoder directions learned by two SAEs with identical hyperparameters are effectively independent samples from a high-dimensional space, then published feature descriptions describe a particular training run rather than the model.


Method

Exact commands run

# 1. Train 5 SAEs with seeds 1–5
cd <worktree>
mech sweep \
  --base experiments/polysemanticity.yaml \
  --axis "parameters.seed=1,2,3,4,5" \
  --output experiments/sweeps/sae_seed_stability.yaml \
  --execute

# Produced: artifacts/run-000001 through run-000005

# 2. Pairwise alignment (run directly; CLI command added in this PR)
python -c "
from mech_interp.analysis.sae_seed_stability import compute_stability_report
import json
from pathlib import Path
report = compute_stability_report(
    [Path(f'artifacts/run-{i:06d}') for i in range(1,6)],
    threshold=0.9, top_k=20
)
Path('artifacts/seed_stability_report.json').write_text(json.dumps(report, indent=2))
"

Each SAE: 128 features, k=8, 8 training epochs on the openwebtext_sample.jsonl 100-doc corpus (~992 tokens after tokenization). Training took ~30s per run on CPU.

Matching method

For each pair (seed_i, seed_j), we: 1. Extract the decoder weight matrix from each SAE: shape (n_features, d_model), L2-normalised row-wise. 2. Compute the full (128, 128) cosine similarity matrix. 3. Run scipy.optimize.linear_sum_assignment (Hungarian algorithm) to find the maximum-weight one-to-one bipartite matching. 4. Report the distribution of matched cosines.


Headline Numbers

Metric Value
Global median cosine (all 1280 matched pairs) 0.095
Median-of-medians across 10 pairs 0.095
Mean-of-means across 10 pairs 0.208
Stability fraction at cosine ≥ 0.9 0.16% (2/1280)
Stability fraction at cosine ≥ 0.7 ~1.5%
Stability fraction at cosine ≥ 0.5 ~12%
Best single matched pair cosine (across all pairs) 0.927 (run-1 × run-3)

0.16% of features are stable across seeds at cosine > 0.9. The median matched cosine is ~0.095 — barely above zero.


Pairwise Matrix

Median cosines by pair (all 10 combinations of seeds 1–5):

seed=1 seed=2 seed=3 seed=4 seed=5
seed=1 1.000 0.093 0.096 0.095 0.100
seed=2 0.093 1.000 0.094 0.095 0.093
seed=3 0.096 0.094 1.000 0.099 0.100
seed=4 0.095 0.095 0.099 1.000 0.094
seed=5 0.100 0.093 0.100 0.094 1.000

All off-diagonal values are within ±0.01 of each other — effectively uniform low alignment with no detectable seed-pair that is systematically "closer."

For the heatmap image, see: docs/investigations/sae_replication_heatmap.png (generated by notebooks/06_sae_replication_crisis.ipynb)


Three Feature Examples

Example A — best match found (cosine = 0.927, run-1 × run-3): Feature 62 in seed=1 aligns with feature 83 in seed=3 at cosine=0.927. This is the single best feature match across all 1280 pairs. It exists, but it is the extreme outlier; the next-best pair is 0.921.

Example B — borderline match (~0.5 cosine): Several features per pair land near 0.5 — their directions overlap roughly as much as two random directions in 768-dimensional space would if one were mildly aligned. These are not interpretably "the same feature."

Example C — clearly different (worst matched cosine ≈ -0.2 to 0.0): The majority of matched pairs have cosines in [0.0, 0.2], consistent with two unrelated random unit vectors in a 768-d space (expected cosine ≈ 0).


What This Means for the SAE Literature

  1. Features are not reproducible at threshold 0.9. Across 10 seed pairs, only 2 of 1280 matched pairs exceeded 0.9 cosine similarity. The dictionary is essentially re-drawn from scratch on each run.

  2. The median matched cosine (0.095) is near the random baseline for 768-dimensional unit vectors (expected value ≈ 0). The learned dictionaries are not converging to a shared basis.

  3. Published feature descriptions describe one training run, not a stable property of the model. If you re-train with a different seed, the "banana feature" at index 47 in one run likely maps to a completely different direction in another.

  4. This does not mean SAEs are useless. Each individual run may discover genuine structure in the residual stream. But claims that a specific feature is interpretable must be replicated across seeds before they can be trusted as model-level claims.

  5. The finding is consistent with known problems in dictionary learning: overcomplete bases are highly non-unique; there is an entire orbit of equally good solutions under rotation and permutation.


Robustness Checks (Investigation #3 Follow-Up)

Three confounds were identified in the initial result and addressed in a follow-up experiment. Results are compared across four conditions below.

Four-Condition Comparison

Condition n_features layer matching median cosine stab @ ≥0.9
Layer-0 128f (full matrix, original) 128 0 all features 0.095 0.16%
Layer-0 128f (live-only) 128 0 live only 0.500 0.48%
Layer-6 128f (live-only) 128 6 live only 0.323 0.00%
Layer-6 512f (live-only) 512 6 live only 0.257 0.00%

Training details: - Layer-0 128f: 128 features, k=8, 8 epochs, seeds 1-5, ~40-45 live features/run (~66% dead) - Layer-6 128f: 128 features, k=8, 8 epochs, seeds 1-5, ~63-78 live features/run (~46% dead) - Layer-6 512f: 512 features, k=32, 10 epochs, seeds 1-5, ~289-313 live features/run (~40% dead) - All runs: gpt2-small, 100-doc openwebtext corpus, 992 tokens

Does Fixing the Confounds Change the Headline?

The claim holds up, but with important nuance.

Confound 1 — dead features. Restricting to live-only features at layer 0 raises the median cosine from 0.095 to 0.500. This is the single largest effect of any confound. It means dead-to-dead random matching was strongly deflating the full-matrix result. Among features that actually activate, the typical best-match cosine is ~0.5 — meaningful overlap but well below the 0.9 threshold that would indicate reproducibility. Stability fraction at ≥ 0.9 rises from 0.16% to 0.48% — essentially still zero.

Confound 2 — layer. Moving to layer 6 (mid-network) does not improve stability — it makes it worse. Live-only median drops from 0.500 (layer 0) to 0.323 (layer 6). This is counterintuitive: layer 6 representations are more structured (better logit-lens rank) yet less seed-stable. One hypothesis: layer 0 is close to the embedding space, which constrains the manifold of useful directions; layer 6 has more degrees of freedom and thus more equivalent solutions.

Confound 3 — dictionary size. Larger SAEs (512 features vs 128) at layer 6 show slightly lower median cosine (0.257 vs 0.323). More features means more redundancy in the overcomplete basis and more equivalent solutions, which is consistent with theory.

Overall verdict: Fixing all three confounds does not rescue the stability claim. Even in the best condition (layer 0, live-only matching), the median matched cosine is 0.5 and the stability fraction at ≥ 0.9 is under 0.5%. SAE feature dictionaries remain substantially non-reproducible across random seeds.

The one softening: "live features are not entirely random" — a median cosine of 0.5 means the better-trained features do partially overlap across seeds. The SAE is finding something real; it just isn't finding the same particular basis every time.


What Would Make This Conclusive (Publication Gap)

Missing piece What to do
Only gpt2-small Replicate on gpt2-medium, gpt2-xl, Llama-3-8B (3+ models)
Only layers 0 and 6 Add layer 11 (final), plus intermediate layers 3, 9 (5 layers total)
Only 128 and 512 features Test 1024, 4096, 16384 features (4 sizes matching literature)
Only 5 seeds Run 20+ seeds; plot stability fraction vs seed count curve
Only one corpus Vary corpus domain: code, math, prose, multilingual (4 domains)
Only cosine threshold Add activation correlation (Pearson r) and circuit-ablation equivalence
No statistical testing Bootstrap CIs on median cosine; permutation test vs random baseline

Minimum viable publication: 3 models × 5 layers × 4 dictionary sizes × 20 seeds = 1200 training runs. At ~30s each on CPU (or ~5s on A100), this is ~10 GPU-hours. The analysis pipeline already exists; what's missing is compute and model diversity.

The current 15-run experiment is sufficient for a workshop paper or blog post framing the question and presenting preliminary evidence. To pass peer review at a top venue (NeurIPS, ICML, ICLR), the full sweep is required.


Code

  • src/mech_interp/analysis/sae_seed_stability.pycompute_sae_pair_alignment, compute_stability_report, compute_live_only_alignment, compute_live_only_stability_report
  • src/mech_interp/cli.pymech analyze-sae-stability (with --live-only flag)
  • tests/test_sae_seed_stability.py — 14 unit tests (all passing)
  • notebooks/06_sae_replication_crisis.ipynb — full narrative analysis with robustness checks
  • experiments/polysemanticity_sae_layer6.yaml — layer-6 128f sweep config
  • experiments/polysemanticity_sae_layer6_512.yaml — layer-6 512f sweep config
  • artifacts/stability_layer0_128.json — layer-0 full+live results
  • artifacts/stability_layer6_128.json — layer-6 128f live results
  • artifacts/stability_layer6_512.json — layer-6 512f live results