SAEs are not unique solutions: feature dictionaries diverge across random seeds¶

Status: preliminary report, single model, single corpus Author: Mason Wyatt (with the mech-interp platform at github.com/ashlrai/mechanistic-interpretability) Date: 2026-05-27

Abstract¶

We train five Top-K Sparse Autoencoders with identical hyperparameters and different random seeds on the residual stream of GPT-2 small, then ask whether the learned feature dictionaries are the same. Using optimal bipartite matching (Hungarian algorithm on decoder-cosine matrices) between every seed pair, we find a median best-match cosine of 0.500 among live features at layer 0 and 0.323 at layer 6, with the stability fraction at the cosine ≥ 0.9 threshold being effectively zero in all conditions. Counterintuitively, deeper layers and larger feature dictionaries produce less seed-stability, not more. This implies that published SAE feature descriptions — "the X feature of model M" — are properties of a particular training run rather than of the underlying model. We do not yet claim this generalises beyond GPT-2 small / 100-document corpora; we provide the platform and reproducibility receipts to test it at scale.

1. Introduction¶

The mechanistic-interpretability community treats Sparse Autoencoder features as approximations to intrinsic model properties. Anthropic's "Towards Monosemanticity" report (Bricken et al., 2023) speaks of "the X feature" of a model. Auto-interpretability pipelines (Cunningham et al., 2023; Marks et al., 2024) build natural-language descriptions of individual SAE features and treat them as durable handles on the model.

This implicitly assumes that SAE training is approximately deterministic up to random seed — i.e. two SAEs trained with identical hyperparameters and different initialisations recover (modulo permutation and small noise) the same feature dictionary. The assumption is rarely tested. We test it.

2. Method¶

Setup¶

Model: gpt2-small (124M params, d_model=768, 12 layers).
Hook sites tested: blocks.0.hook_resid_pre (embedding-adjacent) and blocks.6.hook_resid_pre (mid-network).
SAE: Top-K SAE (Gao et al., 2024). For layer 0: n_features=128, k=8, 8 epochs. For layer 6: n_features ∈ {128, 512}, k=8 / k=32, 8 epochs.
Corpus: 100 documents from the bundled openwebtext_sample.jsonl (~992 tokens after tokenisation with seq_len=64, max_tokens=2000).
Seeds: 1, 2, 3, 4, 5 — every other hyperparameter identical.
Hardware: single Apple Silicon MBP, CPU only, deterministic seeding (torch.manual_seed, numpy.random.seed, random.seed set before each run via the platform's runner).

Matching protocol¶

For each pair of seeds (i, j): 1. Extract decoder weight matrices W_dec_i, W_dec_j of shape (n_features, d_model). L2-normalise rows. 2. Build the cosine-similarity matrix C = W_dec_i @ W_dec_j.T, shape (n_features, n_features). 3. Solve argmax_P sum(C[i, P(i)]) via scipy.optimize.linear_sum_assignment to find the optimal one-to-one matching. 4. Report the distribution of matched cosines.

A feature pair is stable if its matched cosine is ≥ 0.9. The 0.9 threshold reflects the implicit standard of the auto-interp literature: features described as "the X feature" need to be approximately the same direction across runs for the description to be a model property.

Live-features-only variant¶

The matrix includes dead features (zero activation across the corpus). Dead features are essentially random unit vectors with no information-bearing relationship to the model — matching them inflates the random component of the cosine distribution. We separately report compute_live_only_alignment results that restrict the matching to features whose feature_analysis.json reports dead=False.

3. Results¶

3.1 Headline numbers¶

Condition	Median best-match cosine	Stability fraction at ≥ 0.9
Layer 0, 128 features, full matrix	0.095	0.16% (2/1280)
Layer 0, 128 features, live-only	0.500	0.48%
Layer 6, 128 features, live-only	0.323	0.00%
Layer 6, 512 features, live-only	0.257	0.00%

The full-matrix layer-0 number (0.095) is dominated by the 66% dead-feature rate. Restricting to live features raises the median to 0.500 — these features partially overlap, but the overlap is far from the "same feature" threshold.

The layer-6 results are the most striking: a more structured representational locus produces less seed-stability than the embedding layer, and the larger 512-feature SAE produces less stability than the smaller 128-feature one. Both directions are consistent with a degenerate-basis interpretation: a richer manifold with more overcomplete solutions admits more equivalent dictionaries.

3.2 Example matched pairs¶

The single highest-cosine matched pair across all 1280 pairs of seed-stability runs at layer 0 was (seed_1, feature_47) ↔ (seed_3, feature_82) with cosine = 0.927. Both features' top-activating prompts cluster on geography (Eiffel Tower / Paris content), so this single pair is genuinely the same feature across seeds.

For a representative "borderline" pair with cosine ≈ 0.5: matched feature pairs' top-activating prompts overlap on broad category (e.g. both fire on code-related documents) but disagree on specifics (one fires more on Python control flow, the other on C++ syntax). For pairs at cosine ≈ 0.1: top prompts are unrelated.

3.3 The dead-feature confound¶

At our scale (≈ 1000 training tokens), 66% of 128 features are dead. The original literature (Anthropic's gemma-scope, jbloom's gpt2 SAEs) trains on ≥ 10⁶ tokens with much lower dead ratios. Our headline 0.095 cosine in the full-matrix condition is inflated by dead-feature pairing; the live-only 0.500 is the better single-number summary, with the more important point being that even the live-only condition does not approach the 0.9 threshold in any layer × size combination tested.

4. Caveats and robustness checks¶

Scale. Our corpus is two-to-three orders of magnitude smaller than the published SAEs we're implicitly comparing to. At larger scale the dead-feature fraction shrinks and feature directions sharpen; we expect this to raise the median live-only cosine, but the gap to 0.9 is sufficiently large that it is unlikely to close from training scale alone.
Model. A single 124M-parameter model. The result should be tested on at least one of gpt2-medium, gpt2-xl, pythia-1.4b, Llama-3.1-8B before claiming generality.
Training recipe. We only test Top-K SAEs. L1-regularised SAEs, JumpReLU SAEs (Rajamanoharan et al., 2024), and gated SAEs may have different seed-stability profiles.
Hook site. Only residual-stream pre-LN. Attention-output SAEs and MLP-output SAEs are not tested.
Statistical testing. Bootstrap confidence intervals on the median, and a permutation test against the random-vector baseline, are not yet computed. The 0.097 vs 0.095 baseline gap is small enough that the live-only 0.500 layer-0 number is unambiguously above noise, but the layer-6 0.323 number needs a formal test before we'd claim significance.

5. Implications¶

If features are not seed-stable at the 0.9 threshold, several common practices in the SAE literature inherit hidden noise:

Single-run feature labels. "Feature 47 of the layer-8 SAE detects bananas" is meaningful only if specifying the training run. Without that specification it's not a model property.
Auto-interp comparisons. Auto-interp labels assigned in one run cannot be matched 1-to-1 with auto-interp labels from a sibling run.
Crosscoder-based model diffing. Lindsey et al. (2024) crosscoders compare features across two models; if a single model's features are seed-unstable, cross-model feature comparisons need to compare distributions of features, not specific paired features.
SAE evaluation. Benchmarks that score "how well does this SAE recover known features?" need to either average across seeds or report seed sensitivity.

The constructive suggestion is to publish multi-seed mean ± standard-deviation numbers, and to test seed stability as a prerequisite for any "feature X" claim.

6. Limitations and future work¶

A publishable version of this result would require, at minimum:

Dimension	This report	Publishable minimum
Models	1 (gpt2-small)	3+ (e.g., gpt2-{small, medium, xl}, Llama-3.1-8B)
Layers per model	2	5+ across depth
Dictionary sizes	2 (128, 512)	4 (128, 512, 2048, 8192)
Seeds	5	20+ for tight CIs
Training tokens	~1k	10⁶+ to match published SAEs
Statistical testing	none	bootstrap CIs + permutation test

That budget is approximately 1200 SAE training runs. At ~5 seconds per run on an A100 it's ~1.7 GPU-hours; on the same MBP this report ran on, the same scale would take ~8-12 hours overnight. The platform pipeline is the bottleneck, not the compute — every step here (mech sweep, mech analyze-sae-stability --live-only, the analysis notebook) generalises to that scale without code changes.

7. Reproducibility¶

Every run referenced here has an environment.json artifact recording torch / numpy / transformer-lens versions, the uv.lock SHA-256, the seed, and a sample hash of the model weights. The exact commands to reproduce from a fresh clone are in docs/publications/sae_replication_artifacts/reproduce.sh. Total wall-clock to reproduce from scratch: under 5 minutes on a 2024-era Apple Silicon machine.

References¶

Gao et al., 2024. "Scaling and evaluating sparse autoencoders." OpenAI.
Bricken et al., 2023. "Towards Monosemanticity." Anthropic.
Cunningham et al., 2023. "Sparse autoencoders find highly interpretable features in language models."
Marks et al., 2024. "Sparse feature circuits."
Lindsey et al., 2024. "Sparse Crosscoders for Cross-Layer Features in Superposition." Anthropic.
Rajamanoharan et al., 2024. "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders." Google DeepMind.
Conmy et al., 2023. "Towards Automated Circuit Discovery." NeurIPS.
Wang et al., 2022. "Interpretability in the Wild: A Circuit for Indirect Object Identification."