SAEs are not unique solutions: feature dictionaries diverge across random seeds¶
Status: preliminary report, single model, single corpus Author: Mason Wyatt (with the mech-interp platform at github.com/ashlrai/mechanistic-interpretability) Date: 2026-05-27
Abstract¶
We train five Top-K Sparse Autoencoders with identical hyperparameters and different random seeds on the residual stream of GPT-2 small, then ask whether the learned feature dictionaries are the same. Using optimal bipartite matching (Hungarian algorithm on decoder-cosine matrices) between every seed pair, we find a median best-match cosine of 0.500 among live features at layer 0 and 0.323 at layer 6, with the stability fraction at the cosine ≥ 0.9 threshold being effectively zero in all conditions. Counterintuitively, deeper layers and larger feature dictionaries produce less seed-stability, not more. This implies that published SAE feature descriptions — "the X feature of model M" — are properties of a particular training run rather than of the underlying model. We do not yet claim this generalises beyond GPT-2 small / 100-document corpora; we provide the platform and reproducibility receipts to test it at scale.
1. Introduction¶
The mechanistic-interpretability community treats Sparse Autoencoder features as approximations to intrinsic model properties. Anthropic's "Towards Monosemanticity" report (Bricken et al., 2023) speaks of "the X feature" of a model. Auto-interpretability pipelines (Cunningham et al., 2023; Marks et al., 2024) build natural-language descriptions of individual SAE features and treat them as durable handles on the model.
This implicitly assumes that SAE training is approximately deterministic up to random seed — i.e. two SAEs trained with identical hyperparameters and different initialisations recover (modulo permutation and small noise) the same feature dictionary. The assumption is rarely tested. We test it.
2. Method¶
Setup¶
- Model:
gpt2-small(124M params, d_model=768, 12 layers). - Hook sites tested:
blocks.0.hook_resid_pre(embedding-adjacent) andblocks.6.hook_resid_pre(mid-network). - SAE: Top-K SAE (Gao et al., 2024). For layer 0: n_features=128, k=8, 8 epochs. For layer 6: n_features ∈ {128, 512}, k=8 / k=32, 8 epochs.
- Corpus: 100 documents from the bundled
openwebtext_sample.jsonl(~992 tokens after tokenisation withseq_len=64,max_tokens=2000). - Seeds: 1, 2, 3, 4, 5 — every other hyperparameter identical.
- Hardware: single Apple Silicon MBP, CPU only, deterministic seeding (
torch.manual_seed,numpy.random.seed,random.seedset before each run via the platform's runner).
Matching protocol¶
For each pair of seeds (i, j):
1. Extract decoder weight matrices W_dec_i, W_dec_j of shape (n_features, d_model). L2-normalise rows.
2. Build the cosine-similarity matrix C = W_dec_i @ W_dec_j.T, shape (n_features, n_features).
3. Solve argmax_P sum(C[i, P(i)]) via scipy.optimize.linear_sum_assignment to find the optimal one-to-one matching.
4. Report the distribution of matched cosines.
A feature pair is stable if its matched cosine is ≥ 0.9. The 0.9 threshold reflects the implicit standard of the auto-interp literature: features described as "the X feature" need to be approximately the same direction across runs for the description to be a model property.
Live-features-only variant¶
The matrix includes dead features (zero activation across the corpus). Dead features are essentially random unit vectors with no information-bearing relationship to the model — matching them inflates the random component of the cosine distribution. We separately report compute_live_only_alignment results that restrict the matching to features whose feature_analysis.json reports dead=False.
3. Results¶
3.1 Headline numbers¶
| Condition | Median best-match cosine | Stability fraction at ≥ 0.9 |
|---|---|---|
| Layer 0, 128 features, full matrix | 0.095 | 0.16% (2/1280) |
| Layer 0, 128 features, live-only | 0.500 | 0.48% |
| Layer 6, 128 features, live-only | 0.323 | 0.00% |
| Layer 6, 512 features, live-only | 0.257 | 0.00% |
The full-matrix layer-0 number (0.095) is dominated by the 66% dead-feature rate. Restricting to live features raises the median to 0.500 — these features partially overlap, but the overlap is far from the "same feature" threshold.
The layer-6 results are the most striking: a more structured representational locus produces less seed-stability than the embedding layer, and the larger 512-feature SAE produces less stability than the smaller 128-feature one. Both directions are consistent with a degenerate-basis interpretation: a richer manifold with more overcomplete solutions admits more equivalent dictionaries.
3.2 Example matched pairs¶
The single highest-cosine matched pair across all 1280 pairs of seed-stability runs at layer 0 was (seed_1, feature_47) ↔ (seed_3, feature_82) with cosine = 0.927. Both features' top-activating prompts cluster on geography (Eiffel Tower / Paris content), so this single pair is genuinely the same feature across seeds.
For a representative "borderline" pair with cosine ≈ 0.5: matched feature pairs' top-activating prompts overlap on broad category (e.g. both fire on code-related documents) but disagree on specifics (one fires more on Python control flow, the other on C++ syntax). For pairs at cosine ≈ 0.1: top prompts are unrelated.
3.3 The dead-feature confound¶
At our scale (≈ 1000 training tokens), 66% of 128 features are dead. The original literature (Anthropic's gemma-scope, jbloom's gpt2 SAEs) trains on ≥ 10⁶ tokens with much lower dead ratios. Our headline 0.095 cosine in the full-matrix condition is inflated by dead-feature pairing; the live-only 0.500 is the better single-number summary, with the more important point being that even the live-only condition does not approach the 0.9 threshold in any layer × size combination tested.
4. Caveats and robustness checks¶
- Scale. Our corpus is two-to-three orders of magnitude smaller than the published SAEs we're implicitly comparing to. At larger scale the dead-feature fraction shrinks and feature directions sharpen; we expect this to raise the median live-only cosine, but the gap to 0.9 is sufficiently large that it is unlikely to close from training scale alone.
- Model. A single 124M-parameter model. The result should be tested on at least one of
gpt2-medium,gpt2-xl,pythia-1.4b,Llama-3.1-8Bbefore claiming generality. - Training recipe. We only test Top-K SAEs. L1-regularised SAEs, JumpReLU SAEs (Rajamanoharan et al., 2024), and gated SAEs may have different seed-stability profiles.
- Hook site. Only residual-stream pre-LN. Attention-output SAEs and MLP-output SAEs are not tested.
- Statistical testing. Bootstrap confidence intervals on the median, and a permutation test against the random-vector baseline, are not yet computed. The 0.097 vs 0.095 baseline gap is small enough that the live-only 0.500 layer-0 number is unambiguously above noise, but the layer-6 0.323 number needs a formal test before we'd claim significance.
5. Implications¶
If features are not seed-stable at the 0.9 threshold, several common practices in the SAE literature inherit hidden noise:
- Single-run feature labels. "Feature 47 of the layer-8 SAE detects bananas" is meaningful only if specifying the training run. Without that specification it's not a model property.
- Auto-interp comparisons. Auto-interp labels assigned in one run cannot be matched 1-to-1 with auto-interp labels from a sibling run.
- Crosscoder-based model diffing. Lindsey et al. (2024) crosscoders compare features across two models; if a single model's features are seed-unstable, cross-model feature comparisons need to compare distributions of features, not specific paired features.
- SAE evaluation. Benchmarks that score "how well does this SAE recover known features?" need to either average across seeds or report seed sensitivity.
The constructive suggestion is to publish multi-seed mean ± standard-deviation numbers, and to test seed stability as a prerequisite for any "feature X" claim.
6. Limitations and future work¶
A publishable version of this result would require, at minimum:
| Dimension | This report | Publishable minimum |
|---|---|---|
| Models | 1 (gpt2-small) | 3+ (e.g., gpt2-{small, medium, xl}, Llama-3.1-8B) |
| Layers per model | 2 | 5+ across depth |
| Dictionary sizes | 2 (128, 512) | 4 (128, 512, 2048, 8192) |
| Seeds | 5 | 20+ for tight CIs |
| Training tokens | ~1k | 10⁶+ to match published SAEs |
| Statistical testing | none | bootstrap CIs + permutation test |
That budget is approximately 1200 SAE training runs. At ~5 seconds per run on an A100 it's ~1.7 GPU-hours; on the same MBP this report ran on, the same scale would take ~8-12 hours overnight. The platform pipeline is the bottleneck, not the compute — every step here (mech sweep, mech analyze-sae-stability --live-only, the analysis notebook) generalises to that scale without code changes.
7. Reproducibility¶
Every run referenced here has an environment.json artifact recording torch / numpy / transformer-lens versions, the uv.lock SHA-256, the seed, and a sample hash of the model weights. The exact commands to reproduce from a fresh clone are in docs/publications/sae_replication_artifacts/reproduce.sh. Total wall-clock to reproduce from scratch: under 5 minutes on a 2024-era Apple Silicon machine.
References¶
- Gao et al., 2024. "Scaling and evaluating sparse autoencoders." OpenAI.
- Bricken et al., 2023. "Towards Monosemanticity." Anthropic.
- Cunningham et al., 2023. "Sparse autoencoders find highly interpretable features in language models."
- Marks et al., 2024. "Sparse feature circuits."
- Lindsey et al., 2024. "Sparse Crosscoders for Cross-Layer Features in Superposition." Anthropic.
- Rajamanoharan et al., 2024. "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders." Google DeepMind.
- Conmy et al., 2023. "Towards Automated Circuit Discovery." NeurIPS.
- Wang et al., 2022. "Interpretability in the Wild: A Circuit for Indirect Object Identification."