A two-stage system for discovering and explaining what your SAE features do — fast screening first, then deep, validated explanations.
🎯 What this does
- AutoInterp Lite (Feature Discovery): Quickly rank domain-relevant features out of thousands (minutes).
- AutoInterp Full (Feature Explanation): Produce human-readable explanations with confidence & detection metrics for specific features or all features (deeper, slower).
Typical workflow:
- Lite → find top 10–50 candidates for your domain.
- Full → generate vetted explanations and scores for those feature IDs (or sweep all).
🔎 1) AutoInterp Lite — Feature Discovery (minutes)
Why: SAEs expose tens of thousands of features; you need the few that matter (finance, healthcare, legal, etc.).
What: Compares feature activations on domain text vs general text to score specialization.
Inputs
- Trained SAE (weights + config)
- Base model used for activations (e.g.,
bert-base-uncased,meta-llama/...)
- Two corpora: domain (e.g., financial news) and general (e.g., Wikipedia)
Outputs
- Ranked CSV of features with:
- Specialization score (domain vs general)
- Activation strength (mean activation)
- Top activating examples (text spans)
- Optional proto-label (heuristic string from examples)
Key metrics
- Specialization score ↑ = more domain-specific
- Domain activation vs General activation
- Specialization confidence (stability of the rank under resampling)
Speed
- 1k–10k features in 2–5 minutes on a single GPU (sampling-based)
CLI (example)
bashcd autointerp_lite python run_analysis.py \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --domain_corpus data/finance.jsonl \ --general_corpus data/general.jsonl \ --out_dir results/lite_finance
Sample output (CSV)
Feature | Proto-Label | Specialization | Domain Act. | General Act. | Spec. Conf. |
133 | Earnings reports / rate changes | 19.56 | 96.73 | 116.29 | 195.60 |
214 | Inflation indicators / labor data | 4.65 | 22.26 | 26.92 | 46.55 |
66 | Stock index performance | 4.75 | 19.77 | 24.52 | 47.51 |
Rule of thumb: specialization > 3.0 and confidence > 30 are strong candidates.
🧠 2) AutoInterp Full — Feature Explanation (deeper)
Why: “Financial feature” isn’t enough; you need precise semantics (e.g., earnings-call phrasing vs volatility language), plus validation.
What it does
- Uses a chat-formatted LLM to generate explanations from activating vs non-activating examples.
- Adds contrastive analysis (FAISS) so the LLM sees near misses and avoids vague labels.
- Computes detection metrics (F1 / Precision / Recall) for quality assurance.
Inputs
- Feature IDs from Lite (e.g.,
27 133 220) or “analyze all”
- Explainer chat model (cloud or local)
- Optional FAISS index for contrastive negatives
Outputs
- Explanation cards (Markdown/JSON) per feature
- Scores: F1, Precision, Recall (+ per-feature thresholds)
- Examples: positive/negative spans used for validation
Speed
- 30–60 min/feature (depends on dataset size & LLM)
CLI (API-based explainer)
bashcd autointerp_full ./example_LLM_API.sh \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --feature_num 27 133 220 \ --n_tokens 200000 \ --explainer_model openai/gpt-4o-mini \ --non_activating_source FAISS \ --out_dir results/full_finance
Key parameters
Flag | Purpose | Example |
--feature_num | Specific features to explain | 27 133 220 |
--n_tokens | Sample size for evidence | 2e4 (fast) – 1e7 (thorough) |
--explainer_model | Chat LLM for explanations | openai/gpt-4o-mini / Qwen2.5-7B-Instruct |
--non_activating_source | Contrastive negatives | FAISS (best) / random (fast) |
Results layout
plain textresults/full_finance/ explanations/FEATURE_133.md scores/detection/FEATURE_133.json # f1, precision, recall, threshold results_summary.csv
Quality targets
- F1 ≥ 0.70, Precision ≥ 0.80, Recall ≥ 0.60 → production-worthy
Example (excerpt)
Feature | Label (LLM) | F1 | Explanation (short) |
27 | “-ing verb forms / gerunds” | .75 | Detects morphological pattern “-ing” in running text. |
220 | “Abstract alternatives framing” | .53 | Marks language about hypotheticals / options. |
133 | “Bio taxonomy terms” | .02 | Likely spurious for finance; exclude from catalog. |
🧩 Contrastive Search (FAISS) — why it matters
Goal: Show the LLM positive vs near-negative examples so it learns what the feature is not.
Pipeline
- Embed text spans (sentence-transformers)
- Build FAISS index over non-activating spans
- For each feature, retrieve closest negatives to pair with positives
- Prompt LLM with contrastive pairs → sharper, less vague explanations
🧰 Installation & Files
Install
bash# From repo root pip install -e .
You’ll need
- SAE model path (trained previously)
- Base model path/name (for activation extraction)
- Domain & general corpora (for Lite)
- Chat model access (API key or local checkpoint) for Full
🚀 Quick Start (copy/paste)
1) Find domain-relevant features (Lite)
bashcd autointerp_lite python run_analysis.py \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --domain_corpus data/finance.jsonl \ --general_corpus data/general.jsonl \ --out_dir results/lite_finance
2) Explain the top features (Full)
bashcd autointerp_full ./example_LLM_API.sh \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --feature_num 27 133 220 \ --explainer_model openai/gpt-4o-mini \ --non_activating_source FAISS \ --out_dir results/full_finance
🗂️ Output Schemas (for automation)
Lite CSV columns
plain textfeature_id:int, proto_label:str, specialization:float, domain_activation:float, general_activation:float, specialization_conf:float, top_examples:list[str]
Full detection JSON
json{ "feature_id": 133, "label": "Earnings reports / rate changes", "f1": 0.74, "precision": 0.82, "recall": 0.67, "threshold": 0.43, "positive_examples": ["..."], "negative_examples": ["..."] }
✅ When to use which
- Use Lite when you have thousands of features and need a shortlist (10–50) for your domain.
- Use Full when you need validated semantics with metrics and contrastive evidence (for research, audits, safety).
Pro tip: Save the results_summary.csv from Full as your feature catalog; your steering and alignment guard pages can link to these feature IDs and labels directly.
🧭 Best practices & pitfalls
- Balance evidence size: start with
-n_tokens 2e4to iterate quickly; scale up later.
- Contrastive negatives: prefer FAISS over random; explanations are crisper.
- De-duplication: similar features may exist; use cosine on SAE atoms to merge siblings.
- Guardrails: exclude features with F1 < 0.5 from production catalogs.
- Reproducibility: fix seeds and persist corpora snapshot for consistent labels.
🔗 References / Deep Dives
- AutoInterp Lite README – feature ranking & discovery
- AutoInterp Full README – LLM-based explanation, contrastive prompting, scoring