A two-stage system for discovering and explaining what your SAE features do — fast screening first, then deep, validated explanations.

🎯 What this does

  • AutoInterp Lite (Feature Discovery): Quickly rank domain-relevant features out of thousands (minutes).
  • AutoInterp Full (Feature Explanation): Produce human-readable explanations with confidence & detection metrics for specific features or all features (deeper, slower).
Typical workflow:
  1. Lite → find top 10–50 candidates for your domain.
  1. Full → generate vetted explanations and scores for those feature IDs (or sweep all).

🔎 1) AutoInterp Lite — Feature Discovery (minutes)

Why: SAEs expose tens of thousands of features; you need the few that matter (finance, healthcare, legal, etc.).
What: Compares feature activations on domain text vs general text to score specialization.
Inputs
  • Trained SAE (weights + config)
  • Base model used for activations (e.g., bert-base-uncased, meta-llama/...)
  • Two corpora: domain (e.g., financial news) and general (e.g., Wikipedia)
Outputs
  • Ranked CSV of features with:
    • Specialization score (domain vs general)
    • Activation strength (mean activation)
    • Top activating examples (text spans)
    • Optional proto-label (heuristic string from examples)
Key metrics
  • Specialization score ↑ = more domain-specific
  • Domain activation vs General activation
  • Specialization confidence (stability of the rank under resampling)
Speed
  • 1k–10k features in 2–5 minutes on a single GPU (sampling-based)
CLI (example)
bash
cd autointerp_lite python run_analysis.py \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --domain_corpus data/finance.jsonl \ --general_corpus data/general.jsonl \ --out_dir results/lite_finance
Sample output (CSV)
Feature
Proto-Label
Specialization
Domain Act.
General Act.
Spec. Conf.
133
Earnings reports / rate changes
19.56
96.73
116.29
195.60
214
Inflation indicators / labor data
4.65
22.26
26.92
46.55
66
Stock index performance
4.75
19.77
24.52
47.51
Rule of thumb: specialization > 3.0 and confidence > 30 are strong candidates.

🧠 2) AutoInterp Full — Feature Explanation (deeper)

Why: “Financial feature” isn’t enough; you need precise semantics (e.g., earnings-call phrasing vs volatility language), plus validation.
What it does
  • Uses a chat-formatted LLM to generate explanations from activating vs non-activating examples.
  • Adds contrastive analysis (FAISS) so the LLM sees near misses and avoids vague labels.
  • Computes detection metrics (F1 / Precision / Recall) for quality assurance.
Inputs
  • Feature IDs from Lite (e.g., 27 133 220) or “analyze all”
  • Explainer chat model (cloud or local)
  • Optional FAISS index for contrastive negatives
Outputs
  • Explanation cards (Markdown/JSON) per feature
  • Scores: F1, Precision, Recall (+ per-feature thresholds)
  • Examples: positive/negative spans used for validation
Speed
  • 30–60 min/feature (depends on dataset size & LLM)
CLI (API-based explainer)
bash
cd autointerp_full ./example_LLM_API.sh \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --feature_num 27 133 220 \ --n_tokens 200000 \ --explainer_model openai/gpt-4o-mini \ --non_activating_source FAISS \ --out_dir results/full_finance
Key parameters
Flag
Purpose
Example
--feature_num
Specific features to explain
27 133 220
--n_tokens
Sample size for evidence
2e4 (fast) – 1e7 (thorough)
--explainer_model
Chat LLM for explanations
openai/gpt-4o-mini / Qwen2.5-7B-Instruct
--non_activating_source
Contrastive negatives
FAISS (best) / random (fast)
Results layout
plain text
results/full_finance/ explanations/FEATURE_133.md scores/detection/FEATURE_133.json # f1, precision, recall, threshold results_summary.csv
Quality targets
  • F1 ≥ 0.70, Precision ≥ 0.80, Recall ≥ 0.60 → production-worthy
Example (excerpt)
Feature
Label (LLM)
F1
Explanation (short)
27
“-ing verb forms / gerunds”
.75
Detects morphological pattern “-ing” in running text.
220
“Abstract alternatives framing”
.53
Marks language about hypotheticals / options.
133
“Bio taxonomy terms”
.02
Likely spurious for finance; exclude from catalog.

🧩 Contrastive Search (FAISS) — why it matters

Goal: Show the LLM positive vs near-negative examples so it learns what the feature is not.
Pipeline
  1. Embed text spans (sentence-transformers)
  1. Build FAISS index over non-activating spans
  1. For each feature, retrieve closest negatives to pair with positives
  1. Prompt LLM with contrastive pairssharper, less vague explanations

🧰 Installation & Files

Install
bash
# From repo root pip install -e .
You’ll need
  • SAE model path (trained previously)
  • Base model path/name (for activation extraction)
  • Domain & general corpora (for Lite)
  • Chat model access (API key or local checkpoint) for Full

🚀 Quick Start (copy/paste)

1) Find domain-relevant features (Lite)
bash
cd autointerp_lite python run_analysis.py \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --domain_corpus data/finance.jsonl \ --general_corpus data/general.jsonl \ --out_dir results/lite_finance
2) Explain the top features (Full)
bash
cd autointerp_full ./example_LLM_API.sh \ --sae_path /models/sae/layer6 \ --base_model bert-base-uncased \ --feature_num 27 133 220 \ --explainer_model openai/gpt-4o-mini \ --non_activating_source FAISS \ --out_dir results/full_finance

🗂️ Output Schemas (for automation)

Lite CSV columns
plain text
feature_id:int, proto_label:str, specialization:float, domain_activation:float, general_activation:float, specialization_conf:float, top_examples:list[str]
Full detection JSON
json
{ "feature_id": 133, "label": "Earnings reports / rate changes", "f1": 0.74, "precision": 0.82, "recall": 0.67, "threshold": 0.43, "positive_examples": ["..."], "negative_examples": ["..."] }

✅ When to use which

  • Use Lite when you have thousands of features and need a shortlist (10–50) for your domain.
  • Use Full when you need validated semantics with metrics and contrastive evidence (for research, audits, safety).
Pro tip: Save the results_summary.csv from Full as your feature catalog; your steering and alignment guard pages can link to these feature IDs and labels directly.

🧭 Best practices & pitfalls

  • Balance evidence size: start with -n_tokens 2e4 to iterate quickly; scale up later.
  • Contrastive negatives: prefer FAISS over random; explanations are crisper.
  • De-duplication: similar features may exist; use cosine on SAE atoms to merge siblings.
  • Guardrails: exclude features with F1 < 0.5 from production catalogs.
  • Reproducibility: fix seeds and persist corpora snapshot for consistent labels.

🔗 References / Deep Dives

  • AutoInterp Lite README – feature ranking & discovery
  • AutoInterp Full README – LLM-based explanation, contrastive prompting, scoring