SAE Labeling

A two-stage system for discovering and explaining what your SAE features do — fast screening first, then deep, validated explanations.

🎯 What this does

AutoInterp Lite (Feature Discovery): Quickly rank domain-relevant features out of thousands (minutes).

AutoInterp Full (Feature Explanation): Produce human-readable explanations with confidence & detection metrics for specific features or all features (deeper, slower).

Typical workflow:

Lite → find top 10–50 candidates for your domain.

Full → generate vetted explanations and scores for those feature IDs (or sweep all).

🔎 1) AutoInterp Lite — Feature Discovery (minutes)

Why: SAEs expose tens of thousands of features; you need the few that matter (finance, healthcare, legal, etc.).

What: Compares feature activations on domain text vs general text to score specialization.

Inputs

Trained SAE (weights + config)

Base model used for activations (e.g., bert-base-uncased, meta-llama/...)

Two corpora: domain (e.g., financial news) and general (e.g., Wikipedia)

Outputs

Ranked CSV of features with:

Specialization score (domain vs general)
Activation strength (mean activation)
Top activating examples (text spans)
Optional proto-label (heuristic string from examples)

Key metrics

Specialization score ↑ = more domain-specific

Domain activation vs General activation

Specialization confidence (stability of the rank under resampling)

Speed

1k–10k features in 2–5 minutes on a single GPU (sampling-based)

CLI (example)

bash
cd autointerp_lite
python run_analysis.py \
  --sae_path /models/sae/layer6 \
  --base_model bert-base-uncased \
  --domain_corpus data/finance.jsonl \
  --general_corpus data/general.jsonl \
  --out_dir results/lite_finance

Sample output (CSV)

Feature	Proto-Label	Specialization	Domain Act.	General Act.	Spec. Conf.
133	Earnings reports / rate changes	19.56	96.73	116.29	195.60
214	Inflation indicators / labor data	4.65	22.26	26.92	46.55
66	Stock index performance	4.75	19.77	24.52	47.51

Rule of thumb: specialization > 3.0 and confidence > 30 are strong candidates.

🧠 2) AutoInterp Full — Feature Explanation (deeper)

Why: “Financial feature” isn’t enough; you need precise semantics (e.g., earnings-call phrasing vs volatility language), plus validation.

What it does

Uses a chat-formatted LLM to generate explanations from activating vs non-activating examples.

Adds contrastive analysis (FAISS) so the LLM sees near misses and avoids vague labels.

Computes detection metrics (F1 / Precision / Recall) for quality assurance.

Inputs

Feature IDs from Lite (e.g., 27 133 220) or “analyze all”

Explainer chat model (cloud or local)

Optional FAISS index for contrastive negatives

Outputs

Explanation cards (Markdown/JSON) per feature

Scores: F1, Precision, Recall (+ per-feature thresholds)

Examples: positive/negative spans used for validation

Speed

30–60 min/feature (depends on dataset size & LLM)

CLI (API-based explainer)

bash
cd autointerp_full
./example_LLM_API.sh \
  --sae_path /models/sae/layer6 \
  --base_model bert-base-uncased \
  --feature_num 27 133 220 \
  --n_tokens 200000 \
  --explainer_model openai/gpt-4o-mini \
  --non_activating_source FAISS \
  --out_dir results/full_finance

Key parameters

Flag	Purpose	Example
`--feature_num`	Specific features to explain	`27 133 220`
`--n_tokens`	Sample size for evidence	`2e4` (fast) – `1e7` (thorough)
`--explainer_model`	Chat LLM for explanations	`openai/gpt-4o-mini` / `Qwen2.5-7B-Instruct`
`--non_activating_source`	Contrastive negatives	`FAISS` (best) / `random` (fast)

Results layout

plain text
results/full_finance/
  explanations/FEATURE_133.md
  scores/detection/FEATURE_133.json  # f1, precision, recall, threshold
  results_summary.csv

Quality targets

F1 ≥ 0.70, Precision ≥ 0.80, Recall ≥ 0.60 → production-worthy

Example (excerpt)

Feature	Label (LLM)	F1	Explanation (short)
27	“-ing verb forms / gerunds”	.75	Detects morphological pattern “-ing” in running text.
220	“Abstract alternatives framing”	.53	Marks language about hypotheticals / options.
133	“Bio taxonomy terms”	.02	Likely spurious for finance; exclude from catalog.

🧩 Contrastive Search (FAISS) — why it matters

Goal: Show the LLM positive vs near-negative examples so it learns what the feature is not.

Pipeline

Embed text spans (sentence-transformers)

Build FAISS index over non-activating spans

For each feature, retrieve closest negatives to pair with positives

Prompt LLM with contrastive pairs → sharper, less vague explanations

🧰 Installation & Files

Install

bash
# From repo root
pip install -e .

You’ll need

SAE model path (trained previously)

Base model path/name (for activation extraction)

Domain & general corpora (for Lite)

Chat model access (API key or local checkpoint) for Full

🚀 Quick Start (copy/paste)

1) Find domain-relevant features (Lite)

bash
cd autointerp_lite
python run_analysis.py \
  --sae_path /models/sae/layer6 \
  --base_model bert-base-uncased \
  --domain_corpus data/finance.jsonl \
  --general_corpus data/general.jsonl \
  --out_dir results/lite_finance

2) Explain the top features (Full)

bash
cd autointerp_full
./example_LLM_API.sh \
  --sae_path /models/sae/layer6 \
  --base_model bert-base-uncased \
  --feature_num 27 133 220 \
  --explainer_model openai/gpt-4o-mini \
  --non_activating_source FAISS \
  --out_dir results/full_finance

🗂️ Output Schemas (for automation)

Lite CSV columns

plain text
feature_id:int, proto_label:str, specialization:float,
domain_activation:float, general_activation:float, specialization_conf:float,
top_examples:list[str]

Full detection JSON

json
{
  "feature_id": 133,
  "label": "Earnings reports / rate changes",
  "f1": 0.74,
  "precision": 0.82,
  "recall": 0.67,
  "threshold": 0.43,
  "positive_examples": ["..."],
  "negative_examples": ["..."]
}

✅ When to use which

Use Lite when you have thousands of features and need a shortlist (10–50) for your domain.

Use Full when you need validated semantics with metrics and contrastive evidence (for research, audits, safety).

Pro tip: Save the results_summary.csv from Full as your feature catalog; your steering and alignment guard pages can link to these feature IDs and labels directly.

🧭 Best practices & pitfalls

Balance evidence size: start with -n_tokens 2e4 to iterate quickly; scale up later.

Contrastive negatives: prefer FAISS over random; explanations are crisper.

De-duplication: similar features may exist; use cosine on SAE atoms to merge siblings.

Guardrails: exclude features with F1 < 0.5 from production catalogs.

Reproducibility: fix seeds and persist corpora snapshot for consistent labels.

🔗 References / Deep Dives

AutoInterp Lite README – feature ranking & discovery

AutoInterp Full README – LLM-based explanation, contrastive prompting, scoring