Interactive, feature-level control for a fine-tuned financial LLM using Sparse Autoencoder (SAE) features. Steer specific financial concepts inside the model’s hidden states and compare outputs before vs after in real time.
Status: Layers 4, 10, 16, 22, 28 validated end-to-end on the Llama-2-7B Finance variant.
🎯 Overview
- What it is: A practical system to boost or suppress SAE features (e.g., “financial performance”, “market indicators”) at chosen layers to influence generation.
- Why it matters: Turn interpretability into control: test hypotheses, reduce hallucinations, and prototype mitigations without retraining the base model.
- What you get: Streamlit app for interactive steering, CLI scripts for quick tests, and a verification tool to quantify effects.
📁 Repo Layout
Main Files
streamlit_feature_steering_app.py— Interactive Streamlit UI (side-by-side outputs, sliders, metrics)
minimal_steering_test.py— Quick CLI sanity check (single prompt, multiple strengths)
verify_steering.py— Batch verification on a prompt set (quality & consistency metrics)
Archived
archive/— Dev and exploratory scripts (kept for reference)
🚀 Quick Start
1) Launch the Streamlit App
bashconda activate sae streamlit run streamlit_feature_steering_app.py --server.port 8501
2) Minimal CLI Test
bashpython minimal_steering_test.py
🔧 How Steering Works (High-Level)
We modify hidden states at a chosen layer by adding a normalized feature direction from the SAE decoder, scaled by a user-controlled strength.
- Direction source: SAE decoder weights for
feature_id
- Normalization: L2 normalize the direction for stable scaling
- Injection: Add to hidden states at the target layer during forward pass
Formula
plain textsteering_vector = strength × α × ( decoder[feature_id] / ||decoder[feature_id]|| ) steered_hidden = original_hidden + steering_vector
strength∈ [−50, +50] (UI slider)
α(alpha) default 0.5 — empirical balance between effect and stability
Negative strength suppresses a feature; positive strength amplifies it.
📊 Feature Catalog (Finance-Focused, example labels)
Validated SAE feature labels per layer (subset):
- Layer 4: “Financial Market Analysis”, “Financial Institutions”
- Layer 10: “Financial Market Trends”, “Investment Guidance”
- Layer 16: “Performance Indicators”, “News & Analysis”
- Layer 22: “Financial Performance & Growth”, “Market Indicators”
- Layer 28: “Company Financials”, “Stock Performance Analysis”
Tip: Link these labels to your AutoInterp catalog so users can search features by name and jump into steering.
🖼️ Interactive Demo (Streamlit)
- Left Sidebar: Layer picker, Feature selector, Strength slider (−50…+50), Max tokens
- Main Panel: Prompt input, Before vs After generations, activation snapshots
- Metrics Pane: Token-level activation sparkline, per-layer deltas, basic readability stats
Observed behavior: At strengths around +20…+30, general business prompts shift toward structured financial analysis (KPIs, MA lines, sector context).
🎯 Steering Examples (L22 shown)
Example 1 — “Financial Performance & Growth” (Feature 159, Layer 22)
Prompt: “The company’s quarterly earnings show”
Strength | Output trend |
0 | Concrete numbers with plain narrative |
10 | More KPI-centric language; EPS/YoY mentions |
20 | Strategy framing; growth outlook |
30 | Heavier KPI emphasis; analyst-style tone |
Example 2 — “Market Indicators & Metrics” (Feature 258, Layer 22)
Prompt: “Stock market analysis indicates”
Strength | Output trend |
0 | Generic advice |
10 | Light technicals (overbought/oversold) |
20 | Price/volume references; 50/200-DMA |
30 | Sector-level analysis; macro sensitivity |
(Similar tables can be shown for Features 345, 375, 116 to illustrate terminology shifts and analytic depth.)
🧩 Features
Streamlit App
- Clean controls (Layer / Feature / Strength / Tokens)
- Side-by-side outputs with diff highlighting
- Large text areas for readability (configurable height)
- Activation charts (per-step mean activation/Δ vs baseline)
- Download run JSON (prompt, settings, outputs, activations)
Minimal Test Script
- Runs a fixed prompt across strengths
[0, 5, 10, 15]
- Logs outputs and basic activation stats
- Good for quick smoke tests & CI
Verification Script
- Batch prompts → compute steering efficacy
- Output similarity/shift (embedding cosine)
- Topic/term frequency changes (finance lexicon)
- Readability & length deltas
- Guardrail checks (toxicity / refusal / hallucination heuristics)
- CSV + plots for reports
🔧 Technical Details
- Model:
cxllin/Llama2-7b-Finance(example; pluggable)
- SAE:
llama2_7b_finance_layers4_10_16_22_28_k32_latents400_wikitext103_torchrun(example)
- Hidden size: 4096 → steering vector shape
[1, 1, 4096](broadcast to[B, T, 4096])
- Hook point: Residual stream (post-layer selection configurable)
- Alpha (α): default 0.5 (stable & visibly effective)
Why α = 0.5?
- After normalization, smaller α can be too subtle; larger α can destabilize style/fluency.
- 0.5 balances salient changes with coherence in finance tasks. Adjust per model/feature.
🧪 Safety & Quality Guards (recommended)
- Refusal & Toxicity checks on steered outputs (safety proxy models)
- Hallucination heuristics: self-consistency disagreement on factual prompts
- Domain guardrails: enforce “no financial advice” disclaimer templates when relevant
- Strength caps: clamp |strength| for sensitive deployments (e.g., ≤ 25)
💡 Usage Tips
- Start with +10…+15 to see controlled effects.
- Use domain prompts (earnings, indicators) to showcase signal.
- Try negative strengths to suppress an unwanted style/behavior.
- Compare multiple features on the same prompt to understand semantics.
- Validate with
verify_steering.pybefore sharing demos externally.
💻 Core Steering Hook (reference)
pythondef steering_hook(module, inputs, output): # Handle models that return tuples (hidden_states, ...) hidden = output[0] if isinstance(output, tuple) else output # 1) SAE decoder direction (feature_id -> vector) direction = decoder[feature_id] # [hidden_dim] direction = direction.unsqueeze(0).unsqueeze(0) # [1,1,hidden_dim] # 2) Normalize norm = torch.norm(direction) if norm > 0: direction = direction / norm # 3) Scale (strength × α) steering_vec = strength * 0.5 * direction # α = 0.5 default # 4) Inject steered = hidden + steering_vec if isinstance(output, tuple): return (steered.to(hidden.dtype),) + output[1:] return steered.to(hidden.dtype)
Replace decoder and feature_id with your loaded SAE module & selected feature.
🔬 Comparison to SAELens Steering
Aspect | SAELens tutorials | This implementation |
Normalization | sometimes omitted | Always normalize |
Scale | small (≈0.1–0.3) | α = 0.5 default (clearer effects) |
Mechanism | activation-based variants | Direct residual addition |
Behavior | subtle & very stable | balanced stability with visible shifts |
Choose scale per model/feature; our default aims for live demos with clear, controlled impact.
🧷 Repro & Config Notes
- Log seed, layer, feature_id, strength, alpha, prompt, and model hash to reproduce a run.
- Store pre/post outputs and activation summaries for audits.
- For batch evaluation, keep a fixed prompt set (finance Q&A, indicators, earnings, risk).
⚠️ Troubleshooting
- Effect too weak: Increase
strength, or try a different feature/layer; verify the feature’s label via AutoInterp Full.
- Text degrades at high strength: Lower
strengthorα; try a later layer (e.g., 22→28).
- No visible change: Ensure decoder vector aligns with residual stream; confirm correct layer hook; check feature truly activates on your prompt type.
- Domain drift (non-finance style): Use domain prompts; suppress conflicting features with negative strengths.
🔗 Suggested Next Steps
- Label & Catalog: Use SAE Labeling (AutoInterp) to name features and filter by F1/precision for production.
- Alignment Guard: Evaluate how steering affects risk features and hallucination metrics post fine-tune.
- Mitigations: Convert effective steering settings into router/clamp rules for deployment.