Interactive, feature-level control for a fine-tuned financial LLM using Sparse Autoencoder (SAE) features. Steer specific financial concepts inside the model’s hidden states and compare outputs before vs after in real time.
Status: Layers 4, 10, 16, 22, 28 validated end-to-end on the Llama-2-7B Finance variant.

🎯 Overview

  • What it is: A practical system to boost or suppress SAE features (e.g., “financial performance”, “market indicators”) at chosen layers to influence generation.
  • Why it matters: Turn interpretability into control: test hypotheses, reduce hallucinations, and prototype mitigations without retraining the base model.
  • What you get: Streamlit app for interactive steering, CLI scripts for quick tests, and a verification tool to quantify effects.

📁 Repo Layout

Main Files

  • streamlit_feature_steering_app.py — Interactive Streamlit UI (side-by-side outputs, sliders, metrics)
  • minimal_steering_test.py — Quick CLI sanity check (single prompt, multiple strengths)
  • verify_steering.py — Batch verification on a prompt set (quality & consistency metrics)

Archived

  • archive/ — Dev and exploratory scripts (kept for reference)

🚀 Quick Start

1) Launch the Streamlit App

bash
conda activate sae streamlit run streamlit_feature_steering_app.py --server.port 8501

2) Minimal CLI Test

bash
python minimal_steering_test.py

🔧 How Steering Works (High-Level)

We modify hidden states at a chosen layer by adding a normalized feature direction from the SAE decoder, scaled by a user-controlled strength.
  • Direction source: SAE decoder weights for feature_id
  • Normalization: L2 normalize the direction for stable scaling
  • Injection: Add to hidden states at the target layer during forward pass
Formula
plain text
steering_vector = strength × α × ( decoder[feature_id] / ||decoder[feature_id]|| ) steered_hidden = original_hidden + steering_vector
  • strength ∈ [−50, +50] (UI slider)
  • α (alpha) default 0.5 — empirical balance between effect and stability
Negative strength suppresses a feature; positive strength amplifies it.

📊 Feature Catalog (Finance-Focused, example labels)

Validated SAE feature labels per layer (subset):
  • Layer 4: “Financial Market Analysis”, “Financial Institutions”
  • Layer 10: “Financial Market Trends”, “Investment Guidance”
  • Layer 16: “Performance Indicators”, “News & Analysis”
  • Layer 22: “Financial Performance & Growth”, “Market Indicators”
  • Layer 28: “Company Financials”, “Stock Performance Analysis”
Tip: Link these labels to your AutoInterp catalog so users can search features by name and jump into steering.

🖼️ Interactive Demo (Streamlit)

  • Left Sidebar: Layer picker, Feature selector, Strength slider (−50…+50), Max tokens
  • Main Panel: Prompt input, Before vs After generations, activation snapshots
  • Metrics Pane: Token-level activation sparkline, per-layer deltas, basic readability stats
Observed behavior: At strengths around +20…+30, general business prompts shift toward structured financial analysis (KPIs, MA lines, sector context).

🎯 Steering Examples (L22 shown)

Example 1 — “Financial Performance & Growth” (Feature 159, Layer 22)
Prompt: “The company’s quarterly earnings show”
Strength
Output trend
0
Concrete numbers with plain narrative
10
More KPI-centric language; EPS/YoY mentions
20
Strategy framing; growth outlook
30
Heavier KPI emphasis; analyst-style tone
Example 2 — “Market Indicators & Metrics” (Feature 258, Layer 22)
Prompt: “Stock market analysis indicates”
Strength
Output trend
0
Generic advice
10
Light technicals (overbought/oversold)
20
Price/volume references; 50/200-DMA
30
Sector-level analysis; macro sensitivity
(Similar tables can be shown for Features 345, 375, 116 to illustrate terminology shifts and analytic depth.)

🧩 Features

Streamlit App

  • Clean controls (Layer / Feature / Strength / Tokens)
  • Side-by-side outputs with diff highlighting
  • Large text areas for readability (configurable height)
  • Activation charts (per-step mean activation/Δ vs baseline)
  • Download run JSON (prompt, settings, outputs, activations)

Minimal Test Script

  • Runs a fixed prompt across strengths [0, 5, 10, 15]
  • Logs outputs and basic activation stats
  • Good for quick smoke tests & CI

Verification Script

  • Batch prompts → compute steering efficacy
    • Output similarity/shift (embedding cosine)
    • Topic/term frequency changes (finance lexicon)
    • Readability & length deltas
    • Guardrail checks (toxicity / refusal / hallucination heuristics)
  • CSV + plots for reports

🔧 Technical Details

  • Model: cxllin/Llama2-7b-Finance (example; pluggable)
  • SAE: llama2_7b_finance_layers4_10_16_22_28_k32_latents400_wikitext103_torchrun (example)
  • Hidden size: 4096 → steering vector shape [1, 1, 4096] (broadcast to [B, T, 4096])
  • Hook point: Residual stream (post-layer selection configurable)
  • Alpha (α): default 0.5 (stable & visibly effective)
Why α = 0.5?
  • After normalization, smaller α can be too subtle; larger α can destabilize style/fluency.
  • 0.5 balances salient changes with coherence in finance tasks. Adjust per model/feature.

🧪 Safety & Quality Guards (recommended)

  • Refusal & Toxicity checks on steered outputs (safety proxy models)
  • Hallucination heuristics: self-consistency disagreement on factual prompts
  • Domain guardrails: enforce “no financial advice” disclaimer templates when relevant
  • Strength caps: clamp |strength| for sensitive deployments (e.g., ≤ 25)

💡 Usage Tips

  1. Start with +10…+15 to see controlled effects.
  1. Use domain prompts (earnings, indicators) to showcase signal.
  1. Try negative strengths to suppress an unwanted style/behavior.
  1. Compare multiple features on the same prompt to understand semantics.
  1. Validate with verify_steering.py before sharing demos externally.

💻 Core Steering Hook (reference)

python
def steering_hook(module, inputs, output): # Handle models that return tuples (hidden_states, ...) hidden = output[0] if isinstance(output, tuple) else output # 1) SAE decoder direction (feature_id -> vector) direction = decoder[feature_id] # [hidden_dim] direction = direction.unsqueeze(0).unsqueeze(0) # [1,1,hidden_dim] # 2) Normalize norm = torch.norm(direction) if norm > 0: direction = direction / norm # 3) Scale (strength × α) steering_vec = strength * 0.5 * direction # α = 0.5 default # 4) Inject steered = hidden + steering_vec if isinstance(output, tuple): return (steered.to(hidden.dtype),) + output[1:] return steered.to(hidden.dtype)
Replace decoder and feature_id with your loaded SAE module & selected feature.

🔬 Comparison to SAELens Steering

Aspect
SAELens tutorials
This implementation
Normalization
sometimes omitted
Always normalize
Scale
small (≈0.1–0.3)
α = 0.5 default (clearer effects)
Mechanism
activation-based variants
Direct residual addition
Behavior
subtle & very stable
balanced stability with visible shifts
Choose scale per model/feature; our default aims for live demos with clear, controlled impact.

🧷 Repro & Config Notes

  • Log seed, layer, feature_id, strength, alpha, prompt, and model hash to reproduce a run.
  • Store pre/post outputs and activation summaries for audits.
  • For batch evaluation, keep a fixed prompt set (finance Q&A, indicators, earnings, risk).

⚠️ Troubleshooting

  • Effect too weak: Increase strength, or try a different feature/layer; verify the feature’s label via AutoInterp Full.
  • Text degrades at high strength: Lower strength or α; try a later layer (e.g., 22→28).
  • No visible change: Ensure decoder vector aligns with residual stream; confirm correct layer hook; check feature truly activates on your prompt type.
  • Domain drift (non-finance style): Use domain prompts; suppress conflicting features with negative strengths.

🔗 Suggested Next Steps

  • Label & Catalog: Use SAE Labeling (AutoInterp) to name features and filter by F1/precision for production.
  • Alignment Guard: Evaluate how steering affects risk features and hallucination metrics post fine-tune.
  • Mitigations: Convert effective steering settings into router/clamp rules for deployment.