SAE Steering

Interactive, feature-level control for a fine-tuned financial LLM using Sparse Autoencoder (SAE) features. Steer specific financial concepts inside the model’s hidden states and compare outputs before vs after in real time.

Status: Layers 4, 10, 16, 22, 28 validated end-to-end on the Llama-2-7B Finance variant.

🎯 Overview

What it is: A practical system to boost or suppress SAE features (e.g., “financial performance”, “market indicators”) at chosen layers to influence generation.

Why it matters: Turn interpretability into control: test hypotheses, reduce hallucinations, and prototype mitigations without retraining the base model.

What you get: Streamlit app for interactive steering, CLI scripts for quick tests, and a verification tool to quantify effects.

📁 Repo Layout

Main Files

streamlit_feature_steering_app.py — Interactive Streamlit UI (side-by-side outputs, sliders, metrics)

minimal_steering_test.py — Quick CLI sanity check (single prompt, multiple strengths)

verify_steering.py — Batch verification on a prompt set (quality & consistency metrics)

Archived

archive/ — Dev and exploratory scripts (kept for reference)

🚀 Quick Start

1) Launch the Streamlit App

bash
conda activate sae
streamlit run streamlit_feature_steering_app.py --server.port 8501

2) Minimal CLI Test

bash
python minimal_steering_test.py

🔧 How Steering Works (High-Level)

We modify hidden states at a chosen layer by adding a normalized feature direction from the SAE decoder, scaled by a user-controlled strength.

Direction source: SAE decoder weights for feature_id

Normalization: L2 normalize the direction for stable scaling

Injection: Add to hidden states at the target layer during forward pass

Formula

plain text
steering_vector = strength × α × ( decoder[feature_id] / ||decoder[feature_id]|| )
steered_hidden  = original_hidden + steering_vector

strength ∈ [−50, +50] (UI slider)

α (alpha) default 0.5 — empirical balance between effect and stability

Negative strength suppresses a feature; positive strength amplifies it.

📊 Feature Catalog (Finance-Focused, example labels)

Validated SAE feature labels per layer (subset):

Layer 4: “Financial Market Analysis”, “Financial Institutions”

Layer 10: “Financial Market Trends”, “Investment Guidance”

Layer 16: “Performance Indicators”, “News & Analysis”

Layer 22: “Financial Performance & Growth”, “Market Indicators”

Layer 28: “Company Financials”, “Stock Performance Analysis”

Tip: Link these labels to your AutoInterp catalog so users can search features by name and jump into steering.

🖼️ Interactive Demo (Streamlit)

Left Sidebar: Layer picker, Feature selector, Strength slider (−50…+50), Max tokens

Main Panel: Prompt input, Before vs After generations, activation snapshots

Metrics Pane: Token-level activation sparkline, per-layer deltas, basic readability stats

Observed behavior: At strengths around +20…+30, general business prompts shift toward structured financial analysis (KPIs, MA lines, sector context).

🎯 Steering Examples (L22 shown)

Example 1 — “Financial Performance & Growth” (Feature 159, Layer 22)

Prompt: “The company’s quarterly earnings show”

Strength	Output trend
0	Concrete numbers with plain narrative
10	More KPI-centric language; EPS/YoY mentions
20	Strategy framing; growth outlook
30	Heavier KPI emphasis; analyst-style tone

Example 2 — “Market Indicators & Metrics” (Feature 258, Layer 22)

Prompt: “Stock market analysis indicates”

Strength	Output trend
0	Generic advice
10	Light technicals (overbought/oversold)
20	Price/volume references; 50/200-DMA
30	Sector-level analysis; macro sensitivity

(Similar tables can be shown for Features 345, 375, 116 to illustrate terminology shifts and analytic depth.)

🧩 Features

Streamlit App

Clean controls (Layer / Feature / Strength / Tokens)

Side-by-side outputs with diff highlighting

Large text areas for readability (configurable height)

Activation charts (per-step mean activation/Δ vs baseline)

Download run JSON (prompt, settings, outputs, activations)

Minimal Test Script

Runs a fixed prompt across strengths [0, 5, 10, 15]

Logs outputs and basic activation stats

Good for quick smoke tests & CI

Verification Script

Batch prompts → compute steering efficacy

Output similarity/shift (embedding cosine)
Topic/term frequency changes (finance lexicon)
Readability & length deltas
Guardrail checks (toxicity / refusal / hallucination heuristics)

CSV + plots for reports

🔧 Technical Details

Model: cxllin/Llama2-7b-Finance (example; pluggable)

SAE: llama2_7b_finance_layers4_10_16_22_28_k32_latents400_wikitext103_torchrun (example)

Hidden size: 4096 → steering vector shape [1, 1, 4096] (broadcast to [B, T, 4096])

Hook point: Residual stream (post-layer selection configurable)

Alpha (α): default 0.5 (stable & visibly effective)

Why α = 0.5?

After normalization, smaller α can be too subtle; larger α can destabilize style/fluency.

0.5 balances salient changes with coherence in finance tasks. Adjust per model/feature.

🧪 Safety & Quality Guards (recommended)

Refusal & Toxicity checks on steered outputs (safety proxy models)

Hallucination heuristics: self-consistency disagreement on factual prompts

Domain guardrails: enforce “no financial advice” disclaimer templates when relevant

Strength caps: clamp |strength| for sensitive deployments (e.g., ≤ 25)

💡 Usage Tips

Start with +10…+15 to see controlled effects.

Use domain prompts (earnings, indicators) to showcase signal.

Try negative strengths to suppress an unwanted style/behavior.

Compare multiple features on the same prompt to understand semantics.

Validate with verify_steering.py before sharing demos externally.

💻 Core Steering Hook (reference)

python
def steering_hook(module, inputs, output):
    # Handle models that return tuples (hidden_states, ...)
    hidden = output[0] if isinstance(output, tuple) else output

    # 1) SAE decoder direction (feature_id -> vector)
    direction = decoder[feature_id]                   # [hidden_dim]
    direction = direction.unsqueeze(0).unsqueeze(0)   # [1,1,hidden_dim]

    # 2) Normalize
    norm = torch.norm(direction)
    if norm > 0:
        direction = direction / norm

    # 3) Scale (strength × α)
    steering_vec = strength * 0.5 * direction        # α = 0.5 default

    # 4) Inject
    steered = hidden + steering_vec
    if isinstance(output, tuple):
        return (steered.to(hidden.dtype),) + output[1:]
    return steered.to(hidden.dtype)

Replace decoder and feature_id with your loaded SAE module & selected feature.

🔬 Comparison to SAELens Steering

Aspect	SAELens tutorials	This implementation
Normalization	sometimes omitted	Always normalize
Scale	small (≈0.1–0.3)	α = 0.5 default (clearer effects)
Mechanism	activation-based variants	Direct residual addition
Behavior	subtle & very stable	balanced stability with visible shifts

Choose scale per model/feature; our default aims for live demos with clear, controlled impact.

🧷 Repro & Config Notes

Log seed, layer, feature_id, strength, alpha, prompt, and model hash to reproduce a run.

Store pre/post outputs and activation summaries for audits.

For batch evaluation, keep a fixed prompt set (finance Q&A, indicators, earnings, risk).

⚠️ Troubleshooting

Effect too weak: Increase strength, or try a different feature/layer; verify the feature’s label via AutoInterp Full.

Text degrades at high strength: Lower strength or α; try a later layer (e.g., 22→28).

No visible change: Ensure decoder vector aligns with residual stream; confirm correct layer hook; check feature truly activates on your prompt type.

Domain drift (non-finance style): Use domain prompts; suppress conflicting features with negative strengths.

🔗 Suggested Next Steps

Label & Catalog: Use SAE Labeling (AutoInterp) to name features and filter by F1/precision for production.

Alignment Guard: Evaluate how steering affects risk features and hallucination metrics post fine-tune.

Mitigations: Convert effective steering settings into router/clamp rules for deployment.