📋 Overview
This document provides a comprehensive framework for evaluating Sparse Autoencoders (SAEs) across different domains, model architectures, and datasets. It covers theoretical foundations, evaluation metrics, domain-specific considerations, and practical implementation guidelines.
Why Evaluate SAEs?
SAE evaluation serves multiple critical purposes:
- Reconstruction Quality: Measure how well the SAE preserves information
- Sparsity Validation: Ensure the SAE maintains desired sparsity properties
📊 Core Evaluation Metrics
1. Loss Recovered (Fraction of Variance Explained - FVU)
Definition:
(1 - MSE/Total_Variance) × 100% - Measures how well SAE reconstructs original activations.Significance: Indicates information preservation and training quality. Scale-invariant metric widely used in SAE literature.
Healthy Ranges by Model Size:
- Small Models (<1B): 40-60%
- Medium Models (1-10B): 60-80%
- Large Models (>10B): 70-90%
2. L0 Sparsity (Average Active Features)
Definition:
mean(count_nonzero(activations, axis=1)) - Average number of active features per sample.Significance: Ensures sparse representations for computational efficiency and interpretability. Direct measurement of actual active features.
Optimal Ranges by Use Case:
- Interpretability Focus: 10-50 features
- Balanced Approach: 50-150 features
- Reconstruction Focus: 100-300 features
- Maximum: 500 features (beyond this, sparsity loses meaning)
3. Dead Features Percentage
Definition:
(features_with_usage < threshold) / total_features × 100% - Percentage of rarely or never activated features.Significance: Indicates training health and capacity utilization. High dead features suggest training problems or poor initialization.
Acceptable Ranges:
- Excellent: 0-5%
- Good: 5-15%
- Acceptable: 15-25%
- Problematic: >25%
4. Feature Absorption (Decoder Weight Correlation)
Definition:
mean(cosine_similarity(decoder_weights)) - Average correlation between decoder weights.Significance: Measures feature diversity and redundancy. High absorption indicates overlapping features, low diversity suggests better interpretability.
Healthy Ranges:
- Excellent: 0-0.15
- Good: 0.15-0.25
- Acceptable: 0.25-0.35
- Problematic: >0.35
🔄 In-Sample vs Out-of-Sample Evaluation
In-Sample Evaluation
Purpose: Assess SAE performance on the training distribution
- Metrics: All four core metrics
- Dataset: Same as training data
- Significance: Measures learning capacity and training success
When to Use:
- Training Monitoring: During training to track progress
- Capacity Assessment: Determine if SAE dimension is appropriate
- Baseline Establishment: Set performance expectations
Out-of-Sample (OOS) Evaluation
Purpose: Test generalization to unseen domains and distributions
- Metrics: Focus on reconstruction quality and sparsity consistency
- Datasets: Different domains, tasks, or distributions
- Significance: Measures real-world applicability
Why OOS is Critical:
- Generalization Test: Ensures SAE works beyond training data
- Robustness Assessment: Identifies overfitting to training distribution
- Practical Validation: Real-world usage involves unseen data
OOS Dataset Categories:
1. Domain Shift Datasets
- Purpose: Test performance on different text domains
- Examples: News → Fiction, Academic → Social Media
- Expected Behavior: Moderate performance degradation (20-40% drop)
2. Task Shift Datasets
- Purpose: Test performance on different NLP tasks
- Examples: Classification → Generation, QA → Summarization
- Expected Behavior: Variable performance (0-60% drop)
3. Distribution Shift Datasets
- Purpose: Test performance on different data distributions
- Examples: Different time periods, demographics, or sources
- Expected Behavior: Mild performance degradation (10-30% drop)
🗂️ Dataset Selection Guidelines
Training Dataset Requirements
Aspect | Minimum | Optimal | Maximum | Notes |
Samples | 100K | 1M-10M | 100M+ | More samples = better features |
Tokens | 50M | 500M-5B | 50B+ | Context length × samples |
Context Length | 256 | 512-1024 | 2048 | Longer context = richer patterns |
Domains | 2-3 | 5-10 | 20+ | Multiple domains for generalization |
Evaluation Dataset Selection
1. Primary Evaluation Dataset
- Purpose: Standard benchmark for comparison
- Characteristics: Clean, well-structured, representative
- Examples: WikiText, C4, OpenWebText
2. Domain-Specific Datasets
- Purpose: Test domain generalization
- Selection: Based on target application domains
- Examples: News (AG News), Reviews (IMDB), Code (CodeSearchNet)
3. Task-Specific Datasets
- Purpose: Test task generalization
- Selection: Based on target NLP tasks
- Examples: QA (SQuAD), Classification (GLUE), Generation (CNN/DailyMail)
🎯 Domain-Specific SAE Guidelines
Domain-Specific SAE Configurations
Domain | Model Dim | SAE Dim | Top-K | Learning Rate | Batch Size | Focus |
General Language | 4096 | 2048 (50%) | 64 | 0.001 | 32 | Cross-domain generalization |
Domain-Specific | 768 | 230 (30%) | 32 | 0.005 | 64 | Domain-specific features |
Code Models | 1024 | 410 (40%) | 48 | 0.0005 | 16 | Syntax preservation |
Multimodal | 2048 | 1229 (60%) | 96 | 0.0001 | 8 | Cross-modal consistency |
⚙️ Hyperparameter Guidelines
Model Size Hyperparameter Guidelines
Model Size | Parameters | SAE Dim | Top-K | Learning Rate | Batch Size | Epochs | Expected Performance |
Small | <100M | 20% | 16-32 | 0.01 | 128 | 50-100 | 40-60% loss, 20-80 L0, 5-20% dead |
Medium | 100M-1B | 30% | 32-64 | 0.005 | 64 | 30-50 | 50-70% loss, 40-120 L0, 3-15% dead |
Large | 1B-10B | 40% | 64-128 | 0.001 | 32 | 20-30 | 60-80% loss, 60-150 L0, 2-10% dead |
Very Large | >10B | 50% | 128-256 | 0.0005 | 16 | 15-25 | 70-90% loss, 80-200 L0, 1-8% dead |
Dataset Size Adjustments
Dataset Size | Learning Rate | Batch Size | Epochs | Regularization | Expected Impact |
Small (<1M) | ×0.5 | ×0.5 | ×2 | ×1.5 | Higher dead features (10-25%), lower reconstruction |
Medium (1-10M) | ×1.0 | ×1.0 | ×1 | ×1.0 | Balanced performance |
Large (>10M) | ×1.2 | ×1.5 | ×0.7 | ×0.8 | Lower dead features (2-8%), higher reconstruction |
📊 Evaluation Workflow
1. Pre-Training Assessment
python# Baseline evaluation baseline_metrics = evaluate_baseline(model, layer, dataset) print(f"Baseline sparsity: {baseline_metrics['natural_sparsity']}") print(f"Baseline variance: {baseline_metrics['total_variance']}")
2. Training Monitoring
python# Every N steps if step % evaluation_frequency == 0: metrics = evaluate_sae(sae, model, layer, dataset) log_metrics(metrics, step) # Check for issues if metrics['dead_features'] > 0.25: print("Warning: High dead features detected")
3. Post-Training Evaluation
python# Comprehensive evaluation datasets = ['wikitext', 'squad', 'glue', 'ag_news', 'imdb'] results = {} for dataset in datasets: results[dataset] = evaluate_sae(sae, model, layer, dataset) # Generate health report health_report = generate_health_report(results)
🎯 Metrics & Success Criteria
Metrics & Success Criteria
Metric | Minimum | Optimal | Model Adjustments |
Loss Recovered | 40% | 70% | Small: ×0.8, Large: ×1.1 |
L0 Sparsity | 20-300 | 40-120 | Small: ×0.7, Large: ×1.3 |
Dead Features | ≤25% | ≤10% | Consistent across sizes |
Feature Absorption | ≤35% | ≤25% | Consistent across sizes |
📈 Results from Trained SAEs
Trained SAE Results Summary
Model | Layer | SAE Dim | Top-K | Loss Rec | L0 Sparsity | Dead Feat | Absorption | Status |
Llama 3.1 8B | 1 | 1536 | 32 | 93.22% ✅ | 856.27 ❌ | 78.26% ❌ | 0.400 ❌ | Excellent reconstruction, poor sparsity |
BERT-base | 6 | 200 | 32 | 28.82% ❌ | 94.99 ✅ | 0.00% ✅ | 0.156 ✅ | Perfect utilization, poor reconstruction |
Gemma 3 270M | 6 | 200 | 64 | 0.00% ❌ | 103.37 ✅ | 11.00% ✅ | 0.114 ✅ | Good sparsity, failed reconstruction |
Cross-Dataset Performance
Model | WikiText | GLUE/CoLA | AG News | IMDB | Generalization |
BERT-base | 31.08% | 0.00% | 0.00% | 37.03% | ⚠️ Variable |
Gemma 3 270M | 0.00% | 0.00% | 0.00% | 0.00% | ❌ Poor |
🔄 Continuous Improvement Framework
1. Regular Evaluation Schedule
- Weekly: Run comprehensive evaluation on all SAEs
- Monthly: Compare against benchmarks and previous versions
- Quarterly: Update evaluation methodology and metrics
2. Iterative Improvement Process
pythondef iterative_improvement(sae_config, evaluation_results): # Identify issues issues = analyze_issues(evaluation_results) # Generate improvements improvements = generate_improvements(issues) # Test improvements new_results = test_improvements(sae_config, improvements) # Compare and iterate if new_results > evaluation_results: return new_results else: return evaluation_results
3. Documentation Requirements
- Configuration Tracking: Document all training parameters
- Result Archiving: Save detailed evaluation results
- Analysis Recording: Document insights and recommendations
- Comparison Tracking: Track improvements over time
This framework provides a comprehensive approach to SAE evaluation that can be applied across different domains, model sizes, and use cases. Regular updates should be made based on new research findings and practical experience.
Version: 1.0
Last Updated: Based on comprehensive SAE evaluation research
Methodology: SAEBench with theoretical foundations