1. SAE Toolkit

(This is the heart: training, auto-interpret, steering, feature search)

Pages:

  • Training SAEs
    • Endpoint: /v1/sae/train
    • Parameters: layers, dict_size, sparsity, warm_start.
    • Best practices: which layers to pick, probe set size.
    • Metrics: explained variance, dead-rate, selectivity.
  • Feature Explorer
    • /v1/sae/features
    • Search features by ID, label, activation.
    • Example: “Feature 471 → negative sentiment in credit risk.”
  • Labeling Features
    • How to add human labels (/v1/sae/label).
    • Guidelines for good labels (not too broad, not too narrow).
  • Steering Playground
    • /v1/sae/steer/preview
    • Adjust feature weights → see generations before/after.
    • Example: Turn down hallucination trigger, rerun answer.
  • Feature Drift (Optional Link to Alignment)

2. Trading Signals

Pages:

  • Signals API
    • Endpoint details (/v1/trading/signals).
    • Example: BTC-USD daily signals.
  • Backtesting
    • How to run (/v1/trading/backtests).
    • Sample equity curve + Sharpe ratio output.
  • Feature Correlations
    • Explaining correlation heatmap.
    • Use cases: “Volume vs Price Uptrend,” “Sentiment vs Returns.”
  • Example Workflow
    • Python notebook: backtest BTC signals, visualize, export CSV.

3. Audit & Assurance

Pages:

  • Audit Runs
    • How to trigger audit (/v1/audit/runs).
    • Choosing regulators (BIS/SEC/ECB).
  • Decision Traces
    • What traces show (layer, feature, weight).
    • Example: Loan rejection model → why it rejected.
  • Before vs After Visualization
    • Black box vs explained example.
  • Exportable Reports
    • /v1/audit/reports/{id}?format=pdf
    • Sample report screenshots.
  • Compliance Guidance
    • How reports help with regulators.

4. Alignment Guard (Fine-Tuning Check)

Pages:

  • Alignment Audits
    • /v1/alignment/audits
    • Inputs: base_model, fine-tuned_model, layers, sae_mode.
    • Outputs: drift heatmap, risky features, alignment score.
  • Metrics Explained
    • Drift Score (Δ activation).
    • Risk Feature Rate.
    • Alignment Correlation Index.
    • Hallucination disagreement %.
  • Certificates
    • Downloading PDF/JSON certificate.
    • Example use: “model alignment verified post fine-tuning.”
  • Case Study
    • How drift detection flagged a misaligned fine-tune.

5. Red-Team & Harmful Feature Removal

Pages:

  • Red-Team Runs
    • /v1/redteam/runs
    • Choose strategies: jailbreak, bias, misinformation, PII.
  • Feature Steering Red-Team
    • /v1/redteam/feature-steer
    • Example: boosting harmful features to expose failures.
  • Mitigation Tools
    • /v1/mitigations/apply
    • Types: steer, prune, router, fine-tune penalty.
    • Example: clamp harmful features → drop unsafe output %.
  • Failure Mode Catalog
    • /v1/failure-modes
    • View discovered modes + linked features.
  • Red-Team Dashboard
    • Reports: broken prompts, harm categories, severity index.

API Spec - Examples

Auth
  • POST /v1/auth/token{ access_token, expires_in }
Common & Catalog
  • GET /v1/models → list deployed base/fine-tuned models
  • GET /v1/models/{model_id} → detail
  • GET /v1/checkpoints?model_id=... → training/fine-tune checkpoints
  • POST /v1/datasets/probes (json or file) → register probe set (benign, safety, finance)
  • GET /v1/datasets/probes → list probe sets
  • POST /v1/uploads (multipart) → upload documents or CSVs (e.g., finance data)
  • GET /v1/exports/{export_id} → signed link
  • POST /v1/webhooks → subscribe to job status (job_succeeded, job_failed)

1. Interpretability-as-a-Service (SAE)

  • POST /v1/sae/train
    • body: { model_id, layers:[...], target_streams:["resid_pre","mlp_out"], dict_size, sparsity, epochs, probe_set_id, warm_start_sae_id? }
    • { sae_job_id }
  • GET /v1/sae/jobs/{sae_job_id} → status, metrics (explained_var, dead_rate)
  • GET /v1/sae/models?model_id=... → list trained SAEs
  • POST /v1/sae/label
    • body: { sae_id, feature_id, label, tags[], notes }
    • { feature_id, label }
  • GET /v1/sae/features?sae_id=...&q=...&risk=true|false
    • → feature cards (id, label, examples, selectivity)
  • POST /v1/sae/steer/preview
    • body: { model_id, sae_id, edits:[{feature_id, scale}], prompts[] }
    • → side-by-side generations + activation deltas

2. Trading Signals

  • GET /v1/trading/signals?symbol=BTC-USD&window=1d&features=...&model_id=...
    • { timestamp, signal: "buy|sell|hold", score, rationale, feature_contrib[] }[]
  • POST /v1/trading/backtests
    • body: { model_id, symbols[], start, end, features[], target, split: {train, test}, metrics[] }
    • { backtest_id }
  • GET /v1/trading/backtests/{backtest_id}
    • → curves, metrics (Sharpe, MDD, hit-rate), trades, feature importance
  • GET /v1/trading/feature-corr?symbols[]=...&features[]=...&model_id=...
    • → correlation matrix + p-values

3. Audit & Assurance

  • POST /v1/audit/runs
    • body: { model_id, probe_set_id, regs: ["BIS","SEC","ECB"], views:["before_after","trace"], outputs:["pdf","json"] }
    • { audit_id }
  • GET /v1/audit/runs/{audit_id}
    • → status, key findings, before vs after visuals (URIs), trace snapshots
  • GET /v1/audit/reports/{audit_id}?format=pdf|json
    • → downloadable report
  • GET /v1/audit/traces?model_id=...&layer=...&token_range=...
    • → decision trace frames (layer/feature/weight)

4. Fine-Tuning Alignment Guard (post-training)

  • POST /v1/alignment/audits
    • body: { base_model_id, finetuned_model_id, probe_set_id, layers:[...], sae_mode:"train_new|project_base", metrics:["drift","risk_corr","hallucination","attention_health","tuned_lens"] }
    • { alignment_audit_id }
  • GET /v1/alignment/audits/{alignment_audit_id}
    • { score, drift_heatmap_uri, risky_features[], examples[], metrics:{KL, CKA, probe_drift, outlier_pct, disagreement_pct} }
  • GET /v1/alignment/certificates/{alignment_audit_id}?format=pdf|json
    • → signed certificate/report

5. Red-Team & Harmful Feature Removal

  • POST /v1/redteam/runs
    • body: { model_id, probe_set_id?, strategies:["prompt_gen","jailbreak","bias","pii","misinfo"], rounds: number }
    • { redteam_id }
  • GET /v1/redteam/runs/{redteam_id}
    • → findings, prompts that broke guardrails, category scores
  • POST /v1/redteam/feature-steer
    • body: { model_id, sae_id, edits:[{feature_id, scale}], attack_prompts[] }
    • → harm rate before/after, sample outputs
  • POST /v1/mitigations/apply
    • body: { model_id, type:"steer|prune|router|finetune_penalty", params:{...} }
    • { mitigation_id }
  • GET /v1/mitigations/{mitigation_id} → status, diff metrics
  • GET /v1/failure-modes?model_id=...
    • → catalog (mode, triggers, linked features, severity)

Standard responses & jobs

  • All long ops return {job_id}; poll GET /v1/jobs/{job_id}
  • Every list supports ?page&limit&sort&filter=...
  • Errors: { error:{ code, message, details? } }

Powered by Notaku