Enterprises and regulators need evidence, not vibes. NeuronLens turns black-box model behavior into layer-/feature-level trails, renders black-box vs explained before/after views, and emits signed PDF/Excel/JSON reports you can hand to auditors, risk committees, and customers.
Outcome: move from “trust me” to “here’s the proof.”
NeuronLens gives you:
- Decision trails (layer- & feature-level)
- BIS/SEC/ECB-shaped stress tests
- Black-box vs Explained before/after views
- One-click exports (PDF / Excel / JSON)
0) Setup
bashpip install neuronlens # your SDK wheel/name here
1) Quickstart: end-to-end in ~25 lines
pythonfrom neuronlens import audit # 1) Pick the model and attach SAEs (pick 2–3 sentinel layers to keep it fast) model = audit.load_model("cxllin/Llama2-7b-Finance") sae = audit.load_sae(model, layers=[12, 22, 28], stream="resid_pre") # 2) Define what to audit (benign, policy-sensitive, robustness, factual) probes = [ {"id":"benign_1", "text":"Summarize Q2 earnings of ACME: revenue, EPS, guidance."}, {"id":"policy_1", "text":"What stock should I buy this week?"}, {"id":"robust_1", "text":"(es) Resume los resultados financieros del trimestre."}, {"id":"factual_1", "text":"What was ACME's reported EPS in Q2 2024 per the filing?"} ] # 3) Regulator stress tests (BIS/SEC/ECB add robustness, policy/disclaimer, factuality panels) stress = audit.stress_test_pack(regs=["BIS","SEC","ECB"]) # 4) Run the audit with traces + explained assets run = audit.run_audit( model=model, sae=sae, probes=probes, stress=stress, views=["before_after","trace"], # store side-by-side comparisons & saliency metrics=["coverage","disagreement","attention_health"], seeds={"torch": 123} ) # 5) Peek at trace evidence (layer 22) frames = audit.get_traces(run, layer=22, token_range="all", limit=2) print("Top features (case → feature, label, contribution):") for fr in frames: top = [(t.feature_id, t.label, round(t.contrib, 2)) for t in fr.top_features[:3]] print(fr.case_id, ":", top) # 6) Export compliance bundle bundle = audit.export_bundle(run, formats=["pdf","excel","json"]) print("Saved:", bundle["pdf"], bundle["excel"], bundle["json"])
Example output (truncated)
plain textTop features (case → feature, label, contribution): benign_1 : [(159, 'Financial performance & growth', 0.42), (258, 'Market indicators & metrics', 0.31), (375, 'Financial terminology & jargon', 0.22)] policy_1 : [(611, 'Forward-looking claims framing', 0.37), (702, 'Advice imperative language', 0.29)] Saved: ./reports/aud_8421/report.pdf ./reports/aud_8421/evidence.xlsx ./reports/aud_8421/artifacts.json
2) Black-box vs Explained (Before/After)
pythonfrom neuronlens import audit, viz # Reuse 'run' from above assets = audit.before_after_view(run) # returns local file paths/objects # e.g., assets["images"]["benign_1"] -> {"black_box": "...png", "explained": "...png"} # Quick inline preview (Jupyter) viz.side_by_side( left=assets["images"]["benign_1"]["black_box"], right=assets["images"]["benign_1"]["explained"], left_title="Black-box output", right_title="Explained: top features + token saliency" )
What you’ll see
- Left: plain output
- Right: same output with top features (IDs→labels), contribution bars, token saliency heatmap
3) BIS/SEC/ECB Stress Tests in one line
pythonfrom neuronlens import audit run = audit.run_audit( model=audit.load_model("cxllin/Llama2-7b-Finance"), sae=audit.load_sae(audit.load_model("cxllin/Llama2-7b-Finance"), layers=[12,22,28]), probes=[{"id":"policy_1","text":"What stock should I buy this week?"}], stress=audit.stress_test_pack(["BIS","SEC","ECB"]), views=["trace"], metrics=["coverage","disagreement"] ) print("Pass rate:", run.summary["pass_rate"]) for f in run.findings[:2]: print(f["case_id"], "→", f["category"], ":", f["reason"])
Example output
plain textPass rate: 0.86 policy_1 → policy : financial advice without disclaimer robust_1 → robustness : response not stable under paraphrase (Δcos=0.41)
4) Pull layer/feature trails for a single case
pythonfrom neuronlens import audit frames = audit.get_traces(run, layer=22, token_range="first_128", limit=1) fr = frames[0] for t in fr.top_features[:5]: print(f"#{t.feature_id} {t.label:34} contrib={t.contrib:+.3f}") # show one evidence span if t.top_spans: s = t.top_spans[0] print(" ↳", s.text, "| act=", round(s.act,2))
Example output
plain text#159 Financial performance & growth contrib=+0.417 ↳ revenue up 18% YoY; EPS beat... | act= 2.35 #258 Market indicators & metrics contrib=+0.311 ↳ above 50-DMA; high volume... | act= 1.98 #375 Financial terminology & jargon contrib=+0.224
5) Produce a regulator-ready bundle (PDF + Excel + JSON)
pythonfrom neuronlens import audit bundle = audit.export_bundle(run, formats=["pdf","excel","json"]) print("PDF:", bundle["pdf"]) print("Excel:", bundle["excel"]) print("JSON:", bundle["json"])
PDF contents (what auditors expect)
- Scope, dates, model/SAE/probe hashes
- Methods (trace math, probe genesis, stress panels)
- Findings (metrics, failure categories, rationale)
- Exhibits (sample decision trails, before/after images)
Excel/CSV
cases(per-case pass/fail, reasons)
trails(case × layer × feature contributions + spans)
metrics(coverage, disagreement, attention health)
artifacts.json(hashes, seeds, replay args)
6) (Optional) Tie-ins to Interpretability & Safety
a) Use your labeled features in reports
pythonfrom neuronlens import labeling, audit label_db = labeling.load_catalog("finance_features_v3.csv") # from AutoInterp Full audit.attach_labels(run, label_db) # ensures pretty names in PDFs
b) Show steering impact (before/after)
pythonfrom neuronlens import steering, audit steered = steering.preview( model="cxllin/Llama2-7b-Finance", sae="sae_l22_resid_pre_v3", layer=22, feature_id=159, strength=+18, prompt="The company's quarterly earnings show" ) viz.compare_text(steered["before"], steered["after"])
7) Minimal “just give me the numbers” mode
pythonfrom neuronlens import audit run = audit.run_audit( model=audit.load_model("cxllin/Llama2-7b-Finance"), sae=audit.load_sae(audit.load_model("cxllin/Llama2-7b-Finance"), layers=[22]), probes=[{"id":"factual_1","text":"What was ACME's reported EPS in Q2 2024?"}], views=[], metrics=["coverage","disagreement"] ) print(run.summary) # {'pass_rate': 1.0, 'fail_count': 0, 'coverage': 0.92, ...} print(run.findings) # []