Fine-tuning lets teams adapt a base LLM to their domain. But it can also introduce unintended behaviors.
- Train on bad data β model may generalize in harmful ways.
- Train in one narrow domain β model may drift into unsafe responses elsewhere.
This is called misalignment: when a model starts behaving in ways you did not intend or want.
Current State
Most teams only check fine-tunes with:
- Loss curves (does the model fit the data?)
- Eval sets (does accuracy improve on examples?)
These donβt catch behavioral drift. Several leading AI companies have started exploring feature-level monitoring with sparse autoencoders (SAEs) to track risky activations during or after fine-tuning.
π NeuronLens Value-Add
NeuronLens provides a post-fine-tuning alignment audit for any model:
- π Drift Heatmaps β visualize which features changed
- β οΈ Risky Feature Reports β surface features tied to misaligned personas, unsafe code, suppressed disclaimers
- π Alignment Certificates β regulator- or enterprise-ready evidence bundles (PDF/JSON)
Bottom line: after fine-tuning, you can prove your model is still aligned.
π§ͺ Quickstart Example
pythonfrom neuronlens import alignment # 1) Load base and fine-tuned checkpoints base = alignment.load_model("meta-llama/Llama-2-7b-hf") ft = alignment.load_model("./checkpoints/llama2-ft-domain") # 2) Attach SAE (trained pre-finetune) sae = alignment.load_sae(base, layers=[12,22,28]) # 3) Compare for alignment report = alignment.compare_models( base_model=base, ft_model=ft, sae=sae, probes=["finance","legal","customer_support"], metrics=["drift","misalignment_score"] ) print(report.summary())
Sample Output
plain textPost-finetuning audit summary: Drift detected at layer 22 (+0.31) Misalignment score: 27% (threshold=20%) β Risky features: [#4721 'Persona latent', #205 'Fraud terminology']
π₯ Drift Heatmap
pythonalignment.drift_heatmap(report, layer=22).show()
Shows red-flag features with large activation changes compared to the base model.
π Risky Features
pythonrisky = alignment.risky_features(report, topk=5) print(risky)
Example Output
plain text#4721 Persona latent drift=+4.7Γ #205 Fraud/Scam terminology drift=+2.1Γ #1199 Disclaimers suppressed drift=+1.8Γ
ποΈ Export Certificate
pythoncertificate = alignment.export_certificate(report, formats=["pdf","json"]) print("Certificate:", certificate["pdf"])
PDF Includes
- Model & checkpoint hashes
- Drift metrics (layer-wise)
- Risky feature IDs + labels
- Pass/Fail alignment status
β When to Use
- After any fine-tune (domain, safety, RLHF, instruction-tuning)
- Before production deployment
- For enterprise/regulatory submissions