Fine-Tuning Alignment

Fine-tuning lets teams adapt a base LLM to their domain. But it can also introduce unintended behaviors.
  • Train on bad data β†’ model may generalize in harmful ways.
  • Train in one narrow domain β†’ model may drift into unsafe responses elsewhere.
This is called misalignment: when a model starts behaving in ways you did not intend or want.

Current State

Most teams only check fine-tunes with:
  • Loss curves (does the model fit the data?)
  • Eval sets (does accuracy improve on examples?)
These don’t catch behavioral drift. Several leading AI companies have started exploring feature-level monitoring with sparse autoencoders (SAEs) to track risky activations during or after fine-tuning.

πŸš€ NeuronLens Value-Add

NeuronLens provides a post-fine-tuning alignment audit for any model:
  • πŸ”Ž Drift Heatmaps β†’ visualize which features changed
  • ⚠️ Risky Feature Reports β†’ surface features tied to misaligned personas, unsafe code, suppressed disclaimers
  • πŸ“œ Alignment Certificates β†’ regulator- or enterprise-ready evidence bundles (PDF/JSON)
Bottom line: after fine-tuning, you can prove your model is still aligned.

πŸ§ͺ Quickstart Example

python
from neuronlens import alignment # 1) Load base and fine-tuned checkpoints base = alignment.load_model("meta-llama/Llama-2-7b-hf") ft = alignment.load_model("./checkpoints/llama2-ft-domain") # 2) Attach SAE (trained pre-finetune) sae = alignment.load_sae(base, layers=[12,22,28]) # 3) Compare for alignment report = alignment.compare_models( base_model=base, ft_model=ft, sae=sae, probes=["finance","legal","customer_support"], metrics=["drift","misalignment_score"] ) print(report.summary())
Sample Output
plain text
Post-finetuning audit summary: Drift detected at layer 22 (+0.31) Misalignment score: 27% (threshold=20%) ❌ Risky features: [#4721 'Persona latent', #205 'Fraud terminology']

πŸ”₯ Drift Heatmap

python
alignment.drift_heatmap(report, layer=22).show()
Shows red-flag features with large activation changes compared to the base model.

πŸ“‹ Risky Features

python
risky = alignment.risky_features(report, topk=5) print(risky)
Example Output
plain text
#4721 Persona latent drift=+4.7Γ— #205 Fraud/Scam terminology drift=+2.1Γ— #1199 Disclaimers suppressed drift=+1.8Γ—

πŸ›οΈ Export Certificate

python
certificate = alignment.export_certificate(report, formats=["pdf","json"]) print("Certificate:", certificate["pdf"])
PDF Includes
  • Model & checkpoint hashes
  • Drift metrics (layer-wise)
  • Risky feature IDs + labels
  • Pass/Fail alignment status

βœ… When to Use

  • After any fine-tune (domain, safety, RLHF, instruction-tuning)
  • Before production deployment
  • For enterprise/regulatory submissions