Fine-Tuning Alignment

Fine-tuning lets teams adapt a base LLM to their domain. But it can also introduce unintended behaviors.

Train on bad data → model may generalize in harmful ways.

Train in one narrow domain → model may drift into unsafe responses elsewhere.

This is called misalignment: when a model starts behaving in ways you did not intend or want.

Current State

Most teams only check fine-tunes with:

Loss curves (does the model fit the data?)

Eval sets (does accuracy improve on examples?)

These don’t catch behavioral drift. Several leading AI companies have started exploring feature-level monitoring with sparse autoencoders (SAEs) to track risky activations during or after fine-tuning.

🚀 NeuronLens Value-Add

NeuronLens provides a post-fine-tuning alignment audit for any model:

🔎 Drift Heatmaps → visualize which features changed

⚠️ Risky Feature Reports → surface features tied to misaligned personas, unsafe code, suppressed disclaimers

📜 Alignment Certificates → regulator- or enterprise-ready evidence bundles (PDF/JSON)

Bottom line: after fine-tuning, you can prove your model is still aligned.

🧪 Quickstart Example

python
from neuronlens import alignment

# 1) Load base and fine-tuned checkpoints
base = alignment.load_model("meta-llama/Llama-2-7b-hf")
ft   = alignment.load_model("./checkpoints/llama2-ft-domain")

# 2) Attach SAE (trained pre-finetune)
sae = alignment.load_sae(base, layers=[12,22,28])

# 3) Compare for alignment
report = alignment.compare_models(
    base_model=base,
    ft_model=ft,
    sae=sae,
    probes=["finance","legal","customer_support"],
    metrics=["drift","misalignment_score"]
)

print(report.summary())

Sample Output

plain text
Post-finetuning audit summary:
  Drift detected at layer 22 (+0.31)
  Misalignment score: 27% (threshold=20%) ❌
  Risky features: [#4721 'Persona latent', #205 'Fraud terminology']

🔥 Drift Heatmap

python
alignment.drift_heatmap(report, layer=22).show()

Shows red-flag features with large activation changes compared to the base model.

📋 Risky Features

python
risky = alignment.risky_features(report, topk=5)
print(risky)

Example Output

plain text
#4721  Persona latent drift=+4.7×
#205   Fraud/Scam terminology drift=+2.1×
#1199  Disclaimers suppressed drift=+1.8×

🏛️ Export Certificate

python
certificate = alignment.export_certificate(report, formats=["pdf","json"])
print("Certificate:", certificate["pdf"])

PDF Includes

Model & checkpoint hashes

Drift metrics (layer-wise)

Risky feature IDs + labels

Pass/Fail alignment status

✅ When to Use

After any fine-tune (domain, safety, RLHF, instruction-tuning)

Before production deployment

For enterprise/regulatory submissions