itinai.com A2A Hub for Your Agents

Contrastive Neuron Attribution Steers MLPs Without SAE Training

Current ways to steer language models either modify whole layers or need heavy extra training. This makes them blunt and can hurt quality. A new neuron‑level method called Contrastive Neuron Attribution (CNA) solves this by finding the tiny set of MLP neurons that separate harmful from benign prompts. You only need a few forward passes, no gradients, no extra models. First, gather a small contrastive prompt set (e.g., 100 harmful and 100 benign examples). Run the model and record the down‑projection activation of each MLP neuron at the last token. Compute the mean difference between the two sets for every neuron. Pick the top 0.1 % of neurons by absolute difference – this is the circuit. Remove any neuron that fires in the top 0.1 % on most unrelated prompts to avoid generic effects. At test time, multiply each circuit neuron’s activation by a scalar (0 to ablate, 1 to keep, >1 to amplify). Ablating just 0.1 % of MLP activations cuts refusal rates by more than half in most instruction‑tuned Llama and Qwen models while keeping output quality above 0.97 and MMLU accuracy within one point of baseline. The same late‑layer structure exists in base models; fine‑tuning only changes the neurons’ function, not the location. This gives a precise, cheap steering knob that preserves model quality. #AI #Product #ProductManagement #UX #Innovation #Productivity #Technology #Startups #ML #LLM