Itinai.com user using ui app iphone15 closeup hands photo can a757815c 1405 470a 99ad 8da436e99421 0
Itinai.com user using ui app iphone15 closeup hands photo can a757815c 1405 470a 99ad 8da436e99421 0

Contrastive Neuron Attribution Steers MLPs Without SAE Training

Current ways to steer language models either modify whole layers or need heavy extra training. This makes them blunt and can hurt quality. A new neuron‑level method called Contrastive Neuron Attribution (CNA) solves this by finding the tiny set of MLP neurons that separate harmful from benign prompts. You only need a few forward passes, no gradients, no extra models. First, gather a small contrastive prompt set (e.g., 100 harmful and 100 benign examples). Run the model and record the down‑projection activation of each MLP neuron at the last token. Compute the mean difference between the two sets for every neuron. Pick the top 0.1 % of neurons by absolute difference – this is the circuit. Remove any neuron that fires in the top 0.1 % on most unrelated prompts to avoid generic effects. At test time, multiply each circuit neuron’s activation by a scalar (0 to ablate, 1 to keep, >1 to amplify). Ablating just 0.1 % of MLP activations cuts refusal rates by more than half in most instruction‑tuned Llama and Qwen models while keeping output quality above 0.97 and MMLU accuracy within one point of baseline. The same late‑layer structure exists in base models; fine‑tuning only changes the neurons’ function, not the location. This gives a precise, cheap steering knob that preserves model quality. #AI #Product #ProductManagement #UX #Innovation #Productivity #Technology #Startups #ML #LLM

Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.