A Simple Linear Patch Revives Layer-Pruned Large Language Models

Chen, Xinrui; Bai, Haoli; Yuan, Tao; liu, ruikang; Zhao, Kang; Yu, Xianzhi; Hou, Lu; Guan, Tian; He, Yonghong; Yuan, Chun

A Simple Linear Patch Revives Layer-Pruned Large Language Models

Xinrui Chen, Haoli Bai, Tao Yuan, ruikang liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Abstract

Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at \url{https://github.com/chenxinrui-tsinghua/LinearPatch}.

Abstract

Name Change Policy