LLMS can now maintain high accuracy: UNC Chapel Hill researchers introduce TACQ, a task-aware quantization method that preserves critical weight circuits without performance losses,

LLMs show impressive capabilities in numerous applications, but they face challenges due to their computing requirements and memory requirements. This challenge is serious in situations where privacy issues are needed (such as handling sensitive patient records) or computing constrained environments such as real-time customer service systems and edge devices). Post-training quantization (PTQ) is a promising solution that effectively compresses pretrained models, reducing memory consumption by 2-4 times. However, the current process has bottlenecks under 4-bit compression, with a lot of performance degradation when trying 2 or 3-bit precision. Most PTQ methods rely on small, general-purpose pretrained data to explain the activation changes caused by quantization.
The current methods of LLM compression are mainly divided into three categories. Unified quantization represents the most basic approach, whereby the weight of each row is compressed into a 16-bit floating tensor, mapping the float to an integer based on the maximum and minimum values within each channel. GPTQ-based quantization technology advances this concept by focusing on layer reconstruction, aiming to minimize post-quantitative reconstruction losses. Furthermore, the hybrid precision quantization method provides a more nuanced strategy beyond the fixed accuracy of all weights. These techniques allocate bit widths based on weight importance to maintain performance, and some of these methods can maintain height “abnormal” weights with higher accuracy.
UNC Chapel Hill researchers have proposed a novel post-training quantification method for mixed semen, called TaskCircuit quantification (TACQ). This method shows similarities to automatic circuit discovery by directly adjusting a specific weight circuit (defined as a set of weights related to downstream task performance). TACQ compares model weights with uniform quantization to estimate quantized expected weight changes, and then uses gradient information to predict the impact on task performance, thus retaining task-specific weights. TACQ always surpassed the baseline with the same calibration data and lower weight budget and made significant improvements in the challenging 2-bit and 3-bit regimes.
TACQ is defined by a significance indicator that identifies critical weights retained during the quantization process and is based on the concepts of model interpretability such as automatic circuit discovery, knowledge positioning, and input attribution. This metric uses two components:
- Quantitatively Perceived Positioning (QAL): Track how model performance is affected by estimating expected weight changes caused by quantization.
- Amplitude-Simultaneous Gradient (MSG): A generalized measure of the importance of absolute weight adapted from input attribution techniques.
MSG helps stabilize TACQ and address bias in QAL estimation. These factors combined into a unified measure of significance, which can be effectively evaluated for each weight in a single backoff, thus retaining the highest 16-bit accuracy of the highest score weight.
In a challenging 2-bit environment, TACQ performed better than Slim-Llm, with an absolute margin improved by 16.0% (from 20.1% to 36.1% on GSM8K), 14.1% (from 34.8% to 49.2%) and 21.9% (from 0% to 21.9%) (scintine). At this compression level, other baseline methods such as GPTQ, Squeezellm, and SPQR deteriorate in near random performance. At 3-bit accuracy, TACQ retained 91%, 96% and 89% of non-quantitative accuracy for GSM8K, MMLU and spider, respectively, while in most datasets, the most powerful baseline Slim-llm performed better than the strongest baseline Slim-LLM, respectively. The advantages of TACQ are evident in generation tasks that require sequential token output, which is the only way to restore non-negligible performance in a 2-bit setting for a spider text to SQL task.
In summary, the researchers introduced TACQ, a significant advance in post-task-aware training quantification. It improves model performance on ultra-low bit widths (2 to 3 bits), which reduces the previous method to nearly random output. TACQ is consistent with the automatic circuit discovery study by selectively retaining a small portion of the significant weight of 16-bit accuracy, suggesting that sparse weighted “circuits” affect specific tasks. Furthermore, spider experiments show that TACQ better retains the generation function of the model and is therefore suitable for program prediction tasks. This also applies to situations involving proxy, where the model often produces many executable outputs, and efficiency is a problem.
Check Paper and Github page. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.
🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.
