Improve EORA’s 2-bit LLM accuracy

0 0 7 minutes read

It is one of the key technologies to reduce the memory footprint of the Large Language Model (LLMS). It converts the data type of the model parameters from a higher precision format to a 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) to a lower precision integer format, usually INT8 or INT4. For example, quantizing the model to 4 bits means that only 0.5 bytes per parameter, while 4 bytes in fp32 are used.

Post-trained quantization methods such as GPTQ and AWQ can greatly reduce the size of large models. Models like the Llama 3 with 70 billion parameters can occupy about 140 GB of about 140 GB in FP16, but using 4-bit quantization can reduce it to about 40 GB while still maintaining strong performance on downstream tasks.

However, despite a significant reduction in this model, this model still outweighs the memory capacity of most consumer-grade GPUs, typically offering 24 GB to 32 GB of VRAM. In order for these models to be truly accessible, quantization is required to be made to a lower bit width, such as 2 bits. While the latest advances in low-digit quantization are promising, achieving stable and accurate 2-bit quantization remains a major challenge.

In this article, we review a technique called eora This helps compensate for errors caused by quantization. Eora is a No training method, which means it can be applied quickly and efficiently to any model, even the largest model. We will examine how EORA works and demonstrate how it can significantly improve the performance of 2-bit quantization models, bringing them close to their fully accurate pairing accuracy while being 5.5 times smaller.

We will analyze the experimental results obtained using large models such as QWEN3-32B and QWEN2.5-72B, and both quantized into 2 bits using the latest quantization techniques to illustrate the effectiveness of EORA.

Infiltrate feature space to find adapters

Post-training quantization or more generally, compression is designed to reduce model size or inference costs by minimizing the output difference between the original weights. w_l and compressed weight ŵ_l  Use only one small calibration dataset.

Most quantization methods are constructive, but the choice of compression formats is rigid and limits the flexibility of various deployment requirements.

Bypassing constraints of formats and improving accuracy, such as Qlora [1] and HQQ+ [2]the Lola adapter was fine-tuned directly on the top of the frozen quantization model.

Compression can also be restructured into compensation Problem: Given a compression model, lower residual paths are introduced that specifically correct compression errors.

A straightforward way to decompression errors using SVD:

[Delta W_l = W_l – hat{W}_l]

Enter

[U_l Sigma_l V_l^T]

The low-level approximation is formed through two matrices:

[B_l = U_l Sigma_l ]

[A_l = V_l^T]

Where one_l and b_l is a standard tensor for Lora adapter.

However, ordinary SVD has two limitations: it does not directly minimize the original layer compression loss, and can evenly distribute capacity in all error components, ignoring the different importance of different parts of the model.

To solve this problem, Nvidia proposed eora [3].

EORA: No training compensation for compressed LLM, low-level approximation with feature space

EORA first projects the compression error into the feature space defined by the input activation covariance:

[tilde{X} tilde{X}^T]

Where x̃ is the average activation of the calibration set. Then, by performing the characteristic composition, we get:

[tilde{X} tilde{X}^T = Q Lambda Q^T]

Compression error ΔW Predicted as:

[Delta W’ = Delta W Q’]

Where Q’=Qλ. Then apply SVD to ΔW’ To produce a low-level approximation, the result is projected back to the original space, thereby adjusting the low-level factors accordingly.

This feature space projection changes the optimization goal: it weights the importance of different error components according to its contribution to layer output (by eigenvalues), making the approximation more efficient. It can be calculated quickly without any training, only calibration activation is required, and no additional inference latency is introduced. Furthermore, the method shows that this method results in a direct minimization of layer compression losses, not just the original weight error.

Analytically, truncating a singular value in the projection space corresponds to the real compression error minimized under the reasonable assumption of calibration activation.

In their paper, NVIDIA presents a wide range of strong results, indicating that EORA can significantly improve the accuracy of quantitative models. However, their experiments focus mainly on older quantization methods such as GPTQ and are limited to medium-sized LLMs with up to 13b parameters, at 3-bit and 4-bit accuracy.

This leaves a public question: Can EORA use more modern quantization techniques and even push it to 2-bit accuracy?

Let’s find out.

Calibrate the EORA adapter

Suppose we have quantified the models that have significantly reduced performance compared to the full accuracy of some tasks. Our goal is to use EORA to reduce this performance gap.

In my experiments, I used the QWEN2.5-72B instruction and QWEN3-32B, both using AutoRound (Apache 2.0 license) (Apache 2.0 license), a state-of-the-art quantitative algorithm developed by Intel. Automation utilizes symbol optimization to fine-tune quantization, which is especially effective for low-bit settings.

All models I made are available here (Apache 2.0 license):

The group size of the 2-bit model was quantified to 32, except for a group size of 128 used. Larger group sizes reduce model size by storing less quantized metadata, but it introduces larger quantization errors.

I evaluated the model of IFEVAL, which is the benchmark for measuring the function that guides the following. The results show that the performance of the quantitative version has decreased significantly.

Image of the author

To compensate for this downgrade, I applied the EORA adapter using the implementation provided in the GPTQModel library (per Apache 2.0 license). Integration is simple. If you’re curious about how implementations are in Pytorch, the code base is compact, clean and easy to follow:

EORA implementation of GPTQModel: eora.py

EORA requires a calibration dataset. Ideally, the dataset should reflect the expected use cases for the model. However, since we don’t have a specific target task in this case and are designed to preserve the general functionality of the model, I used 1,024 random sampling examples from the C4 dataset (under ODC-BY license).

Another key parameter is the Lora level, which greatly affects the effectiveness of the EORA adapter. Its optimal value depends on the model architecture, target tasks, and calibration data. Higher rankings may result in better performance, but risk being too adapted to the calibration set. When the overall goal of quantization is to reduce memory usage, it also increases the size of the adapter, which is counterproductive. Conversely, lower ratings keep adapter lightweight, but may not capture enough information to effectively compensate for quantization errors.

In my experiment, I tested Lora levels 32, 64 and 256.

Here is the code to create an EORA adapter with GPTQModel:

from gptqmodel import GPTQModel
from gptqmodel.adapter.adapter import Lora
from datasets import load_dataset

calibration_dataset = load_dataset(
      "allenai/c4",
      data_files="en/c4-train.00001-of-01024.json.gz",
      split="train", download_mode="force_redownload"
    ).select(range(1024))["text"]

eora_adapter_path = "Qwen3-32B-autoround-2bit-gptq-r256"
model_path = "kaitchup/Qwen3-32B-autoround-2bit-gptq"
eora = Lora(
    path=eora_adapter_path,
    rank=256,
)

GPTQModel.adapter.generate(
        adapter=eora,
        model_id_or_path="Qwen/Qwen3-32B",
        quantized_model_id_or_path=model_path,
        calibration_dataset=calibration_dataset,
        calibration_dataset_concat_size=0,
        auto_gc=False)

Using the NVIDIA A100 GPU (referral link) on the runpod, it took about 4 hours to generate the EORA adapter for the model QWEN3-32B-AUTOROUND-2BIT-GPTQ.

All EORA adapters created for these models are publicly available (Apache 2.0 license):

Evaluate EORA adapter for 2-bit LLM

Let’s evaluate the effectiveness of the EORA adapter. Do they improve the accuracy of the 2-bit model?

it works!

These improvements are particularly noteworthy for QWEN3-14B and QWEN3-32B. For example, applying EORA to QWEN3-32B quantized to 2 bits with a group size of 128, resulting in an accurate gain of nearly 7.5 points. Increased Lora rankings from 32 to 64 also led to improvements, highlighting the impact of ratings on performance.

EORA is also effective in larger models such as QWEN2.5-72B, although the gain is more moderate. On this model, low-level adapters have little benefit. It wasn’t until I raised my ranking to 256 that major improvements began to occur.

Memory consumption of EORA

Using the EORA adapter during inference results in the following increase in memory consumption:

Overhead is usually negligible. For example, for the 2-bit QWEN3-14B, the adapter adds only 257 MB and 514 MB to the total model size, ranking 32 and 64. In the larger rankings, using EORA adapters becomes questionable, as the total memory consumption may exceed the memory consumption of the same model with higher accuracy. For example, a 2-bit QWEN2.5 72B with an EORA adapter is greater than a 3-bit QWEN2.5 72B.

Note: This estimate only includes memory consumed by the adapter parameters. For completeness, we can also illustrate the memory used by adapter activation during inference. However, they are very small relative to other tensors such as the model’s attention and MLP layers and can be safely considered negligible.

in conclusion

EORA work. We have confirmed that this is a simple but effective way to compensate for quantization errors, even at 2-bit accuracy. It is intuitive, untrained, and can improve meaningful performance. That is, some trade-offs need to be considered:

Level search: Finding the best Lora level requires experimentation. It is difficult to predict in advance whether the 32 ranking is sufficient or whether the rank (e.g. 256) will lead to overfitting. The optimal value depends on the model, calibration data, and target tasks.
Increase memory consumption: The purpose of quantification is to reduce memory usage in highly constrained environments. Although EORA adapters are relatively lightweight in the lower rankings, they do increase memory consumption slightly, especially at higher levels, reducing the overall efficiency of 2-bit quantization.

Looking ahead, Nvidia’s paper also shows that the EORA adapter provides an excellent starting point for Qlora fine tuning. In other words, if you plan to fine-tune the 2-bit model with Qlora, model initialization adapted from EORA can lead to better results with less training efforts. Last year, I wrote about the fine-tuning adapter for GPTQ models in my newsletter:

Qlora with automation: Cheaper, better LLM fine tune on your GPU

The main difference is that we will load the EORA adapter instead of initializing the adapter from scratch. This adapter will be fine-tuned.