NVIDIA researchers introduce dynamic memory sparse (DMS) with 8× KV cache compression in transformer LLMS

0 0 3 minutes read

NVIDIA researchers introduce dynamic memory sparse (DMS) with 8× KV cache compression in transformer LLMS

As the demand for severe tasks grows, large language models (LLMs) increasingly expect to produce longer sequences or parallel chains of reasoning. However, inference time performance is severely limited by the memory footprint to the key value (KV) cache, not just the number of tokens generated. In a recent paper, researchers from NVIDIA and the University of Edinburgh have introduced Dynamic Memory Sparse (DMS)– A data efficiency, improved method that compresses KV cache and unlock Super zoom in reasoning time No model accuracy was reduced.

Bottleneck: KV cache in transformer inference

Transformer-based models such as GPT, Llama, and Qwen use KV caches to store past token representations for self-rotating generation. This cache grows linearly with sequence length and width (parallel thread), consumes a lot of GPU memory, and causes slow inference due to frequent memory access.

Existing KV cache optimization techniques rely on training-free heuristics such as attention-based eviction eviction, or require a large amount of post-training transformations such as dynamic memory compression (DMC). Both have obvious disadvantages: the former tends to compromise accuracy, while the latter is computationally expensive.

Dynamic memory sparse DMS: Uncompromising compression

Dynamic memory sparse DMS solves these limitations with a hybrid approach: it sparsely KV caches like traditional pruning methods, but uses minimal training overhead (approximately 1,000 steps) and Delayed evictionkeep the token temporarily after marking it as demolished. The design retains important context information and avoids sudden drop in accuracy.

The core idea is to use a gingival-based sampling mechanism during training to make eviction decisions during training. Tokens predicting future evictions can still be used for sliding window duration before being discarded, allowing the model to absorb its information value more effectively.

Effectively use minimal data for transformation

Unlike DMCs that require thousands of training steps and are based on complex gradient optimization, DMS does not introduce other parameters per attention head. It repeats a small fraction of the attention mechanism (single neuron) to predict expulsion. This makes DMS ideal for remodeling existing models without architectural changes.

Empirical results show that a large number of 1K training stepsDM can achieve 8×KV cache compressionretain or even improve model performance in inference tasks.

Benchmark results: Scaling performance without scaling costs

The research team tested benchmarks for DMS, such as:

Aime 2024 (Advanced Mathematics)
Mathematics 500 (Mathematical problem solving)
GPQA Diamond (Hard Science QA)
livecodebench (Code generation)

Cross-model size (QWEN-R1 1.5B, 7B and 32B) DMS pass Aime’s 9.1 points,,,,, 7.6 on GPQAand 9.6 on livecodebenchall of which are under the same memory and compute budget.

Compared to best-performing baselines like Quest and Tova, DM always outperforms them in two ways KV cache reading efficiency (Running Era Agent) and Peak memory usageachieve better Pareto frontiers.

Universal Utility

DMS also undertakes non-controversial tasks. On short and lower case benchmarks such as MMLU, GSM8K and HELLASWAG, the performance of DMS adheres to compression ratios is crucial. 4× Minimum degradation (~3.5 points). In novels like novel missions, DMS even surpasses the Vanilla model, suggesting its potential mitigates problems such as over-array of information for long-term sequences.

in conclusion

In summary, Dynamic Memory Sparseness (DMS) proposes a practical and scalable solution for enhancing the inference time efficiency of transformer-based language models. By intelligently compressing the KV cache with minimal retraining, DMS enables models to perform longer sequence or parallel inference without increasing runtime or memory requirements. Its consistent growth in a range of reasoning and universal tasks highlights its versatility and effectiveness. As LLM is increasingly deployed in resource-constrained environments, DMS provides a compelling path forward – compression, accuracy, and ease of integration of realistic inference workloads.

View paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.

▶Want to present your products, webinars or services to over 1 million AI engineers, developers, data scientists, architects, CTOs and CIOs? Let’s explore strategic partnerships

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.