RWKV-X combines sparse attention and relapse memory for efficient 1M token decoding, linear complexity

0 0 3 minutes read

RWKV-X combines sparse attention and relapse memory for efficient 1M token decoding, linear complexity

When processing the inputs in the long article, LLMS built on the transformer architecture faces significant scaling challenges in sequence length due to its secondary complexity. Methods such as linear attention models, state space models such as Mamba, linear RNNs such as Deltanet and RWKV solve this problem. However, these linear architectures struggle with long-standing understanding. For example, RWKV-7 (2.9b) achieved a high level of accuracy in Passkey retrieval, up to 28K tokens, but experienced a rapid performance drop after this. Even with continuous preprocessing using 128k length data, the long-term limit remains. This problem goes beyond RWKV, to other architectures like Mamba, and represents the fundamental challenge of such models.

Linear complexity language models have become an alternative to transformer-based architectures that suffer from the need for secondary computing when dealing with long sequences. The RWKV model series is combined with the RNN-like relapse state representation during the transformation parallelism during training. RWKV has developed through multiple iterations, from basic RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Mixed language models including Jamba, Zamba and Minimax can uniquely enhance hybrid designs. Additionally, local sparse attention organizes the tokens into time blocks with three different attention paths: compressed coarse tokens, selectively retaining fine-grained tokens, and sliding windows for local context information. Other concerns include status and attention blocking (MOBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University and Xining University have proposed a new hybrid architecture called RWKV-X, which combines the efficiency of RWKV, combining RWKV efficiency with RWKV efficiency, making it plotted with sparse attention-continuous plots to capture long ranges. Unlike previous hybrid methods, RWKV-X achieves linear time complexity during training during inference coding. When estimated on a 64k token sequence, it shows the almost perfect accuracy of the 64K Passkey search benchmark. The model always outperforms the previous RWKV-7 model on the novel benchmark while maintaining strong performance on short story tasks.

RWKV-X is a hybrid architecture that integrates RWKV-7 blocks of sparse attention blocks. Rather than train from scratch, RWKV-X is built on existing models using interleaved block extension methods and zero heuristics. The training follows two phases of the process:

First, the model is trained from the miniature dataset in just 1024 token context while freezing all parameters except the newly added block.
The second phase involves continuous preprocessing using a long culture that extends 64K dataset and context-length 64K tokens, processing a total of approximately 1 billion tokens. At this stage, all parameters are not structured and jointly optimized. The training uses a long-form cultural horizontal penetration (LongCE) loss, which is dynamically weighted token based on its importance.

Short-case evaluation shows that RWKV-X maintains competitive performance in standard benchmarks. The smaller RWKV-X (0.22B) averaged 51.0, which was comparable to the 51.8 of the RWKV-7. On a larger scale, RWKV-X (3.6b) reaches 71.9, which closely matches RWKV-7 (2.9b, 72.8) and Qwen2.5-3B (71.4) while surpassing Llama3.2-3b (69.7). These results confirm the effectiveness of RWKV-X as a general purpose LLM backbone without sacrificing performance in shorter cases. Furthermore, efficiency analysis demonstrates the excellent scaling characteristics of RWKV-X for long sequences. At the 128K token, RWKV-X achieved 1.37x speed on the Flash Attention V3, which continues to expand as the context length increases.

In this article, the researchers introduce RWKV-X, a hybrid language model that successfully combines the efficiency of RWKV with the efficiency of modeling short-term dependencies with a new sparse attention mechanism designed specifically for remote context modeling. While RWKV-X shows strong performance and efficiency in novel language modeling, there are still some limitations. First, its sparse attention mechanism relies on Top-K block selection, and adopts a heuristic approach that can ignore semantic-related dependencies. Second, the current implementation shows that there is less attention decoding compared to vanilla RWKV, which suggests that further engineering efforts are needed to optimize performance.

Check Paper. Also, don’t forget to follow us twitter.

Here is a brief overview of what we built in Marktechpost:

ML News Community – R/Machinelearningnews (92K+ Member)

Newsletter – airesearchinsights.com/(30K+subscribers)

Minicon AI Events – minicon.marktechpost.com

AI Reports and Magazine – Magazine. markTechpost.com

AI Development and Research News – Marktechpost.com (1M+ Monthly Reader)

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.