Researchers at Fudan University introduce Lorsa: a sparse attention mechanism that restores the atomic attention unit hidden in the transformer superposition

Large language models (LLMs) have attracted significant attention in recent years, but understanding their internal mechanisms remains challenging. When investigating individual attention heads in transformer models, researchers have identified specific functions in certain heads, such as tokens such as “potter” can be predicted when phrases appear in context. Ablation studies confirm the causal relationship between these heads and model behavior. However, most attention leaders allocate focus in various environments without explicit functions. The challenge is to interpret these complex attention patterns, as collaborative heads occur frequently, rather than isolated functions. This phenomenon is similar to the superposition in neural interpretation, suggesting the presence of attention superposition in the multi-headed self-attention (MHSA) mechanism. Understanding these complex interactions is essential for developing more transparent and controllable language models.
Previous research has made great progress in explaining individual attention head functions using techniques such as activation patches and path repair. These methods have identified several professional attention heads in the transformer model, including composition heads, induction heads, name handling heads, numeric comparison heads, copy suppression heads, successor heads and long context search heads. However, the superposition hypothesis suggests that neurons are associated with multiple nonorthogonal latent features rather than a single function. Sparse autoencoders have become a promising method to extract an over-set of sparse, linearly understandable features from neural networks. The success of these autoencoders demonstrates the universality of superposition across various dimensions, including model size, architecture type, and even different ways. While valuable, these approaches are still difficult to fully explain the complex interactions between the attention head and their collaborative behavior in the language model.
Introduction to the Shanghai Innovation Institute of the OpenMoss team of the School of Computer Science at Fudan University Low-level sparse attention (Lorsa)a powerful method related to attention superposition, superimposing atomic attention units with attention. Lorsa replaces standard bullish self-attention with an over-attention head with a one-dimensional OV circuit and sparseness constraint. To evaluate Lorsa, the researchers developed an exploration interface that provides comprehensive information about each Lorsa head, allowing quantitative assessment of interpretability through top-level activation and attribution patterns. The results show that Lorsa’s single momentum is compared with the sparse autoencoder function. The method was tested on both the Python-160m and Llama-3.1-8B models, successfully identifying known attention mechanisms such as sensor heads, name-moving heads, successors and attention sinking. Further analysis showed that the arithmetic specific Lorsa head in Llama-3.1-8B and identified the subject anchor head with long-term subject-specific attention patterns. This approach provides unprecedented visibility into the transformer attention mechanism.
Attention superposition in the transformer model is similar to that of the neuron’s characteristics larger than their dimensions. MHSA’s research hypotheses include multiple attention units in the superposition, each unit taking place between specific token pairs and having interpretable read/write operations on the residual stream. This hypothesis indicates that the atomic attention units are distributed across multiple MHSA heads, while a single head contains multiple units.
Three key evidence supports the overlay of attention: First, multiple political heads respond to irrelevant inputs, such as successor heads, while increasing days, numbers and displaying abbreviation/copy behavior. Second, most attention leaders lack a clear interpretation pattern, and studies show that more than 90% of GPT-2 heads fail to interpret attempts. Third, direct observations show the attention output characteristics collectively contributed by multiple heads, with approximately 25% of the learning attention units distributed over multiple MHSA heads.
Understanding attention superposition is crucial for two reasons. First, attribution-based circuit tracking becomes challenging when feature collectively computes, because it can be misleading due to interference with a single query key pattern from other functions within the same header. Second, attention to the structure of the superposition may reveal important model biology basis and raise questions about why individual MHSA heads (such as induction heads) exist in the superposition (such as induction heads).
The Lorsa architecture addresses these challenges with several innovative design elements. By minimizing mean square error, Lorsa trained to predict MHSA output. It uses a one-dimensional OV circuit that limits read/write operations to specific residual stream characteristics, consistent with the linear representation assumption. For query and key weights, Lorsa implements parameter sharing on each Dlorsa QK header, thus maintaining parameter efficiency while maintaining performance. This strategy makes the Lorsa QK circuit similar to MHSA, but with sparsity limitations for each OV size.
Lorsa has an order of magnitude higher than the standard MHSA, while only activating a subset of each token. For each position, Lorsa’s output summary only has the TOP-K headers with the maximum activation value, while the active header subset changes dynamically at token positions. This approach is similar to Topk-Saes, selecting the most significant linear components. Although similar to the sparse autoencoder, Lorsa’s head activation comes from the attention mode of previous tokens rather than using Relu’s simple linear encoder.
Lorsa’s interpretability assessment uses several key indicators to understand individual head function. The highest activation helps to identify patterns by examining the 16 highest activation tokens for each Lorsa head in 100 million samples from the Hold-Out data. Z-mode analysis breaks the activation into token contributions at the previous location, revealing that previous tokens contribute to current activation. This approach is similar to direct feature attribution analysis for attention sparse autoencoders, but involves only simple attributions of one 1D OV circuit and a single QK circuit.
The visual dashboard provides comprehensive information about each Lorsa header. For example, the “you” specific sensor head shows several important patterns: it mainly comes from the features that indicate the current token, which are “you”/”your” through its weight vector, strongly activate the “Say Say You” function to amplify “you” logit and increase the predicted probability of various “you” tags. QK Attention Mode Calculation involves querying the current token function at the location, while the previous token function has the current token as “you”, and the previous token is usually words like “with with”, “thess”, or “do”. Interestingly, this particular Lorsa head is almost evenly distributed between two MHSA heads (5.0 and 5.7), which demonstrates how Lorsa successfully detached the attention units present on multiple standard attention heads.
The results confirm the effectiveness of Lorsa in identifying known attention mechanisms in different models. The researchers used a path patch to rediscover previously documented single semantic heads for Python-160m, including sensor heads, name movement heads, replication suppression heads, successor heads and attention degradation. In Llama-3.1-8B, they identified arithmetic-specific Lorsa heads activated during simple arithmetic operations, each head using a unique heuristic to obtain operands. In addition, they also found a “topic anchor” head that exhibited long-term concern for locally related tokens, indicating a mechanism to maintain a persistent topic representation to bias subsequent prediction bias for domain-to-domain vocabulary and structure.
Low ranking sparse attention In the transformer model, the atomic attention unit and attention superposition and attention superposition are successfully superpositioned. This method effectively restores known attention mechanisms while revealing new interpretable behaviors, demonstrating its value for neural network interpretability. Despite these advances, the major challenges of unzipping the QK circuit to achieve a completely independent head and reduce the superimposed effect. Future research directions include exploring low-dimensional QK structures, cross-layer superpositions and Q/K/V compositions of systems.
Check Paper,,,,, Model embracing face and Github page. Also, don’t forget to follow us twitter.
Here is a brief overview of what we built in Marktechpost:

Asjad is an intern consultant at Marktechpost. He is mastering B.Tech in the field of mechanical engineering at Kharagpur Indian Institute of Technology. Asjad is a machine learning and deep learning enthusiast who has been studying the applications of machine learning in healthcare.