LLM can now parallel reasoning: UC Berkeley and UCSF researchers introduce adaptive parallel reasoning to effectively scale inference without exceeding context Windows

0 0 5 minutes read

LLM can now parallel reasoning: UC Berkeley and UCSF researchers introduce adaptive parallel reasoning to effectively scale inference without exceeding context Windows

Large language models (LLMs) have made significant progress in inference capabilities, such as breakthrough systems such as OpenAI O1 and DeepSeekr1, which utilize test time calculations for search and enhanced learning to optimize performance. Despite this progress, current methodology faces key challenges that hinder its effectiveness. The idea chain approach to serialization produces too long output sequences, increasing latency and pushing context window constraints. In contrast, parallel approaches such as N-optimal and autosynthetics have poor coordination between inference paths and lack of end-to-end optimization, resulting in inefficient computing and limited improvement potential. Similarly, structured inference time search techniques (e.g., thinking trees) rely on manually designed search structures, severely limiting their flexibility and ability to scale across different inference tasks and domains.

Several methods have emerged to solve the computational challenges in LLM inference. The inference time scaling method improves downstream task performance by increasing test time calculations, but usually produces significantly longer output sequences. This creates high latency and forces the model to fit the entire chain of reasoning into a single context window, making it difficult to participate in the relevant information. Parallelization strategies such as Enembly attempt to mitigate these problems by running multiple independent language model calls simultaneously. However, poor coordination across parallel lines of these methods leads to redundant calculations and inefficient resource utilization. Fixed feasible reasoning structures such as thought trees and multi-agent reasoning systems have been proposed, but their hand-designed search structures limit flexibility and scalability. Other methods, such as pasta break down tasks into parallel subtasks, but ultimately reintegrate the full context into the main reasoning trajectory, failing to effectively reduce context usage. Meanwhile, Hogwild! Inference takes parallel worker threads, but only depends on tips for unend-to-end optimization.

UC Berkeley and UCSF researchers propose Adaptive Parallel Inference (APR). This powerful approach enables language models to dynamically allocate inference time calculations on serial and parallel operations. This method summarizes existing inference methods, including serialized chains of reasoning, parallel inference with self-harmony and structured searches – training models to determine when and how to parallelize inference operations rather than imposing a fixed search structure. APR introduces two key innovations: parent-child thread mechanism and end-to-end reinforcement learning optimization. The threading mechanism allows the parent thread thread to delegate subtasks to multiple subthreads through the Spawn() operation, so that different inference paths can be explored in parallel. The child thread then returns the result to the parent thread through the JOIN() operation, allowing the parent to continue decoding with this new information. APR is built on the Sglang model service framework, which greatly reduces real-time latency by batch processing and simultaneously inferring child threads. The second innovation (well-tuned through end-to-end enhanced learning) performs successful optimization of the overall task without the need for a predefined inference structure. This approach has three important advantages: higher performance in fixed context windows, and higher compute budgets are improved with higher scaling of compute budgets and performance at equivalent latency compared to traditional methods.

The APR architecture implements a complex multithreaded mechanism that enables language models to dynamically coordinate parallel inference processes. APR solves the limitations of serialized inference methods by allocating computations between parent and child threads, thus minimizing latency while improving performance in context constraints. The architecture consists of three key components:

first, Multithreaded inference system Allow the parent thread to generate multiple sub-threads using the egg spawn (MSG) operation. Each child thread receives a different context and performs inference independently, but simultaneously performs inference using the same language model. When a child thread completes its task, it will be returned to the parent through a JOIN (MSG) operation, selectively conveying only the most relevant information. This approach greatly reduces token usage by keeping a search trajectory limited to the middle of the sub-thread.

second, Training Methods Two-phase method is adopted. Initially, APR utilizes supervised learning through automatically generated demonstrations that also combine depth-first and breadth-first search strategies to create a hybrid search pattern. Symbol solvers and parallelization create demonstrations that break down searches into components that avoid bottlenecks in window bottles during training and inference.

Finally, the system implementation End-to-end reinforcement learning optimization with GRPO (Gradar-based Strategy Optimization). At this stage, the model learns to strategically determine when and how to call children’s threads widely to optimize computational and inference efficiency. The model illustrates inference trajectories in iterations, evaluates its correctness and adjusts parameters accordingly, and eventually learns to balance parallel exploration with context window constraints for maximum performance.

The adaptive parallel inference and self-stability approach is compared with the 228 million parameters built by the llama2 architecture and supporting a 4,096 token context window, compared with the serialized inference and self-indulgent approach using the standard decoder language model. All models are initialized with supervised learning of 500,000 trajectories of the symbol solver. For direct computationally accurate evaluation, the team implemented a budget constraint approach and context window adjustment and thread counting conditions for the APR model for the SOS+ model. The Sglang framework is used for inference because it supports continuous batch processing and Radix attention, thus effectively implementing APR.

Experimental results show that APR always outperforms serialization methods in multiple dimensions. When using higher computational scaling, APR initially underperformed in low computational schemes due to parallelism overhead, but with the increase in computations, SOS+ significantly surpassed SOS+, reaching a 13.5% improvement at 20k tokens and exceeding SOS+ Pass@8 performance, while using 57.4% smaller computations. For context window scaling, APR consistently utilizes context more efficiently, with 10 threads achieving approximately 20% accuracy under 4K token limit by allocating inference across parallel threads instead of including the entire trajectory in a single context window.

End-to-end enhanced learning significantly improves APR’s performance, increasing accuracy from 75.5% to 83.4%. The RL optimization model exhibited significantly different behaviors, sequence length (relatively increased by 22.1%) and sub-thread count (relatively increased by 34.4%). This shows that for countdown tasks, RL-optimized models are more conducive to a wider search pattern than deeper search patterns, thus demonstrating the algorithm’s ability to automatically discover the best search strategy.

APR shows high efficiency in both theoretical and practical evaluation. When measuring order tokens are used, APR can significantly improve accuracy, while the smallest additional order tokens exceeds 2,048 and rarely exceeds 2,500 tokens, while SOS+ shows only 3,000 tokens, but only edge improvements. Real-world latency tests on the 8-GPU NVIDIA RTX A6000 server show that APR achieves better accuracy latency tradeoffs, with 75% accuracy of 5000ms per sample, achieving an absolute improvement of 5000ms over SOS+57% 5000ms. These results highlight the potential of APR’s effective hardware parallelization and optimized performance in deployment scenarios.

Adaptive parallel inference represents a significant improvement in the inference capability of language models by enabling computational distribution across serial and parallel paths through parent-child thread mechanisms. By combining supervised training with end-to-end enhanced learning, APR eliminates the need for manually designed structures while allowing models to develop optimal parallelization strategies. Experimental results of the countdown task prove the substantial advantages of APR: higher performance in fixed context windows, scaling with increasing computational budgets, and significantly improving success rates under equivalent delay limits. These achievements highlight the potential of inference systems that can dynamically construct inference processes to achieve improved scalability and efficiency in complex problem-solving tasks.

Check Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit. To promote and partnership, Please talk to us.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Asjad is an intern consultant at Marktechpost. He is mastering B.Tech in the field of mechanical engineering at Kharagpur Indian Institute of Technology. Asjad is a machine learning and deep learning enthusiast who has been studying the applications of machine learning in healthcare.