LLM can think when it’s idle: Researchers at Letta and UC Berkeley introduce “sleep time calculation” to cut inference costs and improve accuracy without sacrificing latency

liralbes April 21, 2025

2 5 minutes read

LLM can think when it’s idle: Researchers at Letta and UC Berkeley introduce “sleep time calculation” to cut inference costs and improve accuracy without sacrificing latency

Large Language Models (LLMS) have the ability to handle complex inference tasks, transforming applications from chatbots to code generation tools, thus gaining outstanding significance. These models are well known to benefit greatly by extending the computations during inference, often resulting in higher accuracy by dedicating more resources to serious problems. However, this approach brings considerable disadvantages. Longer processing times and higher computing costs make it very large to extend such solutions in real-world environments where responsiveness and affordability are critical. As technology moves towards smarter systems, it is increasingly necessary to explore that LLM can not only become smarter, but also more efficient, especially when running in repetitive or familiar environments.

One of the biggest inefficiencies in current LLM deployments is during querying resolution. Typically, when a user asks a question, the model will be processed simultaneously with the necessary context. The test time calculation assumes that the context and the problem are always together. But in practical cases, such as document Q&A or debugging code, the context is usually persistent and can be accessed before a specific question is asked. However, even though you have looked at the context before, the model processes everything from scratch even in every query. This redundancy results in increased computational costs and response latency, especially when multiple queries are involved in a single environment.

To address this inefficiency, various methods have been developed. Sequential and parallel test time calculations are the two main strategies. The sequential method extends the model’s inference path so that it can take into account more possibilities, while the parallel method involves sampling multiple outputs simultaneously, called Pass@k. Technologies such as speculative decoding are designed to reduce latency by guessing early, but when the model still has to think from scratch, its utility is limited. Although helpful, these methods do not repeatedly deal with the context with each new question. They also usually require test time conditions that are not always feasible, such as access to Oracle or ideal validators.

Researchers at Letta and UC Berkeley have come up with a novel solution that they call sleep time calculations. This method involves leveraging the idle time between user interactions to improve productivity. Instead of waiting for user issues, the model begins to analyze the context in advance. It can expect possible future queries and build new versions of the context with relevant inferences. The model can simply refer to this preprocessing context when the user finally asks a question. Since many thinking has been done, it is necessary to reduce the computational work to produce accurate answers. This approach becomes more efficient when multiple problems are related to the same context, allowing shared inference and distributed computing costs.

The implementation of sleep time calculation relies on breaking down traditional hints into two parts: static context and dynamic query. In the sleep time window, only the context is used to generate the preprocessed version. This enhanced context, called C’, is constructed using test time computing techniques such as inference chains or abstracts. Once this rich version is stored, it replaces the original context during the live query. Then use fewer resources to generate the final answer. Not only does the system minimize redundant reasoning, it also paves the way for more aggressive LLMs to think and be better prepared.

To evaluate the effectiveness of sleep time calculations, the team tested it using two specially designed benchmarks: GSM symbols of states and AIME. Both datasets are derived by assigning existing problem sets to separate contexts and problems. In experiments using models such as GPT-4O and GPT-4O-MINI, the researchers observed a 5-fold reduction in test time calculations, with similar levels of accuracy. It is worth noting that the accuracy of the GSM symbol P2 dataset is increased by up to 13%, and the accuracy on state AIME is improved by 18% when scaling the sleep time calculation. Multi-QUERY GSM-SYMBOLIC is a new dataset introduced for this evaluation, helping to prove that when 10 queries share the same context, the cost per query can be reduced by 2.5 times.

When competing with popular strategies like Pass@K, sleep times always outperform them. Unlike Pass@k (assuming a perfect evaluator), sleep time calculations work in more realistic conditions. The results show that sleep time calculations yield comparable or better accuracy, while consuming less tokens, even under low test time calculation budgets. For example, the GPT-4O-MINI model using sleep time calculations achieved higher accuracy using sleep time calculations, while the test time tokens had less than 200 accuracy. Similar improvements were observed even when models such as Claude Sonnet 3.7 and DeepSeek R1 were evaluated.

Scaling the amount of calculations dedicated to sleep time further improves the prognosis. By running five parallel generations over the sleep time of complex tasks, the researchers further pushed the Pareto curve. However, they point out that the rewards lowered this. Importantly, the results show that more powerful models can benefit more from the extra sleep time calculations. Likewise, amortized sleep time calculations become highly cost-effective when the context provides multiple relevant queries. By weighting the test time tokens ten times higher than the sleep time token, consistent with the industry’s delay cost ratio, the researchers confirmed that the average cost per query fell by 2.5 times.

Another interesting finding is that sleep time calculations work best when user queries can predict. Using Llama2-70B, the researchers scored the predictability of each query given its context and found a strong correlation: the easier the query is, the greater the benefit. In an example of a problem that is logically followed from a given context, the sleep time calculation produces higher benefits. Conversely, although they still show benefits compared to traditional test-only methods, they are less efficient.

Overall, this study proposes an intelligent and scalable technology that improves the efficiency of LLM without compromising accuracy. By leveraging other idle times, sleep time calculations can reduce the burden on real-time systems, reduce operational costs and improve response times. Definite quantitative improvements, such as a 5-fold reduction in computation, 13-18% improvement in accuracy, and a 2.5-fold reduction in cost per query, suggesting that forward-looking approaches like this can shape the next generation of intelligent, context-aware assistants.

Several key points of the research are as follows:

Sleep time calculation allows the model to predict the query by inference in context before the query arrives.
When scaling the sleep time calculation, GSM-Symbolic has improved accuracy by 13% and AIME datasets has improved accuracy by 13%.
For similar performance levels, the test time calculation requirements were reduced by about 5 times.
When sharing the context in 10 related queries, the average query cost is reduced by 2.5 times.
The PASS@K policy is exceeded in the parallel computing settings on the same budget.
More efficient in predictable queries, determined by log probability scores.
Reduced returns are recorded over five parallel generations for sleep time calculations.

Check Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.