26572
Science & Space

Boosting Multi-Agent AI: RecursiveMAS Cuts Tokens by 75% and Speeds Inference 2.4x

Posted by u/Fonarow · 2026-05-16 18:16:34

Multi-agent AI systems have great potential for tackling complex tasks, but they often struggle with communication bottlenecks. Traditional systems rely on generating and sharing text tokens, which leads to high latency, soaring token costs, and difficulties in training the entire system cohesively. Researchers from the University of Illinois Urbana-Champaign and Stanford University have developed RecursiveMAS, a groundbreaking framework that lets agents collaborate through embedding space instead of text. This simple shift brings dramatic improvements: inference speed increases by 2.4 times, token usage drops by 75%, and accuracy improves across domains like code generation, medical reasoning, and search. Moreover, RecursiveMAS is far cheaper to train than standard fine-tuning or LoRA methods, making it a scalable, cost-effective solution for custom multi-agent systems. Below, we answer key questions about how it works and why it matters.

What is the main problem with current multi-agent AI systems?

Current multi-agent AI systems communicate by generating and sharing text sequences. While this seems natural, it introduces several critical inefficiencies. First, each agent must generate text token-by-token, and the next agent can only start processing after the previous one finishes — creating sequential latency. Second, generating long text strings consumes enormous numbers of tokens, driving up compute costs. Third, because all information must be spelled out in text, the system becomes highly verbose, wasting resources on intermediate reasoning steps that could be encoded more compactly. Finally, training the whole system end-to-end is extremely difficult, as updating parameters across multiple models is computationally non-trivial. These bottlenecks limit how quickly multi-agent systems can scale and adapt to real-world tasks.

Boosting Multi-Agent AI: RecursiveMAS Cuts Tokens by 75% and Speeds Inference 2.4x
Source: venturebeat.com

How does RecursiveMAS solve the communication bottleneck?

Instead of forcing agents to share information through text, RecursiveMAS enables them to communicate directly through embedding space. Embeddings are compact vector representations of data that capture semantic meaning without requiring full natural language strings. By transmitting embeddings between agents, the framework bypasses the need for token-by-token generation and reading. This allows agents to pass high-dimensional information in a single step, slashing latency and token usage dramatically. Think of it as sending a compressed, meaningful signal instead of a long written message. As a result, the entire multi-agent system becomes much more efficient, since agents no longer have to wait for each other to finish writing and reading verbose text. This embedding-based communication is a key innovation that makes the reported 2.4x speedup and 75% token reduction possible.

What are the performance improvements reported for RecursiveMAS?

Experiments show that RecursiveMAS delivers three major performance gains. First, inference speed increases by 2.4 times compared to baseline multi-agent systems that use text-based communication. Second, token usage drops by 75%, drastically reducing compute costs. Third, accuracy improves across several complex domains: code generation tasks see higher correctness, medical reasoning benchmarks show better diagnostic accuracy, and search tasks yield more relevant results. These improvements come without sacrificing model quality — in fact, the collaborative embedding space helps agents preserve more context than verbose text. The system also proves more stable during training, as it avoids the sequential bottlenecks that plague text-based pipelines. Overall, RecursiveMAS sets a new efficiency standard for multi-agent inference.

Why is training entire multi-agent systems difficult?

Training a multi-agent system as a unified whole is challenging for two main reasons. First, each agent typically has its own underlying model with many parameters. Updating all models jointly requires complex optimization and huge computational resources — often prohibitive for real-world applications. Second, when agents communicate via text, the training signal must flow back through each generation step, creating vanishing gradients and slow convergence. Even with techniques like prompt-based adaptation, the underlying model weights stay static, limiting the system's ability to improve. RecursiveMAS tackles both issues: by using embedding space, it compresses the communication channel, making gradient flow more efficient. And because it’s designed for co-evolution of the whole system, it reduces the number of parameters that need updating, lowering training costs significantly — cheaper than full fine-tuning or LoRA.

How does RecursiveMAS differ from prompt-based adaptation?

Prompt-based adaptation improves multi-agent interactions by iteratively refining the shared context given to agents. This works like a director guiding agents to generate better responses. However, the fundamental limitation is that the underlying model weights remain unchanged — only the prompts are updated. This means the agents’ core capabilities stay fixed, limiting long-term learning. RecursiveMAS takes a different approach: it trains the entire multi-agent system by updating the model weights, but in a much more efficient way. Instead of treating each agent as a separate component, RecursiveMAS co-evolves them as a single integrated whole. The embedding-based communication allows gradients to propagate smoothly across agents, enabling true end-to-end learning. This makes RecursiveMAS a scalable, cost-effective blueprint for custom multi-agent systems that need to adapt over time.

What is the inspiration behind RecursiveMAS?

The framework draws inspiration from recursive language models (RLMs). In a standard language model, data flows linearly through a stack of distinct layers. An RLM, by contrast, reuses a small set of shared layers, looping the computation to deepen the network without adding new parameters. RecursiveMAS applies this recursive principle to multi-agent systems: rather than stacking independent agents in a pipeline, it allows agents to share layers and feed information back to themselves. This recursive structure enables the system to deepen its reasoning while keeping the parameter count low. Combined with embedding-based communication, the result is a compact, trainable multi-agent system that avoids the overhead of separate fine-tuning or LoRA adaptation. This design choice is why RecursiveMAS can achieve dramatic efficiency gains while remaining cost-effective.

In what domains was RecursiveMAS tested and what were the results?

RecursiveMAS was evaluated across three demanding domains: code generation, medical reasoning, and search. In code generation, it produced more accurate and executable code than baseline multi-agent systems. For medical reasoning, the framework improved diagnostic accuracy and clinical decision support. In search tasks, it returned more relevant results with fewer iterations. Across all three domains, RecursiveMAS consistently achieved accuracy improvements while simultaneously delivering the 2.4x inference speedup and 75% token reduction. These results highlight that using embedding space for agent collaboration not only boosts efficiency but also enhances output quality. The framework proved robust across different model backbones and task complexities, making it a reliable choice for deploying scalable multi-agent AI solutions in production environments.

How does RecursiveMAS compare to standard training methods in terms of cost?

Standard approaches like full fine-tuning or LoRA typically require updating a large number of parameters for each agent individually. This becomes prohibitively expensive when scaling to multiple agents, as each round of training consumes vast GPU hours and memory. RecursiveMAS is significantly cheaper to train than these methods. Because it leverages recursive layers and embedding communication, the total parameter count that needs updating is far lower. Instead of fine-tuning each agent's entire model, RecursiveMAS co-evolves the system with a compact set of shared weights. The result is a training cost that is a fraction of full fine-tuning or even LoRA-based approaches. This cost efficiency, combined with the performance gains, makes RecursiveMAS a practical and scalable solution for organizations looking to deploy custom multi-agent systems without breaking their budgets.