Technology

Nous Research Unveils Token Superposition Training: A 2.5x Faster LLM Pre-Training Method

Posted by u/Fonarow · 2026-05-14 10:52:15

Large language model pre-training is notoriously expensive, so even small efficiency gains can lead to significant cost and time savings. Nous Research has introduced a novel technique called Token Superposition Training (TST) that slashes pre-training wall-clock time by up to 2.5 times without altering model architecture, optimizer, tokenizer, parallelism strategy, or training data. This method works by compressing input sequences during a portion of training, allowing the model to process more tokens per unit of compute. Below, we break down how TST works, what problems it solves, and the impressive results it has achieved.

What is Token Superposition Training (TST)?

Token Superposition Training (TST) is a two-phase pre-training approach developed by Nous Research that significantly reduces the time required to train large language models. In the first phase, called superposition, consecutive tokens are grouped into non-overlapping bags and averaged in the embedding layer to create a single latent token per bag. The transformer then processes these shorter sequences—effectively ingesting more text per step without increasing computational cost. After a predefined fraction of training steps, the method switches to a recovery phase that uses standard next-token prediction. TST does not require any changes to the model's architecture, optimizer, tokenizer, or data, making it a drop-in efficiency boost.

Nous Research Unveils Token Superposition Training: A 2.5x Faster LLM Pre-Training Method — Source: www.marktechpost.com

What Problem Does TST Solve?

Modern LLM pre-training is data-driven and often overtrained beyond compute-optimal estimates. A key bottleneck is raw text throughput: how much data a model can process per FLOP. Subword tokenizers like BPE already improve throughput by compressing sequences, but most of their advantage comes from shorter sequences. TST pushes this throughput lever further during training without permanently changing the model. By grouping tokens during the superposition phase, TST allows the model to see more text per unit of compute, effectively reducing the wall-clock time needed to reach a given training loss. This addresses the growing need for faster, more cost-effective pre-training.

How Does the Superposition Phase Work?

In the superposition phase, the input sequence of length L is divided into non-overlapping bags of s contiguous tokens. Each bag is collapsed into a single latent s-token by averaging the token embeddings. The transformer then processes a sequence of length L/s. To keep each training step equal in FLOPs to a standard step, the data sequence length is increased by a factor of s. This means the model ingests s times more text per unit of compute. On the output side, each latent position predicts the next bag of s tokens using a multi-hot cross-entropy (MCE) loss. This loss assigns equal probability mass 1/s to each token in the target bag and can be computed as a simple mean of standard cross-entropy terms—no new kernel or auxiliary head is needed.

What Happens in the Recovery Phase?

After a fraction r of total training steps (typically between 20% and 40%), the superposition phase ends. Training then resumes from the saved checkpoint using standard next-token prediction for the remaining 1-r steps. During recovery, all TST-specific code is removed, and the model continues to train normally. The idea is that the speed gains from superposition are realized early on, and the recovery phase ensures the model can still refine its predictions via exact token-level optimization. This two-phase design is crucial because it prevents any permanent distortion of the model's representation, allowing it to achieve lower loss than a fully standard training run—even with fewer total GPU hours.

What Results Has TST Achieved?

At the 10B-A1B mixture-of-experts scale, TST reached a lower final training loss than a matched-FLOPs baseline. It consumed only 4,768 B200-GPU-hours compared to the baseline's 12,311—a roughly 2.5x reduction in pre-training time. This improvement was consistent across model sizes ranging from 270 million to 10 billion parameters. The paper reports that the optimal superposition fraction r lies between 0.2 and 0.4 across tested scales. These results demonstrate that TST is not only faster but can actually improve model quality by allowing more data to be processed within the same compute budget.

Does TST Require Changes to Model Architecture or Training Data?

No. One of TST's key strengths is that it requires no modifications to the model architecture, optimizer, tokenizer, parallelism strategy, or training data. The superposition phase modifies only the input embedding and loss computation temporarily, and both are reverted during recovery. The multi-hot cross-entropy loss reuses existing fused CE kernels, so no custom CUDA kernels or auxiliary heads are needed. This makes TST a drop-in improvement that can be applied to any existing pre-training pipeline with minimal engineering effort.

Share Save Report