NVIDIA's TwoTower Architecture Enhances Large Model Efficiency with Parallel 30B Models

According to monitoring by Beating, NVIDIA has open-sourced the discrete text diffusion architecture Nemotron-Labs-TwoTower, aimed at addressing the generation speed bottleneck of large models that can only produce one word at a time. Previous text diffusion models forced a single network to balance unidirectional context understanding with bidirectional parallel error correction in pursuit of parallel output, resulting in a significant decline in the model's cognitive ability. The TwoTower architecture employs a decoupled design: on one hand, it completely freezes a pre-trained autoregressive large model as a 'read-only context tower' to retain full reasoning and common sense capabilities; on the other hand, it separately trains a 'denoising writing tower' that reads contextual information through cross-attention at the layer level. The writing tower uses a 'confidence unmasking' mechanism, prioritizing the writing of high-confidence words when predicting a block, and gradually filling in the remaining gaps to achieve parallel writing from easy to difficult. This design adapts on a 30B level mixed architecture (Mamba-Transformer MoE) model using only 1/12 of the baseline model's pre-training data (2.1T tokens), retaining 98.7% of quality while improving actual generation speed by 2.42 times, without adding extra memory cache overhead. However, due to the need to keep both towers in memory, the model's static memory usage has increased, and there is still a slight accuracy degradation in extremely complex code and mathematical reasoning.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned