Meta and others introduce BLT acceleration methods, reducing memory bandwidth by up to 92%

AIMPACT News, May 12 (UTC+8), research teams from Meta, Stanford University, and the University of Washington recently introduced three new methods that significantly accelerate the inference speed of Byte Latent Transformer (BLT). BLT is a language model that operates directly on raw bytes, dynamically grouping bytes into variable-length patches using an entropy-based segmentation strategy, matching the performance of token-based models. Because autoregressive decoding on a byte-by-byte basis requires multiple forward passes, memory bandwidth becomes the main bottleneck. The three acceleration methods are as follows: BLT-D uses block discrete diffusion, training combines next-byte prediction and masked byte prediction losses, generating multiple bytes per forward pass; when block size is 4, memory bandwidth is less than half that of standard BLT, and at size 16, it reduces by 87-92%; BLT-S employs a lightweight local decoder as a speculative draft generator, requiring no additional training, and produces outputs identical to standard BLT under greedy decoding, achieving a 77% reduction in memory bandwidth; BLT-DV combines diffusion drafting with autoregressive verification, with the same model weights usable bidirectionally, reducing memory bandwidth by 81%. All methods benefit most on translation tasks, while encoding tasks are more sensitive to block size. On benchmark tests based on probability such as ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU, BLT-D scores are close to the BLT baseline, maintaining robust inference capabilities.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin