O treinamento em blocos do DiffusionBlocks pode reduzir a memória da rede neural para 1/B, validando o desempenho com várias arquiteturas
DiffusionBlocks divide the Transformer-based network into independently trainable blocks, treating residual networks as discretized denoising steps, utilizing score matching to achieve block-level training without inter-block communication, significantly reducing training memory. Experiments show effectiveness across multiple architectures, with only one block activated per step during inference, reducing the computational cost of 12-layer DiT (B=3) to one-third of the original. This method is suitable for ViT/DiT/MDM/AR Transformer, but requires input and output dimensions to match and cannot be used with U-Net.