Stable Diffusion is currently one of the most powerful open source text-generated image diffusion models, but it has a big disadvantage for small and medium-sized enterprises and individual developers that do not have A100 or H100, requiring high training costs.

In order to solve this pain point, the Wuerstchen open source model adopts a new technical architecture to achieve 42 times ultimate compression while ensuring image quality. ** Taking the 512x512 size training image as an example, Stable Diffusion1.4 requires 150,000 hours of GPU training time, while Wuerstchen only requires 9,000 hours, and the training cost is reduced by 16 times**.

Even if the image resolution is as high as 1536, Wuerstchen only requires 24,602 hours, and the training cost is still 6 times cheaper than Stable Diffusion.

Therefore, this open source product is conducive to developers who do not have huge computing power to try the diffusion model, and at the same time, they can explore better training methods on this basis.

Open source address:

Github：

paper:

Wuerstchen brief introduction

The Wuerstchen diffusion model adopts a method that works in the highly compressed latent space of the image. This is one of the reasons why its training cost is lower than Stable Diffusion.

Compressing data can reduce the cost of training and inference by orders of magnitude. For example, training on 1024×1024 images is definitely much more expensive than training on 32×32 images. Usually the compression range used in the industry is about 4-8 times.

And Wuerstchen pushed compression to the extreme through a brand-new technical architecture, achieving 42 times space compression, which is an unprecedented technological breakthrough! Because once the compression exceeds 16 times, ordinary methods cannot achieve image reconstruction at all.

Wuerstchen extreme compression principle

Wuerstchen’s extreme compression method is divided into three stages: A, B, and C: Stage A) performs initial training, and uses vector quantization generative adversarial network (VQGAN) to create a discretized latent space and map the data to a pre-set This compact representation of points in a defined, smaller set helps model learning and inference speed;

Phase B) further compresses, using an encoder to project the image into a more compact space, and a decoder to try to reconstruct the latent representation of the VQGAN from the encoded image.

And a label predictor based on the Paella model is used to accomplish this task. This model is based on the representation of the encoded image and can be trained using a smaller number of sampling steps, which is a huge help in improving computing power efficiency.

Phase C) uses the image encoders of A and B to project images into a compact latent space, train a text-conditioned latent diffusion model, and significantly reduce the spatial dimension. This discrete latent space allows the model to generate more diverse and innovative images while retaining the high-quality features of the image.

Image sizes that Wuerstchen can generate

Wuerstchen accepted image training data with resolutions between 1024x1024 and 1536x1536, and the output image quality is very stable. Even non-equivalent images such as 1024x2048 can still achieve good results.

Developers also found that Wuerstchen has a very strong adaptability to the training of new resolution images. Fine-tuning data under 2048x2048 resolution images can also greatly reduce costs.

Wuerstchen generates picture display

According to the case presented by Wuerstchen, the model’s ability to understand text is very good, and the quality effect it generates is comparable to the strongest open source diffusion models such as Stable Diffusion.

Real photo of an eagle wearing a white coat

Two stormtroopers from Star Wars sitting in a bar drinking beer

Highly realistic photos of bees dressed as astronauts

A mouse wearing black courtesy

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
430.89K Popularity
#
USSeeksStrategicBitcoinReserve
58.68M Popularity
#
BitcoinETFOptionLimitQuadruples
975.04K Popularity
#
#FedHoldsRateButDividesDeepen
29.61K Popularity
#
DeFiLossesTop600MInApril
10.16M Popularity

Sitemap

The training cost is reduced by 16 times, and the ultimate compression is 42 times! Open source text generation image model

Wuerstchen brief introduction

Wuerstchen extreme compression principle

Wuerstchen generates picture display

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin