top of page
Writer's picturefridahoeft

#12 Kandinsky 2.0 - Can the diffusion model generate valid letters?

The diffusion model Kandinsky 2.0 is still largely unknown. A big advantage over other text-to-image models is the input in 101 different languages. But can this AI tool also handle type?

Experiment #12 deals with Kandinsky 2.0, a text-to-image model based on the diffusion architecture. Within one minute, the model, published in November 2022, generates an image with a maximum of 1024x512 pixels. In order to be able to compare the results better, a square format (512x512 pixels) was chosen for this experiment. The special feature of this model is the input option in 101 languages, which is not used here. If the model is accessed via Google colab, some parameters can be set (batch_size, height of the image, width of the image, num_steps and guidance_scale). Of particular interest is the guidance scale (comparable to the supercondition_factor of minDALL-E), which indicates the agreement with the prompt. In theory, a higher value leads to more accurate but less diverse images. In a previous experiment of mine, it was determined that a value between 30 and 100 showed the best results for this experiment. At a value above 200, the image becomes more and more dissolved. While this glitch can be visually interesting for designers, it is not part of this experiment. A guidance scale of 40 was used for the following prompts.

Overall, it can be said that some prompts produce adequate results. For example, with the prompt "A black letter "A" on white background", the model generates letters that roughly show the properties of an A. The prompt "A word on white background" also shows unreadable words on a light background. "A black letter on white background" was completely misunderstood and shows a white letter sheet in the most optimistic interpretation. Whether the images are generated correctly is rather left to chance. In addition, relics of watermarks can be recognized in several images. The names of the data sets that were used for training are known: LAION-improved-aesthetics-700M, LAION-aesthetics-multilang-46 M, ruDALLE-english-44 M. The source of the training data can only be partially traced (e.g. SAC dataset consisting of "over 238,000 synthetic images generated with AI models such as CompVis latent GLIDE and Stable Diffusion from over forty thousand user submitted prompts" It can be assumed that licensed training material was used.

17 views0 comments

Comentarios


bottom of page