Artificial Intelligence has completely transformed how we create art. Today’s world is one in which the boundaries between technology and art are increasingly blurring, and AI lies at the center of this revolution.
The days when creating art was reserved only for highly-skilled individuals like designers or artists using physical and digital brushes and canvases are long gone. Today, AI, or generative AI more precisely, is transforming the very essence of image creation, offering limitless possibilities and redefining the boundaries of imagination.
But how does AI image generation actually work? How can AI models produce images that can evoke emotions, tell stories, or even mimic the human touch?
In this introductory guide, we will delve into the world of AI image generation and explain how AI image generators work, demystifying the mechanics behind this fascinating art form.
AI image generation is actually Generative AI - a category of AI designed for content creation.
Unlike discriminative models, which analyze and make predictions or decisions based on existing data, generative models aim to generate new data that is similar in statistics or characteristics to the training data.
AI image generators use machine learning (ML) algorithms and artificial neural networks to create realistic images by processing natural language inputs. They are trained on large datasets of images through which they learn about the different image aspects, elements, and characteristics to be able to generate similar images.
There are many types of AI image generation models, but the most popular are Generative Adversarial Networks (GANs) and diffusion models.
We are more interested in the latter - diffusion models - since they are the latest versions of AI image generator models, or in more technical terms, deep generative models.
If you are confused by all this high-tech terminology, you must have heard about the most famous diffusion model, Midjourney, DALL-E 2, and yes, you guessed it, Stable Diffusion.
The name ‘diffusion’ was inspired by the process in which they produce images, which looks like the movement of gas molecules from high to low-density areas in thermodynamics.
The process by which AI image generators built on diffusion models transform noise (we’ll explain what ‘noise’ means in terms of AI models below) is similar to the diffusion of particles.
In contrast to GANs, which use neural networks in the training process to generate images, diffusion models produce images by imitating the data they have been trained on. The data can be anything from images, paintings, photos, 3D models, game assets, etc.
Diffusion models learn through a process that resembles diffusion, in which they corrupt the training data (images) by adding noise to them and learning how to reverse the data by removing the noise and generating new, similar images.
In more descriptive words, imagine a diffusion model being a master chef who tastes a meal, breaks down the ingredients to learn them, and then cooks a meal that tastes similar to the original one.
The term noise comes from digital and analog systems in signal processing.
Noise in signal processing means unwanted changes that happen to a signal when it is captured, stored, transmitted, or processed. These changes are usually measured using a metric called the signal-to-noise ratio, which compares the strength of the signal we want to hear to the strength of unwanted background noise. When there is a lot of noise, it can make it harder to hear the signal we want.
Think of it like trying to listen to someone talk in a really noisy room - the more noise there is, the harder it is to hear what they're saying.
The term noise was then transferred to machine learning to refer to unwanted patterns or behaviors in the data that can make it difficult to get clear signals. We want to minimize the noise and focus on the important signals in the data.
Now that we have a high-level overview of how text-to-image AI works, let’s see the process step-by-step.
The process starts when you input a text prompt into the AI image generator. The AI then uses NLP (natural language processing) to transform the textual data into a machine-friendly language.
This involves turning text into numerical vectors that carry the meaning of the words. Think of these vectors as a set of instructions.
For example, if a user asks an image generator to create "brown puppy on the beach," the model changes the words into numbers that understand "brown," "puppy," and "beach," and how they relate. This numerical guide helps the AI build the image.
The guide tells the AI how to make the final image, ensuring that it positions the object in the image correctly. This clever change from text to numbers, and then to images, lets AI image generators create visual representations from written prompts.
In this stage of the process, the model starts with the input data, for example, a picture, and gradually adds random noise to it. The model uses what is known as the ‘Markov Chain,’ which is a chain of actions where each action changes the picture a little, based on how it looked in the step before.
The random noise added in this stage is known as “Gaussian noise,” a common type of random noise.
The model studies how the random noise added in the previous stage changes the original data. It learns the process of transforming the original image to the noisy one so it can trace back and reverse it.
The goal is to understand the differences between the original and altered data at each step. Training the model is all about perfecting this reverse process.
After the model is trained and has understood how adding noise changes the original data, now it should reverse the process.
The AI examines the noisy image and tries to figure out how to remove it to get the original data (image). In essence, the diffusion model tries to track back the steps of the previous process. By backtracking the steps, it learns how to produce new data (image) similar to the original one.
Ultimately, the model uses what it learns in the reverse diffusion step to output a generated image. First, it starts with a chaotic clutter of pixels - the random noise. Then, it relies on the text prompt to guide it in turning the noise into a clear output image.
The text prompt serves as a guideline, informing it what the final generated image should look like.
The AI performs the reverse diffusion, slowly transforming the random noise into a picture. At the same time, it makes sure the generated image matches the written text prompt. It does this by reducing any differences between the picture it's creating and what the written prompt suggests the picture should be like.
This process of adding random noise and learning how to reverse it makes the AI image generation models capable of generating realistic images.
You can keep repeating this process to tweak the image or make different versions by entering different prompts. The AI model goes through the same steps until it achieves the desired output.
https://twitter.com/ciguleva/status/1688412064698748929
You might be wondering why you don’t get the exact image each time you enter the same prompt into an AI image generator. The main reason is that the noise is random. Due to this, the model makes predictions based on this noise at every step, which influences image generation.
It’s like constructing a building, but having different building blocks and a new blueprint each time. The AI model doesn’t see the image as we people do. Although trained on sample images, it doesn't simply reconstruct them and create collages out of them.
Instead, as we saw, it relies on mathematics assigning numerical vectors to the different words of the text prompt.
Some Midjourney users claim that adding a seed number can help you maintain consistency in your images. To learn more, check out our guide.
We’ve seen how AI diffusion models work under the hood, but now we’ll explore the intricacies of the three most popular diffusion models: Midjourney, DALL-E 2, and Stable Diffusion.
Midjourney is a text-to-image AI generator that allows users to create realistic, abstract, and fantasy-like images. It was developed by Midjourney Inc and launched in open beta in July 2022. It’s accessible only on Discord. Users can generate images by typing the ‘/imagine’ command followed by the description of the image in their Discord server or via chatting with their Discord bot in DMs.
Midjourney's AI creates visually appealing, artistic images, emphasizing complementary colors, light balance, sharp details, and pleasing composition. It uses a diffusion model, like DALL-E and Stable Diffusion, transforming random noise into art. As of March 15, 2023, the V5 model represents a significant upgrade, though details about the training models and source code are undisclosed. Currently, Midjourney produces images at a low resolution of 1,024 x 1,024 pixels, but the upcoming Midjourney 6 promises higher-resolution images suitable for printing.
DALL-E is a diffusion model developed by OpenAI. Its name is a tribute to and bland of Salvador Dali, the Spanish surreal artist and adorable robot in Disney’s animated sci-fi movie “WALL-E.”
DALL-E 2, released in April 2022, is an advanced version of the original DALL-E, built on a diffusion model and integrating data from CLIP, a model developed by OpenAI that connects visual and textual information. Utilizing GPT-3 to interpret natural language prompts, DALL-E 2 consists of two main parts: the Prior and the Decoder. The Prior converts user text into an image representation, and the Decoder generates the corresponding image.
Compared to the original, DALL-E 2 is more efficient, offers four times the resolution, improved speed, flexibility in image sizes, and provides users with a wider range of customization options, including different artistic styles and extensions of existing images.
Stable Diffusion, launched in 2022 through a collaboration between Stability AI, EleutherAI, and LAION, is a text-to-image generative AI model capable of creating detailed images from text. It offers advanced image generation features: filling in missing parts of images (inpainting), extending images (outpainting), and transforming one image into another.
Using the Latent Diffusion Model (LDM), Stable Diffusion begins with random noise and gradually refines the image to match the text. Initially using a CLIP text encoder, its second version includes OpenClip, allowing for more detailed image generation. Notably, Stable Diffusion's open-source nature and compatibility with consumer-grade graphics cards make it accessible to a wide audience, encouraging participation and contribution.
🚀 Explore also how text-to-video AI generation works.
The integration of AI with art has revolutionized the way we approach creativity and imagination. Tools like Generative AI, specifically diffusion models such as Midjourney, DALL-E 2, and Stable Diffusion, are transforming image creation. They can now generate realistic images based on textual descriptions, making them incredibly powerful in the artistic domain.
As these AI image generators continue to evolve, there's growing curiosity about whether they might one day entirely replace designers and visual creators.
However, despite the impressive capabilities of AI in art and design, they aren't poised to fully replace human creativity. AI can serve as valuable assistants to artists and designers, but they lack the essential human touch and emotion that define art. While AI does enhance the artistic process and allows for efficient, high-quality art generation, it's still fundamentally a tool. Human guidance and interpretation remain crucial, emphasizing that AI complements human creativity rather than supplanting it.