Generative models in deep learning
A generative model tries to learn the underlying structure of data well enough to produce new examples that look like they could have come from the same source. In deep learning, that usually means learning a distribution over data such as images, text, audio, or video, then using that learned distribution to sample new outputs. Instead of only answering “what label fits this input?”, a generative model tries to answer a harder question: “what kinds of inputs are even possible here?”
That difference matters because “generate” can mean several things. A model might continue a sentence one token at a time, synthesize a face that never existed, reconstruct an input through a compressed internal code, or turn random noise into a coherent image. Those are all generative tasks, but they are solved with different training objectives and different assumptions about data.
The four big families on this page are autoregressive models, variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models. Autoregressive models generate by predicting the next piece from previous pieces. VAEs learn a smooth latent space and reconstruct samples from it. GANs train through competition between a generator and a discriminator. Diffusion models learn to reverse a gradual noising process. These families differ not just in architecture, but in what they optimize, how stable they are to train, how fast they generate, and what kinds of outputs they are best at .
A useful way to read the rest of the page is to keep three questions in view:
What is the model trying to learn?
How does training push it toward that goal?
What trade-off does that choice create in practice?
What a generative model learns
At the most basic level, a generative model tries to learn a probability distribution over data. If the data is just raw examples, this is often written as , where is a data sample such as an image, a sentence, or a waveform. If the generation is guided by some condition, such as a text prompt, a class label, or a previous context window, the target becomes a conditional distribution like , where is the condition.
That notation sounds abstract until it is grounded in examples:
For face generation, could be a face image, and the model learns what real face images tend to look like.
For text continuation, could be the next token or the whole sequence, conditioned on earlier tokens.
For text-to-image generation, the model learns : what images are plausible given a prompt like “a red bicycle in the snow.”
When people say a model can sample, they mean it can draw a new example from the learned distribution. Not copy a training item. Not pick from a fixed library. Produce a new example that is likely under the model’s learned view of the data.
Likelihood, latent variables, and conditioning
Three ideas show up again and again.
Likelihood means how much probability the model assigns to observed data. If a model gives high probability to real examples, it is modeling the data well. Some families, especially autoregressive models, let us compute this likelihood directly. Others only optimize an indirect objective.
Latent variables are hidden factors the model uses internally to represent variation. In a face dataset, latent factors might loosely capture pose, lighting, hairstyle, or expression. You usually do not hand-design these features. The model learns them.
Conditioning means generation is guided rather than free-running. A text prompt, a source image, a class label, or previous tokens can all act as a condition. This turns plain generation into targeted generation.
The trap here is to think the model stores a set of finished outputs and retrieves them later. It does not. A trained generative model stores parameters, not examples. Those parameters define a rule for assigning probability and producing samples.
Modeling data vs labeling data
A classifier learns something like : given an image, is it a cat or a dog? A generative model learns about the structure of itself, or about how changes under a condition. That is why generation is harder. The model is not carving the world into a few categories. It is trying to model the space of possible observations.
That is also why generative models are useful beyond “making cool outputs.” Learning the structure of data can support:
representation learning
compression
imputation
anomaly detection
simulation
data augmentation
A model that can generate has usually learned something deeper than a model that only labels.
Generative vs discriminative learning
A clean way to separate the two is this:
Generative learning asks: how could this data have been produced?
Discriminative learning asks: given this data, what label or decision should I output?
In probability terms, generative models focus on or . Discriminative models focus on .
That sounds small on paper, but it changes the whole task.
A classifier can succeed while knowing almost nothing about how to make a realistic image or sentence. It only needs decision boundaries. A generative model needs a much richer internal picture, because it must produce plausible data, not just separate categories.
One concrete pair helps:
Discriminative: “This review is positive.”
Generative: “Given the start of this review, what word is likely to come next?”
Another:
Discriminative: “This image contains a dog.”
Generative: “Produce a realistic dog image under these conditions.”
The trap here is to think generative automatically means “unsupervised” and discriminative automatically means “supervised.” Not quite. A conditional generative model can absolutely use labels or prompts. The real distinction is not whether labels exist. It is whether the model learns to model data or to predict decisions about data.
Autoregressive models: generating one step at a time
Autoregressive models break a hard problem into a chain of easier ones. Instead of modeling a whole sequence at once, they factor it into a product of conditional probabilities:
Read that in plain English as: to generate the full sequence, predict the first step, then the next step given what came before, then the next, and keep going.
For text, this becomes next-token prediction. If the context is “Deep learning models can”, the model assigns probabilities to possible next tokens like “generate,” “learn,” or “be.” It picks or samples one, appends it to the context, and repeats.
That simple idea is one of the most powerful in modern AI. It is the core of language modeling, and it extends beyond text to audio, time series, and even some image-generation setups .
Why this works so well
Autoregressive modeling has two big advantages.
Tractable likelihood. The model gives an explicit probability for each next step, so training is usually clean and principled.
Coherent sequential generation. Each new output is conditioned on everything already produced, which helps preserve local consistency.
If someone writes one word at a time while always seeing the full sentence so far, that is the human analogy.
The cost
The cost is speed. Training can often be parallelized over positions because the full target sequence is known, but generation at inference time is still sequential. The model cannot produce token 200 before token 199 is decided. That makes long outputs expensive .
This is the central trade-off of autoregressive models: they are statistically neat and excellent for sequence generation, but they pay for that neatness with slow step-by-step sampling.
Why transformers became the default for text generation
Older sequence models, especially recurrent neural networks, processed tokens in a running chain. That worked, but long-range dependencies were hard. Information from far back in the sequence could fade or become hard to access.
Self-attention changed that. It lets each token directly weigh other tokens in the context, instead of relying only on a hidden state being passed forward step by step. In effect, a token can ask: “Which earlier words matter most for understanding me right now?” That makes it much easier to capture relationships like subject-verb agreement, references across long passages, and subtle contextual meaning .
A useful analogy is note-taking:
An RNN is like reading a book and trying to remember earlier pages from memory alone.
A transformer is like reading with every earlier page open on the desk, and deciding which lines to look back at.
That is why transformers became the standard engine for large language models. During training, they can process many positions efficiently in parallel. During inference, they still generate autoregressively, one token at a time, but the representation of context is much stronger than in older recurrent systems.
The trap here is to think transformers are “non-sequential” because attention looks at the whole context. Training is parallel over known tokens; generation is not. The output still unfolds left to right.
Variational autoencoders: learning a compressed latent space
A variational autoencoder is built around two parts:
an encoder, which maps an input into a latent representation
a decoder, which reconstructs data from that latent representation
If an ordinary autoencoder is like compressing a file and then decompressing it, a VAE adds probability to that process. The encoder does not output one fixed code. It outputs a distribution over latent variables, typically described by a mean and a variance. The decoder then reconstructs from a sampled latent point.
Why do this? Because it forces the latent space to become smooth and structured. Nearby latent points should decode into similar outputs. That makes interpolation and controlled variation possible.
The two-part objective
A VAE is trained with two pressures at once:
Reconstruction loss. The decoded output should resemble the original input.
KL regularization. The latent distribution should stay close to a simple prior, often a standard normal distribution.
The first term says, “keep the information needed to rebuild the sample.” The second says, “organize the latent space so it is smooth and usable.”
Without the first, the model would not preserve the data. Without the second, the latent space could become messy and hard to sample from.
What VAEs are good at
VAEs are especially useful when the latent representation matters, not just the final sample. They are often used for:
representation learning
interpolation between samples
anomaly detection
structured generation
as components inside larger generative pipelines
Their common weakness
The usual criticism is that VAE outputs can look blurry or less sharp than GAN or diffusion outputs, especially for detailed images . That is not because the model is “bad.” It is because the objective rewards faithful average reconstruction and a well-behaved latent space, not aggressive perceptual realism.
A VAE is often the right tool when the goal is understanding and controlling variation. It is often not the first tool when the only goal is the sharpest possible image.
The reparameterization trick and why it matters
Here is the training problem. The encoder outputs a latent distribution, and the model needs to sample from it before decoding. But naive sampling is like inserting a random jump into the computation graph. Gradients cannot cleanly flow backward through a random draw.
That would block ordinary backpropagation.
The reparameterization trick rewrites the sample so the randomness is separated from the learnable parameters. Instead of saying “sample directly from a distribution with mean and variance ,” we write:
where is sampled from a simple fixed distribution such as .
That tiny rewrite changes everything. The randomness now lives in , while and remain part of a differentiable computation. Backpropagation can update them.
A good intuition is this: do not ask the network to backpropagate through a dice roll. Ask it to produce the location and scale of the dice roll, then add external noise in a controlled way.
So the mechanism is:
The encoder produces and .
Sample noise from a fixed normal distribution.
Construct .
Decode .
Backpropagate through the whole path.
The trick matters because it makes VAEs trainable with standard gradient methods while preserving stochastic latent sampling.
Generative adversarial networks: learning by competition
A GAN uses two neural networks in a game.
The generator tries to produce fake samples that look real.
The discriminator tries to tell real samples from generated ones.
The setup is adversarial: the generator improves by fooling the discriminator, and the discriminator improves by catching the generator’s mistakes. Over training, this pressure can drive the generator toward highly realistic outputs.
If a VAE is like learning to compress and reconstruct, a GAN is like a counterfeiter training against a detective.
Why GANs can look so sharp
The key idea is that the generator is not being rewarded mainly for pixel-by-pixel average similarity. It is being rewarded for producing outputs that the discriminator judges as real. That pushes the model toward perceptual realism—details, textures, and local structure that look convincing to a learned critic .
That is why GANs became famous for sharp image synthesis.
The hard part: training instability
GANs are powerful, but notoriously difficult to train . Two common failure modes matter a lot:
Instability. The generator and discriminator can fall out of balance. If one gets too strong, the other may stop learning useful signals.
Mode collapse. The generator may discover a small set of outputs that reliably fool the discriminator and keep repeating variations of them, instead of covering the full diversity of the data.
The trap here is to think “if the samples look good, the model has learned the whole distribution.” Not necessarily. A GAN can generate beautiful samples while still missing large regions of the data space.
Evaluation is also harder
Unlike autoregressive models, GANs usually do not provide a clean tractable likelihood. That makes evaluation less straightforward. In practice, people rely more on sample quality, diversity, and task-specific metrics.
So GANs offer a very recognizable trade-off:
strength: often sharp, realistic samples
weakness: unstable training and harder coverage/evaluation
Diffusion models: generating by denoising
Diffusion models generate data by learning how to turn noise into structure, one denoising step at a time. The basic idea is surprisingly physical. Start with a real sample. Gradually add noise to it over many small steps until it becomes almost pure noise. Then train a model to reverse that corruption process.
So there are two linked processes:
a forward process that adds noise step by step
a reverse process that learns to remove noise step by step
If you can learn the reverse well, you can start from random noise and repeatedly denoise until a coherent sample appears.
A good analogy is watching a photo disappear into television static, then teaching a model how to rewind that destruction.
Why diffusion models became so important
Diffusion models rose quickly because they combine several practical advantages:
stable training
strong sample quality
good controllability, especially in conditional settings such as text-to-image generation
This made them central to modern image generation systems. In many settings, they overtook GANs as the default choice for high-quality image synthesis .
Their trade-off
The main cost is generation speed. Because sampling involves many denoising steps, diffusion models are often slower at inference than one-shot or shorter-path alternatives .
So the picture is almost the mirror image of GANs:
GANs can be very sharp but hard to train.
Diffusion models are usually more stable but slower to sample.
That is the recurring theme across generative modeling: every family gets its power by making a different bargain.
How the major families compare in practice
Once the core ideas are clear, the comparison becomes easier. Each family answers the same problem—model data and generate samples—but with a different training logic and a different practical trade-off.
This broad picture matches common comparisons in surveys and practitioner overviews: VAEs are valued for structured latent spaces, GANs for sharp realism, autoregressive models for sequence modeling with explicit next-step probabilities, and diffusion models for high-quality, stable image generation .
A practical rule of thumb:
Choose autoregressive models when the data is naturally sequential and next-step prediction makes sense.
Choose VAEs when you care about a smooth, useful latent space.
Choose GANs when sample sharpness is central and you can tolerate harder training.
Choose diffusion models when you want top-tier image quality and strong conditional control, and slower sampling is acceptable.
The trap is to ask which family is “best.” There is no global best. There is only the best match between objective, data type, and trade-off.