AI Agents for Image and Video Generation

Updated 23 May 2026

https://learn.deeplearning.ai/courses/ai-agents-for-image-and-video-generation/lesson/txerfn/prompt-engineering-for-image-generation

Introduction

sigLIP, instead of CLIP

Overview of Generative Media

unimodel - same modality in and out crossmodal - modality out different from in multimodal - multiple in and mulitple out

scope - specalized (lyria, veo) vs general purpose (gemini)

paradigms - autoregressive vs diffusion-based

autoregressive generation

encode - tokenizer
generate - generative core (transformer)
decode - detokenizer

diffusion-based genration

condition
denoise
decode latent diffusion, only convert to pixels at the end. work in some intermediate space.

hybrid genration input meida -> autoregressive diffusion -> output media

multi-image composition multi-turn genreation text rendering - hard historically, work better now

video generation reference to video first frame and last frame, interpolates in between

nanobanana - for image gen veo - for video gen

Prompt Engineering for Image Generation

Need structured formula

Subject
Action
Location
Composition - camera controls and lighting details
Style - overall aesthetic

Generate image form a basic text prompt
Formalize prompt structure
Reference stylized input images
Generate an image form an enhanced prompt

The output from my example notebook looks worse than the ones in the lecture video.

Prompt Engineering for Video Generation

Video prompt attributes

Subject
Action
Scene
Style
Temporal elements - to imply changes intake

Comera components

Camera angles
Camera movements
Lens effects

Audio components

Dialogue
Sound effects

Takes in video image.

Video includes sound

Wow, I can make an api call and get a video. This is scary.

Can generate multiple videos at once.

Prompt is meta, generates a prompt from keywords.

Evaluation Techniques

Evaluation is hard

Quality is subjective
No single ground truth
Multiple dimensions to access
Evaluation at scale queries automatically, but automation misses nuance

Need an evaluation system. Automate evaluation

Scoring metrics (SigLIP image-text alignment) FVD (video quality, distribution level)
LLM-as-a-judge
Rubric-based - structured form of LLM as a judge

human evaluation

Gold standard

Scoring metrics

Siglip, similarity between text and image encodings, prompt adherence quick filter

LMM-as-a-judge

Need good model and good instructions

Rubric-based evaluation

Break process in explicit questions, verifiable questions
Google’s Gecko

Human evaluation

For high stakes decisions
Expensive
Side by side comparisons

Generate -> auto-score (scoring metrics) -> evulate (LLM/rubric) -> validate (human review) -> iterate -> repeat by feeding back insights into generation

Want to filter down, so we have less for humans to evaluate.

Siglip will give score, but not explain. Gemini will score on prompt adherence, visual quality, coherence, creativity. Also has a explanation Gecko has evaluation service, dataset with two rows, it generates the rubric automatically.

Image Generation Agent

ADK - agent development kit

4 tools for UI Design Agent

Brand analysis - define a rand identity from a style image
Design concept - create a ui concept form analysis
Generate idea image - use nanobanana
Evaluate image - score agains ui metrics

ADK - model, tools, instruction Uses iterative loop

Loop is specified in system prompt.

Evaluation on second image of second concept had a zero overall score… hallucinated?

Video Generation Agent

Video Agent (ADK)

Plan scenes
Generate scene image
Generate scene video
Evaluate scene

Image to video gives you more control than text to video.

voice profile and style prefix keeps things consistent.

Ffmpeg to concat videos

Building Media Agent with AI

Use gemini cli.

Image & adk skill -> gemini cli -> agent.py

Add skills
Configure the agent
Run local and iterate

Agent hallucinated the cloud project id. Need to keep identifiers away from the agent.

Building Media Agent with AI Lab

I can use it without having to spend api monies.

Conclusion

safety and responsibility