AI Agents for Image and Video Generation

Updated 23 May 2026

https://learn.deeplearning.ai/courses/ai-agents-for-image-and-video-generation/lesson/txerfn/prompt-engineering-for-image-generation

Introduction

sigLIP, instead of CLIP

Overview of Generative Media

unimodel - same modality in and out crossmodal - modality out different from in multimodal - multiple in and mulitple out

scope - specalized (lyria, veo) vs general purpose (gemini)

paradigms - autoregressive vs diffusion-based

autoregressive generation

  1. encode - tokenizer
  2. generate - generative core (transformer)
  3. decode - detokenizer

diffusion-based genration

  1. condition
  2. denoise
  3. decode latent diffusion, only convert to pixels at the end. work in some intermediate space.

hybrid genration input meida -> autoregressive diffusion -> output media

multi-image composition multi-turn genreation text rendering - hard historically, work better now

video generation reference to video first frame and last frame, interpolates in between

nanobanana - for image gen veo - for video gen

Prompt Engineering for Image Generation

Need structured formula

  1. Generate image form a basic text prompt
  2. Formalize prompt structure
  3. Reference stylized input images
  4. Generate an image form an enhanced prompt

The output from my example notebook looks worse than the ones in the lecture video.

Prompt Engineering for Video Generation

Video prompt attributes

Comera components

Audio components

Takes in video image.

Video includes sound

Wow, I can make an api call and get a video. This is scary.

Can generate multiple videos at once.

Prompt is meta, generates a prompt from keywords.

Evaluation Techniques

Evaluation is hard

Need an evaluation system. Automate evaluation

human evaluation

Scoring metrics

LMM-as-a-judge

Rubric-based evaluation

Human evaluation

Generate -> auto-score (scoring metrics) -> evulate (LLM/rubric) -> validate (human review) -> iterate -> repeat by feeding back insights into generation

Want to filter down, so we have less for humans to evaluate.

Siglip will give score, but not explain. Gemini will score on prompt adherence, visual quality, coherence, creativity. Also has a explanation Gecko has evaluation service, dataset with two rows, it generates the rubric automatically.

Image Generation Agent

ADK - agent development kit

4 tools for UI Design Agent

  1. Brand analysis - define a rand identity from a style image
  2. Design concept - create a ui concept form analysis
  3. Generate idea image - use nanobanana
  4. Evaluate image - score agains ui metrics

ADK - model, tools, instruction Uses iterative loop

Loop is specified in system prompt.

Evaluation on second image of second concept had a zero overall score… hallucinated?

Video Generation Agent

Video Agent (ADK)

  1. Plan scenes
  2. Generate scene image
  3. Generate scene video
  4. Evaluate scene

Image to video gives you more control than text to video.

voice profile and style prefix keeps things consistent.

Ffmpeg to concat videos

Building Media Agent with AI

Use gemini cli.

Image & adk skill -> gemini cli -> agent.py

Agent hallucinated the cloud project id. Need to keep identifiers away from the agent.

Building Media Agent with AI Lab

I can use it without having to spend api monies.

Conclusion

safety and responsibility