AI Agents for Image and Video Generation
Updated 23 May 2026
Introduction
sigLIP, instead of CLIP
Overview of Generative Media
unimodel - same modality in and out crossmodal - modality out different from in multimodal - multiple in and mulitple out
scope - specalized (lyria, veo) vs general purpose (gemini)
paradigms - autoregressive vs diffusion-based
autoregressive generation
- encode - tokenizer
- generate - generative core (transformer)
- decode - detokenizer
diffusion-based genration
- condition
- denoise
- decode latent diffusion, only convert to pixels at the end. work in some intermediate space.
hybrid genration input meida -> autoregressive diffusion -> output media
multi-image composition multi-turn genreation text rendering - hard historically, work better now
video generation reference to video first frame and last frame, interpolates in between
nanobanana - for image gen veo - for video gen
Prompt Engineering for Image Generation
Need structured formula
- Subject
- Action
- Location
- Composition - camera controls and lighting details
- Style - overall aesthetic
- Generate image form a basic text prompt
- Formalize prompt structure
- Reference stylized input images
- Generate an image form an enhanced prompt
The output from my example notebook looks worse than the ones in the lecture video.
Prompt Engineering for Video Generation
Video prompt attributes
- Subject
- Action
- Scene
- Style
- Temporal elements - to imply changes intake
Comera components
- Camera angles
- Camera movements
- Lens effects
Audio components
- Dialogue
- Sound effects
Takes in video image.
Video includes sound
Wow, I can make an api call and get a video. This is scary.
Can generate multiple videos at once.
Prompt is meta, generates a prompt from keywords.
Evaluation Techniques
Evaluation is hard
- Quality is subjective
- No single ground truth
- Multiple dimensions to access
- Evaluation at scale queries automatically, but automation misses nuance
Need an evaluation system. Automate evaluation
- Scoring metrics (SigLIP image-text alignment) FVD (video quality, distribution level)
- LLM-as-a-judge
- Rubric-based - structured form of LLM as a judge
human evaluation
- Gold standard
Scoring metrics
- Siglip, similarity between text and image encodings, prompt adherence quick filter
LMM-as-a-judge
- Need good model and good instructions
Rubric-based evaluation
- Break process in explicit questions, verifiable questions
- Google’s Gecko
Human evaluation
- For high stakes decisions
- Expensive
- Side by side comparisons
Generate -> auto-score (scoring metrics) -> evulate (LLM/rubric) -> validate (human review) -> iterate -> repeat by feeding back insights into generation
Want to filter down, so we have less for humans to evaluate.
Siglip will give score, but not explain. Gemini will score on prompt adherence, visual quality, coherence, creativity. Also has a explanation Gecko has evaluation service, dataset with two rows, it generates the rubric automatically.
Image Generation Agent
ADK - agent development kit
4 tools for UI Design Agent
- Brand analysis - define a rand identity from a style image
- Design concept - create a ui concept form analysis
- Generate idea image - use nanobanana
- Evaluate image - score agains ui metrics
ADK - model, tools, instruction Uses iterative loop
Loop is specified in system prompt.
Evaluation on second image of second concept had a zero overall score… hallucinated?
Video Generation Agent
Video Agent (ADK)
- Plan scenes
- Generate scene image
- Generate scene video
- Evaluate scene
Image to video gives you more control than text to video.
voice profile and style prefix keeps things consistent.
Ffmpeg to concat videos
Building Media Agent with AI
Use gemini cli.
Image & adk skill -> gemini cli -> agent.py
- Add skills
- Configure the agent
- Run local and iterate
Agent hallucinated the cloud project id. Need to keep identifiers away from the agent.
Building Media Agent with AI Lab
I can use it without having to spend api monies.
Conclusion
safety and responsibility