Steamer-I2V: Enhanced Controllable Image-to-Video Foundation Model

Yi Yang, Xuewu Jiao, Wen Tao, Zhipeng Jin, Jie Liu, Yi Zheng, Chao Fang, Xinsheng Luo, Wei Hu, Yan Tian, Song Lin, Weizhu Xie, Cong Han, Shuanglong Li and Lin Liu
Steamer Team, Baidu Inc.

Abstract

Steamer-I2V, as an industry-leading image-to-video (I2V) generation model, is specifically optimized for precise visual control, high-definition quality, and Chinese semantic understanding. It converts static images into temporally coherent and visually compelling dynamic videos. Recognized for exceptional controllability and generation fidelity, the model has claimed the top position in comprehensive image-to-video evaluation on VBench, an authoritative international video generation evaluation benchmark, demonstrating cutting-edge technical capabilities.

  1. Fine-grained Video Structured Description Language for Pixel-level Control and Cinematic Composition
    Steamer-I2V employs a meticulously structured design for shooting perspectives and video descriptions, ensuring strict adherence to visual details, object motion trajectories, style attributes, and cinematography. This enables precise control over complex content generation and instruction compliance. The model supports multimodal conditional inputs—including Chinese text prompts, reference images, and guidance signals—to maintain alignment with specific creative visions or functional requirements.
  2. Model Ensemble Optimization Strategy for HD Quality and Cinematic Dynamics
    Leveraging advanced generative methods and Transformer-based diffusion architecture, the model produces high-definition videos up to 1080P resolution with smooth transitions and physically realistic motion patterns. Steamer-I2V implements a model ensemble optimization strategy specifically enhancing temporal consistency, cinematic framing, and motion regularity, ensuring exceptional logical coherence and visual continuity across video sequences.
    • Multi-stage Supervised Training: Progressive supervised fine-tuning (SFT) from low to high resolutions and frame rates enables gradual learning from macro-control to detailed refinement.
    • Aesthetic Conditional Fine-tuning: Through conditional fine-tuning (CFT), the model develops an intrinsic understanding of video aesthetics beyond superficial imitation.
    • Multi-objective Reinforcement Learning: Combines human global feedback with multi-dimensional quality metrics for preference alignment optimization, progressively enhancing precision from macro to micro levels.
    • Prompt Enhancement Technology: A multimodal large model analyzes input images to augment original prompts, predicting temporal evolutions of scenes/objects in video frames—including actions, motion trajectories, and state transitions.
  3. Accurate Chinese Semantic Comprehension
    Steamer-I2V utilizes a hundred-million-scale Chinese multimodal training database processed through a three-tier "filter-purify-proportion" data optimization system. This rigorous data cleansing mechanism ensures precise semantic alignment between textual instructions and visual elements, empowering the model with professional-level Chinese concept parsing capabilities. It effectively captures culture-specific elements and complex semantic relationships, significantly improving visual translation accuracy for Chinese creative instructions.