Intermediate20 minModule 1 of 3

Video Models Overview

Understand Skytells' 9 video models — native TrueFusion models, Google Veo, OpenAI Sora, and LipFusion for lip-sync.

The video model lineup

Skytells gives you access to 9 video models through a single API:

Model IDProviderDescription
truefusion-video-proSkytellsFlagship quality video generation
truefusion-videoSkytellsStandard quality, faster
meraSkytellsCinematic, film-quality output
lumoSkytellsStylized, creative video
lipfusionSkytellsLip-sync — animate a face from audio
veo-3.1GoogleState-of-the-art photorealistic video
veo-3.1-fastGoogleFaster Veo variant, slight quality trade-off
sora-2OpenAIHigh-quality creative video
sora-2-proOpenAIExtended duration, higher resolution

All video models use the same POST /v1/predictions endpoint — just change the model field.

Choosing a video model

For photorealistic content

Google Veo 3.1 produces the most photorealistic video outputs, especially for nature, human subjects, and architectural scenes.

{
  "model": "veo-3.1",
  "input": {
    "prompt": "A timelapse of storm clouds forming over the ocean, cinematic, 4K",
    "duration_seconds": 5,
    "aspect_ratio": "16:9"
  }
}

For creative and stylized content

Sora 2 excels at imaginative, surreal, and stylized scenarios:

{
  "model": "sora-2",
  "input": {
    "prompt": "A tiny wizard casting spells in a library made entirely of books, whimsical, soft light",
    "duration_seconds": 6
  }
}

For Skytells-native production use

TrueFusion Video Pro is the recommended default for production applications — it balances quality, speed, and cost:

{
  "model": "truefusion-video-pro",
  "input": {
    "prompt": "Corporate explainer animation, clean design, text overlays",
    "duration_seconds": 8,
    "aspect_ratio": "16:9",
    "fps": 24
  }
}

For lip-sync

LipFusion takes a source image (face) and an audio file, and produces a video where the face mouth movements match the audio:

{
  "model": "lipfusion",
  "input": {
    "face_image_url": "https://example.com/headshot.jpg",
    "audio_url": "https://example.com/speech.mp3"
  }
}

LipFusion is ideal for:

  • AI avatar presenters
  • Dubbed video localization
  • Synthetic spokesperson generation

Common video parameters

ParameterTypeDescription
promptstringWhat to generate
negative_promptstringWhat to avoid
duration_secondsintVideo length (1–60, model-dependent)
aspect_ratiostring"16:9", "9:16", "1:1"
fpsintFrames per second (24 or 30)
seedintReproducibility seed

Video generation is asynchronous

Unlike image models (which often return in seconds), video generation typically takes 30 seconds to several minutes. Always use async polling or webhooks for video predictions.

Status flow: queued → processing → succeeded
Typical time: 30s (fast models) to 5min (high-quality long videos)

You'll implement polling and webhook handling in the next module.

Summary

  • 9 video models available: 5 Skytells-native + 2 Google Veo + 2 OpenAI Sora
  • Use truefusion-video-pro as your production default
  • Use veo-3.1 for photorealism, sora-2 for creative work
  • Use lipfusion for lip-sync and AI avatars
  • Video generation is always async — plan for polling or webhooks

On this page