Video Models Overview

Understand Skytells' 9 video models — native TrueFusion models, Google Veo, OpenAI Sora, and LipFusion for lip-sync.

The video model lineup

Skytells gives you access to 9 video models through a single API:

Model ID	Provider	Description
`truefusion-video-pro`	Skytells	Flagship quality video generation
`truefusion-video`	Skytells	Standard quality, faster
`mera`	Skytells	Cinematic, film-quality output
`lumo`	Skytells	Stylized, creative video
`lipfusion`	Skytells	Lip-sync — animate a face from audio
`veo-3.1`	Google	State-of-the-art photorealistic video
`veo-3.1-fast`	Google	Faster Veo variant, slight quality trade-off
`sora-2`	OpenAI	High-quality creative video
`sora-2-pro`	OpenAI	Extended duration, higher resolution

All video models use the same POST /v1/predictions endpoint — just change the model field.

Choosing a video model

For photorealistic content

Google Veo 3.1 produces the most photorealistic video outputs, especially for nature, human subjects, and architectural scenes.

{
  "model": "veo-3.1",
  "input": {
    "prompt": "A timelapse of storm clouds forming over the ocean, cinematic, 4K",
    "duration_seconds": 5,
    "aspect_ratio": "16:9"
  }
}

For creative and stylized content

Sora 2 excels at imaginative, surreal, and stylized scenarios:

{
  "model": "sora-2",
  "input": {
    "prompt": "A tiny wizard casting spells in a library made entirely of books, whimsical, soft light",
    "duration_seconds": 6
  }
}

For Skytells-native production use

TrueFusion Video Pro is the recommended default for production applications — it balances quality, speed, and cost:

{
  "model": "truefusion-video-pro",
  "input": {
    "prompt": "Corporate explainer animation, clean design, text overlays",
    "duration_seconds": 8,
    "aspect_ratio": "16:9",
    "fps": 24
  }
}

For lip-sync

LipFusion takes a source image (face) and an audio file, and produces a video where the face mouth movements match the audio:

{
  "model": "lipfusion",
  "input": {
    "face_image_url": "https://example.com/headshot.jpg",
    "audio_url": "https://example.com/speech.mp3"
  }
}

LipFusion is ideal for:

AI avatar presenters
Dubbed video localization
Synthetic spokesperson generation

Common video parameters

Parameter	Type	Description
`prompt`	string	What to generate
`negative_prompt`	string	What to avoid
`duration_seconds`	int	Video length (1–60, model-dependent)
`aspect_ratio`	string	`"16:9"`, `"9:16"`, `"1:1"`
`fps`	int	Frames per second (24 or 30)
`seed`	int	Reproducibility seed

Video generation is asynchronous

Unlike image models (which often return in seconds), video generation typically takes 30 seconds to several minutes. Always use async polling or webhooks for video predictions.

Status flow: queued → processing → succeeded
Typical time: 30s (fast models) to 5min (high-quality long videos)

You'll implement polling and webhook handling in the next module.

Summary

9 video models available: 5 Skytells-native + 2 Google Veo + 2 OpenAI Sora
Use truefusion-video-pro as your production default
Use veo-3.1 for photorealism, sora-2 for creative work
Use lipfusion for lip-sync and AI avatars
Video generation is always async — plan for polling or webhooks

NextCreating AI Videos

On this page