Video Models Overview
Understand Skytells' 9 video models — native TrueFusion models, Google Veo, OpenAI Sora, and LipFusion for lip-sync.
The video model lineup
Skytells gives you access to 9 video models through a single API:
| Model ID | Provider | Description |
|---|---|---|
truefusion-video-pro | Skytells | Flagship quality video generation |
truefusion-video | Skytells | Standard quality, faster |
mera | Skytells | Cinematic, film-quality output |
lumo | Skytells | Stylized, creative video |
lipfusion | Skytells | Lip-sync — animate a face from audio |
veo-3.1 | State-of-the-art photorealistic video | |
veo-3.1-fast | Faster Veo variant, slight quality trade-off | |
sora-2 | OpenAI | High-quality creative video |
sora-2-pro | OpenAI | Extended duration, higher resolution |
All video models use the same POST /v1/predictions endpoint — just change the model field.
Choosing a video model
For photorealistic content
Google Veo 3.1 produces the most photorealistic video outputs, especially for nature, human subjects, and architectural scenes.
{
"model": "veo-3.1",
"input": {
"prompt": "A timelapse of storm clouds forming over the ocean, cinematic, 4K",
"duration_seconds": 5,
"aspect_ratio": "16:9"
}
}For creative and stylized content
Sora 2 excels at imaginative, surreal, and stylized scenarios:
{
"model": "sora-2",
"input": {
"prompt": "A tiny wizard casting spells in a library made entirely of books, whimsical, soft light",
"duration_seconds": 6
}
}For Skytells-native production use
TrueFusion Video Pro is the recommended default for production applications — it balances quality, speed, and cost:
{
"model": "truefusion-video-pro",
"input": {
"prompt": "Corporate explainer animation, clean design, text overlays",
"duration_seconds": 8,
"aspect_ratio": "16:9",
"fps": 24
}
}For lip-sync
LipFusion takes a source image (face) and an audio file, and produces a video where the face mouth movements match the audio:
{
"model": "lipfusion",
"input": {
"face_image_url": "https://example.com/headshot.jpg",
"audio_url": "https://example.com/speech.mp3"
}
}LipFusion is ideal for:
- AI avatar presenters
- Dubbed video localization
- Synthetic spokesperson generation
Common video parameters
| Parameter | Type | Description |
|---|---|---|
prompt | string | What to generate |
negative_prompt | string | What to avoid |
duration_seconds | int | Video length (1–60, model-dependent) |
aspect_ratio | string | "16:9", "9:16", "1:1" |
fps | int | Frames per second (24 or 30) |
seed | int | Reproducibility seed |
Video generation is asynchronous
Unlike image models (which often return in seconds), video generation typically takes 30 seconds to several minutes. Always use async polling or webhooks for video predictions.
Status flow: queued → processing → succeeded
Typical time: 30s (fast models) to 5min (high-quality long videos)You'll implement polling and webhook handling in the next module.
Summary
- 9 video models available: 5 Skytells-native + 2 Google Veo + 2 OpenAI Sora
- Use
truefusion-video-proas your production default - Use
veo-3.1for photorealism,sora-2for creative work - Use
lipfusionfor lip-sync and AI avatars - Video generation is always async — plan for polling or webhooks