AI Voiceovers & Music

Add AI-generated background music with BeatFusion and create lip-synced avatar videos with LipFusion to elevate your content.

Two audio tools for creators

Skytells offers two audio generation models for content creators:

Model	What it does	Best for
`beatfusion-2.0`	Generates original music from a text prompt	Background music, intros, b-roll
`lipfusion`	Animates a face to match an audio file	Talking-head videos, avatar content

Creating background music with BeatFusion

Match music mood to your content type:

curl -X POST https://api.skytells.ai/v1/predictions \
  -H "x-api-key: $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "beatfusion-2.0",
    "input": {
      "prompt": "Upbeat corporate background music, light piano and acoustic guitar, energetic but not distracting, 120 BPM, 30 seconds",
      "duration_seconds": 30
    }
  }'

Music prompt formulas by content type

Social media montage:

Upbeat [genre] background music, [tempo] BPM, [instruments], 
energetic, [mood], no vocals, [duration] seconds

Tutorial / educational:

Calm focus music, lo-fi [genre], subtle [instruments], 
minimal, non-distracting, [duration] seconds

Product launch / reveal:

Cinematic build-up, [genre], starts minimal then swells to a 
dramatic peak at [X] seconds, [instruments], epic feel

Lifestyle / vlog:

Warm acoustic [genre], [instruments], feel-good, positive energy, 
suitable for video background, [duration] seconds

Music generation reference

Content type	Prompt style	Duration
TikTok / Reel (15s)	Punchy, energetic, recognizable hook	15–30s
Product showcase	Clean, modern, minimal	30–60s
Tutorial / how-to	Calm focus, unobtrusive	60–120s
Vlog intro	Upbeat, branded, memorable	10–15s

Adding music to video (ffmpeg)

Once you have both files, merge them:

# Combine video + AI music
ffmpeg -i social_video.mp4 -i background_music.mp3 \
  -c:v copy -c:a aac \
  -filter_complex "[1:a]volume=0.3[music];[0:a][music]amix=inputs=2:duration=first[out]" \
  -map 0:v -map "[out]" \
  output_with_music.mp4

The volume=0.3 keeps the music subtle behind any original audio. Adjust to taste.

LipFusion — talking head videos

LipFusion animates any portrait image to match an audio file — perfect for creating:

Spokesperson videos without a camera
Multilingual content from a single image
AI avatar explainer videos

curl -X POST https://api.skytells.ai/v1/predictions \
  -H "x-api-key: $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lipfusion",
    "input": {
      "image_url": "https://yoursite.com/avatar.jpg",
      "audio_url": "https://yoursite.com/voiceover.mp3"
    }
  }'

Output: an .mp4 of the portrait speaking in sync with the audio.

Requirements for best LipFusion results

Factor	Recommendation
Image	Front-facing, neutral expression, good lighting
Image resolution	At least 512×512
Audio quality	Clear voice, minimal background noise
Audio format	MP3 or WAV, 44.1kHz
Video duration	Matches audio length (up to 60s)

Complete creator workflow

Full Python workflow

import os
import time
import urllib.request
import json
import subprocess

API_KEY = os.environ["SKYTELLS_API_KEY"]
BASE = "https://api.skytells.ai/v1"

def create_and_wait(model, input_data):
    req = urllib.request.Request(
        f"{BASE}/predictions",
        data=json.dumps({"model": model, "input": input_data}).encode(),
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req) as resp:
        prediction = json.loads(resp.read())

    while prediction["status"] not in ("succeeded", "failed"):
        time.sleep(5)
        req = urllib.request.Request(
            f"{BASE}/predictions/{prediction['id']}",
            headers={"x-api-key": API_KEY},
        )
        with urllib.request.urlopen(req) as resp:
            prediction = json.loads(resp.read())

    if prediction["status"] != "succeeded":
        raise RuntimeError(prediction.get("error"))
    return prediction["output"][0]

# 1. Generate video
video_url = create_and_wait("truefusion-video-pro", {
    "prompt": "A barista making pour-over coffee, morning light, cinematic",
    "duration_seconds": 10,
    "aspect_ratio": "9:16",
})

# 2. Generate matching music
music_url = create_and_wait("beatfusion-2.0", {
    "prompt": "Calm morning café ambience, acoustic guitar, warm, relaxing, 10 seconds",
    "duration_seconds": 12,
})

# 3. Download both
urllib.request.urlretrieve(video_url, "video.mp4")
urllib.request.urlretrieve(music_url, "music.mp3")

# 4. Merge with ffmpeg
subprocess.run([
    "ffmpeg", "-i", "video.mp4", "-i", "music.mp3",
    "-c:v", "copy", "-c:a", "aac",
    "-filter_complex", "[1:a]volume=0.4[m];[m]apad[out]",
    "-map", "0:v", "-map", "[out]",
    "-shortest", "final.mp4",
], check=True)

print("Done! Saved to final.mp4")

Summary

BeatFusion generates original background music — match mood to content type
LipFusion creates talking-head videos from portrait + audio
Use ffmpeg to merge video + music tracks
Generate all assets in parallel, then merge — saves time

In the next module, you'll automate this entire workflow with a scheduling pipeline.

PreviousVideo Content for Social Media NextAutomate Your Content Pipeline

On this page