Beginner35 minModule 2 of 3

AI Voiceovers & Music

Add AI-generated background music with BeatFusion and create lip-synced avatar videos with LipFusion to elevate your content.

Two audio tools for creators

Skytells offers two audio generation models for content creators:

ModelWhat it doesBest for
beatfusion-2.0Generates original music from a text promptBackground music, intros, b-roll
lipfusionAnimates a face to match an audio fileTalking-head videos, avatar content

Creating background music with BeatFusion

Match music mood to your content type:

curl -X POST https://api.skytells.ai/v1/predictions \
  -H "x-api-key: $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "beatfusion-2.0",
    "input": {
      "prompt": "Upbeat corporate background music, light piano and acoustic guitar, energetic but not distracting, 120 BPM, 30 seconds",
      "duration_seconds": 30
    }
  }'

Music prompt formulas by content type

Social media montage:

Upbeat [genre] background music, [tempo] BPM, [instruments], 
energetic, [mood], no vocals, [duration] seconds

Tutorial / educational:

Calm focus music, lo-fi [genre], subtle [instruments], 
minimal, non-distracting, [duration] seconds

Product launch / reveal:

Cinematic build-up, [genre], starts minimal then swells to a 
dramatic peak at [X] seconds, [instruments], epic feel

Lifestyle / vlog:

Warm acoustic [genre], [instruments], feel-good, positive energy, 
suitable for video background, [duration] seconds

Music generation reference

Content typePrompt styleDuration
TikTok / Reel (15s)Punchy, energetic, recognizable hook15–30s
Product showcaseClean, modern, minimal30–60s
Tutorial / how-toCalm focus, unobtrusive60–120s
Vlog introUpbeat, branded, memorable10–15s

Adding music to video (ffmpeg)

Once you have both files, merge them:

# Combine video + AI music
ffmpeg -i social_video.mp4 -i background_music.mp3 \
  -c:v copy -c:a aac \
  -filter_complex "[1:a]volume=0.3[music];[0:a][music]amix=inputs=2:duration=first[out]" \
  -map 0:v -map "[out]" \
  output_with_music.mp4

The volume=0.3 keeps the music subtle behind any original audio. Adjust to taste.

LipFusion — talking head videos

LipFusion animates any portrait image to match an audio file — perfect for creating:

  • Spokesperson videos without a camera
  • Multilingual content from a single image
  • AI avatar explainer videos
curl -X POST https://api.skytells.ai/v1/predictions \
  -H "x-api-key: $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lipfusion",
    "input": {
      "image_url": "https://yoursite.com/avatar.jpg",
      "audio_url": "https://yoursite.com/voiceover.mp3"
    }
  }'

Output: an .mp4 of the portrait speaking in sync with the audio.

Requirements for best LipFusion results

FactorRecommendation
ImageFront-facing, neutral expression, good lighting
Image resolutionAt least 512×512
Audio qualityClear voice, minimal background noise
Audio formatMP3 or WAV, 44.1kHz
Video durationMatches audio length (up to 60s)

Complete creator workflow

Yes No Script your content Generate video with prompt Generate background music with BeatFusion Need talking head? Generate LipFusion video Skip Merge video + music ffmpeg Final .mp4 ready to post

Full Python workflow

import os
import time
import urllib.request
import json
import subprocess

API_KEY = os.environ["SKYTELLS_API_KEY"]
BASE = "https://api.skytells.ai/v1"

def create_and_wait(model, input_data):
    req = urllib.request.Request(
        f"{BASE}/predictions",
        data=json.dumps({"model": model, "input": input_data}).encode(),
        headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req) as resp:
        prediction = json.loads(resp.read())

    while prediction["status"] not in ("succeeded", "failed"):
        time.sleep(5)
        req = urllib.request.Request(
            f"{BASE}/predictions/{prediction['id']}",
            headers={"x-api-key": API_KEY},
        )
        with urllib.request.urlopen(req) as resp:
            prediction = json.loads(resp.read())

    if prediction["status"] != "succeeded":
        raise RuntimeError(prediction.get("error"))
    return prediction["output"][0]

# 1. Generate video
video_url = create_and_wait("truefusion-video-pro", {
    "prompt": "A barista making pour-over coffee, morning light, cinematic",
    "duration_seconds": 10,
    "aspect_ratio": "9:16",
})

# 2. Generate matching music
music_url = create_and_wait("beatfusion-2.0", {
    "prompt": "Calm morning café ambience, acoustic guitar, warm, relaxing, 10 seconds",
    "duration_seconds": 12,
})

# 3. Download both
urllib.request.urlretrieve(video_url, "video.mp4")
urllib.request.urlretrieve(music_url, "music.mp3")

# 4. Merge with ffmpeg
subprocess.run([
    "ffmpeg", "-i", "video.mp4", "-i", "music.mp3",
    "-c:v", "copy", "-c:a", "aac",
    "-filter_complex", "[1:a]volume=0.4[m];[m]apad[out]",
    "-map", "0:v", "-map", "[out]",
    "-shortest", "final.mp4",
], check=True)

print("Done! Saved to final.mp4")

Summary

  • BeatFusion generates original background music — match mood to content type
  • LipFusion creates talking-head videos from portrait + audio
  • Use ffmpeg to merge video + music tracks
  • Generate all assets in parallel, then merge — saves time

In the next module, you'll automate this entire workflow with a scheduling pipeline.

On this page