Beginner20 minModule 5 of 6

Your First Inference Call

Use the Inference API hands-on — make LLM chat completions, stream tokens to a UI, hold stateful conversations, and generate embeddings with the OpenAI-compatible interface.

What you'll be able to do after this module

Call the Inference API from cURL, Python, and TypeScript. Stream tokens to a chat UI. Use previous_response_id for stateful conversations without replaying history. Generate embeddings for a RAG pipeline.


Quick recap: Inference API vs Prediction API

You already learned the conceptual difference in Module 3. As a reminder:

  • Inference APIPOST /v1/chat/completions → synchronous LLM text generation
  • Prediction APIPOST /v1/predictions → async media generation

This module is hands-on Inference API only.


Setup

Make sure your API key is available:

export SKYTELLS_API_KEY="sk-your-key-here"

The Inference API accepts both Authorization: Bearer (OpenAI-style) and x-api-key headers. Either works.


Your first chat completion

curl -X POST https://api.skytells.ai/v1/chat/completions \
  -H "Authorization: Bearer $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepbrain-router",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain what a prediction is in the Skytells API in two sentences." }
    ]
  }'

Reading the response:

{
  "id": "chatcmpl-DKQ7HZtNYLc7uK0Dpn0JggRUUhuBE",
  "object": "chat.completion",
  "created": 1773759323,
  "model": "deepbrain-router",
  "system_fingerprint": "fp_490a4ad033",
  "choices": [{
    "index": 0,
    "finish_reason": "stop",
    "message": {
      "role": "assistant",
      "content": "Machine learning is...",
      "annotations": [],
      "refusal": null
    },
    "content_filter_results": {
      "hate":                    { "filtered": false, "severity": "safe" },
      "self_harm":               { "filtered": false, "severity": "safe" },
      "sexual":                  { "filtered": false, "severity": "safe" },
      "violence":                { "filtered": false, "severity": "safe" },
      "protected_material_code": { "filtered": false, "detected": false },
      "protected_material_text": { "filtered": false, "detected": false }
    }
  }],
  "prompt_filter_results": [{
    "prompt_index": 0,
    "content_filter_results": {
      "hate":      { "filtered": false, "severity": "safe" },
      "self_harm": { "filtered": false, "severity": "safe" },
      "sexual":    { "filtered": false, "severity": "safe" },
      "violence":  { "filtered": false, "severity": "safe" },
      "jailbreak": { "filtered": false, "detected": false }
    }
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 10,
    "total_tokens": 19,
    "completion_tokens_details": {
      "reasoning_tokens": 0, "audio_tokens": 0,
      "accepted_prediction_tokens": 0, "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": { "cached_tokens": 0, "audio_tokens": 0 }
  }
}

The generated text is at choices[0].message.content. The usage.total_tokens is what you're billed for. Skytells enforces content safety on every response — content_filter_results shows the per-category analysis; a filtered: true value means that content was blocked. Check finish_reason: "content_filter" in your error handling.


Streaming

Add stream: true to receive tokens as they arrive — essential for chat UIs so users see text appearing instead of waiting for a full response.

curl -X POST https://api.skytells.ai/v1/chat/completions \
  -H "Authorization: Bearer $SKYTELLS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepbrain-router",
    "messages": [{ "role": "user", "content": "Write a 3-sentence story about a robot." }],
    "stream": true
  }'

Streaming in a Next.js API route

// app/api/chat/route.ts
import OpenAI from 'openai';
import { OpenAIStream, StreamingTextResponse } from 'ai';

const client = new OpenAI({
  apiKey: process.env.SKYTELLS_API_KEY,
  baseURL: 'https://api.skytells.ai/v1',
});

export async function POST(req: Request) {
  const { messages } = await req.json();

  const response = await client.chat.completions.create({
    model: 'deepbrain-router',
    messages,
    stream: true,
  });

  const stream = OpenAIStream(response);
  return new StreamingTextResponse(stream);
}

Each SSE chunk is a JSON object. The stream ends with data: [DONE]:

data: {"choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"choices":[{"delta":{"content":"Once"},"index":0}]}
data: {"choices":[{"delta":{"content":" upon"},"index":0}]}
data: [DONE]

Stateful conversations with /v1/responses

POST /v1/responses avoids resending the entire message history on every turn. Pass previous_response_id and the server reconstructs context automatically.

import httpx, os

headers = {
    "Authorization": f"Bearer {os.environ['SKYTELLS_API_KEY']}",
    "Content-Type": "application/json",
}

# Turn 1
resp1 = httpx.post(
    "https://api.skytells.ai/v1/responses",
    headers=headers,
    json={
        "model": "deepbrain-router",
        "input": "What is the Skytells Prediction API?",
        "instructions": "You are a developer assistant. Be concise.",
    },
).json()

print(f"Turn 1: {resp1['output_text']}")
print(f"ID saved: {resp1['id']}")

# Turn 2 — continue without resending Turn 1
resp2 = httpx.post(
    "https://api.skytells.ai/v1/responses",
    headers=headers,
    json={
        "model": "deepbrain-router",
        "input": "How is it different from the Inference API?",
        "previous_response_id": resp1["id"],
    },
).json()

print(f"Turn 2: {resp2['output_text']}")

The ResponseObject has a convenience output_text field containing the full text response — no need to dig into output[0].content[0].text.


Embeddings

Generate dense vector representations for semantic search, clustering, or RAG pipelines.

# Single text
result = client.embeddings.create(
    model="deepbrain-router",
    input="A photorealistic mountain lake at sunrise",
)
vector = result.data[0].embedding
print(f"Vector dimensions: {len(vector)}")   # e.g. 1536

# Batch
result = client.embeddings.create(
    model="deepbrain-router",
    input=[
        "Generate an image of a cat",
        "Generate an image of a dog",
        "Create background music for a podcast",
    ],
)
for item in result.data:
    print(f"Index {item.index}: {len(item.embedding)} dimensions")

The embedding response:

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.002, -0.009, 0.015, "..."] }
  ],
  "model": "deepbrain-router",
  "usage": { "prompt_tokens": 9, "total_tokens": 9 }
}

Generation parameters — quick reference

ParameterDefaultEffect
temperature: 0Fully deterministic — same input = same output
temperature: 0.7Balanced — creative but coherent
temperature: 1.5Very creative, may hallucinate
max_tokens: 256Short answers
max_tokens: 8192Default — allows long responses
top_p: 0.1Very focused, uses top-10% likely tokens
top_p: 0.95Default — balanced nucleus sampling
stop: "\n"Stops at first newline — useful for single-line outputs

Summary

Up next: Module 6 — error handling and best practices for both APIs. Learn how to implement retry logic, set up webhooks for long-running predictions, secure your API key, and build production-ready AI integrations.

On this page