Your First Inference Call
Use the Inference API hands-on — make LLM chat completions, stream tokens to a UI, hold stateful conversations, and generate embeddings with the OpenAI-compatible interface.
What you'll be able to do after this module
Call the Inference API from cURL, Python, and TypeScript. Stream tokens to a chat UI. Use previous_response_id for stateful conversations without replaying history. Generate embeddings for a RAG pipeline.
Quick recap: Inference API vs Prediction API
You already learned the conceptual difference in Module 3. As a reminder:
- Inference API →
POST /v1/chat/completions→ synchronous LLM text generation - Prediction API →
POST /v1/predictions→ async media generation
This module is hands-on Inference API only.
Setup
Make sure your API key is available:
export SKYTELLS_API_KEY="sk-your-key-here"The Inference API accepts both Authorization: Bearer (OpenAI-style) and x-api-key headers. Either works.
Your first chat completion
curl -X POST https://api.skytells.ai/v1/chat/completions \
-H "Authorization: Bearer $SKYTELLS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepbrain-router",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain what a prediction is in the Skytells API in two sentences." }
]
}'Reading the response:
{
"id": "chatcmpl-DKQ7HZtNYLc7uK0Dpn0JggRUUhuBE",
"object": "chat.completion",
"created": 1773759323,
"model": "deepbrain-router",
"system_fingerprint": "fp_490a4ad033",
"choices": [{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Machine learning is...",
"annotations": [],
"refusal": null
},
"content_filter_results": {
"hate": { "filtered": false, "severity": "safe" },
"self_harm": { "filtered": false, "severity": "safe" },
"sexual": { "filtered": false, "severity": "safe" },
"violence": { "filtered": false, "severity": "safe" },
"protected_material_code": { "filtered": false, "detected": false },
"protected_material_text": { "filtered": false, "detected": false }
}
}],
"prompt_filter_results": [{
"prompt_index": 0,
"content_filter_results": {
"hate": { "filtered": false, "severity": "safe" },
"self_harm": { "filtered": false, "severity": "safe" },
"sexual": { "filtered": false, "severity": "safe" },
"violence": { "filtered": false, "severity": "safe" },
"jailbreak": { "filtered": false, "detected": false }
}
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 10,
"total_tokens": 19,
"completion_tokens_details": {
"reasoning_tokens": 0, "audio_tokens": 0,
"accepted_prediction_tokens": 0, "rejected_prediction_tokens": 0
},
"prompt_tokens_details": { "cached_tokens": 0, "audio_tokens": 0 }
}
}The generated text is at choices[0].message.content. The usage.total_tokens is what you're billed for. Skytells enforces content safety on every response — content_filter_results shows the per-category analysis; a filtered: true value means that content was blocked. Check finish_reason: "content_filter" in your error handling.
Streaming
Add stream: true to receive tokens as they arrive — essential for chat UIs so users see text appearing instead of waiting for a full response.
curl -X POST https://api.skytells.ai/v1/chat/completions \
-H "Authorization: Bearer $SKYTELLS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepbrain-router",
"messages": [{ "role": "user", "content": "Write a 3-sentence story about a robot." }],
"stream": true
}'Streaming in a Next.js API route
// app/api/chat/route.ts
import OpenAI from 'openai';
import { OpenAIStream, StreamingTextResponse } from 'ai';
const client = new OpenAI({
apiKey: process.env.SKYTELLS_API_KEY,
baseURL: 'https://api.skytells.ai/v1',
});
export async function POST(req: Request) {
const { messages } = await req.json();
const response = await client.chat.completions.create({
model: 'deepbrain-router',
messages,
stream: true,
});
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}Each SSE chunk is a JSON object. The stream ends with data: [DONE]:
data: {"choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"choices":[{"delta":{"content":"Once"},"index":0}]}
data: {"choices":[{"delta":{"content":" upon"},"index":0}]}
data: [DONE]Stateful conversations with /v1/responses
POST /v1/responses avoids resending the entire message history on every turn. Pass previous_response_id and the server reconstructs context automatically.
import httpx, os
headers = {
"Authorization": f"Bearer {os.environ['SKYTELLS_API_KEY']}",
"Content-Type": "application/json",
}
# Turn 1
resp1 = httpx.post(
"https://api.skytells.ai/v1/responses",
headers=headers,
json={
"model": "deepbrain-router",
"input": "What is the Skytells Prediction API?",
"instructions": "You are a developer assistant. Be concise.",
},
).json()
print(f"Turn 1: {resp1['output_text']}")
print(f"ID saved: {resp1['id']}")
# Turn 2 — continue without resending Turn 1
resp2 = httpx.post(
"https://api.skytells.ai/v1/responses",
headers=headers,
json={
"model": "deepbrain-router",
"input": "How is it different from the Inference API?",
"previous_response_id": resp1["id"],
},
).json()
print(f"Turn 2: {resp2['output_text']}")The ResponseObject has a convenience output_text field containing the full text response — no need to dig into output[0].content[0].text.
Embeddings
Generate dense vector representations for semantic search, clustering, or RAG pipelines.
# Single text
result = client.embeddings.create(
model="deepbrain-router",
input="A photorealistic mountain lake at sunrise",
)
vector = result.data[0].embedding
print(f"Vector dimensions: {len(vector)}") # e.g. 1536
# Batch
result = client.embeddings.create(
model="deepbrain-router",
input=[
"Generate an image of a cat",
"Generate an image of a dog",
"Create background music for a podcast",
],
)
for item in result.data:
print(f"Index {item.index}: {len(item.embedding)} dimensions")The embedding response:
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.002, -0.009, 0.015, "..."] }
],
"model": "deepbrain-router",
"usage": { "prompt_tokens": 9, "total_tokens": 9 }
}Generation parameters — quick reference
| Parameter | Default | Effect |
|---|---|---|
temperature: 0 | — | Fully deterministic — same input = same output |
temperature: 0.7 | ✓ | Balanced — creative but coherent |
temperature: 1.5 | — | Very creative, may hallucinate |
max_tokens: 256 | — | Short answers |
max_tokens: 8192 | ✓ | Default — allows long responses |
top_p: 0.1 | — | Very focused, uses top-10% likely tokens |
top_p: 0.95 | ✓ | Default — balanced nucleus sampling |
stop: "\n" | — | Stops at first newline — useful for single-line outputs |
Summary
You can now call the Inference API, stream tokens to a UI, hold stateful conversations, and generate embeddings.
Up next: Module 6 — error handling and best practices for both APIs. Learn how to implement retry logic, set up webhooks for long-running predictions, secure your API key, and build production-ready AI integrations.
Your First Prediction
Make your first real API call — generate an AI image with the Prediction API, understand the full async lifecycle, and read outputs correctly.
Error Handling & Best Practices
Handle errors from both the Prediction and Inference APIs robustly — implement retry logic, manage rate limits, set up webhooks, secure API keys, and ship production-ready AI integrations.