Runtime Health & Live Events

Read CPU, memory, and heap signals from the Runtime Health view. Stream live events in real time to diagnose problems as they happen.

What you'll be able to do after this module

Read runtime metrics from a running app, recognize the early signs of memory pressure or CPU saturation, and use the Live event stream to watch an app in real time during active debugging.

Runtime Health

What it shows and what it doesn't

Runtime Health shows you the resource consumption of your running apps:

CPU usage — percentage of allocated CPU being used, over time
Memory consumption — total memory in use, compared to the app's allocation
Heap metrics — for apps running on Node.js or similar runtimes, the portion of memory used by the language runtime's heap
Request latency — how long requests are taking to complete (p50 and p95 typically)

It shows you what the machine layer looks like. It doesn't tell you why. For the why, you use it together with logs and the Errors view.

How to read the Runtime Health view

Open Cognition → Runtime Health.

You'll see time-series charts for each metric. The charts default to a recent time window — usually the last hour. You can adjust the window to look further back.

Reading CPU:

Low and stable:  0–40%   — healthy, headroom available
Moderate:        40–70%  — fine, but worth monitoring at traffic peaks
High:            70–90%  — running hot, investigate if sustained
Critical:        90%+    — likely causing latency and dropped requests

Reading memory:

Memory goes in one direction: it grows. The question is whether it grows toward a ceiling or stays at a stable plateau.

Pattern	What it likely means
Steady line (flat after initial growth)	Normal — app loaded, memory is stable
Gradual climb over hours	Possible memory leak — allocating without freeing
Sudden jump	Something new was loaded (large asset, cache warm)
Line hits ceiling and flattens at top	App is out of memory — likely restart loop or degraded performance

Reading request latency:

p50 (median): what a typical request experiences
p95: what the worst 5% of requests are experiencing — this is what you notice when you say "the site is slow"

If p50 is normal but p95 is elevated, slow requests are happening but not universal. Likely a subset of requests (a specific endpoint or query pattern) is causing the drag.

What to do when metrics look bad

CPU running hot

Check whether the CPU spike correlates with a traffic spike — go to Anomalies.
If traffic is normal, look for expensive operations: large database queries, image processing, synchronous loops, or heavy dependency load.
Consider whether the app has enough CPU allocation for its workload.

Memory climbing without a ceiling

Look for allocations that don't get cleaned up: event listeners not removed, large objects held in closure, caches that grow without eviction.
Check whether the climb is correlated with time (slow leak) or requests (request-correlated leak — more requests → more memory, never freed).
Restart the app to confirm the behavior resets, then investigate the code path responsible.

# Restart without redeploying (uses current version)
skytells apps restart my-api

# Stream logs after restart to confirm clean startup
skytells logs my-api --type container --follow

High request latency

Check the specific endpoints driving the latency — look at logs for slow spans.
Check whether downstream services (database, external API) are slow — use the Errors view for connection timeout patterns.
If the issue is database query time, optimize the query or add an index.

Runtime health from the CLI

# View current runtime health summary
skytells cognition runtime

# JSON output
skytells cognition runtime --json

# Pull specific fields
skytells cognition runtime --json | jq '{cpu: .cpu_usage, memory: .memory_usage}'

# Time-series data over the last 24 hours
skytells cognition timeseries --hours 24
skytells cognition timeseries --hours 24 --json

The timeseries command is particularly useful in scripts — it returns bucketed metric data you can process or plot externally.

Live Events

What Live is

The Live view is a real-time feed of events from your apps. As requests come in, errors occur, and system events fire, they appear in the feed in the order they happen.

It's different from logs. Logs are the continuous stream of everything your app writes to stdout/stderr. Live events are structured events with categorized types and metadata.

Use Live when:

You're actively debugging and want to watch what's happening right now
You've made a change and want to confirm the new behavior is showing up correctly
You're investigating a transient problem that only happens under certain conditions
You want to watch how your app behaves during a load test or similar scenario

Using the Live view effectively

Keep your scope narrow. The Live feed can be noisy in a busy app. Use the filter controls to focus on specific event types or apps.

Watch for sequences, not just individual events. A single 500 error is noise. Seeing a database connection event followed immediately by an error is a pattern — those two events are almost certainly related.

Use it in parallel with logs. Open skytells logs my-api --follow in a terminal at the same time you're watching Live. You'll see the same events from two different angles — the structured event metadata in Live, and the raw log output in the terminal.

Streaming events from the CLI

# Pull recent events
skytells cognition events

# Events after a specific event ID (use the ID from a previous response)
skytells cognition events --since evt-100

# For polling in a script — get the latest event ID:
LAST_ID=$(skytells cognition events --limit 1 --json | jq -r '.[-1].id // empty')
# On the next cycle, fetch only new events:
skytells cognition events --since "$LAST_ID" --json

Putting runtime health and live events together

These two views are strongest when used at the same time:

Scenario: Latency degradation, cause unknown

Open Runtime Health — CPU looks fine, but memory is near ceiling.
Open Live — watch for events that fire before latency spikes. Look for patterns: does a specific request type fire before latency goes up?

Open a terminal and stream actual container logs:

skytells logs my-api --type container --tail 100 --follow

Between these three — timing, metrics, and actual log output — you have the signal to find what's causing the slowdown.

What you now know

Task	How to do it
Read CPU and memory trends	Cognition → Runtime Health
Recognize a memory leak pattern	Look for steady climb without plateau
Distinguish median vs. tail latency	Compare p50 and p95 values
Restart a struggling app from terminal	`skytells apps restart <app>`
Watch events in real time	Cognition → Live
Pull runtime health from terminal	`skytells cognition runtime`
Query time-series metrics	`skytells cognition timeseries --hours 24`
Poll events from a specific event ID	`skytells cognition events --since <event-id>`

Up next: Module 5 — Monitoring from the CLI →

PreviousSecurity Threats & Anomalies NextMonitoring from the CLI

Runtime Health & Live Events

On this page