Runtime Health & Live Events
Read CPU, memory, and heap signals from the Runtime Health view. Stream live events in real time to diagnose problems as they happen.
What you'll be able to do after this module
Read runtime metrics from a running app, recognize the early signs of memory pressure or CPU saturation, and use the Live event stream to watch an app in real time during active debugging.
Runtime Health
What it shows and what it doesn't
Runtime Health shows you the resource consumption of your running apps:
- CPU usage — percentage of allocated CPU being used, over time
- Memory consumption — total memory in use, compared to the app's allocation
- Heap metrics — for apps running on Node.js or similar runtimes, the portion of memory used by the language runtime's heap
- Request latency — how long requests are taking to complete (p50 and p95 typically)
It shows you what the machine layer looks like. It doesn't tell you why. For the why, you use it together with logs and the Errors view.
How to read the Runtime Health view
Open Cognition → Runtime Health.
You'll see time-series charts for each metric. The charts default to a recent time window — usually the last hour. You can adjust the window to look further back.
Reading CPU:
Low and stable: 0–40% — healthy, headroom available
Moderate: 40–70% — fine, but worth monitoring at traffic peaks
High: 70–90% — running hot, investigate if sustained
Critical: 90%+ — likely causing latency and dropped requestsReading memory:
Memory goes in one direction: it grows. The question is whether it grows toward a ceiling or stays at a stable plateau.
| Pattern | What it likely means |
|---|---|
| Steady line (flat after initial growth) | Normal — app loaded, memory is stable |
| Gradual climb over hours | Possible memory leak — allocating without freeing |
| Sudden jump | Something new was loaded (large asset, cache warm) |
| Line hits ceiling and flattens at top | App is out of memory — likely restart loop or degraded performance |
Reading request latency:
- p50 (median): what a typical request experiences
- p95: what the worst 5% of requests are experiencing — this is what you notice when you say "the site is slow"
If p50 is normal but p95 is elevated, slow requests are happening but not universal. Likely a subset of requests (a specific endpoint or query pattern) is causing the drag.
What to do when metrics look bad
CPU running hot
- Check whether the CPU spike correlates with a traffic spike — go to Anomalies.
- If traffic is normal, look for expensive operations: large database queries, image processing, synchronous loops, or heavy dependency load.
- Consider whether the app has enough CPU allocation for its workload.
Memory climbing without a ceiling
- Look for allocations that don't get cleaned up: event listeners not removed, large objects held in closure, caches that grow without eviction.
- Check whether the climb is correlated with time (slow leak) or requests (request-correlated leak — more requests → more memory, never freed).
- Restart the app to confirm the behavior resets, then investigate the code path responsible.
# Restart without redeploying (uses current version)
skytells apps restart my-api
# Stream logs after restart to confirm clean startup
skytells logs my-api --type container --followHigh request latency
- Check the specific endpoints driving the latency — look at logs for slow spans.
- Check whether downstream services (database, external API) are slow — use the Errors view for connection timeout patterns.
- If the issue is database query time, optimize the query or add an index.
Runtime health from the CLI
# View current runtime health summary
skytells cognition runtime
# JSON output
skytells cognition runtime --json
# Pull specific fields
skytells cognition runtime --json | jq '{cpu: .cpu_usage, memory: .memory_usage}'
# Time-series data over the last 24 hours
skytells cognition timeseries --hours 24
skytells cognition timeseries --hours 24 --jsonThe timeseries command is particularly useful in scripts — it returns bucketed metric data you can process or plot externally.
Live Events
What Live is
The Live view is a real-time feed of events from your apps. As requests come in, errors occur, and system events fire, they appear in the feed in the order they happen.
It's different from logs. Logs are the continuous stream of everything your app writes to stdout/stderr. Live events are structured events with categorized types and metadata.
Use Live when:
- You're actively debugging and want to watch what's happening right now
- You've made a change and want to confirm the new behavior is showing up correctly
- You're investigating a transient problem that only happens under certain conditions
- You want to watch how your app behaves during a load test or similar scenario
Using the Live view effectively
Keep your scope narrow. The Live feed can be noisy in a busy app. Use the filter controls to focus on specific event types or apps.
Watch for sequences, not just individual events. A single 500 error is noise. Seeing a database connection event followed immediately by an error is a pattern — those two events are almost certainly related.
Use it in parallel with logs. Open skytells logs my-api --follow in a terminal at the same time you're watching Live. You'll see the same events from two different angles — the structured event metadata in Live, and the raw log output in the terminal.
Streaming events from the CLI
# Pull recent events
skytells cognition events
# Events after a specific event ID (use the ID from a previous response)
skytells cognition events --since evt-100
# For polling in a script — get the latest event ID:
LAST_ID=$(skytells cognition events --limit 1 --json | jq -r '.[-1].id // empty')
# On the next cycle, fetch only new events:
skytells cognition events --since "$LAST_ID" --jsonPutting runtime health and live events together
These two views are strongest when used at the same time:
Scenario: Latency degradation, cause unknown
- Open Runtime Health — CPU looks fine, but memory is near ceiling.
- Open Live — watch for events that fire before latency spikes. Look for patterns: does a specific request type fire before latency goes up?
- Open a terminal and stream actual container logs:
skytells logs my-api --type container --tail 100 --follow - Between these three — timing, metrics, and actual log output — you have the signal to find what's causing the slowdown.
What you now know
| Task | How to do it |
|---|---|
| Read CPU and memory trends | Cognition → Runtime Health |
| Recognize a memory leak pattern | Look for steady climb without plateau |
| Distinguish median vs. tail latency | Compare p50 and p95 values |
| Restart a struggling app from terminal | skytells apps restart <app> |
| Watch events in real time | Cognition → Live |
| Pull runtime health from terminal | skytells cognition runtime |
| Query time-series metrics | skytells cognition timeseries --hours 24 |
| Poll events from a specific event ID | skytells cognition events --since <event-id> |
Up next: Module 5 — Monitoring from the CLI →
Security Threats & Anomalies
Understand what Skytells detects as a security event, how severity is assessed, and how anomaly detection distinguishes real problems from noise.
Monitoring from the CLI
Run every Cognition view from your terminal, poll for new events, build automated health check scripts, and wire Cognition data into your alerting systems.