Cognition: Runtime Observer

Six process health monitors sampling event loop timing, memory, CPU, garbage collection, active handles, and HTTP metrics. Configurable thresholds emit anomaly events when limits are crossed.

The Runtime Observer samples your Node.js process on a set interval and sends snapshots to Skytells. It runs six independent monitors covering the areas most likely to show performance problems: event loop timing, heap memory, CPU load, garbage collection behaviour, active handles, and HTTP traffic. When a monitored value crosses a configured threshold, it emits a separate anomaly event alongside the snapshot.

Six Monitors

The observer is composed of six independent sub-monitors, each targeting a distinct resource:

Monitor	Source	What it measures
Event Loop	`perf_hooks.monitorEventLoopDelay()`, `eventLoopUtilization()`	Loop utilization, p50/p99/max lag
Memory	`process.memoryUsage()`	RSS, heap used/total, external, array buffers, growth rate
CPU	`process.cpuUsage()`	User %, system %, total %, core count
Garbage Collection	`PerformanceObserver('gc')`	Collection counts by type, total GC duration, max pause
Active Handles	`process._getActiveHandles()`	Active handles/requests, trend detection
HTTP Metrics	`PerformanceObserver('http')`	Request count, average duration, error rate

Enabling and Disabling

The Runtime Observer is enabled by default. To control it:

// Enabled by default — configure the interval
Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: {
    enabled: true,
    snapshotIntervalMs: 10_000, // default: 10 seconds
  },
});

// Disable entirely
Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: { enabled: false },
});

Snapshot Interval

The snapshot timer fires every snapshotIntervalMs (default: 10 seconds). Each tick:

Collects data from all six monitors simultaneously
Assembles a RuntimeSnapshot event
Sends it through the transport pipeline (beforeSend → buffer → batching)
Checks all anomaly thresholds
Emits AnomalyEvents for any threshold breaches

The timer is unref()'d — it will never prevent your process from exiting naturally.

Metrics Detail

Event Loop

Source: perf_hooks.monitorEventLoopDelay() + perf_hooks.eventLoopUtilization()

A monitorEventLoopDelay histogram with 20ms resolution is enabled at startup. On each collection, percentiles are read and the histogram is reset. ELU is measured as a delta from the previous collection.

Metric	Type	Unit	Description
`utilization`	`number`	ratio (0–1)	Fraction of time the event loop was not idle. `0.7` = 70% busy.
`lagP50`	`number`	ms	50th percentile event loop delay (median)
`lagP99`	`number`	ms	99th percentile event loop delay
`lagMax`	`number`	ms	Maximum delay observed in the interval

What to look for:

utilization > 0.8 — Event loop is heavily loaded; responses will slow
lagP99 > 100ms — Significant latency spikes; likely synchronous blocking code
lagMax > 500ms — Something is blocking the event loop for extended periods

Memory

Source: process.memoryUsage()

Metric	Type	Unit	Description
`rss`	`number`	bytes	Resident Set Size — total OS memory allocated for this process
`heapUsed`	`number`	bytes	V8 heap currently in use
`heapTotal`	`number`	bytes	V8 total heap size (allocated, including free)
`external`	`number`	bytes	Memory used by C++ objects bound to JS
`arrayBuffers`	`number`	bytes	Memory for `ArrayBuffer` and `SharedArrayBuffer`
`heapUsedPercent`	`number`	ratio (0–1)	`heapUsed / heapTotal`
`growthRate`	`number`	bytes/sec	Rate of heap growth since last collection

What to look for:

Sustained positive growthRate — Potential memory leak
heapUsedPercent > 0.9 — Heap pressure; GC will become aggressive
rss growing while heapUsed is stable — Native memory leak (C++ addons, Buffer.alloc)

CPU

Source: process.cpuUsage()

Converts microseconds of CPU time to a percentage of elapsed wall-clock time since the last collection.

Metric	Type	Unit	Description
`userPercent`	`number`	%	CPU time in user-space as % of wall-clock time
`systemPercent`	`number`	%	CPU time in kernel-space (syscalls, I/O)
`totalPercent`	`number`	%	`userPercent + systemPercent`
`coreCount`	`number`	count	Number of logical CPU cores

What to look for:

totalPercent > 100 — Process is using more than one core (worker threads)
systemPercent > 30 — Heavy I/O or syscall overhead
userPercent spikes — CPU-intensive computation on the main thread

Garbage Collection

Source: PerformanceObserver observing 'gc' entries

Metric	Type	Unit	Description
`majorCount`	`number`	count	Mark-sweep-compact (full GC) collections
`minorCount`	`number`	count	Scavenge (young generation) collections
`incrementalCount`	`number`	count	Incremental marking passes
`totalDuration`	`number`	ms	Cumulative GC time across all types in the interval
`maxPause`	`number`	ms	Longest single GC pause in the interval

V8 GC kinds:

Kind	Description
Scavenge	Quick collection of the young generation
Mark-Sweep-Compact	Full GC — marks, sweeps, and compacts the old generation
Incremental Marking	Partial marking done incrementally between frames

What to look for:

maxPause > 50ms — Long GC pauses causing latency spikes
High majorCount — Frequent full GCs indicate heap pressure
High totalDuration relative to snapshot interval — GC is consuming significant CPU

Active Handles

Source: process._getActiveHandles() and process._getActiveRequests()

Metric	Type	Unit	Description
`activeHandles`	`number`	count	Active handles (sockets, timers, file descriptors)
`activeRequests`	`number`	count	Active libuv requests
`trend`	`string`	enum	`'stable'` / `'growing'` / `'shrinking'`

Trend detection: The monitor keeps a sliding window of the last 10 handle counts. If the last 3 values are monotonically increasing, the trend is 'growing'; decreasing = 'shrinking'; otherwise 'stable'.

What to look for:

trend: 'growing' — Possible handle leak (connections not closed, timers not cleared)
High activeHandles — Many open connections or file descriptors

HTTP Metrics

Source: PerformanceObserver observing 'http' entries (Node.js 18.2+)

Metric	Type	Unit	Description
`totalRequests`	`number`	count	Number of outgoing HTTP requests in the interval
`avgDurationMs`	`number`	ms	Average request duration
`errorRate`	`number`	ratio (0–1)	Fraction of requests with status ≥ 400
`entries`	`array`	—	Individual request details (method, status, URL, duration)

The 'http' performance entry type degrades silently if unavailable in a given Node.js version.

On-Demand Snapshots

Collect a runtime snapshot at any time without waiting for the interval:

const snapshot = cognition.getRuntimeSnapshot();

if (snapshot) {
  console.log(`Heap: ${Math.round(snapshot.memory.heapUsed / 1024 / 1024)}MB`);
  console.log(`ELU: ${(snapshot.eventLoop.utilization * 100).toFixed(1)}%`);
  console.log(`GC max pause: ${snapshot.gc.maxPause}ms`);
}

Returns null if the Runtime Observer is disabled.

Anomaly Detection

Configure thresholds to trigger anomaly events when specific metrics are breached. Anomaly events are emitted in addition to the regular snapshot — they don't replace it. Multiple anomalies can fire per snapshot interval if multiple thresholds are breached simultaneously.

Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: {
    thresholds: {
      heapUsedMb: 512,       // Anomaly when heap > 512 MB
      eventLoopLagMs: 100,   // Anomaly when p99 lag > 100 ms
      eluPercent: 0.8,       // Anomaly when ELU > 80%
    },
  },
});

Anomaly Types

Anomaly Type	Triggered when
`heap`	`memory.heapUsed` (in MB) > `thresholds.heapUsedMb`
`event_loop_lag`	`eventLoop.lagP99` (in ms) > `thresholds.eventLoopLagMs`
`elu`	`eventLoop.utilization` (0–1) > `thresholds.eluPercent`

Anomaly Event Structure

interface AnomalyEvent {
  type: 'anomaly';
  timestamp: number;
  anomalyType: 'heap' | 'event_loop_lag' | 'elu' | 'gc_pressure';
  threshold: number;
  actual: number;
  message: string; // e.g. "Heap usage 623.4MB exceeds threshold 512MB"
}

Runtime Snapshot Structure

interface RuntimeSnapshot {
  type: 'runtime_snapshot';
  timestamp: number;

  memory: {
    rss: number;
    heapUsed: number;
    heapTotal: number;
    external: number;
    arrayBuffers: number;
  };

  cpu: {
    user: number;         // percentage
    system: number;       // percentage
  };

  eventLoop: {
    utilization: number;  // 0–1
    lagP50: number;       // ms
    lagP99: number;       // ms
    lagMax: number;       // ms
  };

  gc: {
    majorCount: number;
    minorCount: number;
    incrementalCount: number;
    totalDuration: number; // ms
    maxPause: number;      // ms
  };

  handles: {
    active: number;
    requests: number;
  };
}

Using Snapshots in a Health Check Endpoint

app.get('/health', (req, res) => {
  const snapshot = cognition.getRuntimeSnapshot();

  res.json({
    status: 'ok',
    sdk: {
      initialized: cognition.isInitialized,
      bufferedEvents: cognition.bufferSize,
      droppedEvents: cognition.droppedCount,
    },
    runtime: snapshot
      ? {
          heapUsedMb: Math.round(snapshot.memory.heapUsed / 1024 / 1024),
          eventLoopLagP99Ms: snapshot.eventLoop.lagP99.toFixed(2),
          eluPercent: (snapshot.eventLoop.utilization * 100).toFixed(1),
          gcMajorCount: snapshot.gc.majorCount,
          activeHandles: snapshot.handles.active,
        }
      : null,
  });
});

Configuration — runtime.enabled, snapshotIntervalMs, thresholds
Analytics — View runtime health data in the Console and CLI
Examples — Custom threshold alerting and periodic health check patterns

How is this guide?

Cognition: Runtime Observer

On this page