Cognition: Runtime Observer

Six process health monitors sampling event loop timing, memory, CPU, garbage collection, active handles, and HTTP metrics. Configurable thresholds emit anomaly events when limits are crossed.

The Runtime Observer samples your Node.js process on a set interval and sends snapshots to Skytells. It runs six independent monitors covering the areas most likely to show performance problems: event loop timing, heap memory, CPU load, garbage collection behaviour, active handles, and HTTP traffic. When a monitored value crosses a configured threshold, it emits a separate anomaly event alongside the snapshot.


Six Monitors

The observer is composed of six independent sub-monitors, each targeting a distinct resource:

MonitorSourceWhat it measures
Event Loopperf_hooks.monitorEventLoopDelay(), eventLoopUtilization()Loop utilization, p50/p99/max lag
Memoryprocess.memoryUsage()RSS, heap used/total, external, array buffers, growth rate
CPUprocess.cpuUsage()User %, system %, total %, core count
Garbage CollectionPerformanceObserver('gc')Collection counts by type, total GC duration, max pause
Active Handlesprocess._getActiveHandles()Active handles/requests, trend detection
HTTP MetricsPerformanceObserver('http')Request count, average duration, error rate

Enabling and Disabling

The Runtime Observer is enabled by default. To control it:

// Enabled by default — configure the interval
Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: {
    enabled: true,
    snapshotIntervalMs: 10_000, // default: 10 seconds
  },
});

// Disable entirely
Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: { enabled: false },
});

Snapshot Interval

The snapshot timer fires every snapshotIntervalMs (default: 10 seconds). Each tick:

  1. Collects data from all six monitors simultaneously
  2. Assembles a RuntimeSnapshot event
  3. Sends it through the transport pipeline (beforeSend → buffer → batching)
  4. Checks all anomaly thresholds
  5. Emits AnomalyEvents for any threshold breaches

The timer is unref()'d — it will never prevent your process from exiting naturally.


Metrics Detail

Event Loop

Source: perf_hooks.monitorEventLoopDelay() + perf_hooks.eventLoopUtilization()

A monitorEventLoopDelay histogram with 20ms resolution is enabled at startup. On each collection, percentiles are read and the histogram is reset. ELU is measured as a delta from the previous collection.

MetricTypeUnitDescription
utilizationnumberratio (0–1)Fraction of time the event loop was not idle. 0.7 = 70% busy.
lagP50numberms50th percentile event loop delay (median)
lagP99numberms99th percentile event loop delay
lagMaxnumbermsMaximum delay observed in the interval

What to look for:

  • utilization > 0.8 — Event loop is heavily loaded; responses will slow
  • lagP99 > 100ms — Significant latency spikes; likely synchronous blocking code
  • lagMax > 500ms — Something is blocking the event loop for extended periods

Memory

Source: process.memoryUsage()

MetricTypeUnitDescription
rssnumberbytesResident Set Size — total OS memory allocated for this process
heapUsednumberbytesV8 heap currently in use
heapTotalnumberbytesV8 total heap size (allocated, including free)
externalnumberbytesMemory used by C++ objects bound to JS
arrayBuffersnumberbytesMemory for ArrayBuffer and SharedArrayBuffer
heapUsedPercentnumberratio (0–1)heapUsed / heapTotal
growthRatenumberbytes/secRate of heap growth since last collection

What to look for:

  • Sustained positive growthRate — Potential memory leak
  • heapUsedPercent > 0.9 — Heap pressure; GC will become aggressive
  • rss growing while heapUsed is stable — Native memory leak (C++ addons, Buffer.alloc)

CPU

Source: process.cpuUsage()

Converts microseconds of CPU time to a percentage of elapsed wall-clock time since the last collection.

MetricTypeUnitDescription
userPercentnumber%CPU time in user-space as % of wall-clock time
systemPercentnumber%CPU time in kernel-space (syscalls, I/O)
totalPercentnumber%userPercent + systemPercent
coreCountnumbercountNumber of logical CPU cores

What to look for:

  • totalPercent > 100 — Process is using more than one core (worker threads)
  • systemPercent > 30 — Heavy I/O or syscall overhead
  • userPercent spikes — CPU-intensive computation on the main thread

Garbage Collection

Source: PerformanceObserver observing 'gc' entries

MetricTypeUnitDescription
majorCountnumbercountMark-sweep-compact (full GC) collections
minorCountnumbercountScavenge (young generation) collections
incrementalCountnumbercountIncremental marking passes
totalDurationnumbermsCumulative GC time across all types in the interval
maxPausenumbermsLongest single GC pause in the interval

V8 GC kinds:

KindDescription
ScavengeQuick collection of the young generation
Mark-Sweep-CompactFull GC — marks, sweeps, and compacts the old generation
Incremental MarkingPartial marking done incrementally between frames

What to look for:

  • maxPause > 50ms — Long GC pauses causing latency spikes
  • High majorCount — Frequent full GCs indicate heap pressure
  • High totalDuration relative to snapshot interval — GC is consuming significant CPU

Active Handles

Source: process._getActiveHandles() and process._getActiveRequests()

MetricTypeUnitDescription
activeHandlesnumbercountActive handles (sockets, timers, file descriptors)
activeRequestsnumbercountActive libuv requests
trendstringenum'stable' / 'growing' / 'shrinking'

Trend detection: The monitor keeps a sliding window of the last 10 handle counts. If the last 3 values are monotonically increasing, the trend is 'growing'; decreasing = 'shrinking'; otherwise 'stable'.

What to look for:

  • trend: 'growing' — Possible handle leak (connections not closed, timers not cleared)
  • High activeHandles — Many open connections or file descriptors

HTTP Metrics

Source: PerformanceObserver observing 'http' entries (Node.js 18.2+)

MetricTypeUnitDescription
totalRequestsnumbercountNumber of outgoing HTTP requests in the interval
avgDurationMsnumbermsAverage request duration
errorRatenumberratio (0–1)Fraction of requests with status ≥ 400
entriesarrayIndividual request details (method, status, URL, duration)

The 'http' performance entry type degrades silently if unavailable in a given Node.js version.


On-Demand Snapshots

Collect a runtime snapshot at any time without waiting for the interval:

const snapshot = cognition.getRuntimeSnapshot();

if (snapshot) {
  console.log(`Heap: ${Math.round(snapshot.memory.heapUsed / 1024 / 1024)}MB`);
  console.log(`ELU: ${(snapshot.eventLoop.utilization * 100).toFixed(1)}%`);
  console.log(`GC max pause: ${snapshot.gc.maxPause}ms`);
}

Returns null if the Runtime Observer is disabled.


Anomaly Detection

Configure thresholds to trigger anomaly events when specific metrics are breached. Anomaly events are emitted in addition to the regular snapshot — they don't replace it. Multiple anomalies can fire per snapshot interval if multiple thresholds are breached simultaneously.

Cognition.init({
  apiKey: process.env.SKYTELLS_API_KEY!,
  projectId: process.env.SKYTELLS_PROJECT_ID!,
  runtime: {
    thresholds: {
      heapUsedMb: 512,       // Anomaly when heap > 512 MB
      eventLoopLagMs: 100,   // Anomaly when p99 lag > 100 ms
      eluPercent: 0.8,       // Anomaly when ELU > 80%
    },
  },
});

Anomaly Types

Anomaly TypeTriggered when
heapmemory.heapUsed (in MB) > thresholds.heapUsedMb
event_loop_lageventLoop.lagP99 (in ms) > thresholds.eventLoopLagMs
elueventLoop.utilization (0–1) > thresholds.eluPercent

Anomaly Event Structure

interface AnomalyEvent {
  type: 'anomaly';
  timestamp: number;
  anomalyType: 'heap' | 'event_loop_lag' | 'elu' | 'gc_pressure';
  threshold: number;
  actual: number;
  message: string; // e.g. "Heap usage 623.4MB exceeds threshold 512MB"
}

Runtime Snapshot Structure

interface RuntimeSnapshot {
  type: 'runtime_snapshot';
  timestamp: number;

  memory: {
    rss: number;
    heapUsed: number;
    heapTotal: number;
    external: number;
    arrayBuffers: number;
  };

  cpu: {
    user: number;         // percentage
    system: number;       // percentage
  };

  eventLoop: {
    utilization: number;  // 0–1
    lagP50: number;       // ms
    lagP99: number;       // ms
    lagMax: number;       // ms
  };

  gc: {
    majorCount: number;
    minorCount: number;
    incrementalCount: number;
    totalDuration: number; // ms
    maxPause: number;      // ms
  };

  handles: {
    active: number;
    requests: number;
  };
}

Using Snapshots in a Health Check Endpoint

app.get('/health', (req, res) => {
  const snapshot = cognition.getRuntimeSnapshot();

  res.json({
    status: 'ok',
    sdk: {
      initialized: cognition.isInitialized,
      bufferedEvents: cognition.bufferSize,
      droppedEvents: cognition.droppedCount,
    },
    runtime: snapshot
      ? {
          heapUsedMb: Math.round(snapshot.memory.heapUsed / 1024 / 1024),
          eventLoopLagP99Ms: snapshot.eventLoop.lagP99.toFixed(2),
          eluPercent: (snapshot.eventLoop.utilization * 100).toFixed(1),
          gcMajorCount: snapshot.gc.majorCount,
          activeHandles: snapshot.handles.active,
        }
      : null,
  });
});

  • Configurationruntime.enabled, snapshotIntervalMs, thresholds
  • Analytics — View runtime health data in the Console and CLI
  • Examples — Custom threshold alerting and periodic health check patterns

How is this guide?

On this page