CloudLine
Concepts

Metrics explained

Every Telemetry tile — what it measures, what "normal" looks like for a Discord bot, and when to actually worry.

The Telemetry panel on each bot's detail page shows a tile for every metric the SDK reports. Each tile has a current value, sometimes a sparkline trend, and sometimes a horizontal bar when the metric has a real ceiling.

This page goes tile-by-tile.

RAM

What it measures: process resident-set size (RSS) in megabytes — the physical RAM the bot's process is currently using.

Normal range: depends on language and bot complexity. Rough guides:

  • discord.js bots: 80–200 MB
  • discord.py bots: 50–150 MB
  • Music bots with active voice connections: 300–800 MB
  • Bots with large guild caches (10k+ guilds): 500 MB – 2 GB

When to worry:

  • Steady upward growth across days → memory leak. Find the unbounded data structure.
  • Sudden spike + sustained → stuck object cache. Check if a cache is missing eviction.
  • Spike + return to baseline → routine GC cycle, ignore.

CPU

What it measures: process CPU usage in percent, relative to a single core. 100% means one core fully saturated. Values are clamped server-side to the 0–100 range, so a process busy across several cores still reports 100, not 400 — read 100 as "pegged" rather than as a precise multi-core figure.

Normal range: 1–5% idle. Brief spikes to 30–50% during slash-command bursts are routine.

When to worry:

  • Sustained > 50% → a hot loop or unbounded recursion. Profile the process.
  • Steady climb → something is accumulating work without releasing it (event handler installed twice, recursive listener).

Gateway

What it measures: state of the Discord WebSocket gateway connection.

  • OK — gateway is healthy and acking heartbeats.
  • Zombie — process is alive but gateway is stale (no events arriving). Bot won't serve any commands. See Zombie state.
  • Shard warning — for sharded bots, one or more shards are disconnected.

Two related sub-alerts can be enabled on the Alerts tab: zombie (fires when gateway goes stale) and shard_down (fires when shard count drops below expected).

Latency (gateway ping)

What it measures: round-trip time of the Discord gateway heartbeat ack, in milliseconds.

Normal range: 30–80 ms. Above 150 ms suggests either Discord-side congestion or your host's connection to Discord is degraded.

When to worry:

  • Sustained > 200 ms → your host's network to Discord is slow. Could be: shared hosting with noisy neighbors, a VPS in a region far from Discord's nearest gateway, or a misconfigured firewall adding latency.
  • Climbing then dropping back to normal → Discord-side hiccup, no action needed.

Slash p50 / p95

What it measures: end-to-end latency of slash commands, context menus, and modal submits, from the user's click to the bot's first response (reply, deferReply, or showModal).

Normal range: under 500 ms for stateless commands, under 1.5 s for DB-backed commands.

When to worry:

  • p50 > 1 s → typical command is slow. Probably synchronous DB query or 3rd-party API call on every command.
  • p95 > 3 s → some commands are getting close to Discord's 3-second interaction timeout. Users see "interaction failed."
  • p95 ≫ p50 (e.g. 4× wider) → a specific command or edge case is slow but most are fine. Find and profile that command.

See Slow slash commands for the fix flow.

Component p50 / p95

What it measures: button and select-menu response latency. Same measurement contract as slash, just for component interactions (update, deferUpdate).

Normal range: same as slash — under 500 ms is the goal.

Autocomplete p50 / p95

What it measures: slash-command autocomplete responsiveness. Discord cuts autocomplete off after 3 seconds.

Normal range: under 200 ms for in-memory completions; under 1 s for DB-backed lookups.

When to worry: > 1 s p95. Autocomplete fires per keystroke — what looks like one slow query is actually one query per character the user types. Cache hard, index your search columns.

Event-loop lag

What it measures: average time the Node perf_hooks event loop is blocked between scheduled timers (or asyncio scheduler delay on Python), averaged over the heartbeat window.

Normal range: 0–5 ms idle. 20–50 ms brief spikes during GC.

When to worry:

  • Sustained > 100 ms → main thread is busy doing synchronous work. Find the blocking operation (regex, JSON parse, sync I/O).
  • Spikes correlate with p95 spikes → the blocking work is happening during command handling. Defer the work to a worker thread / asyncio.to_thread.

Rate-limit hits

What it measures: count of 429 Too Many Requests responses the bot received from the Discord REST API in this heartbeat window.

Normal range: 0. Discord generously buckets requests.

When to worry: any non-zero number is unusual.

  • Spike + drop → one route saw a burst (e.g. a broadcast). Throttle the sender.
  • Sustained > 5/min → unintended retry loop hitting the same rate limit. Stop retrying 4xx errors.
  • Climbing on startup → registering commands on every restart. Move to a deploy-time script.

Uptime

What it measures: how long the bot process has been running, in seconds. Reported by process.uptime() (Node) or time.monotonic() (Python).

Normal range: depends entirely on your hosting. A hobby bot on free hosting might restart every few hours; a production bot might run for weeks.

When to watch: sudden drop to 0 means a restart happened. The seq (heartbeat counter) also resets, which is how CloudLine detects restarts independently.

Custom metrics

User-pushed values via monitor.gauge() / monitor.counter(). Each appears as its own tile under the Custom metrics section.

What "normal" looks like is entirely up to you — the metric is yours. See Custom metrics for the API and common patterns.

Reading the sparkline

Every tile with enough data shows a small sparkline trend. The window matches your selected date range:

  • Last 1 hour → per-minute resolution
  • Last 24 hours → per-5-minute resolution
  • Last 7 days → per-hour resolution
  • Longer ranges → fall back to the 5-minute aggregate, then daily aggregate. See Retention policy.

The sparkline color is steady when the metric is in a healthy band, and shifts toward warning / danger as the value approaches alert thresholds.

What's hidden when

Tiles with no data show . The most common reasons:

Tile because...
RAM / CPUPython without [metrics] extra, or runtime without process.memoryUsage().
Slash / Component / AutocompleteNo interactions of that type have fired in the current window.
LatencyGateway hasn't acked the first heartbeat yet (clears within 30 s).
Event-loop lagDeno without --unstable-node-builtins.
ShardsSingle-process bot (not sharded).

See No data showing for the full diagnostic flow.