Metrics explained
Every Telemetry tile — what it measures, what "normal" looks like for a Discord bot, and when to actually worry.
The Telemetry panel on each bot's detail page shows a tile for every metric the SDK reports. Each tile has a current value, sometimes a sparkline trend, and sometimes a horizontal bar when the metric has a real ceiling.
This page goes tile-by-tile.
RAM
What it measures: process resident-set size (RSS) in megabytes — the physical RAM the bot's process is currently using.
Normal range: depends on language and bot complexity. Rough guides:
- discord.js bots: 80–200 MB
- discord.py bots: 50–150 MB
- Music bots with active voice connections: 300–800 MB
- Bots with large guild caches (10k+ guilds): 500 MB – 2 GB
When to worry:
- Steady upward growth across days → memory leak. Find the unbounded data structure.
- Sudden spike + sustained → stuck object cache. Check if a cache is missing eviction.
- Spike + return to baseline → routine GC cycle, ignore.
CPU
What it measures: process CPU usage in percent, relative to a single core. 100% means one core fully saturated. Values are clamped server-side to the 0–100 range, so a process busy across several cores still reports 100, not 400 — read 100 as "pegged" rather than as a precise multi-core figure.
Normal range: 1–5% idle. Brief spikes to 30–50% during slash-command bursts are routine.
When to worry:
- Sustained > 50% → a hot loop or unbounded recursion. Profile the process.
- Steady climb → something is accumulating work without releasing it (event handler installed twice, recursive listener).
Gateway
What it measures: state of the Discord WebSocket gateway connection.
OK— gateway is healthy and acking heartbeats.Zombie— process is alive but gateway is stale (no events arriving). Bot won't serve any commands. See Zombie state.Shard warning— for sharded bots, one or more shards are disconnected.
Two related sub-alerts can be enabled on the Alerts tab: zombie (fires when gateway goes stale) and shard_down (fires when shard count drops below expected).
Latency (gateway ping)
What it measures: round-trip time of the Discord gateway heartbeat ack, in milliseconds.
Normal range: 30–80 ms. Above 150 ms suggests either Discord-side congestion or your host's connection to Discord is degraded.
When to worry:
- Sustained > 200 ms → your host's network to Discord is slow. Could be: shared hosting with noisy neighbors, a VPS in a region far from Discord's nearest gateway, or a misconfigured firewall adding latency.
- Climbing then dropping back to normal → Discord-side hiccup, no action needed.
Slash p50 / p95
What it measures: end-to-end latency of slash commands, context menus, and modal submits, from the user's click to the bot's first response (reply, deferReply, or showModal).
Normal range: under 500 ms for stateless commands, under 1.5 s for DB-backed commands.
When to worry:
- p50 > 1 s → typical command is slow. Probably synchronous DB query or 3rd-party API call on every command.
- p95 > 3 s → some commands are getting close to Discord's 3-second interaction timeout. Users see "interaction failed."
- p95 ≫ p50 (e.g. 4× wider) → a specific command or edge case is slow but most are fine. Find and profile that command.
See Slow slash commands for the fix flow.
Component p50 / p95
What it measures: button and select-menu response latency. Same measurement contract as slash, just for component interactions (update, deferUpdate).
Normal range: same as slash — under 500 ms is the goal.
Autocomplete p50 / p95
What it measures: slash-command autocomplete responsiveness. Discord cuts autocomplete off after 3 seconds.
Normal range: under 200 ms for in-memory completions; under 1 s for DB-backed lookups.
When to worry: > 1 s p95. Autocomplete fires per keystroke — what looks like one slow query is actually one query per character the user types. Cache hard, index your search columns.
Event-loop lag
What it measures: average time the Node perf_hooks event loop is blocked between scheduled timers (or asyncio scheduler delay on Python), averaged over the heartbeat window.
Normal range: 0–5 ms idle. 20–50 ms brief spikes during GC.
When to worry:
- Sustained > 100 ms → main thread is busy doing synchronous work. Find the blocking operation (regex, JSON parse, sync I/O).
- Spikes correlate with p95 spikes → the blocking work is happening during command handling. Defer the work to a worker thread /
asyncio.to_thread.
Rate-limit hits
What it measures: count of 429 Too Many Requests responses the bot received from the Discord REST API in this heartbeat window.
Normal range: 0. Discord generously buckets requests.
When to worry: any non-zero number is unusual.
- Spike + drop → one route saw a burst (e.g. a broadcast). Throttle the sender.
- Sustained > 5/min → unintended retry loop hitting the same rate limit. Stop retrying 4xx errors.
- Climbing on startup → registering commands on every restart. Move to a deploy-time script.
Uptime
What it measures: how long the bot process has been running, in seconds. Reported by process.uptime() (Node) or time.monotonic() (Python).
Normal range: depends entirely on your hosting. A hobby bot on free hosting might restart every few hours; a production bot might run for weeks.
When to watch: sudden drop to 0 means a restart happened. The seq (heartbeat counter) also resets, which is how CloudLine detects restarts independently.
Custom metrics
User-pushed values via monitor.gauge() / monitor.counter(). Each appears as its own tile under the Custom metrics section.
What "normal" looks like is entirely up to you — the metric is yours. See Custom metrics for the API and common patterns.
Reading the sparkline
Every tile with enough data shows a small sparkline trend. The window matches your selected date range:
- Last 1 hour → per-minute resolution
- Last 24 hours → per-5-minute resolution
- Last 7 days → per-hour resolution
- Longer ranges → fall back to the 5-minute aggregate, then daily aggregate. See Retention policy.
The sparkline color is steady when the metric is in a healthy band, and shifts toward warning / danger as the value approaches alert thresholds.
What's hidden when
Tiles with no data show —. The most common reasons:
| Tile | — because... |
|---|---|
| RAM / CPU | Python without [metrics] extra, or runtime without process.memoryUsage(). |
| Slash / Component / Autocomplete | No interactions of that type have fired in the current window. |
| Latency | Gateway hasn't acked the first heartbeat yet (clears within 30 s). |
| Event-loop lag | Deno without --unstable-node-builtins. |
| Shards | Single-process bot (not sharded). |
See No data showing for the full diagnostic flow.
How monitoring works
From the heartbeat your bot sends to the green dot on your dashboard — the full data flow, in plain language.
Reliability tiers
Excellent / Good / At-risk / Critical — what each label means, how it's computed, and why CloudLine's thresholds are Discord-bot-tuned instead of strict SRE.