How monitoring works
From the heartbeat your bot sends to the green dot on your dashboard — the full data flow, in plain language.
CloudLine has two jobs: decide whether your bot is alive, and store enough history to show you uptime, incident timelines, and trends. This page walks the flow end-to-end so you know what's actually happening behind the dashboard.
The big picture
Your bot CloudLine (Cloudflare) Your dashboard
───────── ────────────────────── ──────────────
┌─────────────────┐
SDK ──── HTTP POST ────► │ Worker API │ ───┐
every 30s (default) └─────────────────┘ │
▼
┌─────────────────┐ ┌────────────┐
│ BotMonitor │◄─│ D1 row │
│ Durable Object │ │ per bot │
│ (1 per bot) │ └────────────┘
└────────┬────────┘
│ status change
▼
┌─────────────────┐
│ NotificationHub │ ──── WebSocket ────► your browser
│ (1 per user) │ (sub-second update)
└─────────────────┘Three components, all running on Cloudflare:
- The Worker API — receives heartbeats. Stateless. Validates the request and forwards to the right Durable Object.
- The BotMonitor Durable Object — one instance per bot. Holds the bot's monitoring state and decides whether it's online, degraded, or offline.
- The NotificationHub Durable Object — one instance per user. Holds a WebSocket connection to your dashboard tabs and pushes status changes in real time.
Step 1 — Your bot sends a heartbeat
The SDK (or your raw-fetch loop) sends a POST /api/bots/{botId}/heartbeat every N seconds. The interval depends on your plan:
| Plan | Default interval |
|---|---|
| Starter | 60 s |
| Pro | 30 s |
| Business | 10 s (configurable: 10 / 20 / 30 / 45 / 60 s) |
The body is JSON with whatever metrics the bot can measure (RAM, CPU, gateway ping, slash percentiles, custom metrics). Every field is optional — missing fields are sent as null and just leave the corresponding tile blank.
Authentication is a Bearer header with your bot's clb_live_… heartbeat secret. See Heartbeat secret.
Step 2 — The Worker accepts the request
The Worker API runs on every Cloudflare edge location, so the request typically lands at the data center closest to your bot's host. It does three things:
- Per-IP rate limit — protects against an attacker spreading attempts across many bot IDs.
- Sanity-clamp the metrics — a misconfigured snippet reporting nanoseconds instead of milliseconds can't poison your dashboard with absurd numbers. Out-of-range values are silently dropped to null. The bot's "I'm alive" signal goes through even when the metrics are garbage.
- Forward to the BotMonitor DO for this specific bot.
Step 3 — The BotMonitor validates and stores
The BotMonitor is a Durable Object — Cloudflare's name for a small server that always lives in the same place, holds its own state, and processes requests one at a time. There is exactly one BotMonitor per bot, so two heartbeats from the same bot never race each other.
The BotMonitor:
- Validates the heartbeat secret in constant time. A wrong secret returns 401 with the same shape as "bot not found" or "no secret stored" — you can't tell the three apart, which prevents bot-ID enumeration.
- Per-bot rate-limit — a second layer on top of the per-IP limit, protecting one specific bot from being targeted directly.
- Updates
lastHeartbeatAtin D1 (Cloudflare's SQLite at the edge). - Appends a row to
uptime_eventswith the current status + telemetry. This is the raw event log used for incident timelines and per-second charts. - Updates the 5-minute rollup (
uptime_buckets) in DO memory. Flushed to D1 once per 5-minute boundary, not once per heartbeat — saves ~30× the D1 writes on Business plans.
Step 4 — The alarm tick decides status
In parallel with heartbeats, the BotMonitor wakes itself on a periodic alarm (Cloudflare's "DO Alarms API"). The alarm fires at the bot's checkInterval — same as your plan's heartbeat interval by default.
On each alarm tick:
- Read
lastHeartbeatAtfrom D1. - Compare it to the current time. If the gap is too large (more missed heartbeats than your offline threshold allows), flip status to offline.
- If newly offline, open an incident (
incidentstable) and dispatch an alert. If newly online after offline, close the incident and dispatch a recovery alert. - Re-schedule the next alarm tick.
The alarm is the heartbeat-of-the-heartbeat: it lets CloudLine notice "no heartbeat for too long" even though there's no incoming request to trigger evaluation.
Step 5 — Status changes fan out to your dashboard
When the BotMonitor decides "online → offline" or "offline → online", it POSTs the new status to the NotificationHub for the bot's owner. The NotificationHub holds an open WebSocket to every dashboard tab the user has open.
The dashboard gets a sub-second push. No polling. If you're looking at a bot when it goes offline, you see the status flip immediately.
When you close all your dashboard tabs, the WebSocket disconnects and the NotificationHub goes idle (Cloudflare's WebSocket Hibernation API — no compute charge while idle).
What's stored, and for how long
CloudLine writes to a few D1 tables on each heartbeat:
| Table | What it stores | Retention |
|---|---|---|
bots | The bot row — secret, name, lastHeartbeatAt, current status. | Forever (until you delete the bot). |
uptime_events | One row per heartbeat — status + telemetry. | 7 days (Starter) / 30 days (Pro) / 90 days (Business). |
uptime_buckets | 5-minute aggregates — uptime %, average latency, etc. | 7 days, then folded into uptime_daily. |
uptime_daily | One row per day. Long-term archive for the charts. | 1 year (Starter) / 3 years (Pro) / 5 years (Business). |
bot_telemetry | Detailed snapshot of every heartbeat's metrics. | Same as uptime_events. |
incidents | One row per offline / recovery cycle. | Forever. |
error_events | Errors reported via monitor.captureError(). | Same as uptime_events. |
alert_history | Every alert sent, with delivery status. | Forever, paged in the UI. |
When you query a chart over a window that exceeds your plan's raw-event retention, CloudLine seamlessly falls back to the 5-minute aggregate (and then the daily aggregate for very long windows). You get the chart either way — the resolution just drops the further back you go.
Why this architecture
The "one DO per bot" model has a specific property: each bot's monitoring runs entirely in one place, but each user's fleet doesn't share state. A single noisy bot on someone else's account cannot impact your monitoring latency. A user with 100 bots gets 100 small actors that run independently in parallel.
It also means CloudLine scales to a lot of bots without changing anything — Cloudflare's runtime handles spinning up new BotMonitor instances on demand. There's no central queue, no shared scheduler, no head-of-line blocking from any one bot's slow heartbeats.
What happens when CloudLine has an outage
If our Worker is down, your heartbeats get 5xx responses. The SDK retries twice with backoff (250 ms, 500 ms), then logs a warning and tries again on the next interval. Your bot is unaffected — the SDK is fire-and-forget and never blocks the bot's main work.
When CloudLine comes back, the next successful heartbeat lands lastHeartbeatAt, and the alarm tick decides the bot is online again. There's no manual cleanup. The dashboard will show an "incident" for the gap if it crossed your offline threshold.
We deploy fixes through Cloudflare's atomic Worker swap, so partial-deploy states aren't a thing. You either see the old Worker or the new one — never both.
Where to go next
- Quick start — set up your first bot in five steps.
- Metrics explained — what each tile means and what "normal" looks like.
- Reliability tiers — how the dashboard summarises a bot's health into a single label.
- Retention policy — what we keep, for how long, on which plan.