CloudLine
Concepts

Reliability tiers

Excellent / Good / At-risk / Critical — what each label means, how it's computed, and why CloudLine's thresholds are Discord-bot-tuned instead of strict SRE.

The big colored label at the top of every bot's detail page — Excellent / Good / At-risk / Critical — is CloudLine's one-glance answer to "is this bot in good shape right now?" It's computed from your bot's uptime percentage over the currently selected time window.

The four tiers

TierUptimeApproximate downtime / monthMeaning
Excellent≥ 99%up to 7.2 hoursWell-maintained Discord bot territory. Pro hosting, careful deploys.
Good≥ 97%up to 22 hoursTypical hobby or personal bot. Hits restart cycles + deploys + occasional ISP blips.
At-risk≥ 93%up to 50 hoursFrequent incidents. Worth investigating hosting + deploy stability.
Critical< 93%over 50 hoursSustained issues — the bot is functionally broken for end users.

The thresholds are static. The label updates whenever you change the date range — pick "Last 7 days" and the tier reflects the last 7 days; pick "This month" and it reflects this month.

Why these specific numbers

CloudLine's thresholds are Discord-bot-tuned, deliberately more forgiving than strict SRE "nines" (99.9% / 99.99% / 99.999%).

A hobbyist bot on free hosting:

  • Restarts every few hours when the host puts it to sleep.
  • Loses a few minutes on each deploy.
  • Gets disconnected by ISP blips on the operator's home server.
  • Comes back fast but counts as downtime.

A bot in that situation with 98% uptime over a month is completely normal. Labeling it "Critical" would be useless noise — the tier needs to differentiate "the bot is doing its job" from "the bot is actually broken." Our defaults do that.

If you're running a paid SaaS bot with a 99.9% SLA contract, you can pick a stricter target — see below. The reliability label uses the static tiers; the SLA budget uses your chosen target.

How uptime is calculated

Uptime is time-weighted, not event-weighted:

uptime % = online_seconds ÷ total_observed_seconds × 100

A bot that was offline for 1 hour out of 24 has 95.83% uptime, regardless of whether that hour was one continuous outage or sixty 1-minute blips.

"Online seconds" is computed from the gaps in your heartbeat record — for each heartbeat that arrives, the time since the last heartbeat counts as "online" if the gap is within tolerance. Gaps longer than your offline threshold count as "offline."

Three numbers appear next to the reliability label, drawn from the same data:

MTTR — Mean Time To Recovery

The average length of incidents in the selected window.

MTTR = sum(incident durations) ÷ count(incidents)

Lower is better. Only closed incidents count — an ongoing outage doesn't have a recovery time yet, so it doesn't affect MTTR until it ends.

A bot with a 10-minute MTTR recovers quickly from problems (maybe it has good auto-restart). A bot with a 4-hour MTTR is in long outages — usually a sign that nobody notices until the next business day.

MTBF — Mean Time Between Failures

The average time between incidents starting in the window.

MTBF = total_window_duration ÷ count(incidents)

Higher is better. Less meaningful with very few incidents — a single incident in a 30-day window gives you "MTBF = 30 days" which doesn't really tell you anything.

A widening MTBF over multiple windows is the real signal: it means your bot is having outages less often over time.

SLA budget

The downtime your target uptime allows over the window, with the remaining budget shown.

allowed_downtime = window_duration × (1 - sla_target)
remaining_budget = allowed_downtime - actual_downtime

With a 99% target on a 30-day window:

  • Allowed downtime: 30 × 24 × 60 × 0.01 = 7.2 hours
  • If your bot was offline for 4 hours, your remaining budget is 3.2 hours.
  • If it was offline for 9 hours, your remaining budget is −1.8 hours (over-budget).

The dashboard shows remaining as a progress bar. Negative goes red.

SLA target

The default SLA target is 99%, picked for the same reason as the tier thresholds — most Discord bots, even well-run ones, can't realistically hit 99.9%.

Pro and Business users can pick a stricter target from the picker on the Status panel:

  • 99% (default) — 7.2 hours / 30 days budget. Comfortable for hobby and small-team bots.
  • 99.5% — 3.6 hours budget.
  • 99.9% — 43 minutes budget. Realistic for paid bots on pro hosting.
  • 99.95% — 21 minutes budget.
  • 99.99% — 4 minutes budget. SRE territory; needs redundancy.

Choosing a target only affects the SLA budget number. The Excellent / Good / At-risk / Critical tiers stay static — they're about absolute uptime, not your custom target.

What tier doesn't tell you

The tier is a one-glance summary. It deliberately ignores:

  • What kind of downtime — three 5-minute deploys count the same as one 15-minute outage.
  • What time of day — being down for an hour at 3 AM affects very few users; being down for an hour at peak traffic affects everyone.
  • Whether you have an open SEV1 right now — the SEV indicator on the Incidents tab tells you that, separately.

For all of that, look at the Incidents list under the bot detail page. The reliability tier is the summary; the incident list is the story.

Not enough data

If you have very few heartbeats in the window (newly created bot, very short window), the tier shows Not enough data instead of a number. CloudLine doesn't extrapolate — a bot that's been up for 5 minutes hasn't been online long enough to claim 100% uptime.

Wait for at least a few hours of data, or widen the window, and the tier appears.