Skip to content
Intro
9 min
Anchor Article

Monitoring Basics: Know When Something Is Wrong

Learn what to monitor (uptime, performance, security), how to set alert thresholds, and how to avoid alert fatigue - the monitoring death spiral.

Last updated: January 26, 2026

Pro-Owner perspective: This document frames your systems as a technical estate — an asset to be stewarded, documented, and bequeathed. Treat these steps as craftsmanship: protect the continuity, auditability, and transferability of your digital legacy.

Monitoring Basics: Know When Something Is Wrong

The 60-second version

Monitoring is your early warning system. It should answer: Is it up? Is it slow? Is it secure? Without monitoring, you learn about problems when customers call. With monitoring, you fix problems before customers notice. This article explains the minimum viable monitoring setup that catches real issues without generating noise.

What this solves (in real business terms)

Scenario: Your website has been down for 2 hours. You find out when a customer posts on social media. Revenue lost: $5K. Reputation damage: Worse.

Without monitoring, you're flying blind. With basic monitoring, you get an alert 60 seconds after the site goes down and can start fixing it immediately.

What to monitor:

  1. Uptime: Is it reachable?
  2. Performance: Is it slow?
  3. Security: Are there failed login attempts?
  4. Capacity: Are we running out of disk/memory/CPU?

What it costs (honest ranges)

Basic monitoring (small business): $20-100/month

  • External uptime monitoring: $10-30/month (5-10 websites)
  • system monitoring: Free (Prometheus, Zabbix) or $10-50/month (commercial)
  • Log aggregation: $0-50/month (under 1GB/day)
  • SMS alerts: $5-20/month

Mid-range (growing company): $200-600/month

  • APM (Application Performance Monitoring): $100-300/month
  • Infrastructure monitoring: $50-150/month
  • Log management: $50-200/month
  • On-call rotation tools: $10-50/user/month

Hidden costs:

  • Alert fatigue: If everything alerts, nothing matters
  • Time investigating false positives: 2-10 hours/week if poorly configured

What can go wrong

1. Monitoring everything with default thresholds "We set up monitoring and enabled all checks with default settings..."

  • Result: 50 alerts per day. Team ignores all of them. Real outage gets missed.
  • Prevention: Start with 5 critical alerts. Add more only after tuning thresholds.

2. Only monitoring from one location "Our monitoring system checks our website every minute..."

  • Result: Monitoring system loses power. No alerts. Website actually down for 4 hours.
  • Prevention: External monitoring from multiple geographic locations.

3. No runbooks for common alerts Alert fires. Team scrambles to figure out what it means and how to fix it.

  • Result: 30 minutes lost investigating before starting actual fix.
  • Prevention: Every alert has a runbook: What it means, how to diagnose, how to fix.

Vendor questions (copy/paste)

  1. "What happens if your monitoring service goes down?"
  2. "Can you monitor from multiple geographic regions?"
  3. "How long do you retain historical monitoring data?"
  4. "What alerting channels do you support? (Email, SMS, Slack, PagerDuty?)"
  5. "Can we set different thresholds for business hours vs. nights/weekends?"

Minimum viable implementation

Week 1: Monitor uptime

  • [ ] Sign up for external uptime monitoring service
  • [ ] Add your 5 most critical URLs:
    • Main website
    • Login page
    • API endpoint (if applicable)
    • Admin portal
    • Customer-facing application
  • [ ] Set check frequency: Every 1-5 minutes
  • [ ] Configure alerts: Email + SMS to 2 people

Week 2: Monitor infrastructure

  • [ ] Install monitoring agent on systems
  • [ ] Monitor:
    • CPU usage (alert if > 80% for 10 minutes)
    • Memory usage (alert if > 85%)
    • Disk space (alert if < 20% free)
    • Network connectivity (alert if unreachable for 2 minutes)
  • [ ] Test each alert (artificially trigger it)

Week 3: Monitor application performance

  • [ ] Set up response time monitoring
  • [ ] Alert if response time > 3 seconds
  • [ ] Monitor error rates (alert if > 5% of requests fail)
  • [ ] Track database query performance (alert if queries > 1 second)

Week 4: Create runbooks

  • [ ] For each alert type, document:
    • What this alert means
    • How to diagnose the problem
    • Common causes and fixes
    • Escalation path if fix doesn't work
  • [ ] Test runbooks with team member who didn't write it

When to hire help

DIY-friendly if:

  • Under 10 systems/applications
  • Simple architecture (web system + database)
  • Can tolerate learning curve

Get professional help if:

  • Complex distributed systems
  • Multiple environments (dev, staging, prod)
  • Compliance requirements for monitoring
  • 24/7 operations required
  • Previous incidents due to lack of visibility

Warning signs:

  • You've had outages you didn't know about
  • Customers report problems before your team knows
  • No one checks monitoring dashboards regularly
  • Alerts go to email, no one reads them
  • Last review of alert thresholds was 1+ year ago

Related Reading

Need Help Implementing This?

If you'd like guidance tailored to your specific infrastructure, we offer focused consultations. No sales pressure, just practical next steps.

Get in Touch