Pro-Owner perspective: This document frames your systems as a technical estate — an asset to be stewarded, documented, and bequeathed. Treat these steps as craftsmanship: protect the continuity, auditability, and transferability of your digital legacy.

Monitoring Basics: Know When Something Is Wrong

The 60-second version

Monitoring is your early warning system. It should answer: Is it up? Is it slow? Is it secure? Without monitoring, you learn about problems when customers call. With monitoring, you fix problems before customers notice. This article explains the minimum viable monitoring setup that catches real issues without generating noise.

What this solves (in real business terms)

Scenario: Your website has been down for 2 hours. You find out when a customer posts on social media. Revenue lost: $5K. Reputation damage: Worse.

Without monitoring, you're flying blind. With basic monitoring, you get an alert 60 seconds after the site goes down and can start fixing it immediately.

What to monitor:

Uptime: Is it reachable?
Performance: Is it slow?
Security: Are there failed login attempts?
Capacity: Are we running out of disk/memory/CPU?

What it costs (honest ranges)

Basic monitoring (small business): $20-100/month

External uptime monitoring: $10-30/month (5-10 websites)
system monitoring: Free (Prometheus, Zabbix) or $10-50/month (commercial)
Log aggregation: $0-50/month (under 1GB/day)
SMS alerts: $5-20/month

Mid-range (growing company): $200-600/month

APM (Application Performance Monitoring): $100-300/month
Infrastructure monitoring: $50-150/month
Log management: $50-200/month
On-call rotation tools: $10-50/user/month

Hidden costs:

Alert fatigue: If everything alerts, nothing matters
Time investigating false positives: 2-10 hours/week if poorly configured

What can go wrong

1. Monitoring everything with default thresholds "We set up monitoring and enabled all checks with default settings..."

Result: 50 alerts per day. Team ignores all of them. Real outage gets missed.
Prevention: Start with 5 critical alerts. Add more only after tuning thresholds.

2. Only monitoring from one location "Our monitoring system checks our website every minute..."

Result: Monitoring system loses power. No alerts. Website actually down for 4 hours.
Prevention: External monitoring from multiple geographic locations.

3. No runbooks for common alerts Alert fires. Team scrambles to figure out what it means and how to fix it.

Result: 30 minutes lost investigating before starting actual fix.
Prevention: Every alert has a runbook: What it means, how to diagnose, how to fix.

Vendor questions (copy/paste)

"What happens if your monitoring service goes down?"
"Can you monitor from multiple geographic regions?"
"How long do you retain historical monitoring data?"
"What alerting channels do you support? (Email, SMS, Slack, PagerDuty?)"
"Can we set different thresholds for business hours vs. nights/weekends?"

Minimum viable implementation

Week 1: Monitor uptime

[ ] Sign up for external uptime monitoring service
[ ] Add your 5 most critical URLs:
- Main website
- Login page
- API endpoint (if applicable)
- Admin portal
- Customer-facing application
[ ] Set check frequency: Every 1-5 minutes
[ ] Configure alerts: Email + SMS to 2 people

Week 2: Monitor infrastructure

[ ] Install monitoring agent on systems
[ ] Monitor:
- CPU usage (alert if > 80% for 10 minutes)
- Memory usage (alert if > 85%)
- Disk space (alert if < 20% free)
- Network connectivity (alert if unreachable for 2 minutes)
[ ] Test each alert (artificially trigger it)

Week 3: Monitor application performance

[ ] Set up response time monitoring
[ ] Alert if response time > 3 seconds
[ ] Monitor error rates (alert if > 5% of requests fail)
[ ] Track database query performance (alert if queries > 1 second)

Week 4: Create runbooks

[ ] For each alert type, document:
- What this alert means
- How to diagnose the problem
- Common causes and fixes
- Escalation path if fix doesn't work
[ ] Test runbooks with team member who didn't write it

When to hire help

DIY-friendly if:

Under 10 systems/applications
Simple architecture (web system + database)
Can tolerate learning curve

Get professional help if:

Complex distributed systems
Multiple environments (dev, staging, prod)
Compliance requirements for monitoring
24/7 operations required
Previous incidents due to lack of visibility

Warning signs:

You've had outages you didn't know about
Customers report problems before your team knows
No one checks monitoring dashboards regularly
Alerts go to email, no one reads them
Last review of alert thresholds was 1+ year ago

Monitoring Basics: Know When Something Is Wrong

Monitoring Basics: Know When Something Is Wrong

The 60-second version

What this solves (in real business terms)

What it costs (honest ranges)

What can go wrong

Vendor questions (copy/paste)

Minimum viable implementation

When to hire help

Related Reading

Asset Inventory And Why IT Matters

Downtime Cost Simple Math