Monitoring and observability: what every IT team should track

Monitoring vs observability: the distinction matters

Monitoring and observability are often used interchangeably, but they solve different problems. Understanding the difference helps you build systems that are not only reliable but also diagnosable when things go wrong.

Monitoring is the practice of collecting predefined metrics and setting thresholds that trigger alerts. It answers known questions: is the server up? Is CPU usage above 90%? Is the API responding within acceptable latency?

Observability goes further. It gives you the ability to ask new questions about your systems without deploying new instrumentation. When a customer reports slow page loads at 14:00 on a Tuesday, observability lets you trace that experience through load balancers, application servers, databases, and third-party APIs to find the root cause - even if you never anticipated that specific failure mode.

For most South African businesses, the practical takeaway is this: start with solid monitoring, then progressively build observability as your systems grow more complex.

The three pillars of observability

Modern observability rests on three complementary data types. Each provides a different lens into system behaviour.

Metrics

Metrics are numerical measurements collected at regular intervals. They are cheap to store, fast to query, and excellent for dashboards and alerts.

Counters - values that only go up (total requests, total errors)
Gauges - values that go up and down (current memory usage, active connections)
Histograms - distributions of values (response time percentiles)

Logs

Logs are timestamped records of discrete events. They provide context that metrics cannot: the specific error message, the user ID involved, the request payload that triggered a failure.

Structured logging (JSON format with consistent fields) makes logs far more useful than unstructured text. Invest in log standardisation early - retrofitting it across dozens of services is painful.

Traces

Distributed traces follow a single request as it moves through multiple services. Each service adds a span to the trace, recording what it did and how long it took. Traces are indispensable for diagnosing latency in microservice architectures or systems that depend on multiple APIs.

What to monitor: the four golden signals

Google’s Site Reliability Engineering handbook defines four golden signals that apply to virtually any service. These should form the baseline of your monitoring strategy.

Latency

How long requests take to complete. Track both successful and failed requests separately - a fast error is still an error, and slow successes may indicate degradation before failure.

Measure at the 50th, 95th, and 99th percentiles, not just averages
Set baselines during normal operation and alert on deviation
Track latency at the edge (user experience) and at each service boundary

Traffic

The volume of demand on your system. This could be HTTP requests per second, database queries per minute, or messages processed per hour.

Establish normal traffic patterns (daily, weekly, monthly cycles)
Correlate traffic spikes with performance changes
Use traffic data to inform capacity planning

Errors

The rate of requests that fail. This includes explicit errors (HTTP 500s, exceptions) and implicit ones (HTTP 200 responses with incorrect content, timeouts treated as successes).

Distinguish between client errors (4xx) and server errors (5xx)
Track error rates as a percentage of total traffic, not just absolute counts
Alert on sustained error rate increases, not individual errors

Saturation

How full your system is. Every resource has a limit - CPU, memory, disk I/O, network bandwidth, database connections. Saturation monitoring tells you how close you are to those limits.

Monitor resource utilisation as a percentage of capacity
Set alerts at 70-80% for resources that degrade gradually
For resources with hard cliffs (disk space, connection pools), alert earlier

Alerting best practices

Poorly configured alerting is worse than no alerting. Alert fatigue - where teams ignore alerts because most are noise - is one of the biggest operational risks in IT.

Alert on symptoms, not causes

Alert when users are affected (high error rate, slow response times), not when a single metric looks unusual (CPU spike that resolves in seconds). Cause-based alerts generate noise. Symptom-based alerts generate action.

Tier your alerts

Critical - customer-facing service is down or severely degraded. Pages the on-call engineer immediately.
Warning - something is trending toward a problem but isn’t impacting users yet. Sends a notification to a channel.
Informational - useful for dashboards and post-incident review but doesn’t notify anyone.

Include context in every alert

An alert that says “high CPU on web-03” is less useful than one that says “web-03 CPU at 94% for 10 minutes, current request rate 2x normal, top process: database connection pool exhaustion.” Include links to relevant dashboards and runbooks.

Review and prune regularly

Schedule a monthly review of all alerts. Archive any that haven’t fired in 90 days or that consistently fire without requiring action. Every active alert should have a clear owner and a documented response procedure.

Tool categories

You don’t need to buy one platform that does everything. Most mature monitoring stacks combine tools across several categories.

Infrastructure monitoring - tracks servers, VMs, containers, and network devices. Examples: Zabbix, Datadog, Prometheus with node exporters.
Application performance monitoring (APM) - instruments application code to track transactions, dependencies, and errors. Examples: New Relic, Dynatrace, Elastic APM.
Log management - aggregates, indexes, and searches log data from all sources. Examples: Elasticsearch/Kibana, Grafana Loki, Splunk.
Distributed tracing - correlates requests across services. Examples: Jaeger, Zipkin, Tempo.
Synthetic monitoring - simulates user interactions from external locations to detect outages and performance issues before real users are affected.
Uptime and status pages - simple endpoint checks with public-facing status communication. Examples: UptimeRobot, Pingdom, Statuspage.

For organisations running a mix of on-premise and cloud workloads - which describes most South African businesses - a unified monitoring approach is critical. Your infrastructure team should have visibility across all environments from a single pane of glass.

Building a monitoring strategy

A common mistake is adopting tools before defining what you need to know. Start with these steps:

Inventory your services - list every application, database, queue, and external dependency.
Define SLIs and SLOs - for each service, decide which metrics matter most (service level indicators) and what acceptable performance looks like (service level objectives).
Instrument progressively - start with infrastructure metrics and uptime checks, then add application-level monitoring, then distributed tracing.
Automate response - where possible, automate remediation for known issues (auto-scaling, service restarts, failover).
Review after every incident - update monitoring to detect the root cause earlier next time.

A managed IT partner can accelerate this process by bringing established monitoring frameworks and 24/7 coverage, freeing your internal team to focus on product and business priorities.

Cloud-native monitoring considerations

If you’re running workloads in AWS, Azure, or Google Cloud, each provider offers native monitoring tools (CloudWatch, Azure Monitor, Cloud Operations). These are a good starting point for cloud resources, but they don’t cover on-premise infrastructure or cross-cloud environments.

For businesses with cloud architecture spanning multiple providers or hybrid deployments, a vendor-neutral monitoring stack avoids lock-in and provides consistent visibility.

Getting started

Good monitoring is not a one-time project - it’s a practice that matures alongside your systems. Start with the four golden signals, invest in structured logging, and build from there.

If your team lacks the capacity to build and maintain a monitoring stack, or if you’re experiencing alert fatigue and incident response gaps, ITHQ can help design and implement an observability strategy that fits your environment and budget.

Talk to our team about monitoring, observability, and managed operations.