System Monitoring with Prometheus & Grafana

The Pull Model Architecture

Unlike traditional systems (Nagios, Pushgateway), Prometheus pulls metrics from targets.

App / Host
(/metrics)

← Scrape

Prometheus Server
(TSDB)

Query →

Grafana
(Dashboard)

↓

AlertManager
(PagerDuty/Slack)

Why Pull?

Central Control: Prometheus decides when and what to scrape.
Reliability: If the app goes down, Prometheus knows immediately (Scrape failed).
Simplicity: No heavy agent required on the host; just an HTTP endpoint.

The 4 Metric Types

1. Counter

A value that only goes UP (or resets on restart).

http_requests_total

2. Gauge

A value that can go up and down.

memory_usage_bytes, temperature_celsius

3. Histogram

Samples observations (e.g., request duration) into buckets.

request_duration_bucket{le="0.5"}

4. Summary

Similar to Histogram but calculates quantiles on the client side.

rpc_duration_seconds

PromQL: Querying the Data

Prometheus Query Language is functional and vector-based.

Calculating Rate (Requests Per Second)

# Calculate the per-second rate of requests over the last 5 minutes
rate(http_requests_total[5m])

# Sum it up by Service (removing other labels like 'instance')
sum(rate(http_requests_total[5m])) by (service)

99th Percentile Latency

# Calculate the approximate 99th percentile duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Alerting & Recording Rules

Don't alert on CPU usage. Alert on User Pain (High Error Rate, High Latency).

groups:
- name: example
  rules:
  # RECORDING RULE: Pre-compute complex query
  - record: job:http_inprogress_requests:sum
    expr: sum(http_inprogress_requests) by (job)

  # ALERTING RULE: Fire if condition is true for 10m
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "High request latency on {{ $labels.instance }}"

Summary

Service Discovery: Prometheus finds targets automatically (via K8s, EC2, Consul).
Exporters: Use "Node Exporter" for system metrics and sidecars for app metrics.
Visualization: Use Grafana Variables to create dynamic dashboards (e.g., select Pod).