Observability & Data

Monitoring with Prometheus

The industry standard for Time Series data. Learn the Pull Model, PromQL queries, and how to structure your alerts.

The Pull Model Architecture

Unlike traditional systems (Nagios, Pushgateway), Prometheus pulls metrics from targets.

App / Host
(/metrics)
← Scrape
Prometheus Server
(TSDB)
Query →
Grafana
(Dashboard)


AlertManager
(PagerDuty/Slack)

Why Pull?

  • Central Control: Prometheus decides when and what to scrape.
  • Reliability: If the app goes down, Prometheus knows immediately (Scrape failed).
  • Simplicity: No heavy agent required on the host; just an HTTP endpoint.

The 4 Metric Types

1. Counter

A value that only goes UP (or resets on restart).

http_requests_total
2. Gauge

A value that can go up and down.

memory_usage_bytes, temperature_celsius
3. Histogram

Samples observations (e.g., request duration) into buckets.

request_duration_bucket{le="0.5"}
4. Summary

Similar to Histogram but calculates quantiles on the client side.

rpc_duration_seconds

PromQL: Querying the Data

Prometheus Query Language is functional and vector-based.

Calculating Rate (Requests Per Second)
# Calculate the per-second rate of requests over the last 5 minutes
rate(http_requests_total[5m])

# Sum it up by Service (removing other labels like 'instance')
sum(rate(http_requests_total[5m])) by (service)
99th Percentile Latency
# Calculate the approximate 99th percentile duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Alerting & Recording Rules

Don't alert on CPU usage. Alert on User Pain (High Error Rate, High Latency).

groups:
- name: example
  rules:
  # RECORDING RULE: Pre-compute complex query
  - record: job:http_inprogress_requests:sum
    expr: sum(http_inprogress_requests) by (job)

  # ALERTING RULE: Fire if condition is true for 10m
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "High request latency on {{ $labels.instance }}"

Summary

  • Service Discovery: Prometheus finds targets automatically (via K8s, EC2, Consul).
  • Exporters: Use "Node Exporter" for system metrics and sidecars for app metrics.
  • Visualization: Use Grafana Variables to create dynamic dashboards (e.g., select Pod).