Logging & Monitoring System Design | ELK, Prometheus & Observability

The Logging Pipeline (ELK/EFK)

Logs must be shipped off the server immediately. They are ephemeral on the node but permanent in the archive.

App
(STDOUT)

→

Collector
(Fluentd/Filebeat)

→

Buffer
(Kafka/Redis)

→

Indexer
(Elasticsearch)

→

UI
(Kibana)

*The "Buffer" layer is optional but recommended for high-volume systems to handle backpressure.

Start with Structured Logging

Do NOT log plain text strings. Log **JSON**. Machines cannot parse "Error: User 123 failed". They can parse `{"level":"error", "user_id":123}`.

❌ Bad (Unstructured)

[2023-10-05 10:00:01] ERROR Payment failed for user 555. Reason: Timeout.

✅ Good (JSON/Structured)

{ "timestamp": "2023-10-05T10:00:01Z", "level": "ERROR", "event": "payment_failed", "user_id": 555, "reason": "timeout", "service": "billing-service" }

# Python Implementation (Structlog)
import structlog

log = structlog.get_logger()

# This produces JSON automatically
log.error("payment_failed", 
          user_id=555, 
          reason="timeout", 
          amount=99.00)

Collection Patterns: DaemonSet vs Sidecar

Pattern	Description	Pros	Cons
DaemonSet (Node Agent)	One collector (e.g., Fluent Bit) per Node. Reads all container logs from `/var/log`.	Resource efficient (1 agent per node).	Hard to customize parsing per app.
Sidecar	Dedicated collector container inside each Pod.	Full isolation. Custom parsing logic per app.	High resource usage (N agents).

The 3 Long Pillars of Observability

📜 Logs

Discrete events. "What happened?"
(e.g., Error stacktrace)

📈 Metrics

Aggregatable numbers. "Is it healthy?"
(e.g., CPU, Req/sec)

🧭 Traces

Request journey. "Where is the latency?"
(e.g., Span across microservices)

Summary

ELK Stack: Elastic (Store), Logstash (Process), Kibana (View). The industry standard.
Format: Always use JSON structured logging to enable powerful querying (e.g., `user_id:555`).
Context: Using Correlation IDs (Trace IDs) is the only way to debug microservices.