Observability & DevOps

Logging & Monitoring System

From print statements to production observability: Designing the ELK/EFK stack, Structured Logging, and Sidecar Collectors.

The Logging Pipeline (ELK/EFK)

Logs must be shipped off the server immediately. They are ephemeral on the node but permanent in the archive.

App
(STDOUT)
Collector
(Fluentd/Filebeat)
Buffer
(Kafka/Redis)
Indexer
(Elasticsearch)
UI
(Kibana)

*The "Buffer" layer is optional but recommended for high-volume systems to handle backpressure.

Start with Structured Logging

Do NOT log plain text strings. Log **JSON**. Machines cannot parse "Error: User 123 failed". They can parse `{"level":"error", "user_id":123}`.

❌ Bad (Unstructured)
[2023-10-05 10:00:01] ERROR Payment failed for user 555. Reason: Timeout.
✅ Good (JSON/Structured)
{ "timestamp": "2023-10-05T10:00:01Z", "level": "ERROR", "event": "payment_failed", "user_id": 555, "reason": "timeout", "service": "billing-service" }
# Python Implementation (Structlog)
import structlog

log = structlog.get_logger()

# This produces JSON automatically
log.error("payment_failed", 
          user_id=555, 
          reason="timeout", 
          amount=99.00)

Collection Patterns: DaemonSet vs Sidecar

Pattern Description Pros Cons
DaemonSet (Node Agent) One collector (e.g., Fluent Bit) per Node. Reads all container logs from `/var/log`. Resource efficient (1 agent per node). Hard to customize parsing per app.
Sidecar Dedicated collector container inside each Pod. Full isolation. Custom parsing logic per app. High resource usage (N agents).

The 3 Long Pillars of Observability

📜 Logs

Discrete events. "What happened?"
(e.g., Error stacktrace)

📈 Metrics

Aggregatable numbers. "Is it healthy?"
(e.g., CPU, Req/sec)

🧭 Traces

Request journey. "Where is the latency?"
(e.g., Span across microservices)

Summary

  • ELK Stack: Elastic (Store), Logstash (Process), Kibana (View). The industry standard.
  • Format: Always use JSON structured logging to enable powerful querying (e.g., `user_id:555`).
  • Context: Using Correlation IDs (Trace IDs) is the only way to debug microservices.