Observability

The ability to understand the internal state of a system by examining its external outputs, typically through logs, metrics, and traces.

Also known as:System ObservabilityApplication Observability

What is Observability?

Observability is the ability to measure the internal states of a system by examining its outputs. Unlike traditional monitoring that tells you when something is wrong, observability helps you understand why by providing the data needed to debug unknown issues.

Three Pillars of Observability

Logs

  • Event records
  • Contextual information
  • Debugging details

Metrics

  • Numerical measurements
  • Time-series data
  • Performance indicators

Traces

  • Request flow
  • Distributed systems
  • Latency breakdown

Monitoring vs. Observability

MonitoringObservability
Known unknownsUnknown unknowns
Predefined dashboardsAd-hoc exploration
Alert on thresholdsDebug novel issues
What is brokenWhy it's broken

Key Concepts

Cardinality Number of unique values in metrics.

Sampling Collecting subset of data.

Correlation Linking logs, metrics, traces.

Context Metadata and dimensions.

Observability in AI Systems

LLM Observability

  • Prompt/response logging
  • Token usage
  • Latency metrics
  • Error tracking

RAG Observability

  • Retrieval quality
  • Context relevance
  • Generation accuracy

Tools and Platforms

Open Source

  • Prometheus
  • Grafana
  • Jaeger
  • OpenTelemetry

Commercial

  • Datadog
  • New Relic
  • Splunk
  • Dynatrace

Best Practices

  • Standardize instrumentation
  • Use structured logging
  • Implement distributed tracing
  • Define SLIs and SLOs
  • Correlate signals