What is Observability?

Observability is the ability to measure the internal states of a system by examining its outputs. Unlike traditional monitoring that tells you when something is wrong, observability helps you understand why by providing the data needed to debug unknown issues.

Three Pillars of Observability

Logs

Event records
Contextual information
Debugging details

Metrics

Numerical measurements
Time-series data
Performance indicators

Traces

Request flow
Distributed systems
Latency breakdown

Monitoring vs. Observability

Monitoring	Observability
Known unknowns	Unknown unknowns
Predefined dashboards	Ad-hoc exploration
Alert on thresholds	Debug novel issues
What is broken	Why it's broken

Key Concepts

Cardinality Number of unique values in metrics.

Sampling Collecting subset of data.

Correlation Linking logs, metrics, traces.

Context Metadata and dimensions.

Observability in AI Systems

LLM Observability

Prompt/response logging
Token usage
Latency metrics
Error tracking

RAG Observability

Retrieval quality
Context relevance
Generation accuracy

Tools and Platforms

Open Source

Prometheus
Grafana
Jaeger
OpenTelemetry

Commercial

Datadog
New Relic
Splunk
Dynatrace

Best Practices

Standardize instrumentation
Use structured logging
Implement distributed tracing
Define SLIs and SLOs
Correlate signals