What is Latency?

Latency is the time delay between when a request is initiated and when a response begins or completes. In AI and computing contexts, it measures the speed of system responses and is critical for user experience and system performance.

Latency Metrics

Time to First Byte (TTFB) Time until first response byte.

Time to First Token (TTFT) Time until first AI token generated.

Total Latency Complete request-response time.

P50/P95/P99 Latency Percentile-based measurements.

Latency in AI Systems

LLM API Latency

Model inference time
Network round trip
Token generation

RAG System Latency

Retrieval time
Embedding generation
LLM inference

Contributing Factors

Network

Distance to server
Bandwidth
Congestion

Processing

Model size
Compute resources
Batch size

Infrastructure

Server location
Load balancing
Caching

Optimization Strategies

Model-Level

Smaller models
Quantization
Speculative decoding

Infrastructure

Edge deployment
CDN usage
GPU acceleration

Architecture

Caching
Streaming responses
Async processing

Acceptable Latency Targets

Use Case	Target
Interactive chat	< 500ms TTFT
Real-time	< 100ms
Batch processing	Less critical
Streaming	< 200ms TTFT