What is Latency?
Latency is the time delay between when a request is initiated and when a response begins or completes. In AI and computing contexts, it measures the speed of system responses and is critical for user experience and system performance.
Latency Metrics
Time to First Byte (TTFB) Time until first response byte.
Time to First Token (TTFT) Time until first AI token generated.
Total Latency Complete request-response time.
P50/P95/P99 Latency Percentile-based measurements.
Latency in AI Systems
LLM API Latency
- Model inference time
- Network round trip
- Token generation
RAG System Latency
- Retrieval time
- Embedding generation
- LLM inference
Contributing Factors
Network
- Distance to server
- Bandwidth
- Congestion
Processing
- Model size
- Compute resources
- Batch size
Infrastructure
- Server location
- Load balancing
- Caching
Optimization Strategies
Model-Level
- Smaller models
- Quantization
- Speculative decoding
Infrastructure
- Edge deployment
- CDN usage
- GPU acceleration
Architecture
- Caching
- Streaming responses
- Async processing
Acceptable Latency Targets
| Use Case | Target |
|---|---|
| Interactive chat | < 500ms TTFT |
| Real-time | < 100ms |
| Batch processing | Less critical |
| Streaming | < 200ms TTFT |