What is Data Lineage?
Data lineage is the process of tracking data from its origin through all its transformations and movements. It provides visibility into where data comes from, how it's processed, and where it goes, essential for governance, compliance, and debugging.
Lineage Components
Source Where data originates.
Transformations How data is changed.
Destination Where data ends up.
Metadata Context about each step.
Types of Lineage
Technical Lineage
- Column-level tracking
- ETL jobs
- Database queries
Business Lineage
- Business process flow
- Report dependencies
- KPI derivation
Benefits
Compliance
- Audit trails
- Regulatory requirements
- Data subject requests
Data Quality
- Root cause analysis
- Impact assessment
- Trust verification
Operations
- Debugging pipelines
- Change management
- Migration planning
Implementation Approaches
Manual Documentation
- Spreadsheets, wikis
- Labor intensive
- Often outdated
Automated Collection
- Parse code/queries
- Monitor pipelines
- Real-time updates
Hybrid
- Automated technical
- Manual business context
Tools
- Apache Atlas
- Collibra
- Alation
- Informatica
- dbt (data build tool)