Intent
Normalize CSV inputs into a consistent, validated structure before use.
When to use
- You ingest CSVs from multiple sources with inconsistent columns.
- You need a stable, clean export for downstream systems.
- You want a repeatable data interface that survives schema drift.
Core mechanics
- Define the canonical column set and data types.
- Normalize casing, formats, and defaults.
- Validate required fields and quarantine bad rows.
- Emit a clean CSV as the contract for downstream consumers.
Implementation checklist
- Document canonical columns and ownership.
- Build a transform step with validation rules.
- Track schema versions alongside exports.
- Publish a sample export for consumers.
- Log row counts and errors per run.
Failure modes and mitigations
- Silent column changes -> enforce schema checks.
- Encoding or quoting issues -> normalize encoding and force consistent quoting.
- Dirty or missing values -> validate and quarantine rows.
- Consumer breakage -> version exports and communicate changes.
Observability and validation
- Input row count vs output row count.
- Validation error counts with sample records.
- Schema version and export timestamp.
Artifacts
- Canonical column documentation.
- Sample exports and data dictionaries.