Intent
Translate human labels into system IDs with a controlled mapping file.
When to use
- You ingest data from multiple sources with inconsistent formats.
- Downstream automation expects stable fields and identifiers.
- You need a single, documented contract for data shape and meaning.
- You must track provenance and data quality over time.
Core mechanics
- Define the canonical schema and required fields.
- Normalize inputs (types, casing, IDs, missing values).
- Validate records and quarantine invalid entries.
- Version the schema and mapping rules.
Implementation checklist
- Document the canonical schema, field meanings, and owners.
- Build a normalization step with strict validation rules.
- Create mapping tables or reference lists for IDs.
- Add fixtures and tests for edge cases and schema drift.
- Emit normalized outputs with version metadata.
- Log validation errors and provide review reports.
Failure modes and mitigations
- Schema drift causes silent breakage -> enforce schema checks.
- Incorrect mapping IDs -> validate against authoritative lists.
- Partial data overwrites good records -> define precedence rules.
- Hidden nulls or blanks -> use explicit null handling.
Observability and validation
- Row counts before and after normalization.
- Validation error counts and example records.
- Mapping coverage percentage.
- Schema version used per run.
Artifacts
- Canonical schema documentation.
- Reference mapping tables or ID catalogs.
- Sample normalized output files.