Plans009 2026 02 21 Data Quality Pipeline
Data Quality Pipeline
Overview
Implement a clean, auditable data pipeline that replaces direct xlsx→DuckDB loading with a two-step process: extract xlsx sheets to canonical CSV files, then load CSVs into DuckDB. Add a validation step between load and transform. Add is_valid flagging in dbt to preserve bad rows while keeping marts clean. Document every source sheet in @plan/audit/ as a snapshot reference.
The key problems being solved:
raw_targetsis completely broken (702 auto-named columns, 400k garbage rows) due to wrong header rowraw_studentsis 20× bloated (201k rows, should be ~10k) due to compute/summary sheets being included- 2023 LAPORAN has a 2-row merged header — sub-columns (
PROGRAM,INTAKE,NAMA SISWA,JUMLAH) are under aPENDAPATANgroup header in row 6; actual column names are in row 7 - Invalid
INTAKEyears (e.g.SEPTEMBER 2032) pass silently into marts stg_marketing_activityis a dead model —raw_marketing_activityis loaded but nothing downstream uses it- No documented record of which sheets exist, which are loaded, and which are excluded
Depends on: Plan 008 (Package Restructure) — complete. All work targets @packages/pipeline and @python/analytics.
Audit: T-001 and T-002 (audit documentation) completed in planning session 2026-02-21. Findings embedded in Tasks.md and @plan/audit/.
Goals
- Implement extract step in
@packages/pipeline: SheetJS-based xlsx →@source/clean/*.csvfor all configured sheets across all 4 years (2023–2026) - Implement load step to read from
@source/clean/*.csvinto DuckDBraw_*tables (replacing direct xlsx reading) - Implement validate step: SQL checks on
raw_*tables post-load, output tooutput/validation/*.json - Fix all
@source/config/*.yamlfiles: correct header rows, addmerge_header_rowsfor 2023 transactions only, remove allmarketing_activityentries - Remove
stg_marketing_activity.sqland its source declaration — it is a dead model with no downstream dependencies - Add
is_valid+invalid_reasoncolumns toint_ordersin@python/analytics - Add
audit_flagged_ordersdbt model (view) exposing all invalid rows
Non-Goals
- Implementing a pipeline orchestrator (
runsub-command) — stubs are fine for now - Any EPN entity support
- Real-time or streaming data
- Connecting validation output to a dashboard or alert system
- Historical backfill of flagged data — flagging is forward-looking from this plan
- Extracting REKAP (marketing summary) sheets to CSV — these are derived summaries used for manual validation only, not source data
- Refactoring
stg_*staging SQL to use new English column names — future plan
Phases
- Phase 1: Audit docs — complete (done in planning session 2026-02-21)
- Phase 2: YAML config fixes — correct header rows, remove marketing_activity, add organizations entries
- Phase 3: Extract step — SheetJS-based xlsx → CSV for all years, all file types
- Phase 4: Load step — refactor load to read from
@source/clean/CSVs; remove marketing_activity load block - Phase 5: Validate step — SQL checks on raw_* tables, JSON report output
- Phase 6: dbt quality — remove stg_marketing_activity; add
is_validflagging; addaudit_flagged_orders
Success
-
@plan/audit/contains sheet-index.md and 8 per-file snapshot docs — done in planning session - All
@source/config/*.yamlfiles updated: correct header rows, no marketing_activity entries, organizations entries for 2024–2026 -
pnpm run extract -- --entity IONS --year 2026produces CSV files in@source/clean/for all 2026 sheets -
@source/clean/contains committed CSVs for all years (2023–2026): 48 transactions + 48 students + 1 targets + 3 organizations = 100 files - All CSVs use canonical English column names; all years have the same columns (absent = empty string)
-
pnpm run sync -- --entity IONS --year 2026loads from@source/clean/CSVs intoraw_*tables -
raw_targetshas correct column names and ~55 rows (not 199k) -
raw_studentsrow count is ≤ 15,000 per year (not 201k) -
raw_marketing_activitytable no longer created or loaded -
stg_marketing_activity.sqldeleted;uv run dbt runsucceeds without it -
pnpm run validate -- --entity IONS --year 2026producesoutput/validation/ions-2026-validation.json -
int_ordershasis_validandinvalid_reasoncolumns -
int_enrollmentsand all mart models filter tois_valid = true -
mart_channel_marketingstill works (it readsint_enrollments, not stg_marketing_activity) -
uv run dbt testpasses all schema tests
Requirements
- Plan 008 completed:
@packages/pipelineand@python/analyticsexist @source/raw/contains all 8 xlsx files@source/clean/directory exists with.gitkeepxlsx(SheetJS) available viacatalog:utilin pipeline dependencies- Python/uv available for dbt verification
Context
Why This Approach
- CSV intermediary layer: xlsx files have structural problems (merged headers, wrong header rows, hidden rows) that DuckDB's
read_xlsxcannot handle correctly. Extracting to CSV in TypeScript with SheetJS gives full control over header detection, row skipping, hidden-row handling, and column mapping before data enters DuckDB. - CSVs committed to git (frozen snapshots): provides a permanent, diffable audit trail. A new extraction only happens intentionally, not automatically on every sync run. This separates "re-extract from source" (rare) from "re-load into DuckDB" (common).
- Canonical English column names in CSVs: all years produce the same columns. Absent columns for a year are empty strings. This eliminates per-year column aliasing in dbt staging.
is_validflag, not hard exclusion: invalid rows are retained inint_orderswith a flag. Full audit trail. Marts filter tois_valid = true.- Removing stg_marketing_activity: it is a dead model — nothing downstream reads it.
mart_channel_marketingreads fromint_enrollments(payment channel breakdown from transaction data), not from REKAP-derived data. Removing it eliminates confusion and a broken dependency. - REKAP sheets excluded from extraction: REKAP sheets are Excel-computed summaries derived from the monthly transaction sheets. Every number in them is computable from the transaction CSVs we already extract. They serve as manual validation reference only.
Key Constraints
- 2023 LAPORAN only has 2-row merged header: row 6 =
PENDAPATANgroup (spans cols E–H), row 7 =PROGRAM,INTAKE,NAMA SISWA,JUMLAH. 2024, 2025, 2026 LAPORAN all have flat single-row headers at row 6. Only 2023transactionsneedsmerge_header_rows: [6, 7]. - 2026 targets has 2-row header at rows 10–11: row 10 = outer labels, row 11 = sub-labels (
SISWA STANDARunderTARGETandREALISASI). Needsmerge_header_rows: [10, 11]. Data starts at row 12. Rows 1–8 are a letter salutation — skip entirely. - Targets exists only in 2026:
EVALUASI DATA SISWAsheet absent in 2023, 2024, 2025. - Organizations (REFERENSI) exists only in 2024–2026: absent in 2023.
- Hidden rows must be ignored: xlsx files use Excel autoFilter extensively. 747 of 847 rows in JANUARI 2026 are marked hidden. Both SheetJS and openpyxl read ALL rows regardless of hidden state — this is correct. Never filter by
row_dimensions.hidden. - Targets CSV named by period:
EVALUASI DATA SISWArow 9 contains"DATA BULAN FEBRUARI 2026". Extract derives the month number from this string. Filename:ions-2026-02-targets.csv. @source/clean/CSVs are frozen: extract must NOT overwrite existing CSVs unless--overwriteis passed.- 2023 students payment columns swapped:
CARA BAYARandPOLA PEMBAYARANare in reversed positions compared to 2024–2026. The YAML column map already handles this —payment_method: CARA BAYAR,payment_pattern: POLA PEMBAYARANin all years, but the Indonesian values differ by year. - 2026 DATABASE SISWA
SEPTEMBER25typo: the September data sheet is misnamedSEPTEMBER25(notSEPTEMBER26). Config must reference this exact name. - ICC = IONs Culinary College, ICA = IONs Culinary Academy: both appear as
unitvalues in transaction data. Treated as separate units. stg_marketing_activityusesadapter.get_columns_in_relation: ifraw_marketing_activitytable is removed from DuckDB without deleting the model file, dbt will error. Both must be removed together.
Edge Cases
- Empty monthly sheets: future months in 2025/2026 have 0 data rows but stubs exist. Extract produces a header-only CSV. Load skips empty CSVs gracefully (logs warning, does not fail).
- Targets period detection: if row 9 of
EVALUASI DATA SISWAdoes not contain a parseable month/year, fall back to the CLI--yearargument and log a warning. - Organizations matrix layout: REFERENSI is not a flat table. It's a pivot matrix: age cluster labels in col B, organization category headers (
SMA,SMP,TK-SD,UNIVERSITAS,LAIN-LAIN) in row 3 cols D–H, values in the matrix cells.readReferensiSheetmust unpivot to 3 columns:organization,category,age_cluster. birth_place/birth_datesplit:T/T/Lformat is"City, DD Month YYYY". Split at first comma: before =birth_place, after (trimmed) =birth_date. If no comma found, put full value inbirth_place_dateonly, leavebirth_placeandbirth_dateempty.N0column in REKAP (not extracted): the first column isN0with a zero, notNOwith an O. Not relevant since REKAP is excluded.
Tradeoffs
- SheetJS vs openpyxl for extraction: SheetJS chosen for language consistency with
@packages/pipeline. Both read hidden rows by default (correct behavior). - 100 CSV files committed to git: adds ~5–15 MB to the repo. Acceptable — provides full audit trail and diff visibility when source data changes.
- Removing
stg_marketing_activityentirely vs keeping as stub: removing is cleaner. The REKAP data it was trying to load was always garbage (wrong structure). If REKAP analysis is needed in future, it warrants a proper dedicated model with the stateful section reader. - Organizations as 3 separate year CSVs (not combined): allows year-over-year comparison of the lookup list. dbt can
UNION ALLthem if needed.
Skills
plan— plan file format and conventionsvitest— additional tests for extract and validate logic
Boundaries
- Always: Update Progress.md after each task completion
- Always: Verify
pnpm test:typepasses after any TypeScript change - Always: Run
uv run dbt run && uv run dbt testafter any dbt model change - Always: The
--overwriteflag must be explicitly passed to re-extract an existing CSV — never silently overwrite - Always: Read ALL rows from xlsx sheets — never filter by hidden row state
- Ask first: Any change to existing mart SQL beyond adding
is_validfilter toint_enrollments - Ask first: Adding new dbt models beyond
audit_flagged_orders - Ask first: Changing
stg_*SQL logic (staging refactor is future scope) - Never: Delete or overwrite
@source/clean/CSVs without explicit--overwriteflag - Never: Change
@source/raw/xlsx files - Never: Hardcode year numbers in extract/load/validate TypeScript — always read from config or CLI args
- Never: Extract REKAP sheets — they are excluded by design
Questions
- Should extract produce one CSV per sheet? → Yes. Naming:
ions-{YYYY}-{MM}-{title}.csv. Organizations:ions-{YYYY}-organizations.csv(no month). -
month_intakevsintake? → Useintakeonly. 2026BULANcolumn is dropped;INTAKEmaps tointakefor all years. - Keep
birth_place_dateraw? → Yes. Also split intobirth_placeandbirth_date. Leave split fields blank if format inconsistent. - Should REKAP sheets be extracted? → No. Derived summaries, used for manual validation only.
- Should
stg_marketing_activitybe removed? → Yes. Dead model, nothing downstream reads it. - Extract all 12 months regardless of whether they have data? → Yes. Future months produce header-only CSVs.
- Organizations: separate CSV per year? → Yes. 2024, 2025, 2026 each get their own file.
- Hidden rows: read all or only visible? → Read all. Excel filters are display-only.
- Targets filename? →
ions-{YYYY}-{MM}-targets.csvwhere MM is derived from row 9 content. -
marketing_activitycolumn name collision? → Not relevant — REKAP excluded entirely. - 2023/2024/2025
merge_header_rows? → Only 2023 transactions. 2024/2025 LAPORAN already have flat single-row headers.