feat: RFC 4180 compliant CSV parsing (closes #3)#12
Conversation
- Add src/csv.zig: streaming single-pass state-machine CSV parser
- State machine: field_start, unquoted, quoted, quote_saw
- Handles quoted fields with embedded commas
- Decodes escaped double-quotes ("") to "
- Supports embedded newlines (\n, \r\n) inside quoted fields
- Handles both \r\n and \n record terminators
- Preserves existing behaviour for unquoted fields
- 10 unit tests covering all acceptance criteria
- Update src/main.zig to use csv.CsvReader for header and data rows
- Replaces readUntilDelimiterOrEofAlloc + splitScalar with csvReader
- Preserves BOM stripping and column name trimming
- No full-input buffering (streaming, byte-by-byte)
- Update build.zig: add 'zig build unit-test' step for csv.zig tests
There was a problem hiding this comment.
Pull request overview
Adds a streaming, single-pass RFC 4180 CSV parser to support quoted fields (including embedded commas/newlines) and wires it into sql-pipe’s stdin ingestion path, with dedicated unit test and CI coverage.
Changes:
- Introduce
src/csv.zigstreaming CSV state machine (nextRecord/freeRecord) + unit tests. - Switch
src/main.zigfrom line-splitting to the new CSV reader for header + data rows (preserving BOM stripping and header trimming intent). - Add a
zig build unit-testbuild step and run it in CI.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/main.zig |
Uses the new CSV reader for header and rows; keeps BOM stripping and column trimming logic. |
src/csv.zig |
New streaming RFC 4180 parser implementation and unit tests. |
build.zig |
Adds a dedicated unit-test build step for src/csv.zig. |
.github/workflows/ci.yml |
Runs zig build unit-test in the CI matrix. |
Comments suppressed due to low confidence (1)
src/main.zig:9
- The SQLITE_STATIC safety comment references the row buffer being freed via
allocator.free, but rows are now freed viacsv_reader.freeRecord(record). Update this comment so it accurately reflects the lifetime guarantee (memory must remain valid throughsqlite3_step).
const csv = @import("csv.zig");
// SQLITE_STATIC (null): SQLite assumes the memory is constant and won't free it.
// Safe here because sqlite3_step is called immediately after binding,
// before the row buffer is freed by the deferred allocator.free.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (header_record.len > 0 and std.mem.startsWith(u8, header_record[0], bom)) { | ||
| const without_bom = header_record[0][bom.len..]; | ||
| allocator.free(header_record[0]); | ||
| header_record[0] = try allocator.dupe(u8, without_bom); | ||
| } |
There was a problem hiding this comment.
header_record is declared as const but is later mutated (header_record[0] = ...) when stripping the BOM. This won’t compile in Zig; make header_record a var (or assign the BOM-stripped value to a separate variable) before modifying the slice element.
| else | ||
| line; | ||
| if (row.len == 0) continue; | ||
| if (record.len == 0) continue; |
There was a problem hiding this comment.
This check no longer skips completely blank input lines. The previous line-based parser skipped row.len == 0; the CSV reader will return a 1-field record containing an empty string for an empty line, so record.len == 0 will never be true. If blank lines should be ignored (as before), update the condition to detect an empty record (e.g., a single empty field) before binding/insert.
| if (record.len == 0) continue; | |
| // Skip completely blank input lines (represented as a single empty field). | |
| if (record.len == 1 and record[0].len == 0) continue; |
| field.deinit(self.allocator); | ||
| for (fields.items) |f| self.allocator.free(f); | ||
| fields.deinit(self.allocator); |
There was a problem hiding this comment.
In the EndOfStream branch for state == .quoted, the code manually deinitializes/frees field and fields and then returns an error. Because fields/field already have errdefer cleanup, this path will double-free. Remove the manual cleanup here and rely on errdefer (or null out/reset the lists before returning the error).
| field.deinit(self.allocator); | |
| for (fields.items) |f| self.allocator.free(f); | |
| fields.deinit(self.allocator); |
| while (true) { | ||
| const byte = self.reader.readByte() catch |err| switch (err) { | ||
| error.EndOfStream => { |
There was a problem hiding this comment.
nextRecord consumes input via readByte() in a tight loop. If the underlying reader isn’t buffered (e.g. std.fs.File.Reader), this can cause many small reads and a large performance regression on big inputs. Consider documenting that callers should pass a buffered reader, or wrap the provided reader internally with std.io.bufferedReader when appropriate.
| pub const CsvError = error{ | ||
| /// A quoted field was opened but no closing double-quote was found before EOF. | ||
| UnterminatedQuotedField, | ||
| }; |
There was a problem hiding this comment.
CsvError.UnterminatedQuotedField is part of the public surface and nextRecord can return it, but there’s no unit test asserting this error is raised for an unterminated quoted field. Adding a test for this helps prevent regressions in the EOF/state handling.
Summary
Implements full RFC 4180 CSV parsing as a streaming, single-pass state machine. Closes #3.
Changes
src/csv.zig(new)Streaming single-pass CSV parser using a 4-state machine (
field_start,unquoted,quoted,quote_saw):""→")\n,\r\n) inside quoted fields\r\nand\nrecord terminatorssrc/main.zigreadUntilDelimiterOrEofAlloc+splitScalarwithcsv.csvReaderfor header and data rowsbuild.zigzig build unit-teststep targetingsrc/csv.zig.github/workflows/ci.ymlUnit tests (csv.zig)step (runs on all 3 OS matrix targets)Acceptance Criteria
"..."are parsed correctly as a single value"") inside quoted fields are decoded to"\n) are read as a single multi-line valuezig build unit-test)