Skip to content

feat: RFC 4180 compliant CSV parsing (closes #3)#12

Merged
vmvarela merged 2 commits intomasterfrom
feature/issue-3-rfc4180-csv
Mar 2, 2026
Merged

feat: RFC 4180 compliant CSV parsing (closes #3)#12
vmvarela merged 2 commits intomasterfrom
feature/issue-3-rfc4180-csv

Conversation

@vmvarela
Copy link
Owner

@vmvarela vmvarela commented Mar 2, 2026

Summary

Implements full RFC 4180 CSV parsing as a streaming, single-pass state machine. Closes #3.

Changes

src/csv.zig (new)

Streaming single-pass CSV parser using a 4-state machine (field_start, unquoted, quoted, quote_saw):

  • Quoted fields with embedded commas
  • Escaped double-quotes (""")
  • Embedded newlines (\n, \r\n) inside quoted fields
  • Both \r\n and \n record terminators
  • No full-input buffering — byte-by-byte streaming, safe for stdin

src/main.zig

  • Replaced readUntilDelimiterOrEofAlloc + splitScalar with csv.csvReader for header and data rows
  • BOM stripping and column name trimming preserved
  • Backward-compatible for all unquoted CSV inputs

build.zig

  • Added zig build unit-test step targeting src/csv.zig

.github/workflows/ci.yml

  • Added Unit tests (csv.zig) step (runs on all 3 OS matrix targets)

Acceptance Criteria

  • Fields enclosed in "..." are parsed correctly as a single value
  • Embedded commas inside quoted fields are not treated as delimiters
  • Escaped double-quotes ("") inside quoted fields are decoded to "
  • Fields with embedded newlines (\n) are read as a single multi-line value
  • Existing behaviour for unquoted fields is unchanged
  • 10 unit tests covering edge cases (zig build unit-test)
  • Parser is single-pass and streaming — no buffering the full input

vmvarela added 2 commits March 2, 2026 13:40
- Add src/csv.zig: streaming single-pass state-machine CSV parser
  - State machine: field_start, unquoted, quoted, quote_saw
  - Handles quoted fields with embedded commas
  - Decodes escaped double-quotes ("") to "
  - Supports embedded newlines (\n, \r\n) inside quoted fields
  - Handles both \r\n and \n record terminators
  - Preserves existing behaviour for unquoted fields
  - 10 unit tests covering all acceptance criteria

- Update src/main.zig to use csv.CsvReader for header and data rows
  - Replaces readUntilDelimiterOrEofAlloc + splitScalar with csvReader
  - Preserves BOM stripping and column name trimming
  - No full-input buffering (streaming, byte-by-byte)

- Update build.zig: add 'zig build unit-test' step for csv.zig tests
Copilot AI review requested due to automatic review settings March 2, 2026 12:43
@github-actions github-actions bot added type:feature New functionality type:chore Maintenance, refactoring, tooling labels Mar 2, 2026
@vmvarela vmvarela merged commit b48b8fd into master Mar 2, 2026
8 checks passed
@vmvarela vmvarela deleted the feature/issue-3-rfc4180-csv branch March 2, 2026 12:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a streaming, single-pass RFC 4180 CSV parser to support quoted fields (including embedded commas/newlines) and wires it into sql-pipe’s stdin ingestion path, with dedicated unit test and CI coverage.

Changes:

  • Introduce src/csv.zig streaming CSV state machine (nextRecord / freeRecord) + unit tests.
  • Switch src/main.zig from line-splitting to the new CSV reader for header + data rows (preserving BOM stripping and header trimming intent).
  • Add a zig build unit-test build step and run it in CI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/main.zig Uses the new CSV reader for header and rows; keeps BOM stripping and column trimming logic.
src/csv.zig New streaming RFC 4180 parser implementation and unit tests.
build.zig Adds a dedicated unit-test build step for src/csv.zig.
.github/workflows/ci.yml Runs zig build unit-test in the CI matrix.
Comments suppressed due to low confidence (1)

src/main.zig:9

  • The SQLITE_STATIC safety comment references the row buffer being freed via allocator.free, but rows are now freed via csv_reader.freeRecord(record). Update this comment so it accurately reflects the lifetime guarantee (memory must remain valid through sqlite3_step).
const csv = @import("csv.zig");

// SQLITE_STATIC (null): SQLite assumes the memory is constant and won't free it.
// Safe here because sqlite3_step is called immediately after binding,
// before the row buffer is freed by the deferred allocator.free.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +68
if (header_record.len > 0 and std.mem.startsWith(u8, header_record[0], bom)) {
const without_bom = header_record[0][bom.len..];
allocator.free(header_record[0]);
header_record[0] = try allocator.dupe(u8, without_bom);
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header_record is declared as const but is later mutated (header_record[0] = ...) when stripping the BOM. This won’t compile in Zig; make header_record a var (or assign the BOM-stripped value to a separate variable) before modifying the slice element.

Copilot uses AI. Check for mistakes.
else
line;
if (row.len == 0) continue;
if (record.len == 0) continue;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check no longer skips completely blank input lines. The previous line-based parser skipped row.len == 0; the CSV reader will return a 1-field record containing an empty string for an empty line, so record.len == 0 will never be true. If blank lines should be ignored (as before), update the condition to detect an empty record (e.g., a single empty field) before binding/insert.

Suggested change
if (record.len == 0) continue;
// Skip completely blank input lines (represented as a single empty field).
if (record.len == 1 and record[0].len == 0) continue;

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +93
field.deinit(self.allocator);
for (fields.items) |f| self.allocator.free(f);
fields.deinit(self.allocator);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the EndOfStream branch for state == .quoted, the code manually deinitializes/frees field and fields and then returns an error. Because fields/field already have errdefer cleanup, this path will double-free. Remove the manual cleanup here and rely on errdefer (or null out/reset the lists before returning the error).

Suggested change
field.deinit(self.allocator);
for (fields.items) |f| self.allocator.free(f);
fields.deinit(self.allocator);

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +82
while (true) {
const byte = self.reader.readByte() catch |err| switch (err) {
error.EndOfStream => {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRecord consumes input via readByte() in a tight loop. If the underlying reader isn’t buffered (e.g. std.fs.File.Reader), this can cause many small reads and a large performance regression on big inputs. Consider documenting that callers should pass a buffered reader, or wrap the provided reader internally with std.io.bufferedReader when appropriate.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +17
pub const CsvError = error{
/// A quoted field was opened but no closing double-quote was found before EOF.
UnterminatedQuotedField,
};
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CsvError.UnterminatedQuotedField is part of the public surface and nextRecord can return it, but there’s no unit test asserting this error is raised for an unterminated quoted field. Adding a test for this helps prevent regressions in the EOF/state handling.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:chore Maintenance, refactoring, tooling type:feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC 4180 compliant CSV parsing (quoted fields, embedded commas and newlines)

2 participants