feat: RFC 4180 compliant CSV parsing (closes #3) by vmvarela · Pull Request #12 · vmvarela/sql-pipe

vmvarela · 2026-03-02T12:43:22Z

Summary

Implements full RFC 4180 CSV parsing as a streaming, single-pass state machine. Closes #3.

Changes

`src/csv.zig` (new)

Streaming single-pass CSV parser using a 4-state machine (field_start, unquoted, quoted, quote_saw):

Quoted fields with embedded commas
Escaped double-quotes ("" → ")
Embedded newlines (\n, \r\n) inside quoted fields
Both \r\n and \n record terminators
No full-input buffering — byte-by-byte streaming, safe for stdin

`src/main.zig`

Replaced readUntilDelimiterOrEofAlloc + splitScalar with csv.csvReader for header and data rows
BOM stripping and column name trimming preserved
Backward-compatible for all unquoted CSV inputs

`build.zig`

Added zig build unit-test step targeting src/csv.zig

`.github/workflows/ci.yml`

Added Unit tests (csv.zig) step (runs on all 3 OS matrix targets)

Acceptance Criteria

Fields enclosed in "..." are parsed correctly as a single value
Embedded commas inside quoted fields are not treated as delimiters
Escaped double-quotes ("") inside quoted fields are decoded to "
Fields with embedded newlines (\n) are read as a single multi-line value
Existing behaviour for unquoted fields is unchanged
10 unit tests covering edge cases (zig build unit-test)
Parser is single-pass and streaming — no buffering the full input

- Add src/csv.zig: streaming single-pass state-machine CSV parser - State machine: field_start, unquoted, quoted, quote_saw - Handles quoted fields with embedded commas - Decodes escaped double-quotes ("") to " - Supports embedded newlines (\n, \r\n) inside quoted fields - Handles both \r\n and \n record terminators - Preserves existing behaviour for unquoted fields - 10 unit tests covering all acceptance criteria - Update src/main.zig to use csv.CsvReader for header and data rows - Replaces readUntilDelimiterOrEofAlloc + splitScalar with csvReader - Preserves BOM stripping and column name trimming - No full-input buffering (streaming, byte-by-byte) - Update build.zig: add 'zig build unit-test' step for csv.zig tests

Copilot

Pull request overview

Adds a streaming, single-pass RFC 4180 CSV parser to support quoted fields (including embedded commas/newlines) and wires it into sql-pipe’s stdin ingestion path, with dedicated unit test and CI coverage.

Changes:

Introduce src/csv.zig streaming CSV state machine (nextRecord / freeRecord) + unit tests.
Switch src/main.zig from line-splitting to the new CSV reader for header + data rows (preserving BOM stripping and header trimming intent).
Add a zig build unit-test build step and run it in CI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`src/main.zig`	Uses the new CSV reader for header and rows; keeps BOM stripping and column trimming logic.
`src/csv.zig`	New streaming RFC 4180 parser implementation and unit tests.
`build.zig`	Adds a dedicated `unit-test` build step for `src/csv.zig`.
`.github/workflows/ci.yml`	Runs `zig build unit-test` in the CI matrix.

Comments suppressed due to low confidence (1)

src/main.zig:9

The SQLITE_STATIC safety comment references the row buffer being freed via allocator.free, but rows are now freed via csv_reader.freeRecord(record). Update this comment so it accurately reflects the lifetime guarantee (memory must remain valid through sqlite3_step).

const csv = @import("csv.zig");

// SQLITE_STATIC (null): SQLite assumes the memory is constant and won't free it.
// Safe here because sqlite3_step is called immediately after binding,
// before the row buffer is freed by the deferred allocator.free.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T12:52:36Z

src/main.zig

+    if (header_record.len > 0 and std.mem.startsWith(u8, header_record[0], bom)) {
+        const without_bom = header_record[0][bom.len..];
+        allocator.free(header_record[0]);
+        header_record[0] = try allocator.dupe(u8, without_bom);
+    }


header_record is declared as const but is later mutated (header_record[0] = ...) when stripping the BOM. This won’t compile in Zig; make header_record a var (or assign the BOM-stripped value to a separate variable) before modifying the slice element.

Copilot · 2026-03-02T12:52:36Z

src/main.zig

-        else
-            line;
-        if (row.len == 0) continue;
+        if (record.len == 0) continue;


This check no longer skips completely blank input lines. The previous line-based parser skipped row.len == 0; the CSV reader will return a 1-field record containing an empty string for an empty line, so record.len == 0 will never be true. If blank lines should be ignored (as before), update the condition to detect an empty record (e.g., a single empty field) before binding/insert.

Suggested change

if (record.len == 0) continue;

// Skip completely blank input lines (represented as a single empty field).

if (record.len == 1 and record[0].len == 0) continue;

Copilot · 2026-03-02T12:52:37Z

src/csv.zig

+                            field.deinit(self.allocator);
+                            for (fields.items) |f| self.allocator.free(f);
+                            fields.deinit(self.allocator);


In the EndOfStream branch for state == .quoted, the code manually deinitializes/frees field and fields and then returns an error. Because fields/field already have errdefer cleanup, this path will double-free. Remove the manual cleanup here and rely on errdefer (or null out/reset the lists before returning the error).

Suggested change

field.deinit(self.allocator);

for (fields.items) |f| self.allocator.free(f);

fields.deinit(self.allocator);

Copilot · 2026-03-02T12:52:37Z

src/csv.zig

+            while (true) {
+                const byte = self.reader.readByte() catch |err| switch (err) {
+                    error.EndOfStream => {


nextRecord consumes input via readByte() in a tight loop. If the underlying reader isn’t buffered (e.g. std.fs.File.Reader), this can cause many small reads and a large performance regression on big inputs. Consider documenting that callers should pass a buffered reader, or wrap the provided reader internally with std.io.bufferedReader when appropriate.

Copilot · 2026-03-02T12:52:37Z

src/csv.zig

+pub const CsvError = error{
+    /// A quoted field was opened but no closing double-quote was found before EOF.
+    UnterminatedQuotedField,
+};


CsvError.UnterminatedQuotedField is part of the public surface and nextRecord can return it, but there’s no unit test asserting this error is raised for an unterminated quoted field. Adding a test for this helps prevent regressions in the EOF/state handling.

vmvarela added 2 commits March 2, 2026 13:40

ci: add unit-test step for csv.zig in CI matrix

2ab2c4f

Copilot AI review requested due to automatic review settings March 2, 2026 12:43

github-actions bot assigned vmvarela Mar 2, 2026

github-actions bot added type:feature New functionality type:chore Maintenance, refactoring, tooling labels Mar 2, 2026

Copilot started reviewing on behalf of vmvarela March 2, 2026 12:43 View session

vmvarela merged commit b48b8fd into master Mar 2, 2026
8 checks passed

vmvarela deleted the feature/issue-3-rfc4180-csv branch March 2, 2026 12:47

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RFC 4180 compliant CSV parsing (closes #3)#12

feat: RFC 4180 compliant CSV parsing (closes #3)#12
vmvarela merged 2 commits intomasterfrom
feature/issue-3-rfc4180-csv

vmvarela commented Mar 2, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (record.len == 0) continue;
	// Skip completely blank input lines (represented as a single empty field).
	if (record.len == 1 and record[0].len == 0) continue;

	field.deinit(self.allocator);
	for (fields.items) \|f\| self.allocator.free(f);
	fields.deinit(self.allocator);

Conversation

vmvarela commented Mar 2, 2026

Summary

Changes

src/csv.zig (new)

src/main.zig

build.zig

.github/workflows/ci.yml

Acceptance Criteria

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/csv.zig` (new)

`src/main.zig`

`build.zig`

`.github/workflows/ci.yml`