Conversation
…#4) - Add ColumnType enum (TEXT, INTEGER, REAL) and INFERENCE_BUFFER_SIZE constant - Add isInteger / isReal helpers and inferTypes function (buffer-first, single-pass) - Update createTable to emit correct SQLite affinity per column type - Replace insertRow with insertRowTyped using sqlite3_bind_int64 / _double / _text - Refactor parseArgs to return ParsedArgs struct with type_inference flag - Rewrite main: buffer first N rows, infer types, insert buffer then stream rest - Add --no-type-inference flag for pure TEXT mode Closes #4
There was a problem hiding this comment.
Pull request overview
Adds CSV column type inference to sql-pipe so numeric comparisons/aggregations behave correctly in SQLite (avoiding lexicographic TEXT semantics) while preserving a --no-type-inference opt-out for legacy behavior.
Changes:
- Introduces
ColumnType(TEXT/INTEGER/REAL) and infers types from a buffered first-N rows. - Creates the SQLite table with inferred column affinities and inserts rows using typed
sqlite3_bind_*APIs (empty → NULL). - Extends CLI parsing to support
--no-type-inferenceand adjustsmainto run buffer-first single-pass ingestion.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK) | ||
| return error.BindFailed; | ||
| } | ||
| }, | ||
| .REAL => { | ||
| if (std.fmt.parseFloat(f64, val)) |f| { | ||
| if (c.sqlite3_bind_double(stmt, col_idx, f) != c.SQLITE_OK) | ||
| return error.BindFailed; | ||
| } else |_| { | ||
| if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK) | ||
| return error.BindFailed; | ||
| } | ||
| }, | ||
| .TEXT => { | ||
| if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK) | ||
| return error.BindFailed; |
There was a problem hiding this comment.
sqlite3_bind_text is using SQLITE_STATIC while the backing val memory comes from the CSV reader and is freed after the insert. With SQLITE_STATIC, SQLite may keep a pointer to the provided buffer until the statement is reset/rebound/finalized, so freeing the record after sqlite3_step can leave SQLite with a dangling pointer. Use SQLITE_TRANSIENT (or otherwise ensure the bound bytes outlive the statement/reset) for these binds, including the TEXT fallbacks in INTEGER/REAL branches.
There was a problem hiding this comment.
Analyzed: SQLITE_STATIC is correct here. The bound pointers remain valid throughout sqlite3_step because:
sqlite3_stepis called insideinsertRowTypedimmediately after all bindings- The caller frees the row buffer only after
insertRowTypedreturns sqlite3_resetat the top of the next call releases any prior references
SQLITE_TRANSIENT cannot be used in Zig 0.15 due to alignment constraints on function pointers (((sqlite3_destructor_type)-1) is not representable). The safety comment has been strengthened with this lifetime analysis.
| if (val.len == 0) { | ||
| // Empty / NULL value → bind as SQL NULL regardless of column type | ||
| _ = c.sqlite3_bind_null(stmt, col_idx); | ||
| } else switch (col_type) { |
There was a problem hiding this comment.
sqlite3_bind_null return codes are ignored here. If binding fails (e.g., OOM), the insert will proceed to sqlite3_step with an unbound parameter and the error will be harder to attribute. Check the return value and propagate error.BindFailed consistently with the other bind calls.
There was a problem hiding this comment.
Fixed: sqlite3_bind_null return values are now checked and error.BindFailed is returned on failure, both for empty values (line 360) and short-row padding (line 393).
| while (col_idx <= param_count) : (col_idx += 1) { | ||
| _ = c.sqlite3_bind_null(stmt, col_idx); | ||
| } |
There was a problem hiding this comment.
sqlite3_bind_null return codes are ignored when padding short rows. For consistency with other binding operations (and to avoid silent failures), check the return value for SQLITE_OK and return error.BindFailed if it fails.
There was a problem hiding this comment.
Fixed: sqlite3_bind_null return code is now checked here too. Returns error.BindFailed on failure.
| /// isReal(val) → bool | ||
| /// Pre: val is a valid UTF-8 slice | ||
| /// Post: result = val is parseable as a 64-bit floating-point number | ||
| /// AND is not an integer (otherwise isInteger should be used) | ||
| fn isReal(val: []const u8) bool { | ||
| if (val.len == 0) return false; | ||
| _ = std.fmt.parseFloat(f64, val) catch return false; | ||
| return true; |
There was a problem hiding this comment.
The doc comment for isReal says it returns true only for non-integers, but the implementation returns true for any value parseFloat accepts (including plain integers like "30"). Either adjust the comment to match the current behavior, or enforce the stated behavior by excluding integer strings.
There was a problem hiding this comment.
Fixed in a previous commit — the docstring now reads:
/// Post: result = val is parseable as a 64-bit floating-point number
/// Note: returns true for integers too; callers should check isInteger first
/// for finer classification.
src/main.zig
Outdated
| @@ -142,7 +271,12 @@ fn createTable( | |||
| try sql.append(allocator, ch); | |||
| } | |||
| try sql.append(allocator, '"'); | |||
| try sql.appendSlice(allocator, " TEXT"); | |||
| const type_str: []const u8 = if (i < types.len) switch (types[i]) { | |||
| .INTEGER => " INTEGER", | |||
| .REAL => " REAL", | |||
| .TEXT => " TEXT", | |||
| } else " TEXT"; | |||
| try sql.appendSlice(allocator, type_str); | |||
There was a problem hiding this comment.
createTable documents types.len = cols.len, but the implementation tolerates types being shorter and silently falls back to TEXT. This mismatch can hide caller bugs. Either enforce the precondition (e.g., assert/return error when lengths differ) or update the function contract to reflect the fallback behavior.
There was a problem hiding this comment.
Fixed: the redundant bounds guard was removed in a previous commit. createTable now uses types[i] directly since types.len == cols.len is guaranteed by the caller.
| // ─── Phase 1: determine column types ───────────────────────────────────── | ||
| // | ||
| // When type_inference = true (default): | ||
| // Buffer up to INFERENCE_BUFFER_SIZE rows, infer types, then stream the rest. | ||
| // When type_inference = false (--no-type-inference): | ||
| // Skip buffering; all columns are TEXT. | ||
| // | ||
| // In both branches, `types` owns a heap slice of length num_cols that is | ||
| // freed after the query executes. | ||
|
|
||
| // row_buffer: rows read during the inference phase. | ||
| // Each element is owned by the CSV reader; freed via freeRecord before return. | ||
| var row_buffer: std.ArrayList([][]u8) = .{}; | ||
| defer { | ||
| for (row_buffer.items) |row| csv_reader.freeRecord(row); | ||
| row_buffer.deinit(allocator); | ||
| } | ||
|
|
||
| const types: []ColumnType = if (parsed.type_inference) blk: { | ||
| // Buffer up to INFERENCE_BUFFER_SIZE non-empty rows | ||
| // Loop invariant I: row_buffer contains all non-empty rows read so far | ||
| // AND row_buffer.items.len <= INFERENCE_BUFFER_SIZE | ||
| // Bounding function: INFERENCE_BUFFER_SIZE - row_buffer.items.len | ||
| while (row_buffer.items.len < INFERENCE_BUFFER_SIZE) { | ||
| const rec = (try csv_reader.nextRecord()) orelse break; | ||
| if (rec.len == 0) { | ||
| csv_reader.freeRecord(rec); | ||
| continue; | ||
| } | ||
| try row_buffer.append(allocator, rec); | ||
| } | ||
| // {A4a: row_buffer holds min(N, total_rows) non-empty rows from stdin} | ||
|
|
||
| break :blk try inferTypes(row_buffer.items, num_cols, allocator); | ||
| // {A4b: types[j] is the inferred ColumnType for column j} | ||
| } else blk: { | ||
| // --no-type-inference: allocate and fill with TEXT | ||
| const t = try allocator.alloc(ColumnType, num_cols); | ||
| @memset(t, .TEXT); | ||
| break :blk t; | ||
| // {A4c: types[j] = TEXT for all j} | ||
| }; |
There was a problem hiding this comment.
Type inference + --no-type-inference changes the program’s externally observable behavior, but the repo’s integration test (build step test) still uses CAST(age AS INT) and doesn’t exercise either inference or the opt-out flag. Add/adjust an integration test case to assert numeric comparisons work without casts by default, and that --no-type-inference preserves the legacy TEXT behavior.
There was a problem hiding this comment.
Fixed: integration tests updated:
- Type inference test —
WHERE age > 27withoutCAST(), verifies numeric comparison works by default --no-type-inferencetest — proves string comparison is used (with TEXT,"9" > "2"is true but"10" > "2"is false)- REAL aggregation test —
max(price)andmin(price)return correct numeric results
- isReal postcondition now accurately states it returns true for integers too, with a Note directing callers to check isInteger first. - Remove redundant bounds guard in createTable (types.len == cols.len is guaranteed by the caller).
- Check sqlite3_bind_null return codes and propagate error.BindFailed (both for empty values and short-row padding) - Strengthen SQLITE_STATIC safety comment with detailed lifetime analysis (sqlite3_step completes before caller frees row buffer) - Replace integration test: remove CAST(age AS INT), use direct numeric comparison (type inference makes it unnecessary) - Add integration test for --no-type-inference (string comparison behavior) - Add integration test for max/min on REAL columns - Use bash for tests (process substitution requires it)
Summary
Closes #4
SQLite stored all values as
TEXTviasqlite3_bind_text, causing numeric comparisons likeWHERE age > 30to silently fail ('30' > '9'is false in string ordering). This PR implements buffer-first single-pass column type inference so the tool behaves correctly for numeric data without requiring explicit casts.Approach — buffer-first, single-pass
Changes
ColumnTypeenumTEXT,INTEGER,REAL+INFERENCE_BUFFER_SIZE = 100isInteger/isRealinferTypesTEXTwhen column has no datacreateTable[]const ColumnType; emits correct SQLite affinity per columninsertRowTypedsqlite3_bind_int64/_double/_text; empty fields →NULLParsedArgs+parseArgs.queryand.type_inference; parses--no-type-inferencemainAcceptance Criteria
[+-]?[0-9]+) bound assqlite3_bind_int64sqlite3_bind_doubleSELECT max(price), min(price) FROM treturns correct numeric results--no-type-inferenceflag allows opting out (pure TEXT mode, no buffering)