Skip to content

feat: column type inference (INTEGER, REAL, TEXT) for correct numeric queries#14

Merged
vmvarela merged 3 commits intomasterfrom
feature/4-column-type-inference
Mar 2, 2026
Merged

feat: column type inference (INTEGER, REAL, TEXT) for correct numeric queries#14
vmvarela merged 3 commits intomasterfrom
feature/4-column-type-inference

Conversation

@vmvarela
Copy link
Owner

@vmvarela vmvarela commented Mar 2, 2026

Summary

Closes #4

SQLite stored all values as TEXT via sqlite3_bind_text, causing numeric comparisons like WHERE age > 30 to silently fail ('30' > '9' is false in string ordering). This PR implements buffer-first single-pass column type inference so the tool behaves correctly for numeric data without requiring explicit casts.

Approach — buffer-first, single-pass

  1. Read the first N rows (default: 100) into an in-memory buffer while scanning to infer column types.
  2. Once types are determined (or N rows exhausted), create the table with inferred SQL types.
  3. Insert buffered rows first, then continue reading from stdin and inserting directly — no second pass, no seeking stdin.

Changes

Component What changed
ColumnType enum TEXT, INTEGER, REAL + INFERENCE_BUFFER_SIZE = 100
isInteger / isReal Pattern helpers for per-value classification
inferTypes Scans buffered rows; ignores empty/NULL; defaults to TEXT when column has no data
createTable Accepts []const ColumnType; emits correct SQLite affinity per column
insertRowTyped Uses sqlite3_bind_int64 / _double / _text; empty fields → NULL
ParsedArgs + parseArgs Returns struct with .query and .type_inference; parses --no-type-inference
main Buffer ≤100 rows → infer types → create table → insert buffer → stream rest

Acceptance Criteria

  • Integer columns (all values match [+-]?[0-9]+) bound as sqlite3_bind_int64
  • Float columns bound as sqlite3_bind_double
  • NULL/empty values do not prevent a column from being inferred as numeric
  • Columns with mixed types fall back to TEXT
  • SELECT max(price), min(price) FROM t returns correct numeric results
  • Type inference performed on first N rows (100) stored in a memory buffer
  • Remaining rows streamed and inserted directly after buffer consumed
  • --no-type-inference flag allows opting out (pure TEXT mode, no buffering)

…#4)

- Add ColumnType enum (TEXT, INTEGER, REAL) and INFERENCE_BUFFER_SIZE constant
- Add isInteger / isReal helpers and inferTypes function (buffer-first, single-pass)
- Update createTable to emit correct SQLite affinity per column type
- Replace insertRow with insertRowTyped using sqlite3_bind_int64 / _double / _text
- Refactor parseArgs to return ParsedArgs struct with type_inference flag
- Rewrite main: buffer first N rows, infer types, insert buffer then stream rest
- Add --no-type-inference flag for pure TEXT mode

Closes #4
Copilot AI review requested due to automatic review settings March 2, 2026 14:09
@github-actions github-actions bot added the type:feature New functionality label Mar 2, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CSV column type inference to sql-pipe so numeric comparisons/aggregations behave correctly in SQLite (avoiding lexicographic TEXT semantics) while preserving a --no-type-inference opt-out for legacy behavior.

Changes:

  • Introduces ColumnType (TEXT/INTEGER/REAL) and infers types from a buffered first-N rows.
  • Creates the SQLite table with inferred column affinities and inserts rows using typed sqlite3_bind_* APIs (empty → NULL).
  • Extends CLI parsing to support --no-type-inference and adjusts main to run buffer-first single-pass ingestion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +365 to +380
if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK)
return error.BindFailed;
}
},
.REAL => {
if (std.fmt.parseFloat(f64, val)) |f| {
if (c.sqlite3_bind_double(stmt, col_idx, f) != c.SQLITE_OK)
return error.BindFailed;
} else |_| {
if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK)
return error.BindFailed;
}
},
.TEXT => {
if (c.sqlite3_bind_text(stmt, col_idx, val.ptr, @intCast(val.len), SQLITE_STATIC) != c.SQLITE_OK)
return error.BindFailed;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlite3_bind_text is using SQLITE_STATIC while the backing val memory comes from the CSV reader and is freed after the insert. With SQLITE_STATIC, SQLite may keep a pointer to the provided buffer until the statement is reset/rebound/finalized, so freeing the record after sqlite3_step can leave SQLite with a dangling pointer. Use SQLITE_TRANSIENT (or otherwise ensure the bound bytes outlive the statement/reset) for these binds, including the TEXT fallbacks in INTEGER/REAL branches.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyzed: SQLITE_STATIC is correct here. The bound pointers remain valid throughout sqlite3_step because:

  1. sqlite3_step is called inside insertRowTyped immediately after all bindings
  2. The caller frees the row buffer only after insertRowTyped returns
  3. sqlite3_reset at the top of the next call releases any prior references

SQLITE_TRANSIENT cannot be used in Zig 0.15 due to alignment constraints on function pointers (((sqlite3_destructor_type)-1) is not representable). The safety comment has been strengthened with this lifetime analysis.

Comment on lines +355 to +358
if (val.len == 0) {
// Empty / NULL value → bind as SQL NULL regardless of column type
_ = c.sqlite3_bind_null(stmt, col_idx);
} else switch (col_type) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlite3_bind_null return codes are ignored here. If binding fails (e.g., OOM), the insert will proceed to sqlite3_step with an unbound parameter and the error will be harder to attribute. Check the return value and propagate error.BindFailed consistently with the other bind calls.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: sqlite3_bind_null return values are now checked and error.BindFailed is returned on failure, both for empty values (line 360) and short-row padding (line 393).

Comment on lines 388 to 390
while (col_idx <= param_count) : (col_idx += 1) {
_ = c.sqlite3_bind_null(stmt, col_idx);
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlite3_bind_null return codes are ignored when padding short rows. For consistency with other binding operations (and to avoid silent failures), check the return value for SQLITE_OK and return error.BindFailed if it fails.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: sqlite3_bind_null return code is now checked here too. Returns error.BindFailed on failure.

Comment on lines +112 to +119
/// isReal(val) → bool
/// Pre: val is a valid UTF-8 slice
/// Post: result = val is parseable as a 64-bit floating-point number
/// AND is not an integer (otherwise isInteger should be used)
fn isReal(val: []const u8) bool {
if (val.len == 0) return false;
_ = std.fmt.parseFloat(f64, val) catch return false;
return true;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for isReal says it returns true only for non-integers, but the implementation returns true for any value parseFloat accepts (including plain integers like "30"). Either adjust the comment to match the current behavior, or enforce the stated behavior by excluding integer strings.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a previous commit — the docstring now reads:

/// Post: result = val is parseable as a 64-bit floating-point number
/// Note: returns true for integers too; callers should check isInteger first
///       for finer classification.

src/main.zig Outdated
Comment on lines +243 to +279
@@ -142,7 +271,12 @@ fn createTable(
try sql.append(allocator, ch);
}
try sql.append(allocator, '"');
try sql.appendSlice(allocator, " TEXT");
const type_str: []const u8 = if (i < types.len) switch (types[i]) {
.INTEGER => " INTEGER",
.REAL => " REAL",
.TEXT => " TEXT",
} else " TEXT";
try sql.appendSlice(allocator, type_str);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTable documents types.len = cols.len, but the implementation tolerates types being shorter and silently falls back to TEXT. This mismatch can hide caller bugs. Either enforce the precondition (e.g., assert/return error when lengths differ) or update the function contract to reflect the fallback behavior.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: the redundant bounds guard was removed in a previous commit. createTable now uses types[i] directly since types.len == cols.len is guaranteed by the caller.

Comment on lines +506 to +547
// ─── Phase 1: determine column types ─────────────────────────────────────
//
// When type_inference = true (default):
// Buffer up to INFERENCE_BUFFER_SIZE rows, infer types, then stream the rest.
// When type_inference = false (--no-type-inference):
// Skip buffering; all columns are TEXT.
//
// In both branches, `types` owns a heap slice of length num_cols that is
// freed after the query executes.

// row_buffer: rows read during the inference phase.
// Each element is owned by the CSV reader; freed via freeRecord before return.
var row_buffer: std.ArrayList([][]u8) = .{};
defer {
for (row_buffer.items) |row| csv_reader.freeRecord(row);
row_buffer.deinit(allocator);
}

const types: []ColumnType = if (parsed.type_inference) blk: {
// Buffer up to INFERENCE_BUFFER_SIZE non-empty rows
// Loop invariant I: row_buffer contains all non-empty rows read so far
// AND row_buffer.items.len <= INFERENCE_BUFFER_SIZE
// Bounding function: INFERENCE_BUFFER_SIZE - row_buffer.items.len
while (row_buffer.items.len < INFERENCE_BUFFER_SIZE) {
const rec = (try csv_reader.nextRecord()) orelse break;
if (rec.len == 0) {
csv_reader.freeRecord(rec);
continue;
}
try row_buffer.append(allocator, rec);
}
// {A4a: row_buffer holds min(N, total_rows) non-empty rows from stdin}

break :blk try inferTypes(row_buffer.items, num_cols, allocator);
// {A4b: types[j] is the inferred ColumnType for column j}
} else blk: {
// --no-type-inference: allocate and fill with TEXT
const t = try allocator.alloc(ColumnType, num_cols);
@memset(t, .TEXT);
break :blk t;
// {A4c: types[j] = TEXT for all j}
};
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inference + --no-type-inference changes the program’s externally observable behavior, but the repo’s integration test (build step test) still uses CAST(age AS INT) and doesn’t exercise either inference or the opt-out flag. Add/adjust an integration test case to assert numeric comparisons work without casts by default, and that --no-type-inference preserves the legacy TEXT behavior.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: integration tests updated:

  1. Type inference testWHERE age > 27 without CAST(), verifies numeric comparison works by default
  2. --no-type-inference test — proves string comparison is used (with TEXT, "9" > "2" is true but "10" > "2" is false)
  3. REAL aggregation testmax(price) and min(price) return correct numeric results

vmvarela added 2 commits March 2, 2026 15:14
- isReal postcondition now accurately states it returns true for integers too,
  with a Note directing callers to check isInteger first.
- Remove redundant bounds guard in createTable (types.len == cols.len is
  guaranteed by the caller).
- Check sqlite3_bind_null return codes and propagate error.BindFailed
  (both for empty values and short-row padding)
- Strengthen SQLITE_STATIC safety comment with detailed lifetime analysis
  (sqlite3_step completes before caller frees row buffer)
- Replace integration test: remove CAST(age AS INT), use direct numeric
  comparison (type inference makes it unnecessary)
- Add integration test for --no-type-inference (string comparison behavior)
- Add integration test for max/min on REAL columns
- Use bash for tests (process substitution requires it)
@github-actions github-actions bot added the type:chore Maintenance, refactoring, tooling label Mar 2, 2026
@vmvarela vmvarela merged commit 265f866 into master Mar 2, 2026
5 checks passed
@vmvarela vmvarela deleted the feature/4-column-type-inference branch March 2, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:chore Maintenance, refactoring, tooling type:feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column type inference (INTEGER, REAL, TEXT) for correct numeric queries

2 participants