Skip to content

Column type inference (INTEGER, REAL, TEXT) for correct numeric queries #4

@vmvarela

Description

@vmvarela

Description

SQLite stores all values as TEXT when inserted via sqlite3_bind_text. This means numeric comparisons like WHERE age > 30 silently fail because '30' > '9' is false in string ordering. Column type inference makes the tool behave correctly for numeric data without the user needing to cast explicitly.

Acceptance Criteria

  • Integer columns (all values match [+-]?[0-9]+) are bound as sqlite3_bind_int64
  • Float columns (all values match a floating-point pattern) are bound as sqlite3_bind_double
  • NULL/empty values do not prevent a column from being inferred as numeric
  • Columns with mixed types fall back to TEXT
  • SELECT max(price), min(price) FROM t returns correct numeric results
  • Type inference is performed on the first N rows (default: 100) stored in a memory buffer
  • After the buffer is consumed and types are determined, remaining rows are streamed and inserted directly
  • A --no-type-inference flag allows opting out (pure TEXT mode, no buffering)

Notes

  • Approach — buffer-first, single-pass:
    1. Read the first N rows (default 100) into an in-memory buffer while scanning to infer column types.
    2. Once types are determined (or N rows are exhausted), create the table with the inferred types.
    3. Insert buffered rows first, then continue reading from stdin and inserting directly — no second pass needed.
    4. This avoids seeking stdin, which is not possible for piped input.
  • The "two-pass option" mentioned previously is not viable for stdin (which cannot seek). The buffer-first approach achieves the same result without seeking.
  • --no-type-inference is useful for large inputs where even buffering 100 rows is undesirable, or when the user knows the schema.

Refinement note (Sprint 1): Resolved contradictory notes about two-pass vs streaming. Adopted buffer-first single-pass approach. Updated AC and Notes accordingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    mvpPart of the Minimum Viable Productpriority:highMust be in the next sprintsize:mMedium — 4 to 8 hoursstatus:readyRefined and ready for sprint selectiontype:featureNew functionality

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions