Skip to content

RFC 4180 compliant CSV parsing (quoted fields, embedded commas and newlines) #3

@vmvarela

Description

@vmvarela

Description

The current CSV parser splits on commas and trims line endings, but does not handle the full RFC 4180 quoted-field syntax. Fields that contain commas, double quotes, or embedded newlines must be enclosed in double quotes — this is extremely common in real-world data.

Acceptance Criteria

  • Fields enclosed in "..." are parsed correctly as a single value
  • Embedded commas inside quoted fields are not treated as delimiters
  • Escaped double quotes ("") inside quoted fields are decoded to "
  • Fields with embedded newlines (\n) are read as a single multi-line value
  • Existing behaviour for unquoted fields is unchanged
  • At least 5 unit tests covering edge cases
  • Parser is single-pass and streaming — no buffering the full input (stdin cannot seek)

Notes

  • Reference: RFC 4180
  • Should handle both \n and \r\n line endings within quoted fields
  • Must use a single-pass streaming parser — this is required (not optional) because sql-pipe reads from stdin, which is not seekable. Buffering the entire file is not acceptable for large inputs.
  • Implement as a state machine over individual bytes/codepoints

Refinement note (Sprint 1): Clarified that streaming/single-pass is a hard requirement, not a suggestion. Moved it to AC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    mvpPart of the Minimum Viable Productpriority:criticalMust fix immediately — blocks everythingsize:lLarge — 1 to 2 daysstatus:readyRefined and ready for sprint selectiontype:featureNew functionality

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions