Skip to content

Requested CSV Parsing Features [comment here for new requests] #3

@quinnj

Description

@quinnj

This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:

  • Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
  • Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download) [this might be something to revisit, but right now, we support CSV.getfield(io::IOBuffer, ::Type{T}), which would allow for fairly seamless streaming code
  • Ability to specify an arbitrary ASCII delimiter
  • Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n) [we're just going to accept \r, \n, and \r\n and handle those three automatically]
  • Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
  • Ability to specify an escape character that allows for the quote character inside a quoted field value
  • Ability to provide a custom header of column names
  • Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names; not sure what to do about headerless CSV files (if there is such a thing)
  • Ability to specify the types CSV should expect for each column individually or for every column
  • Ability to specify the date format a file
  • Ability to tell CSV to use 'x' number of rows for type inference
  • Ability to specify a thousands separator character
  • Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here
  • Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)
  • Ability to skip blank lines
  • Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
  • Ability to not parse specified columns (ignore them)
  • Ability to specify a # of lines to skip at the end of a file
  • Right now, we skip leading whitespace for numeric field parsing, but we don't ignore trailing whitespace
  • parse DateTime values

This is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.

For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.

@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions