You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:
Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download) [this might be something to revisit, but right now, we support CSV.getfield(io::IOBuffer, ::Type{T}), which would allow for fairly seamless streaming code
Ability to specify an arbitrary ASCII delimiter
Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n) [we're just going to accept \r, \n, and \r\n and handle those three automatically]
Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
Ability to specify an escape character that allows for the quote character inside a quoted field value
Ability to provide a custom header of column names
Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names; not sure what to do about headerless CSV files (if there is such a thing)
Ability to specify the types CSV should expect for each column individually or for every column
Ability to specify the date format a file
Ability to tell CSV to use 'x' number of rows for type inference
Ability to specify a thousands separator character
Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here
Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)
Ability to skip blank lines
Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
Ability to not parse specified columns (ignore them)
Ability to specify a # of lines to skip at the end of a file
Right now, we skip leading whitespace for numeric field parsing, but we don't ignore trailing whitespace
parse DateTime values
This is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.
For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.
This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download)[this might be something to revisit, but right now, we supportCSV.getfield(io::IOBuffer, ::Type{T}), which would allow for fairly seamless streaming codeAbility to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n)[we're just going to accept\r,\n, and\r\nand handle those three automatically]CSVshould expect for each column individually or for every columnThis is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.
For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.
@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski