ENH: Add encoding option to numpy text IO#4208
ENH: Add encoding option to numpy text IO#4208juliantaylor wants to merge 27 commits intonumpy:masterfrom
Conversation
|
unfinished but so people can have a look at the idea and know what I'm talking about on the mailing list. loadtxt somewhat works genfromtxt not yet. |
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
this explicit decode is needed for backward compatibility with zipped files which have not been opened in text mode
There was a problem hiding this comment.
Meaning that unzip doesn't return strings?
There was a problem hiding this comment.
not if you open it with "rb", probably also applies to normal files
|
The underlying assumption in the I/O port was that all scientific text files are actually 1-byte binary files, not text, which may well have been misguided, and leaves little room for unicode. The use of asbytes originates only from the fact that |
|
"The use of asbytes originates only from the fact that b'%d' % (20,) does not work." interesting -- for the record, there is a big ol' thread about that on Python-dev, and it looks like that's going to be added.: http://www.python.org/dev/peps/pep-0461/ but there are (far more ugly) ways to do it without new features: '%d' % (20,).encode('ascii') for instance. |
|
tests should now succeed after adding more hacks to keep supporting broken assumptions on data encoding. |
|
loading of 'S' dtype in a structured array most likely will not work yet. |
numpy/lib/tests/test_io.py
Outdated
There was a problem hiding this comment.
This docstring isn't very informative ;) What, in particular, does the function do?
There was a problem hiding this comment.
I guess it can be removed its some python2.5 cludge that was made worse in the py3 conversion
|
@charris I think this and a yet to be done genfromtxt fix should be in 1.9, but getting it regression free is difficult as there is all kind of stuff people can be inputing to these functions and it works by accident. |
|
I'd be inclined to push this to 1.10 with the other genfromtxt fixes. The masked array fixes are in the same spot and I'd like both to have more time to settle out. Maybe we can do a 1.10 as soon as the datetime stuff gets done, or maybe sooner if it doesn't get done ;) |
|
@juliantaylor Could you revisit this when you finish with higher priority stuff. Might be easier with support for 2.5 dropped. Also interested in @pv comment about text files. |
|
Pushing this off (again) to 1.11. |
|
@juliantaylor Needs a rebase. |
|
@juliantaylor Still interested in this? |
|
I hope so -- this would be nice :-) |
|
@juliantaylor Closing this. Please resubmit if you get the urge to continue. Anyone else interested in this is welcome to pull the code and give it a shot. |
Load data in chunks and fill it into an array grown with resize. This significantly reduces the memory consumption of the function.
47978a7 to
c482a5b
Compare
|
@juliantaylor Ready for review? |
|
|
||
| * The `np.einsum` function will use BLAS when possible | ||
| * ``genfromtxt``, ``loadtxt``, ``fromregex`` and ``savetxt`` can now handle files | ||
| with arbitrary encoding supported by Python. |
| _open = open | ||
|
|
||
| def _check_mode(mode, encoding, newline): | ||
| if "t" in mode: |
| raise ValueError("Argument 'newline' not supported in binary mode") | ||
|
|
||
| def _python2_bz2open(fn, mode, encoding, newline): | ||
| """ wrapper to open bz2 in text mode """ |
There was a problem hiding this comment.
Needs docstring of the standard type with expanded explanation and documentation of parameters.
| return bz2.BZ2File(fn, mode) | ||
|
|
||
| def _python2_gzipopen(fn, mode, encoding, newline): | ||
| """ wrapper to open gzip in text mode """ |
| @@ -115,7 +173,7 @@ def __getitem__(self, key): | |||
|
|
|||
| _file_openers = _FileOpeners() | |||
There was a problem hiding this comment.
This singleton seems a bit odd. Old design I expect.
| if isinstance(delimiter, unicode): | ||
| delimiter = delimiter.encode('ascii') | ||
| if (delimiter is None) or _is_bytes_like(delimiter): | ||
| if (delimiter is None) or isinstance(delimiter, basestring): |
There was a problem hiding this comment.
So in Python 2 either ascii or unicode?
| fmt = asstr(fmt) | ||
| delimiter = asstr(delimiter) | ||
|
|
||
| class WriteWrap(object): |
| return [] | ||
|
|
||
| def read_data(chunk_size): | ||
| # Parse each line, including the first |
There was a problem hiding this comment.
Needs better docstring, in particular chunk_size. @
|
|
||
| .. versionadded:: 1.10.0 | ||
| encoding: string, optional | ||
| Encoding used to decode the inputfile. Does not apply to input streams. |
There was a problem hiding this comment.
What is a stream in this context?
|
New functions need docstrings. Also, some of the comments look useful. |
|
Just to be clear, are the assumptions here that:
|
|
Rebased and squashed in #10054, so closing this. |
ENH: Add encoding option to numpy text IO.
add encoding flag to np.loadtxt to be able to load non default encoded
text files.