ENH: Add encoding option to numpy text IO.#10054
Conversation
|
OK, should be ready to go. |
doc/release/1.14.0-notes.rst
Outdated
There was a problem hiding this comment.
nit: line break here and below
There was a problem hiding this comment.
Yeah, I decided to leave off the rest of the release notes editing to just before the branch.
There was a problem hiding this comment.
I think all the formatting changes could do with collecting together coherently too, when we get around to that.
numpy/lib/_iotools.py
Outdated
There was a problem hiding this comment.
It seems to me that this type of encoding-guessing goes against the python3 unicode model, which requires this kind of thing to be explicit. Can we perhaps add a warning here in python 3?
There was a problem hiding this comment.
This is trying to preserve compatibility with previous behavior and is only called in three functions, two of which pass 'bytes' as the default. I went ahead and documented the default values in those places. I think we can revisit this after Python 2 is dropped. There has also been a long time desire to replace all these functions, perhaps by bringing over the Pandas functionality, so this may all become moot in the long term.
numpy/lib/tests/test__iotools.py
Outdated
There was a problem hiding this comment.
Shouldn't the contract be that s is now always unicode, and so the decode is unconditional?
There was a problem hiding this comment.
_bytes_to_date is a helper function called by several other tests, so it probably needs to do the check. Mind that we are also trying to make things look as much as possible like they did before, so that check can be looked at as a compatibility check. In particular, I believe that encoding='bytes' means 'latin1' for this PR.
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
Is this really true? It seems like we should now allow and encourage utf8.
There was a problem hiding this comment.
I wondered about that too, but the problem is the pickled case.
if encoding not in ('ASCII', 'latin1', 'bytes'):
# The 'encoding' value for pickle also affects what encoding
# the serialized binary data of NumPy arrays is loaded
# in. Pickle does not pass on the encoding information to
# NumPy. The unpickling code in numpy.core.multiarray is
# written to assume that unicode data appearing where binary
# should be is in 'latin1'. 'bytes' is also safe, as is 'ASCII'.
#
# Other encoding values can corrupt binary data, and we
# purposefully disallow them. For the same reason, the errors=
# argument is not exposed, as values other than 'strict'
# result can similarly silently corrupt numerical data.
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
No need for the asstr here, since the above line already requires x to be unicode in py3, and py2 doesn't care either way.
numpy/lib/npyio.py
Outdated
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
Probably good to call them unicode strings to avoid ambiguity on python 2
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
IMO this block would be clearer as:
try:
new_data = []
for row_tup in data):
row = list(row_tup)
for i in strcolidx:
row[i] = row[i].encode('latin1')
row_tup = tuple(row)
new_data.append(row_tup)
type_str = np.bytes_
except UnicodeEncodeError:
type_str = np.unicode_
else:
data = new_dataOr even better
def encode_unicode_cols(row_tup):
row = list(row_tup)
for i in strcolidx:
row[i] = row[i].encode('latin1')
return tuple(row)
try:
data = [encode_unicode_cols(r) for r in data]
except UnicodeEncodeError:
type_str = np.unicode_
else:
type_str = np.bytes_There was a problem hiding this comment.
type_str needs to be a numpy type, 'U' or 'S'. The 'U' is the default here, overridden if the try succeeds.
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
Would be nice to use np.dtype((the_type, max(len(row[i]) for row in data)) here instead of string formatting
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
Change to np.unicode_, and the 'S' below to np.bytes_
3726112 to
1a5e37d
Compare
|
Most review comments have been addressed. |
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
This or () is useless, as we already check that names is truthy above
numpy/lib/_iotools.py
Outdated
There was a problem hiding this comment.
This now rejects bytes on python 3. This is perhaps fine, but it does so silently, as there there is no else case to fail on this if
There was a problem hiding this comment.
Hmm, the whole function looks a bit bogus as it never checks what the iterator returns. _is_bytes_like also doesn't do much except check that it can be combined with the bytes type, which does not include str in Python 3.
The documentation does say that missing_values is a str or sequence of str, although it would originally accept unicode also. I think in Python 3 after this PR it should not accept bytes, but that might not be backwards compatible with third party converters as the original only worked with byte strings in Python 3.
Maybe the thing to do here is make missing_values iterable, if not already, check each value for basestring, and fail if not.
numpy/lib/npyio.py
Outdated
There was a problem hiding this comment.
Should this forward the encoding argument? (after the handling of "bytes" below)
There was a problem hiding this comment.
I don't think so. The encoding parameter refers to the file encoding, while the comment strings are passed as an argument. A better option might be to not accept bytes strings for this.
There was a problem hiding this comment.
I'll update the docstring to mention the default encoding. The previously used asbytes assumed 'latin1'.
There was a problem hiding this comment.
Hmm, maybe not. There are a bunch of other passed string parameters, none of which are explicitly decoded if they are byte strings. The only thing special about comments is that it may be a sequence. I think we should simply drop the byte string option.
There was a problem hiding this comment.
OK, what I have done is decoded the delimiter also and added a note to the documentation that if either of delimiter or comments are passed as byte strings, that they will be decoded as 'latin1'. We might run into some backwards compatibility problems here, but I don't expect many folks were passing byte strings, especially as that would assume that the input file has some strange encoding that couldn't be handled by the latin1 assumption. We should maybe warn if either is a byte string.
doc/release/1.14.0-notes.rst
Outdated
There was a problem hiding this comment.
I think all the formatting changes could do with collecting together coherently too, when we get around to that.
|
Looking pretty good now, The |
|
I'm going to put the final bits of this off till tomorrow. |
The |
|
OK, I think this is done, but I will be surprised, and pleased, if it doesn't turn up some problems after it is merged :(. |
4d493d3 to
c9a44b1
Compare
numpy/lib/tests/test__iotools.py
Outdated
There was a problem hiding this comment.
Can you elaborate on this TODO?
There was a problem hiding this comment.
Not sure what it was supposed to test. However, the commented out portions work, so I have uncommented them and added tests with str and unicode. The upgrade is supposed to find and add a converter for the passed type.
There was a problem hiding this comment.
Was running wrong tests. So I am still not sure what the comment intended, but the (new?) upshot is that all input data not recognized as boolean or numeric type is converted to unicode. I've made the test check that that is the case.
numpy/lib/tests/test__iotools.py
Outdated
There was a problem hiding this comment.
This should be nx.int_, not nx.integer, and the comment should say long, I think.
There was a problem hiding this comment.
Curiously, numpy.integer is valid and is a long. Probably an obsolete usage going back to Numeric.
There was a problem hiding this comment.
np.integer is the base class of np.int_ and friends. But as you observe, np.dtype(np.integer) promotes to long in the same way that np.dtype(np.number) promotes to float
numpy/lib/tests/test__iotools.py
Outdated
There was a problem hiding this comment.
What happened here? A comment saying what -3 means would be helpful.
There was a problem hiding this comment.
It's a list index into a list of tuples. The contents of the list was changed, Self explanatory if you take a look at what _mapper is.
There was a problem hiding this comment.
Well, this test is garbage - conv._mapper[-3][0] is np.longdouble, yet it seems to be expecting it to be a complex value.
There was a problem hiding this comment.
Also _mapper is a class property, not an instance one - making this test even more bizarre.
Maybe the test should be:
old_map = StringConverter._mapper.copy()
new_map = StringConverter(_bytes_to_date)._mapper
assert_equal(new_map, old_map)
There was a problem hiding this comment.
Yeah, I'm not going to spend a lot of time fixing up crappy old tests/codes as long as they are no worse than they were. Note that the long double converter is also broken.
There was a problem hiding this comment.
And the bytes converter is never reached.
There was a problem hiding this comment.
Seems like a fair stance.
Note that my comments below are about new code though!
numpy/lib/_iotools.py
Outdated
There was a problem hiding this comment.
Message is a little odd in python 3 where string and unicode mean the same thing, but not sure how to fix that.
There was a problem hiding this comment.
Yeah, needed the unicode for Python 2.
numpy/lib/_iotools.py
Outdated
There was a problem hiding this comment.
This is a bug - it was presumably supposed to use set(), but ends up producing a dict instead.
Does this code path ever get taken?
There was a problem hiding this comment.
I don't think so, certainly not tested, as previously used append for a set.
There was a problem hiding this comment.
I thought it used update, which is defined on both?
Maybe we get away with it because later on we test it for truthiness.
There was a problem hiding this comment.
I think originally it was all written using a list, then someone upgraded to sets and missed fixing up this bit of code.
There was a problem hiding this comment.
Nope, used append in a loop.
3d6fe25 to
9386ade
Compare
numpy/lib/_iotools.py
Outdated
There was a problem hiding this comment.
This should say set(['']), not set(''), since the latter is just set(). It was correct as {''}
numpy/lib/tests/test_io.py
Outdated
There was a problem hiding this comment.
I am super confused by this function:
- Comment says
decode, name saysencode - Input is
unicode, comment says bytes - Function returns
Falseif it can decode
There was a problem hiding this comment.
- Comment says
default encoding, but code doesn't usesys.getdefaultencoding()
There was a problem hiding this comment.
Removed. It was only used to determine if test_utf8_file_nodtype_unicode could be run. The test has been modified.
Minor cleanups of old code to reflect more modern usage.
|
More fixes and added a bit more documentation. |
| for i in strcolidx: | ||
| column_types[i] = "|S%i" % max(len(row[i]) for row in data) | ||
| max_line_length = max(len(row[i]) for row in data) | ||
| column_types[i] = np.dtype((type_str, max_line_length)) |
There was a problem hiding this comment.
This chunk needs to run for bytes too
Fixes numpygh-10394, due to regression in numpygh-10054
Fixes numpygh-10394, due to regression in numpygh-10054
Rebase and squash of #4208
This modifies loadtxt and genfromtxt in several ways intended to add
unicode support for text files by adding an
encodingkeyword tonp.load, np.genfromtxt, np.savetxt, and np.fromregex. The original
treatment of the relevant files was to open them as byte
files, whereas they are now opened as text files with an encoding. When
read, they are decoded to unicode strings for Python3 compatibility,
and when written, they are encoded as specified. For backward
compatibility, the default encoding in both cases is latin1.