ENH: Add encoding option to numpy text IO. by charris · Pull Request #10054 · numpy/numpy

charris · 2017-11-19T17:41:16Z

Rebase and squash of #4208

This modifies loadtxt and genfromtxt in several ways intended to add
unicode support for text files by adding an encoding keyword to
np.load, np.genfromtxt, np.savetxt, and np.fromregex. The original
treatment of the relevant files was to open them as byte
files, whereas they are now opened as text files with an encoding. When
read, they are decoded to unicode strings for Python3 compatibility,
and when written, they are encoded as specified. For backward
compatibility, the default encoding in both cases is latin1.

charris · 2017-11-19T17:42:33Z

~~Still need to add some docstrings.~~

OK, should be ready to go.

eric-wieser · 2017-11-19T19:34:45Z

doc/release/1.14.0-notes.rst

nit: line break here and below

Yeah, I decided to leave off the rest of the release notes editing to just before the branch.

I think all the formatting changes could do with collecting together coherently too, when we get around to that.

eric-wieser · 2017-11-19T19:36:59Z

numpy/lib/_iotools.py

It seems to me that this type of encoding-guessing goes against the python3 unicode model, which requires this kind of thing to be explicit. Can we perhaps add a warning here in python 3?

This is trying to preserve compatibility with previous behavior and is only called in three functions, two of which pass 'bytes' as the default. I went ahead and documented the default values in those places. I think we can revisit this after Python 2 is dropped. There has also been a long time desire to replace all these functions, perhaps by bringing over the Pandas functionality, so this may all become moot in the long term.

eric-wieser · 2017-11-19T19:39:56Z

numpy/lib/tests/test__iotools.py

Shouldn't the contract be that s is now always unicode, and so the decode is unconditional?

_bytes_to_date is a helper function called by several other tests, so it probably needs to do the check. Mind that we are also trying to make things look as much as possible like they did before, so that check can be looked at as a compatibility check. In particular, I believe that encoding='bytes' means 'latin1' for this PR.

eric-wieser · 2017-11-19T19:42:38Z

numpy/lib/npyio.py

Is this really true? It seems like we should now allow and encourage utf8.

I wondered about that too, but the problem is the pickled case.

if encoding not in ('ASCII', 'latin1', 'bytes'): # The 'encoding' value for pickle also affects what encoding # the serialized binary data of NumPy arrays is loaded # in. Pickle does not pass on the encoding information to # NumPy. The unpickling code in numpy.core.multiarray is # written to assume that unicode data appearing where binary # should be is in 'latin1'. 'bytes' is also safe, as is 'ASCII'. # # Other encoding values can corrupt binary data, and we # purposefully disallow them. For the same reason, the errors= # argument is not exposed, as values other than 'strict' # result can similarly silently corrupt numerical data.

eric-wieser · 2017-11-19T19:44:21Z

numpy/lib/npyio.py

No need for the asstr here, since the above line already requires x to be unicode in py3, and py2 doesn't care either way.

eric-wieser · 2017-11-19T19:46:32Z

numpy/lib/npyio.py

Typo in oldd

eric-wieser · 2017-11-19T19:48:17Z

numpy/lib/npyio.py

Probably good to call them unicode strings to avoid ambiguity on python 2

eric-wieser · 2017-11-19T19:52:46Z

numpy/lib/npyio.py

IMO this block would be clearer as:

try: new_data = [] for row_tup in data): row = list(row_tup) for i in strcolidx: row[i] = row[i].encode('latin1') row_tup = tuple(row) new_data.append(row_tup) type_str = np.bytes_ except UnicodeEncodeError: type_str = np.unicode_ else: data = new_data

Or even better

def encode_unicode_cols(row_tup): row = list(row_tup) for i in strcolidx: row[i] = row[i].encode('latin1') return tuple(row) try: data = [encode_unicode_cols(r) for r in data] except UnicodeEncodeError: type_str = np.unicode_ else: type_str = np.bytes_

type_str needs to be a numpy type, 'U' or 'S'. The 'U' is the default here, overridden if the try succeeds.

eric-wieser · 2017-11-19T19:55:16Z

numpy/lib/npyio.py

Would be nice to use np.dtype((the_type, max(len(row[i]) for row in data)) here instead of string formatting

eric-wieser · 2017-11-19T19:55:53Z

numpy/lib/npyio.py

Change to np.unicode_, and the 'S' below to np.bytes_

charris · 2017-11-19T22:59:07Z

Most review comments have been addressed.

eric-wieser · 2017-11-19T23:03:11Z

numpy/lib/npyio.py

This or () is useless, as we already check that names is truthy above

eric-wieser · 2017-11-19T23:05:40Z

numpy/lib/_iotools.py

This now rejects bytes on python 3. This is perhaps fine, but it does so silently, as there there is no else case to fail on this if

Hmm, the whole function looks a bit bogus as it never checks what the iterator returns. _is_bytes_like also doesn't do much except check that it can be combined with the bytes type, which does not include str in Python 3.

The documentation does say that missing_values is a str or sequence of str, although it would originally accept unicode also. I think in Python 3 after this PR it should not accept bytes, but that might not be backwards compatible with third party converters as the original only worked with byte strings in Python 3.

Maybe the thing to do here is make missing_values iterable, if not already, check each value for basestring, and fail if not.

eric-wieser · 2017-11-19T23:10:18Z

numpy/lib/npyio.py

Should this forward the encoding argument? (after the handling of "bytes" below)

I don't think so. The encoding parameter refers to the file encoding, while the comment strings are passed as an argument. A better option might be to not accept bytes strings for this.

I'll update the docstring to mention the default encoding. The previously used asbytes assumed 'latin1'.

Hmm, maybe not. There are a bunch of other passed string parameters, none of which are explicitly decoded if they are byte strings. The only thing special about comments is that it may be a sequence. I think we should simply drop the byte string option.

OK, what I have done is decoded the delimiter also and added a note to the documentation that if either of delimiter or comments are passed as byte strings, that they will be decoded as 'latin1'. We might run into some backwards compatibility problems here, but I don't expect many folks were passing byte strings, especially as that would assume that the input file has some strange encoding that couldn't be handled by the latin1 assumption. We should maybe warn if either is a byte string.

eric-wieser · 2017-11-19T23:11:36Z

doc/release/1.14.0-notes.rst

I think all the formatting changes could do with collecting together coherently too, when we get around to that.

eric-wieser · 2017-11-19T23:13:17Z

Looking pretty good now, The decode_line stuff still makes me a little uneasy, but I guess that's no worse than the asstr we had before, and we can deal with that later.

charris · 2017-11-20T00:02:44Z

I'm going to put the final bits of this off till tomorrow.

charris · 2017-11-20T18:44:32Z

The decode_line stuff still makes me a little uneasy, but I guess that's no worse than the asstr we had before, and we can deal with that later.

The _decode_line does behave differently fromasstr, it always return unicode, whereas on Python 2 asstr converts unicode to str.

charris · 2017-11-20T18:44:50Z

OK, I think this is done, but I will be surprised, and pleased, if it doesn't turn up some problems after it is merged :(.

eric-wieser · 2017-11-22T02:20:39Z

numpy/lib/tests/test__iotools.py

Can you elaborate on this TODO?

Not sure what it was supposed to test. However, the commented out portions work, so I have uncommented them and added tests with str and unicode. The upgrade is supposed to find and add a converter for the passed type.

Was running wrong tests. So I am still not sure what the comment intended, but the (new?) upshot is that all input data not recognized as boolean or numeric type is converted to unicode. I've made the test check that that is the case.

eric-wieser · 2017-11-23T04:20:31Z

numpy/lib/tests/test__iotools.py

This should be nx.int_, not nx.integer, and the comment should say long, I think.

Curiously, numpy.integer is valid and is a long. Probably an obsolete usage going back to Numeric.

np.integer is the base class of np.int_ and friends. But as you observe, np.dtype(np.integer) promotes to long in the same way that np.dtype(np.number) promotes to float

eric-wieser · 2017-11-23T04:21:09Z

numpy/lib/tests/test__iotools.py

What happened here? A comment saying what -3 means would be helpful.

It's a list index into a list of tuples. The contents of the list was changed, Self explanatory if you take a look at what _mapper is.

Well, this test is garbage - conv._mapper[-3][0] is np.longdouble, yet it seems to be expecting it to be a complex value.

Also _mapper is a class property, not an instance one - making this test even more bizarre.

Maybe the test should be:

old_map = StringConverter._mapper.copy() new_map = StringConverter(_bytes_to_date)._mapper assert_equal(new_map, old_map)

Yeah, I'm not going to spend a lot of time fixing up crappy old tests/codes as long as they are no worse than they were. Note that the long double converter is also broken.

And the bytes converter is never reached.

Seems like a fair stance.

Note that my comments below are about new code though!

eric-wieser · 2017-11-23T04:23:01Z

numpy/lib/_iotools.py

Message is a little odd in python 3 where string and unicode mean the same thing, but not sure how to fix that.

Yeah, needed the unicode for Python 2.

eric-wieser · 2017-11-23T04:25:09Z

numpy/lib/_iotools.py

This is a bug - it was presumably supposed to use set(), but ends up producing a dict instead.

Does this code path ever get taken?

I don't think so, certainly not tested, as previously used append for a set.

I thought it used update, which is defined on both?

Maybe we get away with it because later on we test it for truthiness.

I think originally it was all written using a list, then someone upgraded to sets and missed fixing up this bit of code.

Nope, used append in a loop.

eric-wieser · 2017-11-23T04:56:49Z

numpy/lib/_iotools.py

This should say set(['']), not set(''), since the latter is just set(). It was correct as {''}

eric-wieser · 2017-11-23T05:19:04Z

numpy/lib/tests/test_io.py

I am super confused by this function:

Comment says decode, name says encode

Input is unicode, comment says bytes

Function returns False if it can decode

Comment says default encoding, but code doesn't use sys.getdefaultencoding()

Removed. It was only used to determine if test_utf8_file_nodtype_unicode could be run. The test has been modified.

Minor cleanups of old code to reflect more modern usage.

charris · 2017-11-24T23:53:16Z

More fixes and added a bit more documentation.

eric-wieser · 2018-01-14T04:01:14Z

numpy/lib/npyio.py

        for i in strcolidx:
-            column_types[i] = "|S%i" % max(len(row[i]) for row in data)
+            max_line_length = max(len(row[i]) for row in data)
+            column_types[i] = np.dtype((type_str, max_line_length))


This chunk needs to run for bytes too

Fixes numpygh-10394, due to regression in numpygh-10054

charris added 01 - Enhancement component: numpy.lib labels Nov 19, 2017

charris added this to the 1.14.0 release milestone Nov 19, 2017

charris mentioned this pull request Nov 19, 2017

ENH: Add encoding option to numpy text IO #4208

Closed

charris force-pushed the gh-4208 branch from e09d574 to 43eb5b9 Compare November 19, 2017 18:55