Skip to content

ENH: Add encoding option to numpy text IO.#10054

Merged
charris merged 4 commits intonumpy:masterfrom
charris:gh-4208
Nov 26, 2017
Merged

ENH: Add encoding option to numpy text IO.#10054
charris merged 4 commits intonumpy:masterfrom
charris:gh-4208

Conversation

@charris
Copy link
Member

@charris charris commented Nov 19, 2017

Rebase and squash of #4208

This modifies loadtxt and genfromtxt in several ways intended to add
unicode support for text files by adding an encoding keyword to
np.load, np.genfromtxt, np.savetxt, and np.fromregex. The original
treatment of the relevant files was to open them as byte
files, whereas they are now opened as text files with an encoding. When
read, they are decoded to unicode strings for Python3 compatibility,
and when written, they are encoded as specified. For backward
compatibility, the default encoding in both cases is latin1.

@charris
Copy link
Member Author

charris commented Nov 19, 2017

Still need to add some docstrings.

OK, should be ready to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line break here and below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I decided to leave off the rest of the release notes editing to just before the branch.

Copy link
Member

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the formatting changes could do with collecting together coherently too, when we get around to that.

Copy link
Member

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this type of encoding-guessing goes against the python3 unicode model, which requires this kind of thing to be explicit. Can we perhaps add a warning here in python 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to preserve compatibility with previous behavior and is only called in three functions, two of which pass 'bytes' as the default. I went ahead and documented the default values in those places. I think we can revisit this after Python 2 is dropped. There has also been a long time desire to replace all these functions, perhaps by bringing over the Pandas functionality, so this may all become moot in the long term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the contract be that s is now always unicode, and so the decode is unconditional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_bytes_to_date is a helper function called by several other tests, so it probably needs to do the check. Mind that we are also trying to make things look as much as possible like they did before, so that check can be looked at as a compatibility check. In particular, I believe that encoding='bytes' means 'latin1' for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really true? It seems like we should now allow and encourage utf8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that too, but the problem is the pickled case.

    if encoding not in ('ASCII', 'latin1', 'bytes'):
        # The 'encoding' value for pickle also affects what encoding
        # the serialized binary data of NumPy arrays is loaded
        # in. Pickle does not pass on the encoding information to
        # NumPy. The unpickling code in numpy.core.multiarray is
        # written to assume that unicode data appearing where binary
        # should be is in 'latin1'. 'bytes' is also safe, as is 'ASCII'.
        #
        # Other encoding values can corrupt binary data, and we
        # purposefully disallow them. For the same reason, the errors=
        # argument is not exposed, as values other than 'strict'
        # result can similarly silently corrupt numerical data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the asstr here, since the above line already requires x to be unicode in py3, and py2 doesn't care either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in oldd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to call them unicode strings to avoid ambiguity on python 2

Copy link
Member

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this block would be clearer as:

try:
    new_data = []
    for row_tup in data):
        row = list(row_tup)
        for i in strcolidx:
            row[i] = row[i].encode('latin1')
        row_tup = tuple(row)
        new_data.append(row_tup)
    type_str = np.bytes_
except UnicodeEncodeError:
    type_str = np.unicode_
else:
    data = new_data

Or even better

def encode_unicode_cols(row_tup):
    row = list(row_tup)
    for i in strcolidx:
        row[i] = row[i].encode('latin1')
    return tuple(row)

try:
    data = [encode_unicode_cols(r) for r in data]
except UnicodeEncodeError:
    type_str = np.unicode_
else:
    type_str = np.bytes_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type_str needs to be a numpy type, 'U' or 'S'. The 'U' is the default here, overridden if the try succeeds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to use np.dtype((the_type, max(len(row[i]) for row in data)) here instead of string formatting

Copy link
Member

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to np.unicode_, and the 'S' below to np.bytes_

@charris charris force-pushed the gh-4208 branch 4 times, most recently from 3726112 to 1a5e37d Compare November 19, 2017 22:54
@charris
Copy link
Member Author

charris commented Nov 19, 2017

Most review comments have been addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This or () is useless, as we already check that names is truthy above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now rejects bytes on python 3. This is perhaps fine, but it does so silently, as there there is no else case to fail on this if

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the whole function looks a bit bogus as it never checks what the iterator returns. _is_bytes_like also doesn't do much except check that it can be combined with the bytes type, which does not include str in Python 3.

The documentation does say that missing_values is a str or sequence of str, although it would originally accept unicode also. I think in Python 3 after this PR it should not accept bytes, but that might not be backwards compatible with third party converters as the original only worked with byte strings in Python 3.

Maybe the thing to do here is make missing_values iterable, if not already, check each value for basestring, and fail if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this forward the encoding argument? (after the handling of "bytes" below)

Copy link
Member Author

@charris charris Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The encoding parameter refers to the file encoding, while the comment strings are passed as an argument. A better option might be to not accept bytes strings for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the docstring to mention the default encoding. The previously used asbytes assumed 'latin1'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe not. There are a bunch of other passed string parameters, none of which are explicitly decoded if they are byte strings. The only thing special about comments is that it may be a sequence. I think we should simply drop the byte string option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, what I have done is decoded the delimiter also and added a note to the documentation that if either of delimiter or comments are passed as byte strings, that they will be decoded as 'latin1'. We might run into some backwards compatibility problems here, but I don't expect many folks were passing byte strings, especially as that would assume that the input file has some strange encoding that couldn't be handled by the latin1 assumption. We should maybe warn if either is a byte string.

Copy link
Member

@eric-wieser eric-wieser Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the formatting changes could do with collecting together coherently too, when we get around to that.

@eric-wieser
Copy link
Member

Looking pretty good now, The decode_line stuff still makes me a little uneasy, but I guess that's no worse than the asstr we had before, and we can deal with that later.

@charris
Copy link
Member Author

charris commented Nov 20, 2017

I'm going to put the final bits of this off till tomorrow.

@charris
Copy link
Member Author

charris commented Nov 20, 2017

The decode_line stuff still makes me a little uneasy, but I guess that's no worse than the asstr we had before, and we can deal with that later.

The _decode_line does behave differently fromasstr, it always return unicode, whereas on Python 2 asstr converts unicode to str.

@charris
Copy link
Member Author

charris commented Nov 20, 2017

OK, I think this is done, but I will be surprised, and pleased, if it doesn't turn up some problems after it is merged :(.

@charris charris force-pushed the gh-4208 branch 2 times, most recently from 4d493d3 to c9a44b1 Compare November 21, 2017 22:27
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what it was supposed to test. However, the commented out portions work, so I have uncommented them and added tests with str and unicode. The upgrade is supposed to find and add a converter for the passed type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was running wrong tests. So I am still not sure what the comment intended, but the (new?) upshot is that all input data not recognized as boolean or numeric type is converted to unicode. I've made the test check that that is the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be nx.int_, not nx.integer, and the comment should say long, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously, numpy.integer is valid and is a long. Probably an obsolete usage going back to Numeric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.integer is the base class of np.int_ and friends. But as you observe, np.dtype(np.integer) promotes to long in the same way that np.dtype(np.number) promotes to float

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? A comment saying what -3 means would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a list index into a list of tuples. The contents of the list was changed, Self explanatory if you take a look at what _mapper is.

Copy link
Member

@eric-wieser eric-wieser Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this test is garbage - conv._mapper[-3][0] is np.longdouble, yet it seems to be expecting it to be a complex value.

Copy link
Member

@eric-wieser eric-wieser Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also _mapper is a class property, not an instance one - making this test even more bizarre.

Maybe the test should be:

old_map = StringConverter._mapper.copy()
new_map = StringConverter(_bytes_to_date)._mapper
assert_equal(new_map, old_map)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not going to spend a lot of time fixing up crappy old tests/codes as long as they are no worse than they were. Note that the long double converter is also broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the bytes converter is never reached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a fair stance.

Note that my comments below are about new code though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message is a little odd in python 3 where string and unicode mean the same thing, but not sure how to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, needed the unicode for Python 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug - it was presumably supposed to use set(), but ends up producing a dict instead.

Does this code path ever get taken?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, certainly not tested, as previously used append for a set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it used update, which is defined on both?

Maybe we get away with it because later on we test it for truthiness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think originally it was all written using a list, then someone upgraded to sets and missed fixing up this bit of code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, used append in a loop.

@charris charris force-pushed the gh-4208 branch 2 times, most recently from 3d6fe25 to 9386ade Compare November 23, 2017 04:45
Copy link
Member

@eric-wieser eric-wieser Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say set(['']), not set(''), since the latter is just set(). It was correct as {''}

Copy link
Member

@eric-wieser eric-wieser Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super confused by this function:

  • Comment says decode, name says encode
  • Input is unicode, comment says bytes
  • Function returns False if it can decode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Comment says default encoding, but code doesn't use sys.getdefaultencoding()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. It was only used to determine if test_utf8_file_nodtype_unicode could be run. The test has been modified.

Minor cleanups of old code to reflect more modern usage.
@charris
Copy link
Member Author

charris commented Nov 24, 2017

More fixes and added a bit more documentation.

for i in strcolidx:
column_types[i] = "|S%i" % max(len(row[i]) for row in data)
max_line_length = max(len(row[i]) for row in data)
column_types[i] = np.dtype((type_str, max_line_length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk needs to run for bytes too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants