-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
BUG: np.loadtxt cannot load text file with quoted fields separated by whitespace #22899
Description
Describe the issue:
Description
np.loadtxt cannot load text file with quoted fields separated by whitespace (multiple " " instead of one " "). It raises ValueError. See example for details.
Bug Fix
This bug (typo?) can be fix by just fixing one line (we checked that the fix below works, but I cannot submit pull request)
Here is the block from numpy/core/src/multiarray/textreading/tokenize.cpp function tokenizer_core that contains the bug:
// lines 214 -- 228
case TOKENIZE_QUOTED_CHECK_DOUBLE_QUOTE:
if (*pos == config->quote) {
/* Copy the quote character directly from the config: */
if (copy_to_field_buffer(ts,
&config->quote, &config->quote+1) < 0) {
return -1;
}
ts->state = TOKENIZE_QUOTED;
pos++;
}
else {
/* continue parsing as if unquoted */
// BUG: The line below contains bug
ts->state = TOKENIZE_UNQUOTED;
// BUG Fix: One should replace TOKENIZE_UNQUOTED by ts->unquoted_state similar to the code on lines 121-144
ts->state = ts->unquoted_state;
}
break;
Note that UNQUOTED state was replaced by ts->unquoted_state in the block on lines 121 -- 144, but one, probably, forgot to do the same in the block on lines 214 -- 228.
Reproduce the code example:
import numpy as np
from io import StringIO
# The code below raises ValueError
s = StringIO('"alpha, #42" 10.0\n"beta, #64" 2.0\n')
dtype = np.dtype([("label", "U12"), ("value", float)])
np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')
# The code works if we swap positions of unquoted and quoted fields
s = StringIO('10 "alpha, #42"\n2.0 "beta, #64"\n')
dtype = np.dtype([("value", float), ("label", "U12")])
np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')Error message:
ValueError Traceback (most recent call last)
Cell In[13], line 3
1 s = StringIO('"alpha, #42" 10.0\n"beta, #64" 2.0\n')
2 dtype = np.dtype([("label", "U12"), ("value", float)])
----> 3 np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')
File ~/miniconda3/lib/python3.10/site-packages/numpy/lib/npyio.py:1318, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
1315 if isinstance(delimiter, bytes):
1316 delimiter = delimiter.decode('latin1')
-> 1318 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
1319 converters=converters, skiplines=skiprows, usecols=usecols,
1320 unpack=unpack, ndmin=ndmin, encoding=encoding,
1321 max_rows=max_rows, quote=quotechar)
1323 return arr
File ~/miniconda3/lib/python3.10/site-packages/numpy/lib/npyio.py:979, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
976 data = _preprocess_comments(data, comments, encoding)
978 if read_dtype_via_object_chunks is None:
--> 979 arr = _load_from_filelike(
980 data, delimiter=delimiter, comment=comment, quote=quote,
981 imaginary_unit=imaginary_unit,
982 usecols=usecols, skiplines=skiplines, max_rows=max_rows,
983 converters=converters, dtype=dtype,
984 encoding=encoding, filelike=filelike,
985 byte_converters=byte_converters)
987 else:
988 # This branch reads the file into chunks of object arrays and then
989 # casts them to the desired actual dtype. This ensures correct
990 # string-length and datetime-unit discovery (like `arr.astype()`).
991 # Due to chunking, certain error reports are less clear, currently.
992 if filelike:
ValueError: the number of columns changed from 2 to 1 at row 1; use `usecols` to select a subset and avoid this errorRuntime information:
1.23.4
3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]
Context for the issue:
This bug prevents us from using the newly-written np.loadtxt from loading our text files with quotes. We have continue to use a wrapper around pandas.read_csv. Note that the newly-written np.loadtxt is 3-8 times faster (!) than pandas.read_csv. Obviously, we would like to start using higher performant function.
This is an easy bugfix (just one line), it would be great if you would be able to add it to the nearest NumPy release.