Skip to content

BUG: np.loadtxt cannot load text file with quoted fields separated by whitespace #22899

@dmbelov

Description

@dmbelov

Describe the issue:

Description

np.loadtxt cannot load text file with quoted fields separated by whitespace (multiple " " instead of one " "). It raises ValueError. See example for details.

Bug Fix

This bug (typo?) can be fix by just fixing one line (we checked that the fix below works, but I cannot submit pull request)

Here is the block from numpy/core/src/multiarray/textreading/tokenize.cpp function tokenizer_core that contains the bug:

// lines 214 -- 228
        case TOKENIZE_QUOTED_CHECK_DOUBLE_QUOTE:
            if (*pos == config->quote) {
                /* Copy the quote character directly from the config: */
                if (copy_to_field_buffer(ts,
                        &config->quote, &config->quote+1) < 0) {
                    return -1;
                }
                ts->state = TOKENIZE_QUOTED;
                pos++;
            }
            else {
                /* continue parsing as if unquoted */
// BUG: The line below contains bug
                ts->state = TOKENIZE_UNQUOTED;
// BUG Fix: One should replace TOKENIZE_UNQUOTED by ts->unquoted_state similar to the code on lines 121-144 
                ts->state = ts->unquoted_state;
            }
            break;

Note that UNQUOTED state was replaced by ts->unquoted_state in the block on lines 121 -- 144, but one, probably, forgot to do the same in the block on lines 214 -- 228.

Reproduce the code example:

import numpy as np
from io import StringIO

# The code below raises ValueError
s = StringIO('"alpha, #42"         10.0\n"beta, #64" 2.0\n')
dtype = np.dtype([("label", "U12"), ("value", float)])
np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')


# The code works if we swap positions of unquoted and quoted fields
s = StringIO('10     "alpha, #42"\n2.0 "beta, #64"\n')
dtype = np.dtype([("value", float), ("label", "U12")])
np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')

Error message:

ValueError                                Traceback (most recent call last)
Cell In[13], line 3
      1 s = StringIO('"alpha, #42"         10.0\n"beta, #64" 2.0\n')
      2 dtype = np.dtype([("label", "U12"), ("value", float)])
----> 3 np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"')

File ~/miniconda3/lib/python3.10/site-packages/numpy/lib/npyio.py:1318, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
   1315 if isinstance(delimiter, bytes):
   1316     delimiter = delimiter.decode('latin1')
-> 1318 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
   1319             converters=converters, skiplines=skiprows, usecols=usecols,
   1320             unpack=unpack, ndmin=ndmin, encoding=encoding,
   1321             max_rows=max_rows, quote=quotechar)
   1323 return arr

File ~/miniconda3/lib/python3.10/site-packages/numpy/lib/npyio.py:979, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
    976     data = _preprocess_comments(data, comments, encoding)
    978 if read_dtype_via_object_chunks is None:
--> 979     arr = _load_from_filelike(
    980         data, delimiter=delimiter, comment=comment, quote=quote,
    981         imaginary_unit=imaginary_unit,
    982         usecols=usecols, skiplines=skiplines, max_rows=max_rows,
    983         converters=converters, dtype=dtype,
    984         encoding=encoding, filelike=filelike,
    985         byte_converters=byte_converters)
    987 else:
    988     # This branch reads the file into chunks of object arrays and then
    989     # casts them to the desired actual dtype.  This ensures correct
    990     # string-length and datetime-unit discovery (like `arr.astype()`).
    991     # Due to chunking, certain error reports are less clear, currently.
    992     if filelike:

ValueError: the number of columns changed from 2 to 1 at row 1; use `usecols` to select a subset and avoid this error

Runtime information:

1.23.4
3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0]

Context for the issue:

This bug prevents us from using the newly-written np.loadtxt from loading our text files with quotes. We have continue to use a wrapper around pandas.read_csv. Note that the newly-written np.loadtxt is 3-8 times faster (!) than pandas.read_csv. Obviously, we would like to start using higher performant function.

This is an easy bugfix (just one line), it would be great if you would be able to add it to the nearest NumPy release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions