-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
loadtxt strange handling empty lines with max_rows #19785
Description
There is a quirk with how np.loadtxt counts rows/lines that results in strange behavior when using the max_rows parameter. Empty lines (i.e. lines containing only a newline character) are ignored for the purposes of parsing, but are not ignored by the row counting. The result is that, if you have a file with blank lines, you have to also count the blank lines when using the max_rows parameter in order to get the desired number of data rows.
Reproducing code example:
Let's say you have the following data, which includes empty lines between data rows:
>>> import io
>>> f = io.StringIO("1 2\n\n3 4\n\n5 6\n\n")Reading the whole file works as expected (empty rows are ignored):
>>> np.loadtxt(f)
array([[1., 2.],
[3., 4.],
[5., 6.]])Now let's say you want to only read the first 2 data rows, so you try using max_rows:
>>> f.seek(0)
>>> np.loadtxt(f, max_rows=2) # expecting np.array([[1., 2.], [3., 4.]])
array([1., 2.])This is due to the fact that the blank lines are not ignored when incrementing rows.
There is another inconsistency in that blank lines that precede the first row containing data are ignored for the purposes of row counting (unlike blank lines that occur after the first data row):
>>> f = io.StringIO("\n\n\n\n\n1 2\n\n3 4\n\n5 6\n\n")
# The empty lines *preceding* the first data row are ignored for row counting
# but empty lines *after* the first data row are not.
>>> np.loadtxt(f, max_rows=3)
array([[1., 2.],
[3., 4.]])Expected Behavior
I think it would be an improvement if empty lines were ignored for the purposes of row counting. IMO max_rows should be interpreted as the maximum number of rows in the returned array, rather than the maximum number of newlines in the file.
See also BIDS-numpy/npreadtext#20
NumPy/Python version information:
1.22.0.dev0+954.g5ae53e93b 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0]