Skip to content

BUG: numpy.loadtxt read more lines than specified with max_rows #26754

@l09rin

Description

@l09rin

Describe the issue:

Hi all,

I am trying to read a long file in bunches of 80000 lines at a time (I avoid reading the entire file all in once, since I also need to process some very large sized files). The sample file I use consists of 400 thousands lines of 5 columns each (mixed char and floats, all read as str type), but loadtxt() reads every time 110000 lines instead of 80000.

If I try to read less lines (in the example code 50000) it works properly. also if I use the function method genfromtxt() instead of loadtxt() I get the right number of lines.
I would like to keep using loadtxt(), since I saw that it is twice as faster then genfromtxt(), but I do not understand why it does not behave correctly in some cases. I attach the code giving me the error and also a sample data file.

data.txt

Reproduce the code example:

import numpy as np

print("Looping with np.genfromtxt() :")
f = open( "data.txt" , 'r' )
for i in range(5) :
    conf = np.genfromtxt( f , dtype = 'str' , max_rows = 80000 )
    print("conf n.", i+1, ":", "loaded",len(conf),"lines.")
f.close()

print("Looping with np.loadtxt() :")
f = open( "data.txt" , 'r' )
for i in range(5) :
    conf = np.loadtxt( f , dtype = 'str' , max_rows = 80000 )
    print("conf n.", i+1, ":", "loaded",len(conf),"lines.")
f.close()

print("Loading 50000 lines with np.loadtxt() is ok:")
conf = np.loadtxt( "data.txt" , dtype = 'str' , max_rows = 50000 )
print("conf n. 1 :", "loaded",len(conf),"lines.")

Error message:

No response

Python and NumPy Versions:

1.25.0
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

Runtime Environment:

[{'numpy_version': '1.25.0',
'python': '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]',
'uname': uname_result(system='Linux', node='giovanni-deskuu', release='6.5.0-27-generic', version='#28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': '/home/logrin/.local/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-7a851222.3.23.so',
'internal_api': 'openblas',
'num_threads': 24,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23'}]

Context for the issue:

I use numpy for data analysis in university research, since it is a fast tool to read and manipulate very large data sets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    00 - BugsprintableIssue fits the time-frame and setting of a sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions