Is it possible to speed-up python IO?

Question

Consider this python program:

import sys

lc = 0
for line in open(sys.argv[1]):
    lc = lc + 1

print lc, sys.argv[1]

Running it on my 6GB text file, it completes in ~ 2minutes.

Question: is it possible to go faster?

Note that the same time is required by:

wc -l myfile.txt

so, I suspect the anwer to my quesion is just a plain "no".

Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)

PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

10

The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.

First, be sure your 6GB file read is I/O bound, not CPU bound.

If It's I/O bound, consider the "Fan-Out" design pattern.

A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.

A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 11, 2009 at 20:06

S.Lott

393k83 gold badges523 silver badges792 bronze badges

6 Comments

Davide Over a year ago

probably (on the third bullet) you meant that all the children should talk each other in memory, since the disk is already very busy

Davide Over a year ago

Yes, but in your third bullet you wrote: "Each child should probably write a simple disk file."

S.Lott Over a year ago

@Davide: Sorry -- didn't see what you were getting at. There's no easy fan-in with pipes; therefore the final results are simplest to process as disk files. There aren't a lot of ways around this. The final result(s) must be written somewhere. Many small files are less impact than one big file because you have more opportunities for some child to be non-blocking and working.

rjmunro Over a year ago

Surely fan-out is only useful if you are bound by a single CPU core, but have more cores available. If you are I/O bound, it's not going to make any difference.

rjmunro Over a year ago

@CloudCho "I/O bound" means that you are at the limit of the speed of your disk. In the case of a traditional hard disk, it physically can't spin any faster. In the case of an SSD, the electronics simply aren't designed to be able to move more data.

|

Barakando · Accepted Answer · 2009-05-11 18:53:41Z

7

You can't get any faster than the maximum disk read speed.

In order to reach the maximum disk speed you can use the following two tips:

Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.

answered May 11, 2009 at 18:53

Barakando

1941 silver badge1 bronze badge

4 Comments

nosklo Over a year ago

-1 don't see how doing the newline counting in another thread may speed up. It will just slow things down. Waiting on threads doesn't make you wait faster.

Barakando Over a year ago

Normally you would be right. However, in this case the thread reading from the file will wait for I/O while the other thread parses the newlines. That way - the reader thread won't wait for the parser thread to parse the newlines between consequent reads.

Davide Over a year ago

I'm accepting this answer even though in this particular case it does not worth the effort, since the job-per-line is very low and I'm already going at hw maximum speed. See also the follow-up question, for further details.

Dan Over a year ago

I agree with nosklo. I think the increment is so fast as to be irrelevant, and another thread could even make such a thing slower. Also, the for loop is already buffered in python by default. I doubt that using BufferedReader to make the buffer larger would help.

nosklo · Accepted Answer · 2009-05-11 19:24:48Z

6

plain "no".

You've pretty much reached maximum disk speed.

I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.

edited May 11, 2009 at 19:24

answered May 11, 2009 at 17:22

nosklo

225k58 gold badges300 silver badges299 bronze badges

Comments

Georg Schölly · Accepted Answer · 2009-05-12 05:13:46Z

5

If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.

edited May 12, 2009 at 5:13

answered May 11, 2009 at 17:30

Georg Schölly

127k54 gold badges225 silver badges277 bronze badges

2 Comments

Sergey Golovchenko Over a year ago

Where's that 20 in your calculation come from? Did you mean 6000 / 60 = 100? 60 not 20, right?

Georg Schölly Over a year ago

I first wanted to calculate it with 20MB/s, but then I thought that this is too slow.

Mr_Pink · Accepted Answer · 2009-05-11 20:59:07Z

2

as others have said - "no"

Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.

Another option: If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.

edited May 11, 2009 at 20:59

answered May 11, 2009 at 17:33

Mr_Pink

110k17 gold badges288 silver badges275 bronze badges

Comments

Nathan · Accepted Answer · 2009-05-14 00:54:28Z

2

2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:

Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.

answered May 14, 2009 at 0:54

Nathan

12.4k12 gold badges59 silver badges63 bronze badges

Comments

Jack Hales · Accepted Answer · 2022-07-23 03:13:06Z

This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.

Once you've reduced your file size, when you go to iterate through this file you just:

Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.

I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.

Example:

compress.py

import zlib

with open("big.csv", "rb") as f:
    compressed = zlib.compress(f.read())
    open("big_comp.csv", "wb").write(compressed)

iterate.py

import zlib

with open("big_comp.csv", "rb") as f:
    big = zlib.decompress(f.read())
    for line in big.split("\n"):
        line = reversed(line)

andr14142 · Accepted Answer · 2022-04-23 03:24:36Z

-1

PyPy provides optimised input/output faster up to 7 times.

answered Apr 23, 2022 at 3:24

andr14142

1

1 Comment

Beth Long Over a year ago

It would be helpful if you added your code to support this

Collectives™ on Stack Overflow

Is it possible to speed-up python IO?

9 Answers 9

Comments

6 Comments

4 Comments

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Comments

6 Comments

4 Comments

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related