-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Aspects of accessing PP and UM fields file data has sometimes been very slow, for a quite a while. I had previously always assumed that this was a cf.aggregation issue, which it very much sometimes was! ... but I think aggregation now performs pretty well.
@theabro kindly raised a case of reading a CF field from a 16 GB PP file. the CF Field itself comprised 2040 (= 24 x 85) 2-d PP fields:
>>> print(f)
<CF Field: id%UM_m01s50i500_vn1300(time(24), atmosphere_hybrid_height_coordinate(85), latitude(144), longitude(192))>Accessing the full data array with a = f.array is taking ~11,000 seconds - far too long!
Investigations showed that the reason for this was that the whole PP file was being parsed (i.e. all headers read and processed) for every 2-d PP field that contributes to the array, i.e. 2040 times in this case.
Stopping this parsing reduces the time taken to get the full array, on the same machine, to ~2 seconds (!). The entire 16 GB can read from disk in ~3.5 minutes.
The size of the file per se is not the cause of the problem, rather the large amount of individual lookup headers in the file: 162,888 in this case. For small my test cases with fewer than 5 PP fields, the slow down is invisible :(
Long overdue PR to follow.