Skip to content

Improve the performance of reading and accessing the data of PP and UM fields files #746

@davidhassell

Description

@davidhassell

Aspects of accessing PP and UM fields file data has sometimes been very slow, for a quite a while. I had previously always assumed that this was a cf.aggregation issue, which it very much sometimes was! ... but I think aggregation now performs pretty well.

@theabro kindly raised a case of reading a CF field from a 16 GB PP file. the CF Field itself comprised 2040 (= 24 x 85) 2-d PP fields:

>>> print(f)
<CF Field: id%UM_m01s50i500_vn1300(time(24), atmosphere_hybrid_height_coordinate(85), latitude(144), longitude(192))>

Accessing the full data array with a = f.array is taking ~11,000 seconds - far too long!

Investigations showed that the reason for this was that the whole PP file was being parsed (i.e. all headers read and processed) for every 2-d PP field that contributes to the array, i.e. 2040 times in this case.

Stopping this parsing reduces the time taken to get the full array, on the same machine, to ~2 seconds (!). The entire 16 GB can read from disk in ~3.5 minutes.

The size of the file per se is not the cause of the problem, rather the large amount of individual lookup headers in the file: 162,888 in this case. For small my test cases with fewer than 5 PP fields, the slow down is invisible :(

Long overdue PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformanceRelating to speed and memory performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions