Speed improvements#96
Conversation
This commit avoids a lot of allocations for every protobuf message for the construction of the BTreeMap in the serde-value structs. Some tests had to be modified since the names of the resulting objects keys are no longer ordered by name but by the fdset field order.
|
Coincidentally, this also fixes #95 since the unbounded queue has been removed. |
|
This is excellent. I'll need some time to refamiliarize myself with this code, fetch your branch, compile it, run the tests, etc. But - thanks in advance for the contribution. |
|
It might also be possible to save on the allocation per message by avoiding the Iterator interface and instead providing a function like I am not sure about the performance improvements there though EDIT: reusing the message buffer seems to account for ~5s in my use case. (a further improvement of about 6%). Since this would requires some changes in the interface to the stream-delimit subcrate, I'd open a new pullrequest for discussion if you are open to the idea. |
I'm open. |
|
I have time to clone and test this today. |
|
Can you compare your code against this previous (merged) PR: #81 That PR added the call to "spawn" in the first place, and now I see you're removing that part. |
I am not sure what you mean by compare, as the first measurement (and flamegraph) is already done on master, which includes the threaded code Before #81 was merged, each message was also deserialized to a temporary serde_value::Value object, suffering from the same allocations and double traversal it now does. Regarding the removal of the threads, its still faster because: In theory it might be possible to parallelize on a message level (transcoding multiple messages simultaneously) |
|
Just wanted to make sure we're not walking back any desired behavior. Anyway it looks OK on my end, performance improved. |
|
Thanks. I could either cut a release or wait for the next idea you had. |
This Pull requests aims to improve the processing speed by avoiding the construction of serde-value objects (they are implemented as B-Trees), leading to many allocations for every proto message.
For my test case (converting 200k fairly heavy protobuf messages) I was able to observe the following speed improvements:
Before:
$ time ./pq --msgtype simmo.TranshipmentCandidate --fdsetdir ./fdset --stream i32be < candidates.proto.ld > /dev/null
real 2m20.320s
user 3m7.052s
sys 0m20.936s
After:
$ time ./pq --msgtype simmo.TranshipmentCandidate --fdsetdir ./fdset --stream i32be < candidates.proto.ld > /dev/null
real 1m19.370s
user 1m14.457s
sys 0m4.568s
Here you can see flamegraphs clearly showing the allocations taking a lot of cpu time:


And after: