Skip to content

Use module state and heap types for C Extension#1125

Merged
asvetlov merged 19 commits intomasterfrom
state
Apr 4, 2025
Merged

Use module state and heap types for C Extension#1125
asvetlov merged 19 commits intomasterfrom
state

Conversation

@asvetlov
Copy link
Member

@asvetlov asvetlov commented Apr 3, 2025

  1. Module state is crucial for running multiple module instances in sub-interpreters. It is the recommended way to write C Extensions, and CPython built-in modules were rewritten in this way.
  2. It is also recommended that heap types be used over static types for C Extensions. Let's do it.

The PR eliminates global variables except for constant module definitions as a side effect.

@codspeed-hq
Copy link

codspeed-hq bot commented Apr 3, 2025

CodSpeed Performance Report

Merging #1125 will not alter performance

Comparing state (4addc79) with master (018643a)

Summary

✅ 244 untouched benchmarks

@asvetlov
Copy link
Member Author

asvetlov commented Apr 3, 2025

Okay, module state finding is not for free.
Let's see what can we do with it.

@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

performance is good again after adding a pointer to the module state to pair_list_t

@bdraco
Copy link
Member

bdraco commented Apr 4, 2025

multidict_tp_init looks a bit slower. 6.3.2 showed up being a bit slower in yarl in aio-libs/yarl#1492 (comment). Not sure if anything can be done there or it is what it is and we should ack and merge the bump in yarl.

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Apr 4, 2025
@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

multidict_tp_init looks a bit slower. 6.3.2 showed up being a bit slower in yarl in aio-libs/yarl#1492 (comment). Not sure if anything can be done there or it is what it is and we should ack and merge the bump in yarl.

Yeah, tp_init has no direct access to the module state and should do a lookup over the class's hierarchy.
Hopefully, if a user doesn't derive from MultiDict / CIMultiDict, the lookup ends on the topmost class.
But it takes some small time, still.
It is expected, and there is no space to do anything here.

@asvetlov asvetlov marked this pull request as ready for review April 4, 2025 08:46
@asvetlov asvetlov requested a review from webknjaz as a code owner April 4, 2025 08:46
@bdraco
Copy link
Member

bdraco commented Apr 4, 2025

Testing

  • Production 1
  • Production 2
  • aiohttp master
  • aiohttp 3.11

@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

@bdraco for yarl/aiohttp needs, I could imagine exposing public C API for use from Cython.
It is possible, but it should be done carefully. Something like datetime C API is exposed.

Another thing that definitely could improve the performance is switching from pair_list_t to hashtable. It should be pretty close to the structure used for cpython dict itself with the following changes:

  1. Keys are always a PyUnicode_Type identity objects.
  2. No shared keys.
  3. Multiple equal keys could be supported by the hashtable. To achieve it, DKIX_DUMMY should never migrate back to ACTIVE but the table should be rebalanced if there are too many deleted items.

I'd like to experiment with hashtable, O(N) vs O(1) for lookup operations looks interesting.

@bdraco
Copy link
Member

bdraco commented Apr 4, 2025

@bdraco for yarl/aiohttp needs, I could imagine exposing public C API for use from Cython. It is possible, but it should be done carefully. Something like datetime C API is exposed.

These areas are warm and I see them in production profiles. I'm not sure if they could benefit from a C API

https://github.com/aio-libs/aiohttp/blob/eff1dd754c526aae967fe80075aca1df4c28666b/aiohttp/payload.py#L167
https://github.com/aio-libs/aiohttp/blob/eff1dd754c526aae967fe80075aca1df4c28666b/aiohttp/client_reqrep.py#L390
https://github.com/aio-libs/aiohttp/blob/eff1dd754c526aae967fe80075aca1df4c28666b/aiohttp/client_reqrep.py#L415

Might be enough to implement the above patterns as multidict functions but with the flexibility of LooseHeaders it may not result in much real world improvement.

Another thing that definitely could improve the performance is switching from pair_list_t to hashtable. It should be pretty close to the structure used for cpython dict itself with the following changes:

  1. Keys are always a PyUnicode_Type identity objects.
  2. No shared keys.
  3. Multiple equal keys could be supported by the hashtable. To achieve it, DKIX_DUMMY should never migrate back to ACTIVE but the table should be rebalanced if there are too many deleted items.

I'd like to experiment with hashtable, O(N) vs O(1) for lookup operations looks interesting.

O(N) sounds really nice. I'm guessing this probably makes more difference than a C API

@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

Please keep the PR open for a while, I'd like to apply minor improvements.

Copy link
Member

@bdraco bdraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over the code. LGTM. I don't have my head fully wrapped around the Py_TPFLAGS_MANAGED_WEAKREF but it looks good AFAICT.

Testing went well. No regressions or obvious performance issues observed

@bdraco
Copy link
Member

bdraco commented Apr 4, 2025

Near midnight for me (UTC-10) so off to bed. Let me know if anything should be retested on prod and I'm happy to give it a spin tomorrow when I get up

@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

auto-headers processing needs #292 implemented for good performance.
Creation CIMultiDict from C could give use a few ms but I don't it worth the effort.
When I'm speaking about public C API I'm mostly thinking about .get()/.getall(), .add() operations, and a fast iteration over keys/values.
All these are micro-optimizations, sure. Who knows the speedup on macro-benchmarks? I guess it is not too high, except for request parsing and response building.

@asvetlov
Copy link
Member Author

asvetlov commented Apr 4, 2025

@bdraco Py_TPFLAGS_MANAGED_WEAKREF is py3.12 feature for making multidicts weak-referenceable.
tp_weaklistoffset doesn't work with heap types (and with multiple inheritance, this is another beast).
For py <3.12 I use a fallback with __weaklistoffset__ member. It works everywhere but Py_TPFLAGS_MANAGED_WEAKREF should be used instead if possible.

@asvetlov asvetlov merged commit 17f08a1 into master Apr 4, 2025
46 of 47 checks passed
@asvetlov asvetlov deleted the state branch April 4, 2025 10:25
@bdraco
Copy link
Member

bdraco commented Apr 4, 2025

auto-headers processing needs #292 implemented for good performance. Creation CIMultiDict from C could give use a few ms but I don't it worth the effort. When I'm speaking about public C API I'm mostly thinking about .get()/.getall(), .add() operations, and a fast iteration over keys/values. All these are micro-optimizations, sure. Who knows the speedup on macro-benchmarks? I guess it is not too high, except for request parsing and response building.

Yeah. I explored it a bit in aio-libs/aiohttp#10696 and it doesn't seem like there is much to gain there. Maybe some potential saving on iterating but even than usually its two or three keys being iterated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants