New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318
Conversation
|
A benchmark: from pyperf import Runner, perf_counter
import sys
LENGTHS = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9]
DATA = [('a',) * x for x in LENGTHS]
def bench(loops):
data = DATA * loops
t0 = perf_counter()
for s in data:
res = ''
for c in s:
res += c
return perf_counter() - t0
runner = Runner()
runner.bench_time_func("inplace add str", bench) |
|
Looks good for the microbenchmark, do you have numbers for the full suite? This is the stats for this PR? |
On my not-that-stable laptop with GCC on WSL, using --enable-optimizations --with-lto, I get a 1.02x faster geometric mean: https://gist.github.com/sweeneyde/7fb779d28c55ba4b5e8d40f0bf8f596f
Yes, that's correct. |
|
The previous microbenchmark was with MSCV, I got some different results with related benchmarks on GCC: from pyperf import Runner, perf_counter
import sys
LENGTHS = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9]
DATA = [('a',) * x for x in LENGTHS]
def bench1(loops):
data = DATA * loops
t0 = perf_counter()
for s in data:
res = ''
for c in s:
res += c
return perf_counter() - t0
def bench2(loops):
data = []
n = 10_000
for i in range(loops):
data.append(("a" * n, "a"))
data.append(("", "a"))
t0 = perf_counter()
for s, c in data:
s += c
s += c
s += c
return perf_counter() - t0
runner = Runner()
runner.bench_time_func("bench1", bench1)
runner.bench_time_func("bench2", bench2)It could be just a random result of how the PGO decides to shuffle things around? |
|
~0% misses is a very clear improvement on 96.6%. I'm not worried that the benchmarks numbers are a bit noisy, given this is a clear improvement. |
Hopefully "left-hand side is the same as assignment target" is more stable and less miss-prone than
Py_REFCNT == 2.Note that PyUnicode_Append already has lots of overhead, and it checks if it's safe to work in place.
See faster-cpython/ideas#269
https://bugs.python.org/issue47053