bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318

sweeneyde · 2022-02-13T22:00:48Z

Hopefully "left-hand side is the same as assignment target" is more stable and less miss-prone than Py_REFCNT == 2.

Note that PyUnicode_Append already has lots of overhead, and it checks if it's safe to work in place.

https://bugs.python.org/issue47053

sweeneyde · 2022-03-17T22:15:55Z

A benchmark:

from pyperf import Runner, perf_counter
import sys

LENGTHS = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9]
DATA = [('a',) * x for x in LENGTHS]

def bench(loops):
    data = DATA * loops
    t0 = perf_counter()
    for s in data:
        res = ''
        for c in s:
            res += c
    return perf_counter() - t0

runner = Runner()
runner.bench_time_func("inplace add str", bench)

Mean +- std dev: [main_concat] 4.84 us +- 0.06 us -> [nomiss] 4.24 us +- 0.04 us: 1.14x faster

markshannon · 2022-03-18T10:23:47Z

Looks good for the microbenchmark, do you have numbers for the full suite?

This is the stats for this PR?

sweeneyde · 2022-03-19T06:47:19Z

Looks good for the microbenchmark, do you have numbers for the full suite?

On my not-that-stable laptop with GCC on WSL, using --enable-optimizations --with-lto, I get a 1.02x faster geometric mean: https://gist.github.com/sweeneyde/7fb779d28c55ba4b5e8d40f0bf8f596f

This is the stats for this PR?

Yes, that's correct.

sweeneyde · 2022-03-19T07:09:03Z

The previous microbenchmark was with MSCV, I got some different results with related benchmarks on GCC:

from pyperf import Runner, perf_counter
import sys

LENGTHS = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9]
DATA = [('a',) * x for x in LENGTHS]

def bench1(loops):
    data = DATA * loops
    t0 = perf_counter()
    for s in data:
        res = ''
        for c in s:
            res += c
    return perf_counter() - t0


def bench2(loops):
    data = []
    n = 10_000
    for i in range(loops):
        data.append(("a" * n, "a"))
        data.append(("", "a"))
    t0 = perf_counter()
    for s, c in data:
        s += c
        s += c
        s += c
    return perf_counter() - t0

runner = Runner()
runner.bench_time_func("bench1", bench1)
runner.bench_time_func("bench2", bench2)

bench1: Mean +- std dev: [bench_concat_main] 2.48 us +- 0.11 us -> [bench_concat_nomiss] 2.66 us +- 0.15 us: 1.07x slower
bench2: Mean +- std dev: [bench_concat_main] 976 ns +- 24 ns -> [bench_concat_nomiss] 866 ns +- 18 ns: 1.13x faster

Geometric mean: 1.02x faster

It could be just a random result of how the PGO decides to shuffle things around?

markshannon · 2022-03-25T16:12:40Z

~0% misses is a very clear improvement on 96.6%.

I'm not worried that the benchmarks numbers are a bit noisy, given this is a clear improvement.

Don't deopt if refcounts are too big

10cfa0e

bedevere-bot added the awaiting core review label Feb 13, 2022

the-knights-who-say-ni added the CLA signed label Feb 13, 2022

Detect more at specialization time

d285b81

sweeneyde changed the title ~~Don't de-optimize BINARY_OP_INPLACE_ADD_UNICODE if the refcounts are too big~~ Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE Feb 13, 2022

sweeneyde added 4 commits Feb 14, 2022

Shuffle some lines around in the opcode

b08e5f0

merge with main

673a495

Py_DECREF --> Py_SET_REFCNT

a65160e

Merge branch 'main' into nomiss

ed3ac01

sweeneyde added 2 commits Mar 17, 2022

Use Py_DECREF again

0c70d05

Merge remote-tracking branch 'upstream/main' into nomiss

82b0641

sweeneyde changed the title ~~Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE~~ bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE Mar 17, 2022

📜🤖 Added by blurb_it.

ee06828

sweeneyde marked this pull request as ready for review Mar 17, 2022

sweeneyde requested a review from markshannon as a code owner Mar 17, 2022

sweeneyde requested review from brandtbucher and markshannon and removed request for markshannon Mar 17, 2022

markshannon merged commit cca43b7 into python:main Mar 25, 2022
12 checks passed

bedevere-bot removed the awaiting core review label Mar 25, 2022

python / cpython Public

bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318

bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318

sweeneyde commented Feb 13, 2022 •

edited by bedevere-bot

sweeneyde commented Mar 17, 2022

markshannon commented Mar 18, 2022

sweeneyde commented Mar 19, 2022

sweeneyde commented Mar 19, 2022

markshannon commented Mar 25, 2022

python / cpython Public

bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318

bpo-47053: Reduce deoptimization in BINARY_OP_INPLACE_ADD_UNICODE #31318

Conversation

sweeneyde commented Feb 13, 2022 • edited by bedevere-bot

sweeneyde commented Mar 17, 2022

markshannon commented Mar 18, 2022

sweeneyde commented Mar 19, 2022

sweeneyde commented Mar 19, 2022

markshannon commented Mar 25, 2022

sweeneyde commented Feb 13, 2022 •

edited by bedevere-bot