Optimize Vec when cloning from a slice#15471
Conversation
llvm is currently not able to conver `Vec::extend` into a memcpy for `Copy` types, which results in methods like `Vec::push_all` to run twice as slow as it should be running. This patch takes the unsafe `Vec::clone` optimization to speed up all the operations that are cloning a slice into a `Vec`. before: test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0000_0010 ... bench: 125 ns/iter (+/- 4) = 80 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 360 ns/iter (+/- 33) = 277 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2601 ns/iter (+/- 175) = 384 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0010_0010 ... bench: 125 ns/iter (+/- 10) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 361 ns/iter (+/- 28) = 277 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 131 ns/iter (+/- 13) = 76 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 360 ns/iter (+/- 9) = 277 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2575 ns/iter (+/- 168) = 388 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 356 ns/iter (+/- 20) = 280 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2605 ns/iter (+/- 167) = 383 MB/s test vec::tests::bench_from_slice_0000 ... bench: 11 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 115 ns/iter (+/- 5) = 86 MB/s test vec::tests::bench_from_slice_0100 ... bench: 309 ns/iter (+/- 170) = 323 MB/s test vec::tests::bench_from_slice_1000 ... bench: 2065 ns/iter (+/- 198) = 484 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 79 ns/iter (+/- 7) = 126 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 342 ns/iter (+/- 18) = 292 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 2873 ns/iter (+/- 75) = 348 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 154 ns/iter (+/- 8) = 64 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 518 ns/iter (+/- 18) = 193 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 4490 ns/iter (+/- 223) = 222 MB/s after: test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0000_0010 ... bench: 123 ns/iter (+/- 5) = 81 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 367 ns/iter (+/- 23) = 272 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2618 ns/iter (+/- 252) = 381 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0010_0010 ... bench: 124 ns/iter (+/- 7) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 369 ns/iter (+/- 34) = 271 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 123 ns/iter (+/- 6) = 81 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 371 ns/iter (+/- 25) = 269 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2713 ns/iter (+/- 532) = 368 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 369 ns/iter (+/- 14) = 271 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2611 ns/iter (+/- 194) = 382 MB/s test vec::tests::bench_from_slice_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 108 ns/iter (+/- 4) = 92 MB/s test vec::tests::bench_from_slice_0100 ... bench: 235 ns/iter (+/- 24) = 425 MB/s test vec::tests::bench_from_slice_1000 ... bench: 1318 ns/iter (+/- 96) = 758 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 70 ns/iter (+/- 4) = 142 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 176 ns/iter (+/- 16) = 568 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 1125 ns/iter (+/- 94) = 888 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 159 ns/iter (+/- 15) = 62 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 363 ns/iter (+/- 12) = 275 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 2860 ns/iter (+/- 415) = 349 MB/s
src/libcollections/vec.rs
Outdated
There was a problem hiding this comment.
Instead of taking src: &[T], does this optimize as well with src: Iterator<T>?
There was a problem hiding this comment.
I just prototyped this out, and while for most functions it performed the same, for some reason this version of Vec::push_all is half way between the original implementation, and my initial &[T] implementation. Sometimes I just don't understand microbenchmarks. I'll investigate it more tomorrow.
test vec::tests::bench_clone_from_0000_0000 ... bench: 9 ns/iter (+/- 0)
test vec::tests::bench_clone_from_0000_0010 ... bench: 128 ns/iter (+/- 12) = 78 MB/s
test vec::tests::bench_clone_from_0000_0100 ... bench: 379 ns/iter (+/- 14) = 263 MB/s
test vec::tests::bench_clone_from_0000_1000 ... bench: 2624 ns/iter (+/- 187) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000 ... bench: 9 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010 ... bench: 127 ns/iter (+/- 9) = 78 MB/s
test vec::tests::bench_clone_from_0010_0100 ... bench: 386 ns/iter (+/- 26) = 259 MB/s
test vec::tests::bench_clone_from_0100_0010 ... bench: 129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_clone_from_0100_0100 ... bench: 374 ns/iter (+/- 19) = 267 MB/s
test vec::tests::bench_clone_from_0100_1000 ... bench: 2696 ns/iter (+/- 164) = 370 MB/s
test vec::tests::bench_clone_from_1000_0100 ... bench: 384 ns/iter (+/- 27) = 260 MB/s
test vec::tests::bench_clone_from_1000_1000 ... bench: 2638 ns/iter (+/- 75) = 379 MB/s
test vec::tests::bench_extend_0000_0000 ... bench: 9 ns/iter (+/- 0)
test vec::tests::bench_extend_0000_0010 ... bench: 129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_extend_0000_0100 ... bench: 418 ns/iter (+/- 23) = 239 MB/s
test vec::tests::bench_extend_0000_1000 ... bench: 3108 ns/iter (+/- 296) = 321 MB/s
test vec::tests::bench_extend_0010_0010 ... bench: 209 ns/iter (+/- 9) = 47 MB/s
test vec::tests::bench_extend_0100_0100 ... bench: 616 ns/iter (+/- 69) = 162 MB/s
test vec::tests::bench_extend_1000_1000 ... bench: 4891 ns/iter (+/- 243) = 204 MB/s
test vec::tests::bench_from_slice_0000 ... bench: 4 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010 ... bench: 109 ns/iter (+/- 9) = 91 MB/s
test vec::tests::bench_from_slice_0100 ... bench: 233 ns/iter (+/- 11) = 429 MB/s
test vec::tests::bench_from_slice_1000 ... bench: 1331 ns/iter (+/- 53) = 751 MB/s
test vec::tests::bench_push_all_0000_0000 ... bench: 4 ns/iter (+/- 1)
test vec::tests::bench_push_all_0000_0010 ... bench: 64 ns/iter (+/- 4) = 156 MB/s
test vec::tests::bench_push_all_0000_0100 ... bench: 230 ns/iter (+/- 21) = 434 MB/s
test vec::tests::bench_push_all_0000_1000 ... bench: 1777 ns/iter (+/- 248) = 562 MB/s
test vec::tests::bench_push_all_0010_0010 ... bench: 140 ns/iter (+/- 3) = 71 MB/s
test vec::tests::bench_push_all_0100_0100 ... bench: 407 ns/iter (+/- 18) = 245 MB/s
test vec::tests::bench_push_all_1000_1000 ... bench: 3773 ns/iter (+/- 661) = 265 MB/s
There was a problem hiding this comment.
So I figured out an internal loop that lets llvm optimize an interator down into my fast version, but unfortunately it's not drop safe:
fn do_clone_hack_external_iter2<'a, Iter: Iterator<&'a u8>>(dst: &mut Vec<u8>, mut src: Iter) {
let mut dst_len = dst.len();
for x in src {
let x = x.clone();
unsafe {
ptr::write(
dst.as_mut_slice().unsafe_mut_ref(dst_len),
x);
dst_len += 1;
}
}
unsafe {
dst.set_len(dst_len);
}
}
Setting the vec length from inside that hot loop prevents this optimization from happening. I've put my experiments in this gist if anyone else wants to play with it: https://gist.github.com/erickt/7a3fd8689159e69ed74c
There was a problem hiding this comment.
That is unfortunate! It's a shame that this optimization can't apply to patterns such as extend or collect, but I suppose we'll just have to wait to teach LLVM a little more about aliasing!
This changes Vec::from_slice to call unsafe_push_all_clone directly to avoid doing an unnecessary reserve_additional call
llvm is currently not able to conver `Vec::extend` into a memcpy for `Copy` types, which results in methods like `Vec::push_all` to run twice as slow as it should be running. This patch takes the unsafe `Vec::clone` optimization to speed up all the operations that are cloning a slice into a `Vec`. before: ``` test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0000_0010 ... bench: 125 ns/iter (+/- 4) = 80 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 360 ns/iter (+/- 33) = 277 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2601 ns/iter (+/- 175) = 384 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0010_0010 ... bench: 125 ns/iter (+/- 10) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 361 ns/iter (+/- 28) = 277 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 131 ns/iter (+/- 13) = 76 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 360 ns/iter (+/- 9) = 277 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2575 ns/iter (+/- 168) = 388 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 356 ns/iter (+/- 20) = 280 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2605 ns/iter (+/- 167) = 383 MB/s test vec::tests::bench_from_slice_0000 ... bench: 11 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 115 ns/iter (+/- 5) = 86 MB/s test vec::tests::bench_from_slice_0100 ... bench: 309 ns/iter (+/- 170) = 323 MB/s test vec::tests::bench_from_slice_1000 ... bench: 2065 ns/iter (+/- 198) = 484 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 79 ns/iter (+/- 7) = 126 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 342 ns/iter (+/- 18) = 292 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 2873 ns/iter (+/- 75) = 348 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 154 ns/iter (+/- 8) = 64 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 518 ns/iter (+/- 18) = 193 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 4490 ns/iter (+/- 223) = 222 MB/s ``` after: ``` test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0000_0010 ... bench: 123 ns/iter (+/- 5) = 81 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 367 ns/iter (+/- 23) = 272 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2618 ns/iter (+/- 252) = 381 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0010_0010 ... bench: 124 ns/iter (+/- 7) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 369 ns/iter (+/- 34) = 271 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 123 ns/iter (+/- 6) = 81 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 371 ns/iter (+/- 25) = 269 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2713 ns/iter (+/- 532) = 368 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 369 ns/iter (+/- 14) = 271 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2611 ns/iter (+/- 194) = 382 MB/s test vec::tests::bench_from_slice_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 108 ns/iter (+/- 4) = 92 MB/s test vec::tests::bench_from_slice_0100 ... bench: 235 ns/iter (+/- 24) = 425 MB/s test vec::tests::bench_from_slice_1000 ... bench: 1318 ns/iter (+/- 96) = 758 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 70 ns/iter (+/- 4) = 142 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 176 ns/iter (+/- 16) = 568 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 1125 ns/iter (+/- 94) = 888 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 159 ns/iter (+/- 15) = 62 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 363 ns/iter (+/- 12) = 275 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 2860 ns/iter (+/- 415) = 349 MB/s ``` This also includes extra benchmarks for `Vec` and `MemWriter`.
Bumps [slab](https://github.com/tokio-rs/slab) from 0.4.10 to 0.4.11. changelog: none
llvm is currently not able to conver
Vec::extendinto a memcpy forCopytypes, which results in methods likeVec::push_allto run twice as slow as it should be running. This patch takes the unsafeVec::cloneoptimization to speed up all the operations that are cloning a slice into aVec.before:
after:
This also includes extra benchmarks for
VecandMemWriter.