-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Performance of Weak.blit #9258
Description
We noticed that sometimes Weak.blit takes time proportional to the total size of the arrays involved, instead of being proportional to the size of the chunk being copied. This is particularly noticeable when the chunks are small.
In particular, we have a use case where contents is moved from one weak array to another, skipping over the references that were emptied, which involved many blits of length 1.
Note that blit is the appropriate operation in this case rather than get followed by set because we don't want to make the elements live (therefore making them potentially survive for a whole extra GC cycle).
Note also that weak hashtable (Weak.Make) is probably also affected because it uses blits of size 1 in its resize function to copy between buckets, although I haven't tested this empirically.
The code snippet at the end demonstrates the problem. (blit_elementwise is sometimes very fast and sometimes very slow).
The benchmark spends ~99.5% of its time in caml_ephe_clean function (defined in runtime/caml/weak.h). The function cleans the whole array regardless of how much of it is actually being accessed. It's called only when the gc phase is Phase_clean, which explains why the cost is not always there.
One proposed idea is to add two more parameters to caml_ephe_clean specifying what portion of the array to clean and to have caml_ephemeron_blit_key pass that information down.
The benchmark code:
let blit_elementwise arr1 arr2 =
let rec loop i =
if i < 0
then ()
else (
Weak.blit arr1 i arr2 i 1;
loop (i - 1))
in
loop (Weak.length arr1 - 1)
;;
let f () =
let arr1 = Weak.create 20000 in
let arr2 = Weak.create 20000 in
for i = 0 to 20000 - 1 do
Weak.set arr1 i (Some (Array.make 1 "x"))
done;
blit_elementwise arr1 arr2
;;
let rec loop i =
if i = 0
then ()
else (
f ();
Printf.printf ".%!";
loop (i - 1))
;;
let () = loop 100
let () = Printf.printf "\n%!"
$ time ./a.out
....................................................................................................
real 1m2.823s
user 1m2.677s
sys 0m0.012s