Currently, the device extension is unreasonably slow when copying any array to a device because it synchronizes for every buffer copy. For string/binary types, an additional copy is needed to find the length of the next buffer. There is no technical limitation preventing these copies from occuring in parallel; however, the initial PR for this didn't get there.
References: