utf8lut: Vectorized UTF-8 converter

As it was already noted, this blog has moved to https://dirtyhandscoding.github.io. The new article utf8lut: Vectorized UTF-8 converter has been published there.

 

Some time ago I wrote a surprising answer to stackoverflow question “Fastest way to get IPv4 address from string”. At that very moment I discovered that _mm_shuffle_epi8 instruction combined with sufficiently large lookup-table can do wonders. Actually, this exact methodology was used to implement vectorized merging algorithm in one of my previous blog posts.

Inspired by the new discovery, I tried to apply this LUT + shuffle idea to some well-known and moderately generic problem. I tried to apply the idea to accelerate conversion from UTF-8 to UTF-16 and was successful. Initially, I had to postpone all efforts on the problem due to PhD-related activities. After that I thought that I would write a scientific article on the topic. When it became clear that I don’t want to lose time on anything like that anymore, I decided to write about it in a blog. As a result, almost three years after the initial work (update: it is four years already), I finally managed to write this report, describing the algorithm and the utf8lut library.

(…continue reading…)

Posted in High performance | Leave a comment

Blog moved to dirtyhandscoding.github.io

To everyone reading this.

The blog has been moved to GitHub pages.
The new address is https://dirtyhandscoding.github.io/.

Continue reading

Posted in No category | Leave a comment

Vectorizing small fixed-size sort

After a long break, I can finally return to the topic which was started in the previous blog post.

Imagine that we have a small array of compile-time constant size with integers. For instance, N = 32. And we want to sort it as fast as possible. What is the best solution for this problem?

The wisest of you would suggest simply using std::sort, because every modern implementation of it is well-optimized and contains special handling of small subarrays to accelerate the generic quicksort algorithm. The ones who don’t trust libraries would suggest using insertion sort: this is the algorithm usually used to handle small cases in std::sort. The performance geeks and regular stackoverflow visitors would definitely point to sorting networks: every question like “the fastest way to sort K integers” ends up with a solution based on them (N=6, N=10, what, why). I’m going to present a much less known way to sort small arrays of 32-bit keys, with performance comparable to sorting networks.

Continue reading

Posted in High performance | Tagged , , | Leave a comment

Addendum to Performance comparison: linear search vs binary search

The previous blog post got some attention and several good questions and suggestions. So I feel that I should sum up the main points noted by readers. Since the main article is already too long, I decided to keep all of this as a separate post.

Continue reading

Posted in High performance | Tagged , | Leave a comment

Performance comparison: linear search vs binary search.

While working on an implementation of merge sort promised in the previous article, I realized that I’d like to use one neat little thing, which is worth its own post. It is a simple strategy for sorting or doing comparison-based tasks, which works wonderfully when input data is small enough.

Suppose that we have a very small array and we want to sort it as fast as possible. Indeed, applying some fancy O(N log N) algorithm is not a good idea: although it has optimal asymptotic performance, its logic is too complicated to outperform simple bubble-sort-like algorithms which take O(N^2) time instead. That’s why every well-optimized sorting algorithm based on quicksort (e.g. std::sort) or mergesort includes some simple quadratic algorithm which is run for sufficiently small subarrays like N <= 32.

What exactly should we strive for to get an algorithm efficient for small N? Here is the list of things to look for:

  1. Avoid branches whenever possible: unpredictable ones are very slow.
  2. Reduce data dependency: this allows to fully utilize processing units in CPU pipeline.
  3. Prefer simple data access and manipulation patterns: this allows to vectorize the algorithm.
  4. Avoid complicated algorithms: they almost always fail on one of the previous points, and they sometimes do too much work for small inputs.

I decided to start investigating a simpler problem first, which is solved by std::lower_bound: given a sorted array of elements and a key, find index of the first array element greater or equal than the key. And this investigation soon developed into a full-length standalone article.

Continue reading

Posted in High performance | Tagged , , , | 11 Comments

Vectorizing std::merge with vpermd from AVX2 and lookup table

Recently I stumbled upon a question on stackoverflow, which asked how to vectorize computing symmetric difference of two sorted int32 arrays using AVX2. I decided to try doing it myself, and this post is about what I achieved. Of course, the best way to compute symmetric difference of sorted sets is by running ordinary merge algorithm (e.g. with std::merge) for the arrays plus some simple postprocessing. I’ll concentrate only on the generic merging algorithm here.

I’ll handle 32-bit integer keys only. Having keys of larger size would reduce significantly the efficiency of the algorithm (as usual with vectorization). Using 32-bit floating point keys is not much different from using integer keys; moreover, sorting 32-bit floats can be easily reduced to sorting 32-bit integers, which is often used to run radix sort on floating point data (see this and that). Also I’ll briefly discuss the case when 32-bit values are attached to 32-bit keys (sorting without values is pretty useless in practice).

Continue reading

Posted in High performance | Tagged , , | 3 Comments

Doom 3: C++ enhanced RTTI and memory debugging

Although Doom 3 was released 13 years ago, there is still of lot of interesting stuff to find in it. The main reason for that is: it was developed in the time of changes. The ID software team had just moved from C to C++ (using Visual C++ 6). Graphic cards were moving from fixed pipeline to programming pipeline, CPUs were getting SIMD extensions. All of this caused a lot of diversity in the code: crazy mix of C++/C/asm code, several renderer backends, code acceleration for MMX/SSE/AltiVec (yeah, Macs used PowerPC at that time). Not to mention tons of scripts and defs, including C++-like scripting language, and in-game GUI system.

I have to deal with Doom 3 engine in the context of The Dark Mod (TDM for short).

This article is about memory debugging configuration in Doom 3 SDK (more precisely, “Debug with inlines and memory log”).

Example

To get started, here are some messages dumped to game console in the memory debugging configuration (current version of TDM):

Uninitialized_warning_console

Continue reading

Posted in The Dark Mod | Tagged , | Leave a comment