Kontainers STL like containers focusing on performance

This project started while browsing the Doom3 source code published by ID software. Upon inspection of the code that iterates over a hash table, I noticed its improper use of a hash table. After further investigation of the hash tables' implementation, I realised that there must be a better way. I started researching hash table implementations and hash functions. Before I knew it, I was creating my own set of containers, purely for fun. In addition, I had previously researched 64bit (C++) programming practices mentioned in the articles published on the PVS Studio's site. Therefore, I was determined to create all the containers for 32 bit and 64 bit.

Performance

Data provided by the benchmark program has been plotted in the charts below.
Getting the latest revision from BitBucket would create the same executable(s).

Kontainers VS STL

In this chart, x64 Kontainer performance and x64 STL performance are plotted against each other. As seen in the chart, Kontainers appear to be faster in most of the cases.

The following containers (under any configuration) are always faster;
• Map
• Queue
• String(StrReplacing)

The following container is usually faster (under certain configurations is not):
• String (StrExpanding)

The following containers have negligible speed difference:
• List
• Heap

For more details on why their speeds differ, see Container implementation section.

x86 Kontainers VS x64 Kontainers

With knowledge of why 64 bit programs can be either slower or faster than 32 bit programs, this chart offers us insight into the implementation of each container. Since each container is faster, we may conclude that good practice of 32 bit and 64 bit programming has been maintained, also, the implementation doesn't depend heavily on pointers (see next paragraph).

x86 STL VS x64 STL

There is one particular aspect about 32 bit and 64 bit programs that allows us to infer that STL unordered_map and queue may use excessive pointer code. Pointers in 32 bit programs are 4 bytes opposed to 8 bytes in 64 bit programs. Thus, if a section of code uses pointers excessively, its memory usage will be somewhat greater. This reduces (data) cache efficiency. In addition, there are common pointer-based implementations for hash tables, maps, and queues that all make use of linked lists. This could support our earlier inference that pointers are being used excessively in these containers.

If you want to see more graphs, you may download the spreadsheet (Excel 2010) containng all the data here.

Container implementations

Click a container's name below to view further details.

Code Links: Header, Footer, Tests
Stats: consistently faster than STL, x64 Kontainer's Map is 1.3203 times faster than x64 STL unordered_map.

What is a hash table?
A hash table is similar to a phone book or a dictionary. The data is associated with an identifier which can be looked up. For example, say you have a personal phone book containing all the phone numbers of your friends. The friend's name is the key for the phone number (or data/value). However, unlike a phone book the data is not stored in order, hence STL's elegant name "unordered_map". Instead, the key is "hashed" to create an identification hash, which is used to look up the data faster. The only problem is, the hash is not unique to the key, i.e., the hash for "Jen" could be the same as the hash for "Ben". The solution to this problem differs depending on the implementation. Consequently, all hash tables must have two things: a hash function and a means to resolve hash collisions.

Hash function
There are many hash functions that exist with different strengths and weaknesses. For example, some are very fast, but are likely to collide; others are slower, but guarantee unique hashes. Therefore, its purpose must be considered when selecting a hash function. Murmur2 was selected for Kontainers. The reasons are as follows (according to this research): it is fast and has high quality hashes making it collide infrequently. Its quality was asserted by using a hash function benchmarking tool known as Smasher. Other hash functions considered were: FNV-1a, xxHash, and Tiger-192.

Collision resolutions (two will be discussed here)
Buckets / chaining Chaining is where a container is used to store all data that hashes to the same value, helping to manage hash collisions. The container can be a linked list, a vector or a balanced tree. Usually it is a linked list, which adds onto the lookup and insert process.

Open addressing with linear probing: There are several forms of open addressing: linear probing, random probing, quadratic probing and double hashing. A good explanation of these can be found in this lecture. Linear probing is where the next array entry is examined upon hash collision. For example, suppose a collision occurred at index 100. The next step would be to check if index 101 is empty. If it is occupied, 102 is checked and so on. If you're feeling sharp, you may have just noticed a problem. If you removed an element, e.g. index 101, then accessing the element at index 102 would not be possible (since 101 is now empty, the probing stops there). Therefore with probing, erasing an element is not usually implemented.

The pros of buckets / chaining (unordered_map) are:
• Efficient memory use -- the container will use approximately the same as the size of the inserted elements. This is due to the fact that linked lists are being used.
• Does not require rehashing -- when an open addressed hash table requires to increase its size, all the elements need to be reinserted into a new table.

The cons of buckets / chaining (unorded_map) are:
• Inherits linked list performance -- very poor cache efficiency.
• Frequent memory allocation -- every time a new element is added, a memory allocation takes place (or at least; requested from the allocator).

The pros of open addressing with linear probing (Map from Kontainers):
• Efficient cache use -- due to locality of reference; looking up and adding elements is fast.
• Convenient serialisation -- due to the data being in one continuous array, the process of serialising the data is straightforward.
• Infrequent memory allocation -- memory is allocated only when the container requires to resize.

The cons of open addressing with linear probing (Map from Kontainers):
• Resizing -- resizing requires rehashing, which is essentially creating a new hash table.
• Complex remove functionality -- the removal of elements is complicated and usually not implemented.
• Inefficient memory use -- it can use up to 50% more memory than the number of inserted elements (depending on the resize load factor).

Code Links: Header, Footer
Stats: Kontainer's set was not tested; its performance should be identical to Kontainer's Map.

What is a set?
By a set, we mean the very same type of set you find in mathematics. Each value in the set occurs only once.

A set, in a broad sense, is similar to a hash table. The difference being, the data is itself the key. This is why, instead of a key being hashed, the data is hashed and compared.

Kontainer's Set is implemented in the same way as Kontainer's Map. See section above for more details.

Code Links: Header, Footer, Tests
Stats: Significantly faster than STL, x64 Kontainer's queue is 3.028 times faster than x64 STL queue.

What is a queue?
It is just like a queue at the bank. Usually very slow, you join the back and wait your turn until you are at the front. People in the middle of the queue cannot leave. People are added to the back, people are removed from the front. The amount of adding / removing does not have to be equal.

A common implementation for a queue is a linked list. It is memory efficient, but poor at cache efficiency. An alternative is a circular buffer. Essentially, it's a single block of memory which is accessed in a circular fashion. For example, say its size is 3 (0 to 2). When the index 2 is being accessed, the next element is index 0. A circular buffer is less memory efficient than a linked list, however, it is cache efficient. Kontainer's Queue (circular buffer) has an additional (optional) feature to resize when full, opposed to overwriting previous data.

Code Links: Header, Footer, Tests
Stats: There are two tests for strings: StrExpanding and StrReplacing. With StrExpanding, x64 Kontainer's Str is 4.4548 times faster than x64 STL string. With StrReplacing, x64 Kontainer's Str is 1.5688 times faster than x64 STL string. Kontainer's StrReplacing is significantly faster than STL. Kontainer's StrExpanding is slower than STL in one configuration.

What is a string?
A string is a container to store alphanumeric characters. They are similar to an STL vector except specialised to store characters. Furthermore, they usually provide special functions to make using them more convenient.

Str is Kontainer's string class, its source code is vast, as it provides many additional functions. There is a single header and a separate footer for each category of the functions: additional, format, comparison and operators. In addition to these files, there is a set of wrapper functions for the cstring.h functions. This has two advantages: allows asserts to be added before the cstring functions are called, and allows toggling use of Agner Fog's assembly library. For more information on the assembly library, including its speed impact, see Further features.

Kontainer's Str class is implemented as a simple resizing C array. It uses cstring functions such as: memcpy, memove, strtok, (etc) whenever possible. In addition, Str class resizes differently to how STL vector does. Instead of doubling in capacity, it increases by a set amount. The idea is to save memory. For example, if the programmer knows a string will not be larger than 512 characters, then setting the size increase to 128 or 256 would allow the strings to remain small.

Code Links: Header, Footer, Tests
Stats: x64 Kontainer's List is 1.0123 times faster than x64 STL vector. In other configurations List can be marginally slower than vector.

What is a List / vector?
An automatically resizing C style array.

There is not much one can say about a resizing C style array. List resizes in the same way vector does, doubling in capacity. List features additional functions that a C array does not such as erase, insert, shuffle, sort, find, etc.

Code Links: Header, Footer
Stats: none, with correct the inlining, this is a C array.

What is a ListFixed?
It's quite like a List except it is fixed in size. It includes the addition functions such as: erase, insert, shuffle and sort.

Code Links: Header, Footer
Stats: none

What is an Array?
Array is a minimalistic wrapper for a C array; it's intended as a drop-in replacement. Its purpose is to provide a programmer with run time assertions.

Code Links: Header, Footer, Tests
Stats: This is the least successful container. Kontainer's Heap is marginally (1.0569 times) slower than STL priority_queue in most of the test configurations.

What is a heap?
A heap is a type of tree structure. What makes it different from any tree is that the data is sorted (heap property). Thus the root node is either the smallest or the largest value in the whole tree (depending on sorting).

There is a trick you can do with trees that works especially well with heaps: using a little maths to calculate indices (look at line 541 onwards in the footer), the whole tree can be stored inside an array rather than individual objects with pointers (tree). This improves cache efficiency at the cost of some wasted memory at the end of the array. I believe STL uses the same method for its priority_queue, leading to similar performance.


Testing

Testing is intended to be as fair as possible. An example of the test process (Map container) is as follows:

The test function for the container is called, and it creates an instance of the test class:
void MapVSunordered_map::RunMapTests()
{
	MapVSunordered_map mapTest;               // creates test class instance on stack
	TestUtils::testCode("Karl Map Test",      // calls testing function with:
				[&](){mapTest.MapRunA();},    // lambda containing the tests to run multiple times
				mapTest.GetLoopCount().GetRunsArgument(),        // how many times to run the test
				mapTest.GetLoopCount().GetCyclesArgument() );    // how many calls of the test function constitutes one test
}

The test class instance is used in a lambda function passed to the test bed code. The lambda calls "MapRunA()" which looks like this:
void MapVSunordered_map::MapRunA()
{
	for(size_t i = 0; i < m_loopCount.GetOuterLoopA(); ++i )
	{
		TestStringToIndex( m_loopCount.GetArgumentA() );
		TestIndexToString( m_loopCount.GetArgumentB() );
	}
}

"m_loopCount" is a class which contains magic numbers. It is initialised on the construction of the test class. TestStringToIndex and TestIndexToString are the actual tests. To find the matching STL version, look for the STL container name followed by RunMapA, in this case it is "unordered_mapRunMapA". You can find both these functions here. For more information on each test see the Container implementation section; simply open the test footer, skip to the bottom of the file, and follow the code flow.


Further features

Click a feature's name below to view further details.

Agner Fog has a very interesting website with many topics to read. I was particularly interested in his asmlib which provides hand coded assembly versions of functions such as memcpy and strcmp. They boast significant speed increases over the standard ones due to their SIMD nature. Naturally, with the project being focused on performance, why not try them out? A wrapper class was made to enable to toggle these functions at will. Below are charts of the results:


The string class is dramatically faster with Agner's asmlib. Everything else is not affected much.


An even bigger increase on the string class with Agner's asmlib. Again, everything else in not affected much.

Note: the charts at the top of this page are of Kontainer's with Agner's asmlib, however, the assertions made such as "Map is constantly faster than unordered_map" do include tests without Agner's asmlib.

For more charts with and without Anger's see the spreadsheet (Excel 2010) here.

After reading the section on random number generation in Numerical Recipes, I learnt about combining random number generators. I also read their source code for all their generators, and couldn't help but notice that they are all somewhat shorter than other random generators such as Mersenne twister, and yet still do exceptionally well on the DieHard tests. Out of curiosity, I wanted to create a combined generator. The core of my generator is the Numerical Recipes' (third edition, chapter 7, page 342) "Ran" generator. Since I am a fan of creating random numbers in the range of 0.0 to 1.0, I wanted to improve the quality of the double generation. This was achieved by combining it with a random number generator that uses a totally different algorithm from Doom3 source code (line 144). As stipulated in Numerical Recipes, combining generators of different algorithms improves their strength; one's weaknesses are masked by the other's strengths. To test my generator, I was inspired from a post about hash functions. In very much the same way the user displayed the hashes on an image to visually inspect it, I too could display the numbers to inspect them visually.

In order for there to be any point in inspecting results, they should be compared to other generators. The following random number generators were tested and visually inspected:
• Doom3 Random2 (IEEE trick): float
• Numerical Recipes Ran: double, integer
• Numerical Recipes Ran1: double
• Numerical Recipes RanFib: double
• Random2 combined with RanFib: double
• Mersenne Twister: double, integer
• SIMD-oriented Fast Mersenne Twister (SFMT): double, integer
• Mother-of-all: double, integer
• SFMT combined with Mother-of-all: double, integer
• Krand (my combined generator of Ran and Random2): double, integer
• C rand: integer (insufficient range for this test)

What you are looking for is the amount of patterns you can see in the images. The more patterns or shapes you can see the lower quality the generator is. Additionally, the file size can give us an indicator of its quality.

Note: C rand was used with the modulus function, this is common practice. I have always been sceptical about it. For example consider the following:

int num = rand() % 15;  // what happens when you modulus by an odd number? (pigeonhole principle)
num = rand() % 1999;    // 1999 just happens to be prime, how will this affect our results?
num = rand() % 65536;   // C rand doesn't have enough range for this.

In order to ensure correct execution in 32 bit and 64 bit configurations, PVS studio's static code analysis was used. It specialises in detecting warnings / errors specific to 32 bit and 64 bit. In addition, CPP Check was used to find a few warnings that were missed. Both of these were only used after removing essential warnings flagged up by using /Wall to show all warnings generated by the compiler.


Future work

Even though I am currently working on a different project, there are still interesting tasks that could be done:
• Upgrade to MSVS 2012 -- This comes with upgraded STL and C++11 features. STL performance should improve.
• Upgrade codebase to C++11 -- How much improvement would move assignment and move constructors make?
• Cross platform -- Adding support for GCC or CLANG. Are their implementations faster than Windows?
• Memory copy speed test -- Since many of the tests use a lot of memory allocation, the performance of the machine's RAM will certainly affect the time elapsed. Adding a test to measure its performance could assist in explaining why one machine is faster than another.
• Assertion of correct memory use -- Valgrind or similar should be used to check for memory errors and vulranbilites.