C++11 Uncovered: Chapter:11 Hashing Support

Hash table is very popular data structures which has been in use in almost all domain of programming. It is based on the data structure which has '(key, value), pair. Typically hash table uses hash function to compute an index into array of buckets while insertion and lookup of any key. This is almost O(1) time complexity when well behaved hash function is used. This fast lookup makes this data structure so popular.

C++11 has now introduced the support of hash function. There are four associate container classes which directly supports hash table functionality. As usual there are two variants of 'map' and 'set' which has usual meaning as their orders counterpart.

1. unordered map
2. unordered set
3. unordered multimap
4. unordered multiset

If some reader is not familiar to hash table data structure, I suggest them to read about this from any standard algorithm books available. Here we would mainly focus how we can use the STL version of hash table implementation. Hence in the following section, I assume that the reader is aware about the hash table concept.

We all know that 'std::map/std::set' insert the key's based on the less than operator. These container uses red-black tree in their internal implementation. As while inserting elements are inserted in the less than ordered manner, red-black tree guarantee about 'O(log(n))' time complexity in elements lookup.

However this does not hold true for hash table supported container classes. The elements are not guaranteed to be inserted in any order. In fact it depends on many factor of hash table and its result may vary in different run of the same program. Basically hash table needs to provide the 'O(1)' complexity for key lookup, hence it normally uses very different way to store the elements in the container.

If you want to search any random key entry from the sequence, you should ideally go for the hash table variants of container classes(unordered). If you want that elements should be ordered way even you could kind of compromise for lookup time, then you should use the previous variants of container classes(ordered).

You may not get the desired performance benefits while using hash table container class compared to ordered container classes. However once your container size is large we start to observe the significant benefits. As usual do not trust to anyone, write the code and measure the time. We should always try to measure any code performance by ourself and then we can start to use for our requirement.

The following example demonstrate the fact that we should not rely on the order of element in the sequence for hash table container classes. They could vary even in different run of the same program.

//Code Begins

#include<map>
#include<unordered_map>
#include<iostream>
#include<string>

typedef std::map<std::string, int> mymap;
typedef std::unordered_map<std::string, int> myunordererdmap;

void learn_hash_support(void) {
    mymap m_a{ {"Mantosh",7},{"Kumar",5},{"Sharma", 6},
        {"Ram",3},{"Chandra",7},{"Aryan",5},
        {"Deepansu",8},{"Shakuntala",10},
        {"Phool",5},{"Parbati",7}};

    myunordererdmap m_b{ {"Mantosh",7},{"Kumar",5},{"Sharma", 6},
        {"Ram",3},{"Chandra",7},{"Aryan",5},
        {"Deepansu",8},{"Shakuntala",10},
        {"Phool",5},{"Parbati",7}};
    auto itr1=m_a.begin();
    auto itr2=m_b.begin();
    for(;(itr1 != m_a.end()&& itr2 != m_b.end()); ++itr1, ++itr2) {
        std::cout<<"("<<itr1->first<<","<<itr1->second<<")"<<"\t\t\t"
                << "("<<itr2->first<<","<<itr2->second<<")"<<std::endl;
    }
}

int main(int argc, const char* argv[]) {
    learn_hash_support();
    return 0;
}

//Code Ends

When I run the above program, I get the following output. Here we can see that the outputs are different for ordered and unordered version for the same set of input.

//Console Output Begins

$ ./test
(Aryan,5)            (Phool,5)
(Chandra,7)            (Deepansu,8)
(Deepansu,8)        (Chandra,7)
(Kumar,5)            (Aryan,5)
(Mantosh,7)            (Ram,3)
(Parbati,7)            (Parbati,7)
(Phool,5)            (Sharma,6)
(Ram,3)                (Shakuntala,10)
(Shakuntala,10)        (Kumar,5)
(Sharma,6)            (Mantosh,7)

//Console Output Ends

C++11 standard has provided the standalone way of using hash function. This is good as now we can do all experiment with the hash function for given data set. The basic requirement of hash function is it should always give the same hash value for the given input. The second requirement is provide the different hash value even for similar key, so that key would be inserted into the table as uniformly as possible. These things plays the significant role while key lookup into the table.

In the following examples, I have stored the words in each line in the file 'input.txt'. The words are not unique and has multiple entries into the file. First I have stored them into 'std::set' which would ensure that it has only unique entries of words after reading from the file is done.

Now I have defined 'hash fun' function object from the standard library. The idea over here is to calculate the hash value for the entries of the words which we have already stored into the 'std::set' container. I have stored the hash value into another 'std::set' container. Ideally both the container size should be the same which would confirm that hash function is behaving good as it has generated different hash value for all different key.

At the end of the program, I have printed the '(key, hash value)' pair. We can see that even for the similar type of string the hash values are very different which is also indication of the very good hash function. Hope it make sense to the reader.

//Code Begins

#include<iostream>
#include<vector>
#include<string>
#include<fstream>
#include<functional>
#include<set>

std::set<std::string> words;

template<typename T>
void display(const T& val) {
    std::cout<<val<<std::endl;
}

void read_and_store(void) {
    std::string node;
    std::string file = "input.txt";
    std::ifstream ifstr{file};

    if(ifstr.is_open()){
        while(std::getline(ifstr,node)) {
            words.insert(node);
            node.clear();
        }
        ifstr.close();
    }
}

void do_some_hash_stuff(void) {
    std::set<size_t> hash_val;
    std::hash<std::string> hash_fun;

    for(const auto& i: words) {
        auto val = hash_fun(i);
        hash_val.insert(val);
    }

    display(words.size());
    display(hash_val.size());

    auto itr1 =words.begin();
    auto itr2=hash_val.begin();
    for(;(itr1 != words.end()&& itr2 != hash_val.end())
    ; ++itr1,++itr2) {
        std::cout<<*itr1<<"\t\t\t"<<*itr2<<std::endl;
    }
}

int main(int argc, const char* argv[]) {
    read_and_store();
    do_some_hash_stuff();
    return 0;
}

//Code Ends

When I run the above program, I get the following output.

//Console Output Begins

$ ./test
114
114

've            35910839017159818
And            45719090434930041
If            257679905247666261
Let's        728709960516072112
So            828576016503041753
The            891443442213174483
There        1053890543232546217
They        1083224364347218984
We            1778721970047496877
..................................
..................................

//Console Output Ends

The following is kind of complete example of hash table usage for a particular problem. At first I have read the words and calculated their length so that I can store in 'std::unordered map' as '(key,value)' pair. After that I have fetched the various information using the bucket interface of this particular hash table.

You can visualize the hash table as vector where each element of vector is kind of list itself. Each index of vector is known as bucket and ideally there should be uniform distribution values over the entire vector index. Here I have fetched the total number of bucket in the current hash table. I have also printed the count of keys to each bucket. The output would be fairly large and there would be many buckets with the count value 0 and there are many words which are in the same bucket.

This is expected in the giving input which I am using in these programs. There are many multiple entries of same key which lead to situation where one bucket has more than one element. Ideally the list size of key elements in the particular index of vector should not be very large else there would be no guarantee of 'O(1)' complexity in lookup operation. So to maintain this sometime hash table increases its size and again it calculate the hash value of all elements so that they can be positioned to their new bucket. This is known as rehasing and its expensive and ideally it should not happen very frequently.

In the later part of the program, I did find the bucket number for the particular key. After this I wanted to print all elements which has been hashed to this bucket number.The standard provides the 'begin/end' kind of interface to walk through the lists of key in the particular bucket. These 'begin/end' interfaces are different than the one which we uses to walk through the elements of the entire container.

In this program, I have also calculated the 'load factor' of this particular hash table. Standard has provided the different interface to get the same but here I have calculated with the definition of the load factor. At the end of the program, I have also calculated the load factor of this hash table using standard interface so that we can verify that both values are same. As we know that 'load factor' is kind of very important information which hash table uses to find out when entire hash table should be rehashed.

//Code Begins
#include<iostream>
#include<string>
#include<fstream>
#include<utility>
#include<unordered_map>

typedef std::unordered_multimap<std::string,int>myunorderdmulitmap;
typedef myunorderdmulitmap::size_type sztype;
typedef myunorderdmulitmap::local_iterator uniterator;

myunorderdmulitmap words;

template<typename T>
void display(const T& val) {
    std::cout<<val<<std::endl;
}

void read_and_store(void) {
    std::string str;
    std::pair<std::string, int> node;
    std::string file = "input.txt";
    std::ifstream ifstr{file};

    if(ifstr.is_open()){
        while(std::getline(ifstr, str))    {
            node.first = str;
            node.second = str.length();
            words.insert(node);
            str.clear();
        }
        ifstr.close();
    }
}

void learn_bucket(void) {
    sztype bucketcnt=words.bucket_count();
    sztype maxbuckercnt=words.max_bucket_count();

    /* Think The below output as the grid and non zero */
    /* elements means there are so many members.Ideally */
    /* it should not be very non-uniform                */
    for(sztype index = 0; index < bucketcnt; ++index) {
        std::cout<<    words.bucket_size(index)<<"\t";
    }
    std::cout<<std::endl;

    /* Now,lets assume you want to know whether particular*/
    /* key exists(in which bucket)                       */
    std::string key{"mathematical"};
    sztype bucketno = words.bucket(key);
    display(bucketno);

    /*If we want to print all elements which is stored in the*/
    /* same bucket number.         */
    for(uniterator itr = words.begin(bucketno);
            itr != words.end(bucketno); ++itr) {
        std::cout<<itr->first<<"\t"<<itr->second<<std::endl;
    }

    /* Now, lets calculate the load factor manually */
    sztype sz = words.size();
    float ldfactor = (float(sz)/float(bucketcnt));
    display(ldfactor);
}

void learn_load_factor(void) {
    float ldfactor = words.load_factor();
    float maxldfactor = words.max_load_factor();

    display(ldfactor);
    display(maxldfactor);
}

void learn_hash_stuff(void) {
    read_and_store();
    learn_bucket();
    learn_load_factor();
}

int main(int argc, const char* argv[]) {
    learn_hash_stuff();
    return 0;
}

//Code Ends

When I run the above program, I get the following output.

//Console Output Begins
$ ./test
0    0    0    0    0    0    0    0    6    0    0    0    0    6    0    0
0    0    00    0    0    0    0    0    0    0    0    0    0    0    0    0
0    6    0    0    00    0    0    0    0    0    0    0    0    0    6    0
24    0    0    0    0    0    00    0    0    0    0    0    0    0    0    0
.............................................................
.............................................................
13
mathematical    12
mathematical    12
mathematical    12
mathematical    12
mathematical    12
mathematical    12
0.630672
0.630672
1

//Console Output Ends

This is kind of continuation of the previous example. I had to break up into two parts so that it would be easy to understand. In this program, I have described how we can enforce the rehash at time when we want it to be.

First I have stored the 'load factor' of current hash table. I have also saved the bucket number for a particular key. Now I have changed the 'load factor' of the hash table by half of the current 'max load factor'. As we have discussed that rehash would typically depend on the 'load factor' and when it goes beyond some threshold limit, rehash of entire table happens.

I again fetched the bucket number of the same key. As we can see that now its bucket number is changed. Also the total bucket number has also increased to balance out the 'load factor'. This confirms that indeed rehash of entire table has triggered once we changed its 'load factor'.

In this program, I have also shown how we can fetch the hash table function object of that particular hash table. This is not so important and useful, however if we need to use hash function for some use, we can do it. This is same as we can get the allocator object from a particular container class. I have given very similar input to this hash function, however we can see that their hash values are very different. I had mentioned earlier that this is the sign of good hash function. Of course the same behavior should holds true for all different type of the input not just for few of them.

//Code Begins

#include<iostream>
#include<string>
#include<fstream>
#include<utility>
#include<unordered_map>

typedef std::unordered_multimap<std::string, int> hash_multimap;
typedef hash_multimap::size_type sztype;
typedef hash_multimap::hasher hashobject;

hash_multimap words;

template<typename T>
void display(const T& val) {
    std::cout<<val<<std::endl;
}

void read_and_store(void) {
    std::string str;
    std::pair<std::string, int> node;
    std::string file = "input.txt";
    std::ifstream ifstr{file};

    if(ifstr.is_open()){
        while(std::getline(ifstr, str)) {
            node.first = str;
            node.second = str.length();
            words.insert(node);
            str.clear();
        }
        ifstr.close();
    }
}

void adjust_loadfactor_rehash(void) {
    float ldfactor = words.load_factor();
    float maxldfactor = words.max_load_factor();

    std::cout<<"Bucket Count:"<<words.bucket_count()<<"\t"
            <<"Max Bucket Count:"<<words.max_bucket_count()
            <<std::endl;
    std::cout<<"Current Load Factor is: "
            <<words.load_factor()<<std::endl;

    std::string key{"mathematical"};
    sztype bucketno1 = words.bucket(key);
    std::cout<<key<<",word was found on the bucket before hash: "
            <<bucketno1<<std::endl;

    ldfactor=2.0*maxldfactor;
    words.max_load_factor(ldfactor);

    std::cout<<"Bucket Count:"<<words.bucket_count()<<"\t"
            <<"Max Bucket Count:"<<words.max_bucket_count()
            <<std::endl;

    sztype bucketno2=words.bucket(key);
    std::cout<<key<<",word was found on the bucket after rehash: "
            <<bucketno2<<std::endl;
    std::cout<<"Current Load Factor is: "
            <<words.load_factor()<<std::endl;
}

void use_hash_function(void) {
    hashobject hobj=words.hash_function();
    std::size_t hval_a=hobj("Mantosh");
    std::size_t hval_b=hobj("Macintosh");
    display("hval_a = ");
    display(hval_a);

    display("hval_b = ");
    display(hval_b);
}

void learn_hash_stuff(void) {
    read_and_store();
    adjust_loadfactor_rehash();
    use_hash_function();
}

int main(int argc, const char* argv[]) {
    learn_hash_stuff();
    return 0;
}

//Code Ends

When I run the above program, I get the following output.

//Console Output Begins

$ ./test
Bucket Count:1741    Max Bucket Count:576460752303423487
mathematical,word was found on the bucket before hash: 13

Bucket Count:2357    Max Bucket Count:576460752303423487
mathematical,word was found on the bucket after rehash: 2188

Current Load Factor is: 0.465846

hval_a =
5777427744200277933

hval_b =
10830944572245689348

//Console Output Ends

C++11 Uncovered

Saturday, January 11, 2014

Chapter:11 Hashing Support

No comments:

Post a Comment