Hashtable in C

There's many times when the problem you are solving needs something like a map of keys to values with O(1) insertion and retrieval. C doesn't come with one already ready-for-use like more "modern" languages like Python, so you gotta roll up your sleeves and do it yourself.

However, if you search around for "how to implement a hash table in C", you'll often find material that hashes to a fixed number of buckets and then has a linked list of items. This is slower than it needs to be and introduces complicated memory management that would be better avoided.

So we'll implement our own instead. As a bonus, we'll add some extra properties like easy C-style iteration, ordered insertion and realloc-friendly memory usage.

You could easily "templatize" this code.

Basic Reqirements

Your key type needs two basic properties:

Equatible (can compare two keys for equality).
Hashability (can map the key to a well-distributed number).

Your value type can be whatever.

Design Decisions

Limit the number of items in the hashtable to pow(2, 31). This let's us use 32 bit indexes which saves some memory.
Single memory allocation.

Data Layout

You'll (re-)allocate a block of memory such that it is laid out as follows:

|       Items          |        Indexes        |
+---------+------------+-----------------------+
|Occupied | Unoccupied | Indexes into occupied |
+---------+------------+-----------------------+

|- count -|            |----- 2*capacity ------|
|----- capacity -------|

When an item is inserted, the key is hashed and used to start the search in the Indexes array. If it finds a valid index, it uses it to index into the Items array and compare the key there. If it's a match, then return the pointer to the value half of that item. Otherwise, keep going. If an invalid index is found, that means the key is not in the table already. Insert the key after the end of the occupied portion of the items array and increment count. Store this new index into the Indexes array where you found the invalid index.

When the items array is filled, realloc the memory block (the enlarged Items array will now overlap with what was once the Indexes array). Invalidate all indexes in the new Indexes array and regenerate it from the Items array.

In this design, there is just a single allocation, which makes de-allocating the table easier. Additionally, it is re-alloc friendly and if you can re-alloc in place, then no copying of the Items needs to be done, which can be a valuable property if you are allocating from a linear allocator.

Another property is that the table remembers the order in which keys were inserted, which can be a useful property and makes debugging easier.

Finally, you don't need to remember the special way of how to iterate over this table or have some iterator macros or construct. You iterate it like you would an array:

void foo(void){
  for(size_t i = 0; i < table.count; i++){
    Pair* p = &table.data[i];
    // do whatever with p
  }
}

Implementation

#include <stdint.h>
#include <stdlib.h>
#include <string.h>

// Define these as makes sense.
uint32_t hash_key(const KeyType* key);
_Bool key_eq(const KeyType* a, const KeyType* b);

// You'll probably name these something else in your
// implementation to avoid collisions.
typedef struct Pair Pair;
struct Pair {
  KeyType key;
  ValueType value;
};

typedef struct Table Table;
struct Table {
  uint32_t count, capacity;
  Pair* data;
};

enum {TABLE_EMPTY_SLOT=UINT32_MAX};

// Grows the table to the new size.
// Returns 0 on success and non-zero on failure (usually
// allocation failure).
int
grow_table(Table* table, size_t new_size);

// If key is already in the table, returns the pointer to
// the corresponding value.
// Otherwise, inserts the key into the table and returns a
// pointer to the uninitialized value.
// Returns NULL if the table needed to grow and that
// failed.
ValueType*
table_getsert(Table* table, KeyType* key);

// If key is in the table, returns the pointer to the
// corresponding value.
// Otherwise returns NULL.
ValueType*
table_has(const Table* table, const KeyType* key);

// This is faster than modulo while preserving the same
// properties if all the bits are equally well distributed.
// Do *not* use identity hash with this.
uint32_t
fast_reduce32(uint32_t x, uint32_t y){
  return ((uint64_t)x * (uint64_t)y) >> 32;
}

int
grow_table(Table* table, size_t new_cap){
  size_t cap = table->capacity;
  if(new_cap <= cap) return 0;
  if(new_cap > UINT32_MAX/2) return 1;

  // On a 32 bit system, you'll want to check
  // for overflow here.
  size_t new_size = new_cap*(sizeof(Pair)
                    +2*sizeof(uint32_t));
  char* new_data = realloc(table->data, new_size);
  if(!new_data) return 1;
  table->data = (Pair*)new_data;
  table->capacity = (uint32_t)new_cap;

  size_t offset = new_cap * sizeof(Pair);
  uint32_t* indexes = (uint32_t*)(new_data+offset);
  size_t idx_cap = 2*new_cap;
  memset(indexes, 0xff, idx_cap * sizeof *indexes);

  Pair* items = (Pair*)new_data;
  for(size_t i = 0; i < table->count; i++){
    Pair* p = &items[i];
    uint32_t hash = hash_key(&p->key);
    uint32_t idx = fast_reduce32(hash, (uint32_t)idx_cap);
    while(indexes[idx] != TABLE_EMPTY_SLOT){
      idx++;
      if(idx == idx_cap) idx = 0;
    }
    indexes[idx] = i;
  }
  return 0;
}

ValueType*
table_getsert(Table* table, KeyType* key){
  if(table->count == table->capacity){
    size_t new_cap;
    if(table->capacity)
        new_cap = 2*table->capacity;
    else
        new_cap = 32;
    int err = grow_table(table, new_cap);
    if(err) return NULL;
  }
  uint32_t cap = table->capacity;
  uint32_t idx_cap = cap * 2;
  uint32_t hash = hash_key(key);
  uint32_t idx = fast_reduce32(hash, idx_cap);

  size_t offset = cap * sizeof(Pair);
  char* data = (char*)table->data;
  uint32_t* indexes = (uint32_t*)(data + offset);
  Pair* items = table->data;
  for(;;){
    uint32_t i = indexes[idx];
    if(i == TABLE_EMPTY_SLOT){
      indexes[idx] = table->count;
      items[table->count].key = *key;
      return &items[table->count++].value;
    }

    Pair* p = &items[i];
    if(key_eq(&p->key, key))
      return &p->value;

    idx++;
    if(idx == idx_cap) idx = 0;
  }
}

ValueType*
table_has(const Table* table, const KeyType* key){
  if(!table->count) return NULL; // empty table
  uint32_t cap = table->capacity;
  uint32_t idx_cap = cap * 2;
  uint32_t hash = hash_key(key);
  uint32_t idx = fast_reduce32(hash, idx_cap);

  size_t offset = cap * sizeof(Pair);
  char* data = (char*)table->data;
  uint32_t* indexes = (uint32_t*)(data + offset);
  Pair* items = table->data;
  for(;;){
    uint32_t i = indexes[idx];
    if(i == TABLE_EMPTY_SLOT)
      return NULL;

    Pair* p = &items[i];
    if(key_eq(&p->key, key))
      return &p->value;

    idx++;
    if(idx == idx_cap) idx = 0;
  }
}

Deletion

I didn't show to handle deleting items. Funnily enough I haven't found much need to delete things, but you have a few options.

Somehow mark the pair as deleted (using a sentinel for the value part or a book-keeping bitmask or whatever) and adjust the table growing code to shift non-deleted items down. This makes iteration more complicated and growing slower. This retains O(1) deletion.
Eagerly shift the non-deleted items into place and fixup the indexes. This makes deletion O(n).

There's probably other options as well, but you can tune

Alterations

There's some ways you could change the above design.

If hashing is expensive, and you want to avoid the re-hashing on grow, you can store the hash in the Pairs.
You could use 3 different allocations for the keys/values/indexes. There's not really much point to doing this.

Conclusion

It's certainly inconvenient to need to implement your own basic data structures in C, but in doing so you develop the skill to tailor them to your specific problem. And as shown above, you can get a functional hashtable in not very much code at all.