Database indexes

Note

At first you should read what’s index in design section.

Currently there are two main Index implementations. You need to know the difference between them before you will choose the base for your Index. Remember that you don’t need to code whole index code, just required methods or even parts of them.

The default implemented indexes uses 2 files. One is for index metadata / structure, the second one is for storage. So the total number of files opened by database is usually number_of_indexes * 2. Indexes makes usage of sparse files to store information.

To see how to write index see below

Note

The object that you pass to database to add index, it’s not the same object that will be database side!

See also

Index functions for description of powerful mechanism of inside index (database) functions.

Using custom index is quite simple. You need to decide if your index needs to be Hash based (details) or Tree based (details for pros and cons)

Warning

Please remember that the object that is finally used by Database is not the object that you passed into!

There is special class attribute to solve import problems in indexes in DB.

custom_header
It’s string that will be inserted to final index file. It’s useful to pass the custom imports there. You will find an example in Examples - secure storage.
storage_class
It defines what storage to use. By default all indexes will use CodernityDB.storage.Storage. If your Storage needs to be initialized in custom way please look at Examples - secure storage.

Hash Index

This index is based on Hash Table with separate chaining. You can refer also for ISAM.

  • Pros
    • Fast
  • Cons
    • Records are not in order of insert / update / delete but in random like
    • Can be queried only for given key, or iterate over all keys

There are two major implementations

They differs in several places, for details you should read the code of them both.

See also

Hash Index speed tests
For speed tests
CodernityDB.hash_index.HashIndex
For documentation
conflict resolution
When using 0xfffff hash_lim after 1200 inserts there is 50% probability that conflict will occur (birthday problem). Conflicts are solved by separate chaining, so keys with the same hash function results are linked into list, then traversed when needed.
duplicate keys
For duplicate keys the same mechanism is used as for conflict resolution. All indexes different than id one can accept more than one record with the same key (get_many()).

Hash Index details

Note

For api documentation please see HashIndex

See also

Multikeys index
for Multiindex hash based implementation (more than one key per database data).

Below you will find explained in details parameters for that index type.

key_format

It defines what type is your key.

The most important for you as you’re interested to write index for your use is the key size.

For example if you want to use MD5 based index you would need set the key_format to 16s which would set the size for key to 16 characters exactly how long is md5 hash (digest).

An example code for Md5 based index can be found Design, more examples in Examples

Note

For format specification and explanation please visit Python struct documentation

hash_lim

It defines how big results will return index hash function.

Current default is 0xfffff which means 1048575 different hash function results.

Hint

In perfect conditions you will be able to store those number of unique records without conflicts, in practice you will be able to store like 1200 records without conflict with 50% probability (for example birthday problem). Look up when conflict occurs is slower because linked list is traversed. More information about conflicts Hash Index.

Hint

If you want to have index that searches for let’s say True and False values, you should set that parameter to 2. Because the rest values will be not used (however nothing bad will happen).

make_key_value

(make_key_value())

That function is called by database when inserting new or updating objects in database. It has to return None if index is not matched (not required to operate on it) and 2 values if index is matched. That 2 values are in order: key and value. Please remember that key must fit your entry_line_format.

make_key

(make_key())

That function is called when query operations are performed on database. It should format the key correctly to match that one returned by CodernityDB.index.Index.make_key_value()

entry_line_format

(for advanced users, please check if key format is not enough for you)

Entry line format contains all metadata required for Hash Index to work.

First You need to decide it will look like. The default is <32s{key}IIcI which means in order:

  1. mark for use little-endian encoding <
  2. document id format 32s
  3. index key format {key}, it will be replaced with c or if defined with value from key_format parameter.
  4. start of a record in storage format I
  5. size of a record in storage format I
  6. status format c (you probably do not want to change it)
  7. next record (in case of conflicts) format I

Note

If you expect that your index might require more than 4294967295 bytes of space or metadata (that’s the max number for I format), change it to Q.

Hash Index Example

Let’s assume that you want to write hash based index that will separate objects with a % 2 == 0 from a % 2 == 1.

class AIndex(HashIndex):

    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = '?'
        kwargs['hash_lim'] = 2
        super(AIndex, self).__init__(*args, **kwargs)

    def make_key_value(self, data):
        val = data.get('a')
        if val is none:
            return None
        return val % 2, None

    def make_key(self, key):
        return key

It will allow you to perform for example:

[...]
number_of_zeros = db.count(db.get_many, 'a', 0)
number_of_ones = db.count(db.get_many, 'a', 1)
[...]

Note

Please see Examples for more examples, and CodernityDB.hash_index for full Hash Index documentation

B Plus Tree Index

This index is based on B Plus Tree. Duplicate keys are stored inside Tree structure (on leafs/nodes).

  • Pros
    • Can be queried for range queries
    • Records are in order (depends of your keys)
  • Cons
    • Slower than Hash based indexes

See also

Tree Index speed tests
For speed tests
CodernityDB.tree_index.TreeBasedIndex
For documentation
Multikeys index
for Multiindex tree based implementation (more than one key per database data).
duplicate keys
Duplicate keys are stored inside tree structure. So in worst case when you have more duplicate keys than node_size tree will became sub-optimal (a half of one node will be always empty)

Tree Index details

Note

For api documentation please see TreeBasedIndex

That index is based on BPlus tree. It’s main advantage is that it stores data in order. And you can make range queries.

key_format
It’s the same as in hash index, so please refer to key_format in hash index details
pointer_format
It’s information about internal pointer format in tree. Change it only when you need it (for example when your index file might be bigger than 4294967295 bytes)
meta_format
Contains similar information as entry_line_format in hash index. Change it when you really need to.
node_capacity
One of the most important parameters in whole tree index. It defines how big is one leaf / node inside tree. If you expect much data to come into your index, you should play with a bit to adjust the most correct size. Generally, bigger value means less tree height, but it might mean more shifts inside when doing insert operation. It can be said, that bigger node_capacity means faster get operations.

Tree Index Example

class SimpleTreeIndex(TreeBasedIndex):

    def __init__(self, *args, **kwargs):
        kwargs['node_capacity'] = 13
        kwargs['key_format'] = 'I'
        super(SimpleTreeIndex, self).__init__(*args, **kwargs)

    def make_key_value(self, data):
        a_val = data.get('a')
        if a_val is not None:
            return a_val, None
        return None

    def make_key(self, key):
        return key

It will allow you to perform for example:

[...]
from_3_to_10 = db.get_many('tree', limit=-1, start=3, end=10, inclusive_start=True, inclusive_end=True)
[...]

And you will get all records that have a value from 3 to 10.

Multikeys index

Multikeys indexes (aka Multiindex) are indexes where you can have more than one key per database data. They share the same properties as their bases (Hash Index and B Plus Tree Index).

Imagine something like infix search:

class TreeMultiTest(MultiTreeBasedIndex):

    custom_header = """from CodernityDB.tree_index import MultiTreeBasedIndex
from itertools import izip"""

    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = '16s'
        super(TreeMultiTest, self).__init__(*args, **kwargs)
        self.__l = kwargs.get('w_len', 2)

    def make_key_value(self, data):
        name = data['w']
        l = self.__l
        max_l = len(name)
        out = set()
        for x in xrange(l - 1, max_l):
            m = (name, )
            for y in xrange(0, x):
                m += (name[y + 1:],)
            out.update(set(''.join(x).rjust(16, '_').lower() for x in izip(*m)))  #ignore import error
        return out, dict(w=name)

    def make_key(self, key):
        return key.rjust(16, '_').lower()

By using that index you will be able to perform infix search over all words in your database. Only one difference from non multi index is that make_key_value has to return iterable (the best will be set because it has unique values). Then you can easily run something like that:

db = Database('/tmp/multi')
db.create()
db.add_index(TreeMultiTest(db.path, "words"))

db.insert(dict(w='Codernity'))

print db.get('words', 'dern')['w']  # "Codernity"
print db.get('words', 'cod')['w']  # "Codernity"

As you can see implementing infix/suffix/prefix search mechanism in CodernityDB is very easy.

Note

Multiindex requires more time to insert data. Get speed is exactly as fast as in non multiindex (same rules applies to both of them).

Obviously that’s not only one use case for that indexes, it’s just probably the most obvious usage example.

Currently both Hash and Tree indexes have multiindex implementations: MultiHashIndex and MultiTreeBasedIndex (yes they are just prefixed by word Multi).

Index functions

Quite important thing in CodernityDB are index functions. You can do with them anything you want they have access to database object, so they can perform operations on multiple indexes. If you want join like operation, you should write function. Then you will be able to run that function database side when using CodernityDB-HTTP. The only mandatory argument for that kind of function is db, the rest are function arguments.

Writing function is easy see an example there:

def run_timeline(self, db, user, limit):
    u = db.get('user', user)
    it = db.get_many(self.name, user, end=10 ** 11, limit=limit, with_doc=True)
    for curr in it:
        curr['username'] = user
        curr['email'] = u['email']
        curr['pub_date'] = curr['doc']['pub_date']
        curr['text'] = curr['doc']['text']
        del curr['doc']
        yield curr

That function performs simple JOIN operation (that JOIN known from SQL databases). As you can see it’s exactly the same as you would code in your code to archive that.

Function should start it’s name from run_ then you can call it:

gen = db.run("index", "timeline", "user_a", 10)

As mentioned before, while you work in embedded mode it makes no big difference, but when using CodernityDB-HTTP it makes huge.

Note

Please remember that CodernityDB is not relational Database, forcing it to work in that model will usually work, but it’s not recommended. You should try to denormalize it (Database normalization).

Note

Please see Examples for more examples, and CodernityDB.tree_index for full documentation

Easier way of creating indexes

Do I really need to code indexes in Python, is there an easier way to create indexes?

You bet there is! We prepared special, simplified mode for creating indexes. That code will be translated to Python code, and then used exactly in the same way as the rest indexes (so that simple indexes are exactly the same fast as pure Python ones). They are just simple by name not by possibilities.

Note

Don’t be surprised, if we will name it SimpleIndex in several places there.

Usage of this mode is really basic, you just need to provide 2 (or optionally 3) things:

  • properties of the index like this:
name = MyTestIndex
type = HashIndex
key_format = I
etc. etc. ...
  • body for make_key_value containing our simplified syntax:
make_key_value:
a > 1: 1, None
a, None
  • and optionally body for make_key (if you don’t provide it, it will be generated automatically and set to return key value as it is):
make_key:
key > 1: 1
key

Syntax of function body is really basic, every line preceded by a statement with colon at the end means that if conditions before colon are met, function will return everything that follows colon. If there is no colon, value will be always returned. Of course the order of lines actually matters, so if you provide body like that:

a > 1: 1, None
a > 2: 2, None

The second value will be never returned (because if a is less then 1 it’s for sure less than 2). Every name will be look for in dictionaries given to functions, so body like:

a > 1: a, None

will generate python code like that:

if data["a"] > 1:
    return data["a"], None

That’s everything you need to know to work with our simplified index creator, you just need to always provide name and type in index properties and provide body for make_key_value, which has to return always two values (the 2nd has to be a dictionary or None). Here you have some examples and their equivalents in python. Remember that this simplified creator doesn’t provide python power, so if you want to write more sophisticated index, you will have to learn python.

name = Test
type = HashIndex
key_format = 16s

make_key_value:
md5(a),None

make_key:
md5(key)

is an equivalent to:

class Test( HashIndex ):
    def __init__(self,*args,**kwargs):
        kwargs["key_format"] = '16s'
        super( Test ,self).__init__(*args,**kwargs)
    def make_key_value(self,data):
        return md5 ( data["a"] ) .digest() , None
    def make_key(self,key):
        return md5 ( key ) .digest()

Keywords & Helpers in simple index

Please note that Python None value can be written as: none, None, null.

Supported logic operators:

  • and
  • &&
  • or
  • ||

Functions that you can use in make_key and make_key_value:

str(value)

it will convert value to string

Parameters:value – value to convert to string
Returns:string value of value
Return type:string
len(value)

it will return length of a value

Parameters:value – a value to check length
Returns:length of value
Return type:integer
md5(value)

it will return md5 value of a string

Parameters:value (string) – string value to get md5 from it
Returns:md5 sum of value
Return type:string
fix_r(value, length)

it will return fixed string length. Value shorter than length will be right filled with _.

Parameters:
  • value (string) – value to adjust string length
  • length (integer) – how long should be output string
Returns:

fixed size string

Return type:

string

infix(value, min_len, max_len, fixed_len)

it will generate all possible infixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)

Parameters:
  • value (string) – a string which all infixes will be generated from
  • min_len (integer) – minimal length of an infix
  • max_len (integer) – maximal length of an infix
  • fixed_len (integer) – fixed size of all infixes
Returns:

set containing fixed size infixes

Return type:

set

prefix(value, min_len, max_len, fixed_len)

it will generate all possible prefixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)

Parameters:
  • value (string) – a string which all prefixes will be generated from
  • min_len (integer) – minimal length of an prefix
  • max_len (integer) – maximal length of an prefix
  • fixed_len (integer) – fixed size of all prefixes
Returns:

set containing fixed size prefixes

Return type:

set

suffix(value, min_len, max_len, fixed_len)

it will generate all possible suffixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)

Parameters:
  • value (string) – a string which all suffixes will be generated from
  • min_len (integer) – minimal length of an suffix
  • max_len (integer) – maximal length of an suffix
  • fixed_len (integer) – fixed size of all suffixes
Returns:

set containing fixed size suffixes

Return type:

set

Note

Obviously you can use that simple indexes in CodernityDB-HTTP without any problem.

Note

Error reporting / handling system in that mode will tell you exactly what’s wrong with your code.

Tables, collections...?

OK, I got it, but can I store more than one data type in Database. Is there something like table or collection ?

Note

In CodernityDB Demos you can find minitwit example which is rewrite from Sqlite application.

Sure! You can use Index mechanism do to it. As it has been mentioned before, Index mechanism in CodernityDB is like read only Table in SQL databases (see index design). So all you need is to define how your records will differ each other.

Let’s assume that you want to users and users and some data that belongs to user. You will probably want to be able to get all users, and all things that belongs to him, right? So.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class AllUsers(HashIndex):

    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = '16s'
        super(AllUsers, self).__init__(*args, **kwargs)

    def make_key(self, key):
        return md5(key).digest()

    def make_key_value(self, data):
        if data.get('_t') == 'user':
            return md5(data.get('name')).digest(), None


class AllItems(HashIndex):

    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = '16s'
        super(AllItems, self).__init__(*args, **kwargs)

    def make_key(self, key):
        return md5(key).digest()

    def make_key_value(self, data):
        if data.get('_t') == 'item':
            return md5(data.get('owner')).digest(), None

Having that indexes in your database will allow you to query for single user and for items of that user. Isn’t it simple? As you can see, index in CodernityDB is not an index that you probably get used to. It’s much more.

How an index code is processed by CodernityDB?

When you provide CodernityDB with an index, it uses getsource function from inspect module to get index code. This means, that after you call add_index function, it will look for the index class in current scope, take the whole class code as it is (including intends) and place it inside it’s own code. Hence there are few cons you have to bear in mind:
  • You can not generate class code on the fly inside, let’s say, a function, like that:
def create_index(self,a):
    class MyIndex(HashIndex):
        def __init__(self, *args, **kwargs):
            kwargs['key_format'] = 'I'
            super(MyIndex, self).__init__(*args, **kwargs)
        def make_key_value(self,data):
            if data['a'] == a:
                return data['b'], None
        def make_key(self,key):
            return key
    return MyIndex

Despite of code being correct in python terms, it will produce an error in CodernityDB, since class isn’t defined in proper scope.

  • You can not provide index class code with a variable defined outside this class:
a = 5
class MyIndex(HashIndex):
    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = 'I'
        super(MyIndex, self).__init__(*args, **kwargs)
    def make_key_value(self,data):
        if data['a'] == a:
            return data['b'], None
    def make_key(self,key):
            return key

Even if now class is in proper scope, the example won’t work, because variable a isn’t known to CodernityDB.

Sharding in indexes

For advanced users we have sharded indexes.

All you need to do if you want to use Sharded indexes is just:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from CodernityDB.database import Database
from CodernityDB.sharded_hash import ShardedUniqueHashIndex, ShardedHashIndex
from CodernityDB.hash_index import HashIndex

import random


class CustomIdSharded(ShardedUniqueHashIndex):

    custom_header = 'from CodernityDB.sharded_hash import ShardedUniqueHashIndex'

    def __init__(self, *args, **kwargs):
        kwargs['sh_nums'] = 10
        super(CustomIdSharded, self).__init__(*args, **kwargs)


class MySharded(ShardedHashIndex):

    custom_header = 'from CodernityDB.sharded_hash import ShardedHashIndex'

    def __init__(self, *args, **kwargs):
        kwargs['sh_nums'] = 10
        kwargs['key_format'] = 'I'
        kwargs['use_make_keys'] = True
        super(MySharded, self).__init__(*args, **kwargs)

    def make_key_value(self, data):
        return data['x'], None

    def calculate_shard(self, key):
        return key % self.sh_nums


y = 1500 * 'y'

db = Database('/tmp/shard')

db.create(with_id_index=False)
db.add_index(CustomIdSharded(db.path, 'id'))
db.add_index(MySharded(db.path, 'x'))


# it makes non sense to use sharding with such small number of records
for x in xrange(10 ** 4):
    db.insert({'x': x, 'y': y})


print db.get('x', random.randint(0, 10 ** 4))['_id']

Warning

Just remember that you have to hardcode ShardIndex parameters (unlike other indexes). So you really should derive from it’s class.

Performance

Consider this script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/env python

from CodernityDB.database import Database

from CodernityDB.sharded_hash import ShardedUniqueHashIndex, ShardedHashIndex

import shutil
import sys
import time

try:
    shutil.rmtree('/tmp/shard')
except:
    pass

NUM = int(sys.argv[1])

db = Database('/tmp/shard')

if sys.argv[2] == 'sh':
    class MyShard(ShardedUniqueHashIndex):
        custom_header = 'from CodernityDB.sharded_hash import ShardedUniqueHashIndex'

        def __init__(self, *args, **kwargs):
            kwargs['sh_nums'] = 10
            super(MyShard, self).__init__(*args, **kwargs)

    db.create(with_id_index=False)
    db.add_index(MyShard(db.path, 'id'))
else:
    db.create()


st = time.time()

for x in xrange(NUM):
    db.insert({'x': 1})

end = time.time()

print end - st
Number of inserts Time in sharded Time in non sharded
5 000 000 65.405 seconds 74.699 seconds
10 000 000 148.095 seconds 186.383 seconds

As you can see, sharding does matter. It gives you almost 25% performance boost. Totally free. Similar performance boost applies to get operations.

Note

What’s even more important in Sharding is that as you probably already know CodernityDB index metadata stores data position and size by using struct module. By default those fields are I format (unsigned int). So when you need to change that format to Q without sharding, you probably can switch to sharding and still use I format. I format can accept values up to 4294967295 bytes so about 4GB. Having 100 shards will mean that you can index up to 4GB * 100 data.

Note

Currently one index can have up to 255 shards.