Note
At first you should read what’s index in design section.
Currently there are two main Index implementations. You need to know the difference between them before you will choose the base for your Index. Remember that you don’t need to code whole index code, just required methods or even parts of them.
The default implemented indexes uses 2 files. One is for index metadata / structure, the second one is for storage. So the total number of files opened by database is usually number_of_indexes * 2. Indexes makes usage of sparse files to store information.
To see how to write index see below
Note
The object that you pass to database to add index, it’s not the same object that will be database side!
See also
Index functions for description of powerful mechanism of inside index (database) functions.
Using custom index is quite simple. You need to decide if your index needs to be Hash based (details) or Tree based (details for pros and cons)
Warning
Please remember that the object that is finally used by Database is not the object that you passed into!
There is special class attribute to solve import problems in indexes in DB.
This index is based on Hash Table with separate chaining. You can refer also for ISAM.
There are two major implementations
They differs in several places, for details you should read the code of them both.
See also
Note
For api documentation please see HashIndex
See also
Below you will find explained in details parameters for that index type.
It defines what type is your key.
The most important for you as you’re interested to write index for your use is the key size.
For example if you want to use MD5 based index you would need set the key_format to 16s which would set the size for key to 16 characters exactly how long is md5 hash (digest).
An example code for Md5 based index can be found Design, more examples in Examples
Note
For format specification and explanation please visit Python struct documentation
It defines how big results will return index hash function.
Current default is 0xfffff which means 1048575 different hash function results.
Hint
In perfect conditions you will be able to store those number of unique records without conflicts, in practice you will be able to store like 1200 records without conflict with 50% probability (for example birthday problem). Look up when conflict occurs is slower because linked list is traversed. More information about conflicts Hash Index.
Hint
If you want to have index that searches for let’s say True and False values, you should set that parameter to 2. Because the rest values will be not used (however nothing bad will happen).
That function is called by database when inserting new or updating objects in database. It has to return None if index is not matched (not required to operate on it) and 2 values if index is matched. That 2 values are in order: key and value. Please remember that key must fit your entry_line_format.
That function is called when query operations are performed on database. It should format the key correctly to match that one returned by CodernityDB.index.Index.make_key_value()
(for advanced users, please check if key format is not enough for you)
Entry line format contains all metadata required for Hash Index to work.
First You need to decide it will look like. The default is <32s{key}IIcI which means in order:
Note
If you expect that your index might require more than 4294967295 bytes of space or metadata (that’s the max number for I format), change it to Q.
Let’s assume that you want to write hash based index that will separate objects with a % 2 == 0 from a % 2 == 1.
class AIndex(HashIndex):
def __init__(self, *args, **kwargs):
kwargs['key_format'] = '?'
kwargs['hash_lim'] = 2
super(AIndex, self).__init__(*args, **kwargs)
def make_key_value(self, data):
val = data.get('a')
if val is none:
return None
return val % 2, None
def make_key(self, key):
return key
It will allow you to perform for example:
[...]
number_of_zeros = db.count(db.get_many, 'a', 0)
number_of_ones = db.count(db.get_many, 'a', 1)
[...]
Note
Please see Examples for more examples, and CodernityDB.hash_index for full Hash Index documentation
This index is based on B Plus Tree. Duplicate keys are stored inside Tree structure (on leafs/nodes).
See also
Note
For api documentation please see TreeBasedIndex
That index is based on BPlus tree. It’s main advantage is that it stores data in order. And you can make range queries.
class SimpleTreeIndex(TreeBasedIndex):
def __init__(self, *args, **kwargs):
kwargs['node_capacity'] = 13
kwargs['key_format'] = 'I'
super(SimpleTreeIndex, self).__init__(*args, **kwargs)
def make_key_value(self, data):
a_val = data.get('a')
if a_val is not None:
return a_val, None
return None
def make_key(self, key):
return key
It will allow you to perform for example:
[...]
from_3_to_10 = db.get_many('tree', limit=-1, start=3, end=10, inclusive_start=True, inclusive_end=True)
[...]
And you will get all records that have a value from 3 to 10.
Multikeys indexes (aka Multiindex) are indexes where you can have more than one key per database data. They share the same properties as their bases (Hash Index and B Plus Tree Index).
Imagine something like infix search:
class TreeMultiTest(MultiTreeBasedIndex):
custom_header = """from CodernityDB.tree_index import MultiTreeBasedIndex
from itertools import izip"""
def __init__(self, *args, **kwargs):
kwargs['key_format'] = '16s'
super(TreeMultiTest, self).__init__(*args, **kwargs)
self.__l = kwargs.get('w_len', 2)
def make_key_value(self, data):
name = data['w']
l = self.__l
max_l = len(name)
out = set()
for x in xrange(l - 1, max_l):
m = (name, )
for y in xrange(0, x):
m += (name[y + 1:],)
out.update(set(''.join(x).rjust(16, '_').lower() for x in izip(*m))) #ignore import error
return out, dict(w=name)
def make_key(self, key):
return key.rjust(16, '_').lower()
By using that index you will be able to perform infix search over all words in your database. Only one difference from non multi index is that make_key_value has to return iterable (the best will be set because it has unique values). Then you can easily run something like that:
db = Database('/tmp/multi')
db.create()
db.add_index(TreeMultiTest(db.path, "words"))
db.insert(dict(w='Codernity'))
print db.get('words', 'dern')['w'] # "Codernity"
print db.get('words', 'cod')['w'] # "Codernity"
As you can see implementing infix/suffix/prefix search mechanism in CodernityDB is very easy.
Note
Multiindex requires more time to insert data. Get speed is exactly as fast as in non multiindex (same rules applies to both of them).
Obviously that’s not only one use case for that indexes, it’s just probably the most obvious usage example.
Currently both Hash and Tree indexes have multiindex implementations: MultiHashIndex and MultiTreeBasedIndex (yes they are just prefixed by word Multi).
Quite important thing in CodernityDB are index functions. You can do with them anything you want they have access to database object, so they can perform operations on multiple indexes. If you want join like operation, you should write function. Then you will be able to run that function database side when using CodernityDB-HTTP. The only mandatory argument for that kind of function is db, the rest are function arguments.
Writing function is easy see an example there:
def run_timeline(self, db, user, limit):
u = db.get('user', user)
it = db.get_many(self.name, user, end=10 ** 11, limit=limit, with_doc=True)
for curr in it:
curr['username'] = user
curr['email'] = u['email']
curr['pub_date'] = curr['doc']['pub_date']
curr['text'] = curr['doc']['text']
del curr['doc']
yield curr
That function performs simple JOIN operation (that JOIN known from SQL databases). As you can see it’s exactly the same as you would code in your code to archive that.
Function should start it’s name from run_ then you can call it:
gen = db.run("index", "timeline", "user_a", 10)
As mentioned before, while you work in embedded mode it makes no big difference, but when using CodernityDB-HTTP it makes huge.
Note
Please remember that CodernityDB is not relational Database, forcing it to work in that model will usually work, but it’s not recommended. You should try to denormalize it (Database normalization).
Note
Please see Examples for more examples, and CodernityDB.tree_index for full documentation
Do I really need to code indexes in Python, is there an easier way to create indexes?
You bet there is! We prepared special, simplified mode for creating indexes. That code will be translated to Python code, and then used exactly in the same way as the rest indexes (so that simple indexes are exactly the same fast as pure Python ones). They are just simple by name not by possibilities.
Note
Don’t be surprised, if we will name it SimpleIndex in several places there.
Usage of this mode is really basic, you just need to provide 2 (or optionally 3) things:
name = MyTestIndex
type = HashIndex
key_format = I
etc. etc. ...
make_key_value:
a > 1: 1, None
a, None
make_key:
key > 1: 1
key
Syntax of function body is really basic, every line preceded by a statement with colon at the end means that if conditions before colon are met, function will return everything that follows colon. If there is no colon, value will be always returned. Of course the order of lines actually matters, so if you provide body like that:
a > 1: 1, None
a > 2: 2, None
The second value will be never returned (because if a is less then 1 it’s for sure less than 2). Every name will be look for in dictionaries given to functions, so body like:
a > 1: a, None
will generate python code like that:
if data["a"] > 1:
return data["a"], None
That’s everything you need to know to work with our simplified index creator, you just need to always provide name and type in index properties and provide body for make_key_value, which has to return always two values (the 2nd has to be a dictionary or None). Here you have some examples and their equivalents in python. Remember that this simplified creator doesn’t provide python power, so if you want to write more sophisticated index, you will have to learn python.
name = Test
type = HashIndex
key_format = 16s
make_key_value:
md5(a),None
make_key:
md5(key)
is an equivalent to:
class Test( HashIndex ):
def __init__(self,*args,**kwargs):
kwargs["key_format"] = '16s'
super( Test ,self).__init__(*args,**kwargs)
def make_key_value(self,data):
return md5 ( data["a"] ) .digest() , None
def make_key(self,key):
return md5 ( key ) .digest()
Please note that Python None value can be written as: none, None, null.
Supported logic operators:
Functions that you can use in make_key and make_key_value:
it will convert value to string
| Parameters: | value – value to convert to string |
|---|---|
| Returns: | string value of value |
| Return type: | string |
it will return length of a value
| Parameters: | value – a value to check length |
|---|---|
| Returns: | length of value |
| Return type: | integer |
it will return md5 value of a string
| Parameters: | value (string) – string value to get md5 from it |
|---|---|
| Returns: | md5 sum of value |
| Return type: | string |
it will return fixed string length. Value shorter than length will be right filled with _.
| Parameters: |
|
|---|---|
| Returns: | fixed size string |
| Return type: | string |
it will generate all possible infixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)
| Parameters: |
|
|---|---|
| Returns: | set containing fixed size infixes |
| Return type: | set |
it will generate all possible prefixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)
| Parameters: |
|
|---|---|
| Returns: | set containing fixed size prefixes |
| Return type: | set |
it will generate all possible suffixes of value not shorter than min_len and not longer than max_len while all of them will have fixed length defined in fixed_len (which works exactly as fix_r)
| Parameters: |
|
|---|---|
| Returns: | set containing fixed size suffixes |
| Return type: | set |
Note
Obviously you can use that simple indexes in CodernityDB-HTTP without any problem.
Note
Error reporting / handling system in that mode will tell you exactly what’s wrong with your code.
OK, I got it, but can I store more than one data type in Database. Is there something like table or collection ?
Note
In CodernityDB Demos you can find minitwit example which is rewrite from Sqlite application.
Sure! You can use Index mechanism do to it. As it has been mentioned before, Index mechanism in CodernityDB is like read only Table in SQL databases (see index design). So all you need is to define how your records will differ each other.
Let’s assume that you want to users and users and some data that belongs to user. You will probably want to be able to get all users, and all things that belongs to him, right? So.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | class AllUsers(HashIndex):
def __init__(self, *args, **kwargs):
kwargs['key_format'] = '16s'
super(AllUsers, self).__init__(*args, **kwargs)
def make_key(self, key):
return md5(key).digest()
def make_key_value(self, data):
if data.get('_t') == 'user':
return md5(data.get('name')).digest(), None
class AllItems(HashIndex):
def __init__(self, *args, **kwargs):
kwargs['key_format'] = '16s'
super(AllItems, self).__init__(*args, **kwargs)
def make_key(self, key):
return md5(key).digest()
def make_key_value(self, data):
if data.get('_t') == 'item':
return md5(data.get('owner')).digest(), None
|
Having that indexes in your database will allow you to query for single user and for items of that user. Isn’t it simple? As you can see, index in CodernityDB is not an index that you probably get used to. It’s much more.
When you provide CodernityDB with an index, it uses getsource function from inspect module to get index code. This means, that after you call add_index function, it will look for the index class in current scope, take the whole class code as it is (including intends) and place it inside it’s own code. Hence there are few cons you have to bear in mind:
def create_index(self,a):
class MyIndex(HashIndex):
def __init__(self, *args, **kwargs):
kwargs['key_format'] = 'I'
super(MyIndex, self).__init__(*args, **kwargs)
def make_key_value(self,data):
if data['a'] == a:
return data['b'], None
def make_key(self,key):
return key
return MyIndex
Despite of code being correct in python terms, it will produce an error in CodernityDB, since class isn’t defined in proper scope.
- You can not provide index class code with a variable defined outside this class:
a = 5
class MyIndex(HashIndex):
def __init__(self, *args, **kwargs):
kwargs['key_format'] = 'I'
super(MyIndex, self).__init__(*args, **kwargs)
def make_key_value(self,data):
if data['a'] == a:
return data['b'], None
def make_key(self,key):
return key
Even if now class is in proper scope, the example won’t work, because variable a isn’t known to CodernityDB.
For advanced users we have sharded indexes.
All you need to do if you want to use Sharded indexes is just:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | from CodernityDB.database import Database
from CodernityDB.sharded_hash import ShardedUniqueHashIndex, ShardedHashIndex
from CodernityDB.hash_index import HashIndex
import random
class CustomIdSharded(ShardedUniqueHashIndex):
custom_header = 'from CodernityDB.sharded_hash import ShardedUniqueHashIndex'
def __init__(self, *args, **kwargs):
kwargs['sh_nums'] = 10
super(CustomIdSharded, self).__init__(*args, **kwargs)
class MySharded(ShardedHashIndex):
custom_header = 'from CodernityDB.sharded_hash import ShardedHashIndex'
def __init__(self, *args, **kwargs):
kwargs['sh_nums'] = 10
kwargs['key_format'] = 'I'
kwargs['use_make_keys'] = True
super(MySharded, self).__init__(*args, **kwargs)
def make_key_value(self, data):
return data['x'], None
def calculate_shard(self, key):
return key % self.sh_nums
y = 1500 * 'y'
db = Database('/tmp/shard')
db.create(with_id_index=False)
db.add_index(CustomIdSharded(db.path, 'id'))
db.add_index(MySharded(db.path, 'x'))
# it makes non sense to use sharding with such small number of records
for x in xrange(10 ** 4):
db.insert({'x': x, 'y': y})
print db.get('x', random.randint(0, 10 ** 4))['_id']
|
Warning
Just remember that you have to hardcode ShardIndex parameters (unlike other indexes). So you really should derive from it’s class.
Consider this script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | #!/usr/bin/env python
from CodernityDB.database import Database
from CodernityDB.sharded_hash import ShardedUniqueHashIndex, ShardedHashIndex
import shutil
import sys
import time
try:
shutil.rmtree('/tmp/shard')
except:
pass
NUM = int(sys.argv[1])
db = Database('/tmp/shard')
if sys.argv[2] == 'sh':
class MyShard(ShardedUniqueHashIndex):
custom_header = 'from CodernityDB.sharded_hash import ShardedUniqueHashIndex'
def __init__(self, *args, **kwargs):
kwargs['sh_nums'] = 10
super(MyShard, self).__init__(*args, **kwargs)
db.create(with_id_index=False)
db.add_index(MyShard(db.path, 'id'))
else:
db.create()
st = time.time()
for x in xrange(NUM):
db.insert({'x': 1})
end = time.time()
print end - st
|
| Number of inserts | Time in sharded | Time in non sharded |
|---|---|---|
| 5 000 000 | 65.405 seconds | 74.699 seconds |
| 10 000 000 | 148.095 seconds | 186.383 seconds |
As you can see, sharding does matter. It gives you almost 25% performance boost. Totally free. Similar performance boost applies to get operations.
Note
What’s even more important in Sharding is that as you probably already know CodernityDB index metadata stores data position and size by using struct module. By default those fields are I format (unsigned int). So when you need to change that format to Q without sharding, you probably can switch to sharding and still use I format. I format can accept values up to 4294967295 bytes so about 4GB. Having 100 shards will mean that you can index up to 4GB * 100 data.
Note
Currently one index can have up to 255 shards.