Add New Document Store#

DocumentArray can be easily extended to support new Document Store. As we have seen in the previous chapters, a document store can be a SQL/NoSQL/vector database, or even an in-memory data structure.

For DocArray, the motivation of on-boarding a new store is often:

  • having persistence that better fits to the use case;

  • pulling from an existing data source;

  • supporting advanced query languages, e.g. nearest-neighbor retrieval.

For the database vendor, the motivation is often:

  • having a powerful, well-designed and well-maintained Python client for your document store;

  • plugging your document store into Jina AI ecosystems (e.g. Jina, Hub, CLIP-as-service, Finetuner, etc.) and making synergy with Jina AI.

After the extension, users can enjoy convenient and powerful DocumentArray API on top of your document store. It promises the same user experience just like using a regular DocumentArray, no extra learning is required.

This chapter gives you a walk-through on how to add a new document store. To be specific, in this chapter we are extending DocumentArray to support a new document store called mydocstore. The final usage would look like the following:

from docarray import DocumentArray

da = DocumentArray(storage='mydocstore', config={...})

Let’s get started!

Step 1: create the folder#

Go to docarray/array/storage folder, create a sub-folder for your document store. Let’s call it mydocstore. You will need to create four empty files in that folder:

README.md
docarray
    |
    |--- array
            |
            |--- storage
                    |
                    |--- mydocstore
                            |
                            |--- __init__.py
                            |--- getsetdel.py
                            |--- seqlike.py
                            |--- backend.py

These four files consist of necessary interface for making the extension work on DocumentArray. Additionally, if your storage backend supports approximate nearest-neighbor search, you can include another file ‘find.py’.

Step 2: implement getsetdel.py#

Your getsetdel.py should look like the following:

from docarray.array.storage.base.getsetdel import BaseGetSetDelMixin
from docarray import Document


class GetSetDelMixin(BaseGetSetDelMixin):
    def _get_doc_by_id(self, _id: str) -> 'Document':
        # to be implemented
        ...

    def _del_doc_by_id(self, _id: str):
        # to be implemented
        ...

    def _set_doc_by_id(self, _id: str, value: 'Document'):
        # to be implemented
        ...

    def _load_offset2ids(self):
        # to be implemented
        ...

    def _save_offset2ids(self):
        # to be implemented
        ...

You will need to implement the above five functions, which correspond to the logics of get/set/delete items via a string .id. They are essential to ensure DocumentArray works.

Note that DocumentArray maintains an offset2ids mapping to allow a list-like behaviour. This mapping is inherited from the BaseGetSetDelMixin. Therefore, you need to implement methods to persist this mapping, in case you want to also persist the ordering of Documents inside the storage.

Keep in mind that _del_doc_by_id and _set_doc_by_id must not update offset2ids, we handle that for you in an upper level. Also, make sure that _set_doc_by_id performs an upsert operation and removes the old ID (_id) in case value.id is different from _id.

Tip

Let’s call the above five functions as the essentials.

If you aim for high performance, it is recommeneded to implement other methods without leveraging your essentials. They are: _get_docs_by_ids, _del_docs_by_ids, _clear_storage, _set_doc_value_pairs, _set_doc_value_pairs_nested, _set_docs_by_ids. One can get their full signatures from BaseGetSetDelMixin. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray.

Implementing them is fully optional, and you can only implement some of them not all of them. If you are not implementing them, those methods will use a generic-but-slow version that is based on your five essentials.

See also

As a reference, you can check out how we implement for SQLite, check out GetSetDelMixin.

Step 3: implement seqlike.py#

Your seqlike.py should look like the following:

from typing import Iterable, Iterator, Union, TYPE_CHECKING
from docarray.array.storage.base.seqlike import BaseSequenceLikeMixin

if TYPE_CHECKING:
    from docarray import Document


class SequenceLikeMixin(BaseSequenceLikeMixin):
    def __eq__(self, other):
        ...

    def __contains__(self, x: Union[str, 'Document']):
        ...

    def __repr__(self):
        ...

    def __add__(self, other: Union['Document', Iterable['Document']]):
        ...

    def insert(self, index: int, value: 'Document'):
        # Optional. By default, this will add a new item and update offset2id
        # if you want to customize this, make sure to handle offset2id
        ...

    def append(self, value: 'Document'):
        # Optional. Override this if you have a better implementation than inserting at the last position
        ...

    def extend(self, values: Iterable['Document']) -> None:
        # Optional. Override this if you have better implementation than appending one by one
        ...

    def __len__(self):
        # Optional. By default, this will rely on offset2id to get the length
        ...

    def __iter__(self) -> Iterator['Document']:
        # Optional. By default, this will rely on offset2id to iterate
        ...

Most of the interfaces come from Python standard MutableSequence.

See also

As a reference, to see how we implement for SQLite, check out SequenceLikeMixin.

Step 4: implement backend.py#

Your backend.py should look like the following:

from typing import Optional, TYPE_CHECKING, Union, Dict
from dataclasses import dataclass

from docarray.array.storage.base.backend import BaseBackendMixin

if TYPE_CHECKING:
    from docarray.typing import (
        DocumentArraySourceType,
    )


@dataclass
class MyDocStoreConfig:
    config1: str
    config2: str
    config3: Dict
    ...


class BackendMixin(BaseBackendMixin):
    def _init_storage(
        self,
        _docs: Optional['DocumentArraySourceType'] = None,
        config: Optional[Union[MyDocStoreConfig, Dict]] = None,
        **kwargs
    ):
        super()._init_storage(_docs, config, **kwargs)
        ...

_init_storage is a very important function to be called during the DocumentArray construction. You will need to handle different construction & copy behaviors in this function.

MyDocStoreConfig is a dataclass for containing the configs. You can expose arguments of your document store to this data class and allow users to customize them. In init_storage function, you need to parse config either from MyDocStoreConfig object or a Dict.

See also

As a reference, you can check out how we implement for SQLite, check out BackendMixin.

Step 5: implement find.py#

If your storage backend supports approximate nearest neighbor search, you can allow users to use this feature within docarray. To do so, add a find.py file that looks like the following:

from typing import TYPE_CHECKING, TypeVar, List, Union

if TYPE_CHECKING:
    import numpy as np

    # Define the expected input type that your ANN search supports
    MyDocumentStoreArrayType = TypeVar('MyDocumentStoreArrayType', np.ndarray, ...)


class FindMixin:
    def _find_similar_vectors(
        self, query: 'MyDocumentStoreArrayType', limit=10
    ) -> 'DocumentArray':
        """Expects a MyDocumentStoreArrayType vector query and should return a DocumentArray of results retrieved from
        the storage backend"""
        ...

    def _find(
        self, query: 'ElasticArrayType', limit: int = 10, **kwargs
    ) -> Union['DocumentArray', List['DocumentArray']]:
        """Returns `limit` approximate nearest neighbors given a batch of input queries.
        If the query is a single query, should return a DocumentArray, otherwise a list of DocumentArrays containing
        the closest Documents for each query.
        """
        ...

Step 6: summarize everything in __init__.py.#

Your __init__.py should look like the following:

from abc import ABC

from .backend import BackendMixin, MyDocStoreConfig
from .getsetdel import GetSetDelMixin
from .seqlike import SequenceLikeMixin

__all__ = ['StorageMixins', 'MyDocStoreConfig']


class StorageMixins(BackendMixin, GetSetDelMixin, SequenceLikeMixin, ABC):
    ...

Just copy-paste it will do the work.

If you have implemented a find.py module, make sure to also inherit the FindMixin:

class StorageMixins(FindMixin, BackendMixin, GetSetDelMixin, SequenceLikeMixin, ABC):
    ...

Step 7: subclass from DocumentArray#

Create a file mydocstore.py under docarray/array/

README.md
docarray
    |
    |--- array
            |
            |--- mydocstore.py
            |--- storage
                    |
                    |--- mydocstore
                            |
                            |--- __init__.py
                            |--- getsetdel.py
                            |--- seqlike.py
                            |--- backend.py

The file content should look like the following:

from .document import DocumentArray

from .storage.mydocstore import StorageMixins, MyDocStoreConfig

__all__ = ['MyDocStoreConfig', 'DocumentArrayMyDocStore']


class DocumentArrayMyDocStore(StorageMixins, DocumentArray):
    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)

Step 8: add entrypoint to DocumentArray#

We are almost there! Now we need to add the entrypoint to DocumentArray constructor to allow user to use the mydocstore backend as follows:

from docarray import DocumentArray

da = DocumentArray(storage='mydocstore')

Go to docarray/array/document.py and add mydocstore there:

class DocumentArray(AllMixins, BaseDocumentArray):
    
    ...
    
    def __new__(cls, *args, storage: str = 'memory', **kwargs) -> 'DocumentArrayLike':
        if cls is DocumentArray:
            if storage == 'mydocstore':
                from .mydocstore import DocumentArrayMyDocStore

                instance = super().__new__(DocumentArrayMyDocStore)
            elif storage == 'memory':
                from .memory import DocumentArrayInMemory
                ...  

Done! Now you should be able to use it like DocumentArrayMyDocStore!

On pull request: add tests and type-hint#

Welcome to contribute your extension back to DocArray. You will need to include DocumentArrayMyDocStore in at least the following tests:

tests/unit/array/test_advance_indexing.py
tests/unit/array/test_sequence.py
tests/unit/array/test_construct.py

Please also add @overload type hint to docarray/array/document.py.

class DocumentArray(AllMixins, BaseDocumentArray):
    ...

    @overload
    def __new__(
        cls,
        _docs: Optional['DocumentArraySourceType'] = None,
        storage: str = 'mydocstore',
        config: Optional[Union['MyDocStoreConfig', Dict]] = None,
    ) -> 'DocumentArrayMyDocStore':
        """Create a MyDocStore-powered DocumentArray object."""
        ...

Now you are ready to commit the contribution and open a pull request.