Serialization#

DocArray is designed to be “ready-to-wire” at anytime. Serialization is important. DocumentArray provides multiple serialization methods that allows one transfer DocumentArray object over network and across different microservices. Moreover, there is the ability to store/load DocumentArray objects to/from disk.

  • JSON string: .from_json()/.to_json()

    • Pydantic model: .from_pydantic_model()/.to_pydantic_model()

  • Bytes (compressed): .from_bytes()/.to_bytes()

    • Disk serialization: .save_binary()/.load_binary()

  • Base64 (compressed): .from_base64()/.to_base64()

  • Protobuf Message: .from_protobuf()/.to_protobuf()

  • Python List: .from_list()/.to_list()

  • Pandas Dataframe: .from_dataframe()/.to_dataframe()

  • Cloud: .push()/.pull()

From/to JSON#

Tip

If you are building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, it is highly recommended to check out FastAPI/Pydantic and follow the methods there.

Important

Depending on which protocol you use, this feature requires pydantic or protobuf dependency. You can do pip install "docarray[common]" to install it.

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])
da.to_json()
[{"id": "a677577877b611eca3811e008a366d49", "parent_id": null, "granularity": null, "adjacency": null, "blob": null, "tensor": null, "mime_type": "text/plain", "text": "hello", "weight": null, "uri": null, "tags": null, "offset": null, "location": null, "embedding": null, "modality": null, "evaluations": null, "scores": null, "chunks": null, "matches": null}, {"id": "a67758f477b611eca3811e008a366d49", "parent_id": null, "granularity": null, "adjacency": null, "blob": null, "tensor": null, "mime_type": "text/plain", "text": "world", "weight": null, "uri": null, "tags": null, "offset": null, "location": null, "embedding": null, "modality": null, "evaluations": null, "scores": null, "chunks": null, "matches": null}]
da_r = DocumentArray.from_json(da.to_json())

da_r.summary()
                  Documents Summary                   
                                                      
  Length                 2                            
  Homogenous Documents   True                         
  Common Attributes      ('id', 'mime_type', 'text')  
                                                      
                     Attributes Summary                     
                                                            
  Attribute   Data type   #Unique values   Has empty value  
 ────────────────────────────────────────────────────────── 
  id          ('str',)    2                False            
  mime_type   ('str',)    1                False            
  text        ('str',)    2                False            

See also

To load an arbitrary JSON file, please set protocol=None as described here.

More parameters and usages can be found in the Document-level From/to JSON.

From/to bytes#

Important

Depending on your values of protocol and compress arguments, this feature may require protobuf and lz4 dependencies. You can do pip install "docarray[full]" to install it.

Serialization into bytes often yield more compact representation than in JSON. Similar to the Document serialization, DocumentArray can be serialized with different protocol and compress combinations. In its most simple form,

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])
da.to_bytes()
b'\x80\x03cdocarray.array.document\nDocumentArray\nq\x00)\x81q\x01}q\x02(X\x05\x00\x00\x00_dataq\x03]q\x04(cdocarray.document\nDocument\nq\x05) ...
da_r = DocumentArray.from_bytes(da.to_bytes())

da_r.summary()
                  Documents Summary                   
                                                      
  Length                 2                            
  Homogenous Documents   True                         
  Common Attributes      ('id', 'mime_type', 'text')  
                                                      
                     Attributes Summary                     
                                                            
  Attribute   Data type   #Unique values   Has empty value  
 ────────────────────────────────────────────────────────── 
  id          ('str',)    2                False            
  mime_type   ('str',)    1                False            
  text        ('str',)    2                False      

Tip

If you go with default protcol and compress settings, you can simply use bytes(da), which is more Pythonic.

The table below summarize the supported serialization protocols and compressions:

protocol=...

Description

Remarks

pickle-array (default)

Serialize the whole array in one-shot using Python pickle

Often fastest. Not portable to other languages. Insecure in production.

protobuf-array

Serialize the whole array using DocumentArrayProto.

Portable to other languages if they implement DocumentArrayProto. 2GB max-size (pre-compression) restriction by Protobuf.

pickle

Serialize elements one-by-one using Python pickle.

Allow streaming. Not portable to other languages. Insecure in production.

protobuf

Serialize elements one-by-one using DocumentProto.

Allow streaming. Portable to other languages if they implement DocumentProto. No max-size restriction

For compressions, the following algorithms are supported: lz4, bz2, lzma, zlib, gzip. The most frequently used ones are lz4 (fastest) and gzip (most widely used).

If you specified non-default protocol and compress in to_bytes(), you will need to specify the same in from_bytes().

Depending on the use cases, you can choose the one works best for you. Here is a benchmark on serializing a DocumentArray with one million near-empty Documents (i.e. init with DocumentArray.empty(...) where each Document has only id).

../../../_images/benchmark-size.svg
../../../_images/benchmark-time.svg

The benchmark was conducted on the codebase of Jan. 5, 2022.

Depending on how you want to interpret the results, the figures above can be an over-estimation/under-estimation of the serialization latency: one may argue that near-empty Documents are not realistic, but serializing a DocumentArray with one million Documents is also unreal. In practice, DocumentArray passing across microservices are relatively small, say at thousands, for better overlapping the network latency and computational overhead.

Wire format of pickle and protobuf#

When set protocol=pickle or protobuf, the resulting bytes look like the following:

--------------------------------------------------------------------------------------------------------
|   version    |   len(docs)    |  doc1_bytes  |  doc1.to_bytes()  |  doc2_bytes  |  doc2.to_bytes() ...
---------------------------------------------------------------------------------------------------------
| Fixed-length |  Fixed-length  | Fixed-length |  Variable-length  | Fixed-length |  Variable-length ...
--------------------------------------------------------------------------------------------------------
      |               |               |                  |                 |               |
    uint8           uint64          uint32        Variable-length         ...             ...

Here version is a uint8 that specifies the serialization version of the DocumentArray serialization format, followed by len(docs) which is a uint64 that specifies the amount of serialized documents. Afterwards, doc1_bytes describes how many bytes are used to serialize doc1, followed by doc1.to_bytes() which is the bytes data of the document itself. The pattern dock_bytes and dock.to_bytes is repeated len(docs) times.

From/to disk#

If you want to store a DocumentArray to disk you can use .save_binary(filename, protocol, compress) where protocol and compress refer to the protocol and compression methods used to serialize the data. If you want to load a DocumentArray from disk you can use .load_binary(filename, protocol, compress).

For example, the following snippet shows how to save/load a DocumentArray in my_docarray.bin.

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])

da.save_binary('my_docarray.bin', protocol='protobuf', compress='lz4')
da_rec = DocumentArray.load_binary(
    'my_docarray.bin', protocol='protobuf', compress='lz4'
)
da_rec.summary()
                  Documents Summary                   
                                                      
  Length                 2                            
  Homogenous Documents   True                         
  Common Attributes      ('id', 'mime_type', 'text')  
                                                      
                     Attributes Summary                     
                                                            
  Attribute   Data type   #Unique values   Has empty value  
 ────────────────────────────────────────────────────────── 
  id          ('str',)    2                False            
  mime_type   ('str',)    1                False            
  text        ('str',)    2                False            
                                                                                               

User do not need to remember the protocol and compression methods on loading. You can simply specify protocol and compress in the file extension via:

filename.protobuf.gzip
         ~~~~~~~~ ^^^^
             |      |
             |      |-- compress
             |
             |-- protocol

When a filename is given as the above format in .save_binary, you can simply load it back with .load_binary without specifying the protocol and compress method again.

The previous code snippet can be simplified to

da.save_binary('my_docarray.protobuf.lz4')
da_rec = DocumentArray.load_binary('my_docarray.protobuf.lz4')

Stream large binary serialization from disk#

In particular, if a serialization uses protocol='pickle' or protocol='protobuf', then you can load it via streaming with a constant memory consumption by setting streaming=True:

from docarray import DocumentArray, Document

da_generator = DocumentArray.load_binary(
    'xxxl.bin', protocol='pickle', compress='gzip', streaming=True
)

for d in da_generator:
    d: Document
    # go nuts with `d`

From/to base64#

Important

Depending on your values of protocol and compress arguments, this feature may require protobuf and lz4 dependencies. You can do pip install "docarray[full]" to install it.

Serialize into base64 can be useful when binary string is not allowed, e.g. in REST API. This can be easily done via to_base64() and from_base64(). Like in binary serialization, one can specify protocol and compress:

from docarray import DocumentArray

da = DocumentArray.empty(10)

d_str = da.to_base64(protocol='protobuf', compress='lz4')
print(len(d_str), d_str)
176 BCJNGEBAwHUAAAD/Iw+uQdpL9UDNsfvomZb8m7sKIGRkNTIyOTQyNzMwMzExZWNiM2I1MWUwMDhhMzY2ZDQ5MgAEP2FiNDIAHD9iMTgyAB0vNWUyAB0fYTIAHh9myAAdP2MzYZYAHD9jODAyAB0fZDIAHT9kMTZkAABQNjZkNDkAAAAA

To deserialize, remember to set the correct protocol and compress:

from docarray import DocumentArray

da = DocumentArray.from_base64(d_str, protocol='protobuf', compress='lz4')
da.summary()
  Length                 10       
  Homogenous Documents   True     
  Common Attributes      ('id',)  
                                  
                     Attributes Summary                     
                                                            
  Attribute   Data type   #Unique values   Has empty value  
 ────────────────────────────────────────────────────────── 
  id          ('str',)    10               False                                                                    

From/to Protobuf#

Serializing to Protobuf Message is less frequently used, unless you are using Python Protobuf API. Nonetheless, you can use from_protobuf() and to_protobuf() to get a Protobuf Message object in Python.

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])
da.to_bytes()
docs {
  id: "2571b8b66e4d11ec9f271e008a366d49"
  text: "hello"
  mime_type: "text/plain"
}
docs {
  id: "2571ba466e4d11ec9f271e008a366d49"
  text: "world"
  mime_type: "text/plain"
}

From/to list#

Important

This feature requires protobuf or pydantic dependency. You can do pip install "docarray[full]" to install it.

Serializing to/from Python list is less frequently used for the same reason as Document.to_dict(): it is often an intermediate step of serializing to JSON. You can do:

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])
da.to_list()
[{'id': 'ae55782a6e4d11ec803c1e008a366d49', 'text': 'hello', 'mime_type': 'text/plain'}, {'id': 'ae557a146e4d11ec803c1e008a366d49', 'text': 'world', 'mime_type': 'text/plain'}]

See also

More parameters and usages can be found in the Document-level From/to dict.

From/to dataframe#

Important

This feature requires pandas dependency. You can do pip install "docarray[full]" to install it.

One can convert between a DocumentArray object and a pandas.dataframe object.

from docarray import DocumentArray, Document

da = DocumentArray([Document(text='hello'), Document(text='world')])
da.to_dataframe()
                                 id   text   mime_type
0  43cb93b26e4e11ec8b731e008a366d49  hello  text/plain
1  43cb95746e4e11ec8b731e008a366d49  world  text/plain

To build a DocumentArray from dataframe,

df = ...
da = DocumentArray.from_dataframe(df)

From/to cloud#

Important

This feature requires rich and requests dependency. You can do pip install "docarray[full]" to install it.

push() and pull() allows you to serialize a DocumentArray object to Jina Cloud and share it across machines.

Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily store it to the cloud via:

from docarray import DocumentArray

da = DocumentArray(...)  # heavylifting, processing, GPU task, ...
da.push('myda123', show_progress=True)
../../../_images/da-push.png

Then on your local laptop, simply pull it:

from docarray import DocumentArray

da = DocumentArray.pull('<username>/myda123', show_progress=True)

Now you can continue the work at local, analyzing da or visualizing it. Your friends & colleagues who know the token myda123 can also pull that DocumentArray. It’s useful when you want to quickly share the results with your colleagues & friends.

The maximum size of an upload is 4GB under the protocol='protobuf' and compress='gzip' setting. The lifetime of an upload is one week after its creation.

To avoid unnecessary download when upstream DocumentArray is unchanged, you can add DocumentArray.pull(..., local_cache=True).

See also

DocArray allows pushing, pulling, and managing your DocumentArrays in Jina AI Cloud. Read more about how to manage your data in Jina AI Cloud, using either the console or the DocArray Python API, in the Data Management section.