FastAPI/Pydantic#

Long story short, DocArray supports pydantic data model via PydanticDocument and PydanticDocumentArray.

But this is probably too short to make any sense. So let’s take a step back and see what does this mean.

When you want to send/receive Document or DocumentArray object via REST API, you can use .from_json/.to_json that convert the Document/DocumentArray object into JSON. This has been introduced in the Serialization section.

This way, although quite intuitive to many data scientists, is not the modern way of building API services. Your engineer friends won’t be happy if you give them a service like this. The main problem here is the data validation.

Of course, you can include data validation inside your service logic, but it is often brainfuck as you will need to check field by field and repeat things like isinstance(field, int), not even to mention handling nested JSON.

Modern web frameworks validate the data before it enters the core logic. For example, FastAPI leverages pydantic to validate input & output data.

This chapter will introduce how to leverage DocArray’s pydantic support in a FastAPI service to build a modern API service. The fundamentals of FastAPI can be learned from its docs. I won’t repeat them here again.

Tip

Features introduced in this chapter require fastapi and pydantic as dependency, please do pip install "docarray[full]" to enable it.

JSON Schema#

You can get JSON Schema (OpenAPI itself is based on JSON Schema) of Document and DocumentArray by get_json_schema().

from docarray import Document
Document.get_json_schema()
{
  "$ref": "#/definitions/PydanticDocument",
  "definitions": {
    "PydanticDocument": {
      "title": "PydanticDocument",
      "type": "object",
      "properties": {
        "id": {
          "title": "Id",
          "type": "string"
        },
from docarray import DocumentArray
DocumentArray.get_json_schema()
{
  "title": "DocumentArray Schema",
  "type": "array",
  "items": {
    "$ref": "#/definitions/PydanticDocument"
  },
  "definitions": {
    "PydanticDocument": {
      "title": "PydanticDocument",
      "type": "object",
      "properties": {
        "id": {
          "title": "Id",

Give them to your engineer friends, they will be happy as now they can understand what data format you are working with. These schemas also help them to easily integrate DocArray into any webservice.

Validate incoming Document and DocumentArray#

You can import PydanticDocument and PydanticDocumentArray pydantic data models, and use them to type hint your endpoint. This will enable the data validation.

from docarray.document.pydantic_model import PydanticDocument, PydanticDocumentArray
from fastapi import FastAPI

app = FastAPI()

@app.post('/single')
async def create_item(item: PydanticDocument):
    ...

@app.post('/multi')
async def create_array(items: PydanticDocumentArray):
    ...

Let’s now send some JSON:

from starlette.testclient import TestClient
client = TestClient(app)

response = client.post('/single', {'hello': 'world'})
print(response, response.text)
response = client.post('/single', {'id': [12, 23]})
print(response, response.text)
<Response [422]> {"detail":[{"loc":["body"],"msg":"value is not a valid dict","type":"type_error.dict"}]}
<Response [422]> {"detail":[{"loc":["body"],"msg":"value is not a valid dict","type":"type_error.dict"}]}

Both got rejected (422 error) as they are not valid.

Convert between pydantic model and DocArray objects#

PydanticDocument and PydanticDocumentArray are mainly for data validation. When you want to implement real logics, you need to convert it into Document or DocumentArray. This can be easily achieved via from_pydantic_model(). When you are done with processing and want to send back, you can call to_pydantic_model().

In a nutshell, the whole procedure looks like the following:

../../_images/lifetime-pydantic.svg

Let’s see an example,

from docarray import Document, DocumentArray

@app.post('/single')
async def create_item(item: PydanticDocument):
    d = Document.from_pydantic_model(item)
    # now `d` is a Document object
    ...  # process `d` how ever you want
    return d.to_pydantic_model()
    

@app.post('/multi')
async def create_array(items: PydanticDocumentArray):
    da = DocumentArray.from_pydantic_model(items)
    # now `da` is a DocumentArray object
    ...  # process `da` how ever you want
    return da.to_pydantic_model()

Limit returned fields by response model#

Supporting pydantic data model means much more beyond data validation. One useful pattern is to define a smaller data model and restrict the response to certain fields of the Document.

Imagine we have a DocumentArray with .embeddings on the server side. But we do not want to return them to the client for some reasons (1. meaningless to users; 2. too big to transfer). One can simply define the interested fields via pydantic.BaseModel and then use it in response_model=.

from pydantic import BaseModel
from docarray import Document

class IdOnly(BaseModel):
    id: str

@app.get('/single', response_model=IdOnly)
async def get_item_no_embedding():
    d = Document(embedding=[1, 2, 3])
    return d.to_pydantic_model()

And you get:

<Response [200]> {'id': '065a5548756211ecaa8d1e008a366d49'}

Limit returned results recursively#

The same idea applies to DocumentArray as well. Say after .match(), you are only interested in .id - the parent .id and all matches id. You can declare a BaseModel as follows:

from typing import List, Optional
from pydantic import BaseModel

class IdAndMatch(BaseModel):
    id: str
    matches: Optional[List['IdAndMatch']]

Bind it to response_model:

@app.get('/get_match', response_model=List[IdAndMatch])
async def get_match_id_only():
    da = DocumentArray.empty(10)
    da.embeddings = np.random.random([len(da), 3])
    da.match(da)
    return da.to_pydantic_model()

Then you get a very nice result of ids of matches (potentially unlimited depth).

[{'id': 'ef82e4f4756411ecb2c01e008a366d49',
  'matches': [{'id': 'ef82e4f4756411ecb2c01e008a366d49', 'matches': None},
              {'id': 'ef82e6d4756411ecb2c01e008a366d49', 'matches': None},
              {'id': 'ef82e760756411ecb2c01e008a366d49', 'matches': None},
              {'id': 'ef82e7ec756411ecb2c01e008a366d49', 'matches': None},
              ...

If 'matches': None is annoying to you (they are here because you didn’t compute second-degree matches), you can further leverage FastAPI’s feature and do:

@app.get('/get_match', response_model=List[IdMatch], response_model_exclude_none=True)
async def get_match_id_only():
    ...

Finally, you get a very clean results with ids and matches only:

[{'id': '3da6383e756511ecb7cb1e008a366d49',
  'matches': [{'id': '3da6383e756511ecb7cb1e008a366d49'},
              {'id': '3da63a14756511ecb7cb1e008a366d49'},
              {'id': '3da6392e756511ecb7cb1e008a366d49'},
              {'id': '3da63b72756511ecb7cb1e008a366d49'},
              {'id': '3da639ce756511ecb7cb1e008a366d49'},
              {'id': '3da63a5a756511ecb7cb1e008a366d49'},
              {'id': '3da63ae6756511ecb7cb1e008a366d49'},
              {'id': '3da63aa0756511ecb7cb1e008a366d49'},
              {'id': '3da63b2c756511ecb7cb1e008a366d49'},
              {'id': '3da63988756511ecb7cb1e008a366d49'}]},
 {'id': '3da6392e756511ecb7cb1e008a366d49',
  'matches': [{'id': '3da6392e756511ecb7cb1e008a366d49'},
              {'id': '3da639ce756511ecb7cb1e008a366d49'},
              ...

More tricks and usages of pydantic model can be found in its docs. Same for FastAPI. I strongly recommend interested readers to go through their documentations.