Access Modality#

Tip

It is strongly recommended to go through the Access Documents section first before continuing.

Accessing modality means accessing the sub-Documents corresponding to a dataclass field.

In the last chapter, we learned how to represent a multimodal document via @dataclass and type annotation from docarray.typing. We also learned that a multimodal dataclass can be converted into a Document object easily. That means if we have a list of multimodal dataclass objects, we can build a DocumentArray out of them:

from docarray import Document, dataclass, DocumentArray
from docarray.typing import Image, Text


@dataclass
class MMDoc:
    banner: Image
    description: Text


da = DocumentArray(
    [
        Document(
            MMDoc(banner='test-1.jpeg', description='this is a test white-noise image')
        ),
        Document(
            MMDoc(banner='test-2.jpeg', description='another test image but in black')
        ),
    ]
)

da.summary()
╭────────────── Documents Summary ───────────────╮
│                                                │
│   Length                    2                  │
│   Homogenous Documents      True               │
│   Has nested Documents in   ('chunks',)        │
│   Common Attributes         ('id', 'chunks')   │
│   Multimodal dataclass      True               │
│                                                │
╰────────────────────────────────────────────────╯
╭──────────────────────── Attributes Summary ────────────────────────╮
│                                                                    │
│   Attribute   Data type         #Unique values   Has empty value   │
│  ────────────────────────────────────────────────────────────────  │
│   chunks      ('ChunkArray',)   2                False             │
│   id          ('str',)          2                False             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

A natural question would be, how do we select those Documents that correspond to MMDoc.banner?

This chapter describes how to select the sub-documents that correspond to a modality from a DocumentArray. So let me reiterate the logic here: when calling Document() to build Document object from a dataclass object, each field in that dataclass will generate a sub-document nested under .chunks or even .chunks.chunks.chunks at arbitrary level (except primitive types, which are stored in the tags of the root Document). To process a dataclass field via existing DocArray API/Jina/Hub Executor, we need a way to accurately select those sub-documents from the nested structure, which is the purpose of this chapter.

Selector Syntax#

Following the syntax convention described in Access Documents, a modality selector also starts with @, it uses . to indicate the field of the dataclass. Selecting a DocumentArray always results in another DocumentArray.

@.[field1, field2, ...]
^^ ~~~~~~  ~~~~~~
||   |       |
||   |-------|
||       |
||       | --- indicate the field of dataclass
||
|| ------ indicate the start of modality selector
|
| ---- indicate the start of selector

Use the above DocumentArray as an example,

da['@.[banner]']
╭───────────────────────────── Documents Summary ──────────────────────────────╮
│                                                                              │
│   Length                 2                                                   │
│   Homogenous Documents   True                                                │
│   Common Attributes      ('id', 'parent_id', 'granularity', 'tensor',        │
│                          'mime_type', 'uri', 'modality')                     │
│   Multimodal dataclass   False                                               │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────── Attributes Summary ────────────────────────╮
│                                                                   │
│   Attribute     Data type      #Unique values   Has empty value   │
│  ───────────────────────────────────────────────────────────────  │
│   granularity   ('int',)       1                False             │
│   id            ('str',)       2                False             │
│   mime_type     ('str',)       1                False             │
│   modality      ('str',)       1                False             │
│   parent_id     ('str',)       2                False             │
│   tensor        ('ndarray',)   2                False             │
│   uri           ('str',)       2                False             │
│                                                                   │
╰───────────────────────────────────────────────────────────────────╯
da['@.[description]']
╭───────────────────────────── Documents Summary ──────────────────────────────╮
│                                                                              │
│   Length                 2                                                   │
│   Homogenous Documents   True                                                │
│   Common Attributes      ('id', 'parent_id', 'granularity', 'text',          │
│                          'modality')                                         │
│   Multimodal dataclass   False                                               │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────────────── Attributes Summary ──────────────────────╮
│                                                                │
│   Attribute     Data type   #Unique values   Has empty value   │
│  ────────────────────────────────────────────────────────────  │
│   granularity   ('int',)    1                False             │
│   id            ('str',)    2                False             │
│   modality      ('str',)    1                False             │
│   parent_id     ('str',)    2                False             │
│   text          ('str',)    2                False             │
│                                                                │
╰────────────────────────────────────────────────────────────────╯

Select multiple fields#

You can select multiple fields by including them in the square brackets, separated by a comma ,.

da['@.[description, banner]']
╭───────────────────────────── Documents Summary ──────────────────────────────╮
│                                                                              │
│   Length                        4                                            │
│   Homogenous Documents          False                                        │
│   2 Documents have attributes   ('id', 'parent_id', 'granularity', 'text',   │
│                                 'modality')                                  │
│   2 Documents have attributes   ('id', 'parent_id', 'granularity',           │
│                                 'tensor', 'mime_type', 'uri', 'modality')    │
│   Multimodal dataclass          False                                        │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────── Attributes Summary ─────────────────────────────╮
│                                                                              │
│   Attribute     Data type                 #Unique values   Has empty value   │
│  ──────────────────────────────────────────────────────────────────────────  │
│   granularity   ('int',)                  1                False             │
│   id            ('str',)                  4                False             │
│   mime_type     ('str',)                  2                False             │
│   modality      ('str',)                  2                False             │
│   parent_id     ('str',)                  2                False             │
│   tensor        ('ndarray', 'NoneType')   4                True              │
│   text          ('str',)                  3                False             │
│   uri           ('str',)                  3                False             │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Slice dataclass objects#

Remember each dataclass object corresponds to one Document object, you can first slice the DocumentArray before selecting the field. Specifically, you can do

@r[slice].[field1, field2, ...]

where slice can be any slice syntax accepted in Access Documents.

For example, to select the sub-Document .banner for only the first Document,

da['@r[:1].[banner]']
╭───────────────────────────── Documents Summary ──────────────────────────────╮
│                                                                              │
│   Length                 1                                                   │
│   Homogenous Documents   True                                                │
│   Common Attributes      ('id', 'parent_id', 'granularity', 'tensor',        │
│                          'mime_type', 'uri', 'modality')                     │
│   Multimodal dataclass   False                                               │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────── Attributes Summary ────────────────────────╮
│                                                                   │
│   Attribute     Data type      #Unique values   Has empty value   │
│  ───────────────────────────────────────────────────────────────  │
│   granularity   ('int',)       1                False             │
│   id            ('str',)       1                False             │
│   mime_type     ('str',)       1                False             │
│   modality      ('str',)       1                False             │
│   parent_id     ('str',)       1                False             │
│   tensor        ('ndarray',)   1                False             │
│   uri           ('str',)       1                False             │
│                                                                   │
╰───────────────────────────────────────────────────────────────────╯

Slice List[Type] fields#

If a field is annotated as a List of DocArray types, it will create a DocumentArray, one can add slicing after the field selector to further restrict the size of the sub-Documents.

from typing import List

from docarray import Document, dataclass, DocumentArray
from docarray.typing import Image, Text


@dataclass
class MMDoc:
    banner: List[Image]
    description: Text


da = DocumentArray(
    [
        Document(
            MMDoc(
                banner=['test-1.jpeg', 'test-2.jpeg'],
                description='this is a test white image',
            )
        ),
        Document(
            MMDoc(
                banner=['test-1.jpeg', 'test-2.jpeg'],
                description='another test image but in black',
            )
        ),
    ]
)

for d in da['@.[banner][:1]']:
    print(d.uri)
test-1.jpeg
test-1.jpeg

To summarize, slicing can be put in front of the field selector to restrict the number of dataclass objects; or can be put after the field selector to restrict the number of sub-Documents.

Select nested fields#

A field can be annotated as a DocArray dataclass. In this case, the nested structure from the latter dataclass is copied to the former’s .chunks. To select the deeply nested field, one can simply follow:

@.[field1, field2, ...].[nested_field1, nested_field1, ...]

For example,

from docarray import dataclass, Document, DocumentArray
from docarray.typing import Image, Text


@dataclass
class BannerDoc:
    description: Text = 'this is a test empty image'
    banner: Image = 'test-1.jpeg'


@dataclass
class ColumnArticle:
    featured: BannerDoc
    description: Text = 'this is a column article'
    website: str = 'https://jina.ai'


c1 = ColumnArticle(featured=BannerDoc(banner='test-1.jpeg'))
c2 = ColumnArticle(featured=BannerDoc(banner='test-2.jpeg'))

da = DocumentArray([Document(c1), Document(c2)])

for d in da['@.[featured].[banner]']:
    print(d.uri)
test-1.jpeg
test-2.jpeg