This example walks you through how to use DocArray to process multiple data modalities in tandem.
To do this comfortably and cleanly, you can use DocArrayโs dataclass feature.
See also
This example works with image and text data.
If you are not yet familiar with how to process these modalities individually, you may want to check out the
respective examples first: Image and Text
If you work with multiple modalities at the same time, most likely they stand in some relation with each other.
DocArrayโs dataclass feature allows you to model your data and these relationships, using the language of your domain.
Suppose you want to model a page of a newspaper that contains a main text, an image, and an image description.
You can model this example in the following way:
If your domain requires a more complex model, you can use advanced features to represent that accurately.
For this example, we look at a journal which consists of a cover page and multiple other pages, as well as some metadata.
Further, each page contains a main text, and can contain and image and an image description.
You can instantiate this complex Document in the same way as before:
fromdocarrayimportDocumentpages=[Page(main_text='Hello world',image='apple.png',description='This is the image of an apple',),Page(main_text='Second page'),]journal=Journal(cover=Page(main_text='DocArray Daily',image='apple.png'),pages=pages,metadata={'author':'Jina AI','issue':'1'},)doc=Document(journal)doc.summary()
After instantiation, each modality can be accessed directly from the Document:
fromdocarrayimportdataclass,Documentfromdocarray.typingimportImage,Text@dataclassclassPage:main_text:Textimage:Imagedescription:Textpage=Page(main_text='Hello world',image='apple.png',description='This is the image of an apple',)doc=Document(page)print(doc.main_text)print(doc.main_text.text)print(doc.image)print(doc.image.tensor)
Common use cases, such as neural search, involve generating embeddings for your data.
There are two ways of doing this, each of which has its use cases:
Generating individually embeddings for each modality, and generating an overall embedding for the entire Document.
If you have a DocumentArray of multi-modal Documents, you can embed the modalities of each
Document in the following way:
fromdocarrayimportDocumentArray,Documentda=DocumentArray([Document(Page(main_text='First page',image='apple.png',description='This is the image of an apple',)),Document(Page(main_text='Second page',image='apple.png',description='Still the same image of the same apple',)),])fromtorchvision.modelsimportresnet50img_model=resnet50(pretrained=True)# embed textual datada['@.[description, main_text]'].apply(lambdad:d.embed_feature_hashing())# embed image datada['@.[image]'].apply(lambdad:d.set_image_tensor_shape(shape=(224,224)).set_image_tensor_channel_axis(original_channel_axis=-1,new_channel_axis=0).set_image_tensor_normalization(channel_axis=0))da['@.[image]'].embed(img_model)print(da['@.[description, main_text]'].embeddings.shape)print(da['@.[image]'].embeddings.shape)
From the individual embeddings you can create a combined embedding for the entire Document.
This can be useful, for example, when you want to compare different Documents based on all the modalities that they store.
importnumpyasnpdefcombine_embeddings(d):# any (more sophisticated) function could go hered.embedding=np.concatenate([d.image.embedding,d.main_text.embedding,d.description.embedding])returndda.apply(combine_embeddings)print(da.embeddings.shape)
Letโs assume you have multiple pages, and you want to find the page that contains a similar image as some other page
(the query page).
Subindices
For this search task we use DocumentArray subindices.
This patterns is especially important when using a Document store, since it avoid loading Documents into memory.
First, create your dataset and query Document:
fromdocarrayimportdataclass,Document,DocumentArrayfromdocarray.typingimportImage,Text@dataclassclassPage:main_text:Textimage:Imagedescription:Textquery_page=Page(main_text='Hello world',image='apple.png',description='This is the image of an apple',)query=Document(query_page)# our query Documentda=DocumentArray([Document(Page(main_text='First page',image='apple.png',description='This is the image of an apple',)),Document(Page(main_text='Second page',image='pear.png',description='This is an image of a pear',)),],subindex_configs={'@.[image]':None},)# our dataset of pages
Finally, cou can perform a search using find() to find the closest image,
and the parent Document that contains that image:
closest_match_img=da.find(query.image,on='@.[image]')[0][0]print('CLOSEST IMAGE:')closest_match_img.summary()print('PAGE WITH THE CLOSEST IMAGE:')closest_match_page=da[closest_match_img.parent_id]closest_match_page.summary()
Output
CLOSEST IMAGE:
๐ Document: 5922ee1ad0dbfe707301b573f98c5939
โญโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Attribute โ Value โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ parent_id โ e6266f88f6ebcb3417358440934bcf81 โ
โ granularity โ 1 โ
โ tensor โ <class 'numpy.ndarray'> in shape (3, 224, 224), dtype: float32 โ
โ mime_type โ image/png โ
โ uri โ apple.png โ
โ embedding โ <class 'torch.Tensor'> in shape (1000,), dtype: float32 โ
โ modality โ image โ
โ scores โ defaultdict(<class 'docarray.score.NamedScore'>, {'cosine': โ
โ โ {'value': -1.1920929e-07}}) โ
โฐโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
PAGE WITH THE CLOSEST IMAGE:
๐ Document: e6266f88f6ebcb3417358440934bcf81
โโโ ๐ Chunks
โโโ ๐ Document: 29a0e323e2e9befcc42e9823b111f90f
โ โญโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Attribute โ Value
โ โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ parent_id โ e6266f88f6ebcb3417358440934bcf81
โ โ granularity โ 1
โ โ text โ First page
โ โ modality โ text
โ โฐโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโ ๐ Document: 5922ee1ad0dbfe707301b573f98c5939
โ โญโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Attribute โ Value
โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ parent_id โ e6266f88f6ebcb3417358440934bcf81
โ โ granularity โ 1
โ โ tensor โ <class 'numpy.ndarray'> in shape (3, 224, 224), dtype: f
โ โ mime_type โ image/png
โ โ uri โ apple.png
โ โ embedding โ <class 'torch.Tensor'> in shape (1000,), dtype: float32
โ โ modality โ image
โ โฐโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโ ๐ Document: 175e386b1aa248f9387db46341b73e05
โญโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Attribute โ Value
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ parent_id โ e6266f88f6ebcb3417358440934bcf81
โ granularity โ 1
โ text โ This is the image of an apple
โ modality โ text
โฐโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Similarly, you might want to find the page that is overall most similar to the other pages in your dataset.
To do that, you first have to embed each modality for each Document, and then combine the embeddings to an overall embedding:
fromtorchvision.modelsimportresnet50importnumpyasnpimg_model=resnet50(pretrained=True)# embed text data in query and datasetquery.main_text.embed_feature_hashing()query.description.embed_feature_hashing()da['@.[description, main_text]'].apply(lambdad:d.embed_feature_hashing())# combine embeddings to overall embeddingdefcombine_embeddings(d):# any (more sophisticated) function could go hered.embedding=np.concatenate([d.image.embedding,d.main_text.embedding,d.description.embedding])returndquery=combine_embeddings(query)# combine embeddings for queryda.apply(combine_embeddings)# combine embeddings in dataset
Then, you can perform search directly on the top level: