If your text data is larger and can’t be written inline, or comes from a URI, then you can also define uri first and load the text into a Document later:
Often times when you index/search textual Documents, you don’t want to consider thousands of words as one huge Document – some finer granularity would be nice. You can do this by leveraging Document chunks. For example, let’s split this simple Document at each ! mark:
Sometimes you need to encode the text into a numpy.ndarray before further computation. We provide some helper functions in Document and DocumentArray that allow you to do that easily.
For example, we have a DocumentArray with three Documents:
When you have text of different lengths and want output .tensors to have the same length, you can define max_length during conversion:
fromdocarrayimportDocument,DocumentArrayda=DocumentArray([Document(text='a short phrase'),Document(text='word'),Document(text='this is a much longer sentence'),])vocab=da.get_vocabulary()fordinda:d.convert_text_to_tensor(vocab,max_length=10)print(d.tensor)
As a bonus, you can also easily convert an integer ndarray back to text based on a given vocabulary. This is often termed “decoding”.
fromdocarrayimportDocument,DocumentArrayda=DocumentArray([Document(text='a short phrase'),Document(text='word'),Document(text='this is a much longer sentence'),])vocab=da.get_vocabulary()# encodingfordinda:d.convert_text_to_tensor(vocab,max_length=10)# decodingfordinda:d.convert_tensor_to_text(vocab)print(d.text)
a short phrase
word
this is a much longer sentence
Let’s search for "sheenteredtheroom" in Pride and Prejudice:
fromdocarrayimportDocument,DocumentArrayd=Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()da=DocumentArray(Document(text=s.strip())forsind.text.split('\n')ifs.strip())da.apply(lambdad:d.embed_feature_hashing())q=(Document(text='she entered the room').embed_feature_hashing().match(da,limit=5,exclude_self=True,metric='jaccard',use_scipy=True))print(q.matches[:,('text','scores__jaccard')])
[['staircase, than she entered the breakfast-room, and congratulated',
'of the room.',
'She entered the room with an air more than usually ungracious,',
'entered the breakfast-room, where Mrs. Bennet was alone, than she',
'those in the room.'],
[{'value': 0.6, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.7142857142857143, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}]]
You can create applications that search at chunk level using a subindex.
Imagine you want an application that searches at a sentence granularity and returns the title of the Document
containing the closest sentence to the query. For example, you have a database of song lyrics and want to
search a title from which you remember a small part of the lyrics (like the chorus).
Multi-modal Documents
Modelling nested Documents is often more convenient using DocArray’s dataclass API, especially when multiple modalities are
involved.
song1_title='Old MacDonald Had a Farm'song1="""Old MacDonald had a farm, E-I-E-I-OAnd on that farm he had some dogs, E-I-E-I-OWith a bow-wow here, and a bow-wow there,Here a bow, there a bow, everywhere a bow-wow."""song2_title='Ode an die Freude'song2="""Freude, schöner Götterfunken,Tochter aus Elisium,Wir betreten feuertrunkenHimmlische, dein Heiligthum.Deine Zauber binden wieder,was der Mode Schwerd getheilt;Bettler werden Fürstenbrüder,wo dein sanfter Flügel weilt."""
We can create one Document for each song, containing the song’s lines as chunks: