If your text data is big and can not be written inline, or it comes from a URI, then you can also define uri first and load the text into Document later.
Often times when you index/search textual document, you don’t want to consider thousands of words as one document, some finer granularity would be nice. You can do these by leveraging chunks of Document. For example, let’s segment this simple document by ! mark:
Sometimes you may need to encode the text into a numpy.ndarray before further computation. We provide some helper functions in Document and DocumentArray that allow you to convert easily.
For example, we have a DocumentArray with three Documents:
When you have text in different length and you want the output .tensor to have the same length, you can define max_length during converting:
fromdocarrayimportDocument,DocumentArrayda=DocumentArray([Document(text='a short phrase'),Document(text='word'),Document(text='this is a much longer sentence')])vocab=da.get_vocabulary()fordinda:d.convert_text_to_tensor(vocab,max_length=10)print(d.tensor)
As a bonus, you can also easily convert an integer ndarray back to text based on some given vocabulary. This procedure is often termed as “decoding”.
fromdocarrayimportDocument,DocumentArrayda=DocumentArray([Document(text='a short phrase'),Document(text='word'),Document(text='this is a much longer sentence')])vocab=da.get_vocabulary()# encodingfordinda:d.convert_text_to_tensor(vocab,max_length=10)# decodingfordinda:d.convert_tensor_to_text(vocab)print(d.text)
a short phrase
word
this is a much longer sentence
Let’s search for "sheenteredtheroom" in Pride and Prejudice:
fromdocarrayimportDocument,DocumentArrayd=Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()da=DocumentArray(Document(text=s.strip())forsind.text.split('\n')ifs.strip())da.apply(lambdad:d.embed_feature_hashing())q=(Document(text='she entered the room').embed_feature_hashing().match(da,limit=5,exclude_self=True,metric='jaccard',use_scipy=True))print(q.matches[:,('text','scores__jaccard')])
[['staircase, than she entered the breakfast-room, and congratulated',
'of the room.',
'She entered the room with an air more than usually ungracious,',
'entered the breakfast-room, where Mrs. Bennet was alone, than she',
'those in the room.'],
[{'value': 0.6, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.6666666666666666, 'ref_id': 'f47f7448709811ec960a1e008a366d49'},
{'value': 0.7142857142857143, 'ref_id': 'f47f7448709811ec960a1e008a366d49'}]]