docarray.document.mixins.text module#

class docarray.document.mixins.text.TextDataMixin[source]#

Bases: object

Provide helper functions for Document to support text data.

load_uri_to_text(charset='utf-8')[source]#

Convert uri to :attr`.text` inplace.

Parameters

charset (str) – charset may be any character set registered with IANA

Return type

T

Returns

itself after processed

get_vocabulary(text_attrs=('text',))[source]#

Get the text vocabulary in a counter dict that maps from the word to its frequency from all text_fields.

Parameters

text_attrs (Tuple[str, …]) – the textual attributes where vocabulary will be derived from

Return type

Dict[str, int]

Returns

a vocabulary in dictionary where key is the word, value is the frequency of that word in all text fields.

convert_text_to_tensor(vocab, max_length=None, dtype='int64')[source]#

Convert text to tensor inplace.

In the end tensor will be a 1D array where D is max_length.

To get the vocab of a DocumentArray, you can use jina.types.document.converters.build_vocab to

Parameters
  • vocab (Dict[str, int]) – a dictionary that maps a word to an integer index, 0 is reserved for padding, 1 is reserved for unknown words in text. So you should not include these two entries in vocab.

  • max_length (Optional[int]) – the maximum length of the sequence. Sequence longer than this are cut off from beginning. Sequence shorter than this will be padded with 0 from right hand side.

  • dtype (str) – the dtype of the generated tensor

Return type

T

Returns

Document itself after processed

convert_tensor_to_text(vocab, delimiter=' ')[source]#

Convert tensor to text inplace.

Parameters
  • vocab (Union[Dict[str, int], Dict[int, str]]) – a dictionary that maps a word to an integer index, 0 is reserved for padding, 1 is reserved for unknown words in text

  • delimiter (str) – the delimiter that used to connect all words into text

Return type

T

Returns

Document itself after processed

convert_text_to_datauri(charset='utf-8', base64=False)[source]#

Convert text to data uri.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

Return type

T

Returns

itself after processed