docarray.document.generators module#
- docarray.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False, *args, **kwargs)[source]#
Create a generator for a given dimension of a numpy array.
- Parameters
array (np.ndarray) – the numpy ndarray data source
axis (
int
) – iterate over that axissize (
Optional
[int
]) – the maximum number of the sub arraysshuffle (
bool
) – shuffle the numpy data source beforehand
- Yield
documents
- Return type
Generator
[ForwardRef
,None
,None
]
- docarray.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False, exclude_regex=None, *args, **kwargs)[source]#
Creates an iterator over a list of file path or the content of the files.
- Parameters
patterns (
Union
[str
,List
[str
]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’recursive (
bool
) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectoriessize (
Optional
[int
]) – the maximum number of the filessampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]read_mode (
Optional
[str
]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.to_dataturi (
bool
) – if set, then the Document.uri will be filled with DataURI instead of the plan URIexclude_regex (
Optional
[str
]) – if set, then filenames that match to this pattern are not included.
- Yield
file paths or binary content
Note
This function should not be directly used, use
Flow.index_files()
,Flow.search_files()
instead- Return type
Generator
[ForwardRef
,None
,None
]
- docarray.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel', encoding='utf-8', *args, **kwargs)[source]#
Generator function for CSV. Yields documents.
- Parameters
file (
Union
[str
,TextIO
]) – file paths or file handlerfield_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined in JSON, dict to the field names defined in Document.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]dialect (
Union
[str
,ForwardRef
]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be acsv.Dialect
class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it toauto
.encoding (
str
) – encoding used to read the CSV file. By default,utf-8
is used.
- Yield
documents
- Return type
Generator
[ForwardRef
,None
,None
]
- docarray.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]#
Generator function for Hugging Face Datasets. Yields documents.
This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the
datasets
library using keyword arguments. Theload_dataset
method fromdatasets
library is used to load the datasets.- Parameters
dataset_path (
str
) – a valid dataset path for Hugging Face Datasets library.field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]filter_fields (
bool
) – specifies whether to filter the dataset with the fields given in`field_resolver
argument.**datasets_kwargs –
additional arguments for
load_dataset
method from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset
- Yield
documents
- Return type
Generator
[ForwardRef
,None
,None
]
- docarray.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None, *args, **kwargs)[source]#
Generator function for line separated JSON. Yields documents.
- Parameters
fp (
Iterable
[str
]) – file pathsfield_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]
- Yield
documents
- Return type
Generator
[ForwardRef
,None
,None
]
- docarray.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]#
Generator function for lines, json and csv. Yields documents or strings.
- Parameters
lines (
Optional
[Iterable
[str
]]) – a list of strings, each is considered as a documentfilepath (
Optional
[str
]) – a text file that each line contains a documentread_mode (
str
) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binaryline_format (
str
) – the format of each linejson
orcsv
field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.size (
Optional
[int
]) – the maximum number of the documentssampling_rate (
Optional
[float
]) – the sampling rate between [0, 1]
- Yield
documents
- Return type
Generator
[ForwardRef
,None
,None
]