docarray.array.mixins.match module#

class docarray.array.mixins.match.MatchMixin[source]#

Bases: object

A mixin that provides match functionality to DocumentArrays

match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, exclude_self=False, only_id=False, use_scipy=False, device='cpu', num_worker=1, **kwargs)[source]#

Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. .. note:

'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency.
You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`,
`cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`,
`kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`,
`sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`.
To use scipy metric, please set ``use_scipy=True``.
  • To make all matches values in [0, 1], use dA.match(dB, normalization=(0, 1))

  • To invert the distance as score and make all values in range [0, 1],

    use dA.match(dB, normalization=(1, 0)). Note, how normalization differs from the previous.

  • If a custom metric distance is provided. Make sure that it returns scores as distances and not similarity, meaning the smaller the better.

  • darray (DocumentArray) – the other DocumentArray to match against

  • metric (Union[str, Callable[[ForwardRef, ForwardRef], ForwardRef]]) – the distance metric

  • limit (Union[int, float, None]) – the maximum number of matches, when not given defaults to 20.

  • normalization (Optional[Tuple[float, float]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].

  • metric_name (Optional[str]) – if provided, then match result will be marked with this string.

  • batch_size (Optional[int]) – if provided, then darray is loaded in batches, where each of them is at most batch_size elements. When darray is big, this can significantly speedup the computation.

  • exclude_self (bool) – if set, Documents in darray with same id as the left-hand values will not be considered as matches.

  • only_id (bool) – if set, then returning matches will only contain id

  • use_scipy (bool) – if set, use scipy as the computation backend. Note, scipy does not support distance on sparse matrix.

  • device (str) – the computational device for .match(), can be either cpu or cuda.

  • num_worker (Optional[int]) –

    the number of parallel workers. If not given, then the number of CPUs in the system will be used.


    This argument is only effective when batch_size is set.

  • kwargs – other kwargs.

Return type