Embed via Neural Network#

Important

embed() supports both CPU & GPU.

When DocumentArray has .tensors set, you can use a neural network to embed() it into vector representations, i.e. filling .embeddings. For example, our DocumentArray looks like the following:

from docarray import DocumentArray
import numpy as np

docs = DocumentArray.empty(10)
docs.tensors = np.random.random([10, 128]).astype(np.float32)

Let’s use a simple MLP in Pytorch/Keras/ONNX/Paddle as our embedding model:

import torch

model = torch.nn.Sequential(
    torch.nn.Linear(
        in_features=128,
        out_features=128,
    ),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=128, out_features=32))
import tensorflow as tf

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(32),
    ]
)

Preliminary: you need to first export a DNN model to ONNX via API/CLI. For example let’s use the PyTorch one:

data = torch.rand(1, 128)

torch.onnx.export(model, data, 'mlp.onnx', 
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names=['input'],  # the model's input names
    output_names=['output'],  # the model's output names
    dynamic_axes={
        'input': {0: 'batch_size'},  # variable length axes
        'output': {0: 'batch_size'},
    })

Then load it as InferenceSession:

import onnxruntime

model = onnxruntime.InferenceSession('mlp.onnx')
import paddle

model = paddle.nn.Sequential(
    paddle.nn.Linear(
        in_features=128,
        out_features=128,
    ),
    paddle.nn.ReLU(),
    paddle.nn.Linear(in_features=128, out_features=32),
)

Now, you can simply do

docs.embed(model)

print(docs.embeddings)
tensor([[-0.1234,  0.0506, -0.0015,  0.1154, -0.1630, -0.2376,  0.0576, -0.4109,
          0.0052,  0.0027,  0.0800, -0.0928,  0.1326, -0.2256,  0.1649, -0.0435,
         -0.2312, -0.0068, -0.0991,  0.0767, -0.0501, -0.1393,  0.0965, -0.2062,

By default, the filled .embeddings is in the given model framework’s format. If you want it always be numpy.ndarray, use .embed(..., to_numpy=True).

You can specify .embed(..., device='cuda') when working with GPU. The device name identifier depends on the model framework that you are using.

On large DocumentArray that does not fit into GPU memory, you can set batch_size via .embed(..., batch_size=128).

You can use pretrained model from Keras/PyTorch/PaddlePaddle/ONNX for embedding:

import torchvision
model = torchvision.models.resnet50(pretrained=True)
docs.embed(model)

After getting .embeddings, you can visualize it using plot_embeddings(), find more details here.

Note that .embed() only works when you have .tensors set, if you have .texts set and your model function supports string as the input, then you can always do the following to get embeddings:

from docarray import DocumentArray

da = DocumentArray(...)
da.embeddings = my_text_model(da.texts)