Embedding#
Embedding is a multi-dimensional representation of a Document (often a [1, D]
vector). It serves as a very important piece in machine learning. The attribute embedding
is designed to contain the embedding information of a Document.
Like .tensor
, you can assign it with a Python (nested) List/Tuple, Numpy ndarray
, SciPy sparse matrix (spmatrix
), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor.
import numpy as np
import scipy.sparse as sp
import torch
import tensorflow as tf
from docarray import Document
d0 = Document(embedding=[1, 2, 3])
d1 = Document(embedding=np.array([1, 2, 3]))
d2 = Document(embedding=np.array([[1, 2, 3], [4, 5, 6]]))
d3 = Document(embedding=sp.coo_matrix([0, 0, 0, 1, 0]))
d4 = Document(embedding=torch.tensor([1, 2, 3]))
d5 = Document(embedding=tf.sparse.from_dense(np.array([[1, 2, 3], [4, 5, 6]])))
Unlike some other packages, DocArray will not actively cast dtype
into float32
. If the right-hand assigment dtype
is float64
in PyTorch, it will stay as a PyTorch float64
tensor.
To assign multiple Documents .tensor
and .embedding
in bulk, you should use DocumentArray. It is much faster and smarter than using for-loop.
Fill embedding via neural network#
On multiple Documents use DocumentArray
To embed multiple Documents, do not use this feature in a for-loop. Instead, put all Documents in a DocumentArray and call .embed()
. You can find out more in Embed via Neural Network.
Usually you don’t want to assign embedding manually, but instead doing something like:
d.tensor \
d.text ---> some DNN model ---> d.embedding
d.blob /
Once a Document has content field set, you can use a deep neural network to embed()
it, which means filling .embedding
. For example, our Document looks like the following:
q = (Document(uri='/Users/hanxiao/Downloads/left/00003.jpg')
.load_uri_to_image_tensor()
.set_image_tensor_normalization()
.set_image_tensor_channel_axis(-1, 0))
Let’s embed it into vector via ResNet50:
import torchvision
model = torchvision.models.resnet50(pretrained=True)
q.embed(model)
Find nearest-neighbours#
On multiple Documents use DocumentArray
To match multiple Documents, do not use this feature in a for-loop. Instead, find out more in Match Nearest Neighbours.
Documents have .embedding
set can be “matched” against each other. In this example, we build ten Documents and put them into a DocumentArray, and then use another Document to search against them.
from docarray import DocumentArray, Document
import numpy as np
da = DocumentArray.empty(10)
da.embeddings = np.random.random([10, 256])
q = Document(embedding=np.random.random([256]))
q.match(da)
q.summary()
<Document ('id', 'embedding', 'matches') at 63a39fa86d6911eca6fa1e008a366d49>
└─ matches
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a39aee6d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a399d66d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a39b346d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a3999a6d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a39a626d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a397ba6d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a39a1c6d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a39ab26d6911eca6fa1e008a366d49>
├─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a399046d6911eca6fa1e008a366d49>
└─ <Document ('id', 'adjacency', 'embedding', 'scores') at 63a399546d6911eca6fa1e008a366d49>