feat: subindex for all document stores by JohannesMessner · Pull Request #456 · docarray/docarray

JohannesMessner · 2022-07-28T14:54:06Z

This PR is a draft for storing secondary indices as a dict with different docarrays

ToDo:

Example usage:

This feature lets you store a separate index for a given nesting level, and then search through it.

This feature works with access paths like '@c' and '@cc', but also with multi-modal access paths like '@.[image]. Example:

from docarray import dataclass, Document, DocumentArray
from docarray.typing import Image, Text


@dataclass
class Page:
    main_text: Text
    image: Image
    description: Text


query_page = Page(
    main_text='Hello world',
    image='testflow.jpg',
    description='This is the image of an apple',
)

query = Document(query_page)  # our query Document

da = DocumentArray(
    [
        Document(
            Page(
                main_text='First page',
                image='apple.png',
                description='This is the image of an apple',
            )
        ),
        Document(
            Page(
                main_text='Second page',
                image='testflow.jpg',
                description='This is an image of a pear',
            )
        ),
    ],
    subindex_configs = {'@.[image]': None})  # specify subindices at docarray creation time


from torchvision.models import resnet50

img_model = resnet50(pretrained=True)

# embed query
query.image.set_image_tensor_shape(shape=(224, 224)).set_image_tensor_channel_axis(
    original_channel_axis=-1, new_channel_axis=0
).set_image_tensor_normalization(channel_axis=0).embed(img_model)

# embed dataset
da['@.[image]'].apply(
    lambda d: d.set_image_tensor_shape(shape=(224, 224))
    .set_image_tensor_channel_axis(original_channel_axis=-1, new_channel_axis=0)
    .set_image_tensor_normalization(channel_axis=0)
).embed(img_model)


closest_match_img = da.find(query.image, on='@.[image]')[0][0]  # search through subindex using `on=`
print('CLOSEST IMAGE:')
closest_match_img.summary()
print('PAGE WITH THE CLOSEST IMAGE:')
closest_match_page = da[closest_match_img.parent_id]
closest_match_page.summary()

Design doc:

Design doc: https://docs.google.com/presentation/d/1rntTa1Ur2WmdAvOUEI01l2WeNSWgtqk2xjth_gdIr50/edit#slide=id.g13443d1ddb2_1_23

Continuation of this: #403

codecov · 2022-07-28T15:00:30Z

Codecov Report

Merging #456 (4db4586) into main (933ddf0) will increase coverage by 2.64%.
The diff coverage is 96.47%.

@@            Coverage Diff             @@
##             main     #456      +/-   ##
==========================================
+ Coverage   83.37%   86.02%   +2.64%     
==========================================
  Files         134      134              
  Lines        6533     6640     +107     
==========================================
+ Hits         5447     5712     +265     
+ Misses       1086      928     -158

Flag	Coverage Δ
docarray	`86.02% <96.47%> (+2.64%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/array/document.py	`71.92% <ø> (ø)`
docarray/array/mixins/match.py	`75.00% <ø> (ø)`
docarray/array/storage/annlite/find.py	`93.33% <ø> (ø)`
docarray/array/storage/elastic/getsetdel.py	`100.00% <ø> (+4.54%)`	⬆️
docarray/array/storage/memory/find.py	`91.22% <ø> (ø)`
docarray/array/storage/base/getsetdel.py	`90.90% <88.88%> (+12.64%)`	⬆️
docarray/array/mixins/find.py	`87.62% <90.00%> (+0.27%)`	⬆️
docarray/array/storage/base/backend.py	`87.80% <93.33%> (+2.61%)`	⬆️
docarray/array/storage/base/seqlike.py	`86.84% <93.75%> (+6.07%)`	⬆️
docarray/__init__.py	`75.00% <100.00%> (ø)`
... and 39 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

github-actions · 2022-08-09T07:28:06Z