Skip to content

feat: subindex for all document stores#456

Merged
nan-wang merged 72 commits intomainfrom
feat-secondary-index
Aug 10, 2022
Merged

feat: subindex for all document stores#456
nan-wang merged 72 commits intomainfrom
feat-secondary-index

Conversation

@JohannesMessner
Copy link
Copy Markdown
Member

@JohannesMessner JohannesMessner commented Jul 28, 2022

This PR is a draft for storing secondary indices as a dict with different docarrays

ToDo:

  • .find()
  • .match()
    • tests
  • add
  • delete
  • update
  • bug: multi-modal access path. Fix here: fix: proper setting of multimodal attributes #448
  • tests for multimodal (added in find and del)
  • Decide on syntax and argument names
    • Implement new nomenclature
  • Implement InMemory as syntax sugar
  • Validate/test other backends
    • ES
    • Qdrant
    • Weaviate
    • SQLite
  • set subindices correctly when chunks* are given at init
  • return subindex when corresponding access path is passed
  • Documentation
    • for end user
    • for backend devs

Example usage:

This feature lets you store a separate index for a given nesting level, and then search through it.

This feature works with access paths like '@c' and '@cc', but also with multi-modal access paths like '@.[image]. Example:

from docarray import dataclass, Document, DocumentArray
from docarray.typing import Image, Text


@dataclass
class Page:
    main_text: Text
    image: Image
    description: Text


query_page = Page(
    main_text='Hello world',
    image='testflow.jpg',
    description='This is the image of an apple',
)

query = Document(query_page)  # our query Document

da = DocumentArray(
    [
        Document(
            Page(
                main_text='First page',
                image='apple.png',
                description='This is the image of an apple',
            )
        ),
        Document(
            Page(
                main_text='Second page',
                image='testflow.jpg',
                description='This is an image of a pear',
            )
        ),
    ],
    subindex_configs = {'@.[image]': None})  # specify subindices at docarray creation time


from torchvision.models import resnet50

img_model = resnet50(pretrained=True)

# embed query
query.image.set_image_tensor_shape(shape=(224, 224)).set_image_tensor_channel_axis(
    original_channel_axis=-1, new_channel_axis=0
).set_image_tensor_normalization(channel_axis=0).embed(img_model)

# embed dataset
da['@.[image]'].apply(
    lambda d: d.set_image_tensor_shape(shape=(224, 224))
    .set_image_tensor_channel_axis(original_channel_axis=-1, new_channel_axis=0)
    .set_image_tensor_normalization(channel_axis=0)
).embed(img_model)


closest_match_img = da.find(query.image, on='@.[image]')[0][0]  # search through subindex using `on=`
print('CLOSEST IMAGE:')
closest_match_img.summary()
print('PAGE WITH THE CLOSEST IMAGE:')
closest_match_page = da[closest_match_img.parent_id]
closest_match_page.summary()

Design doc:

Design doc: https://docs.google.com/presentation/d/1rntTa1Ur2WmdAvOUEI01l2WeNSWgtqk2xjth_gdIr50/edit#slide=id.g13443d1ddb2_1_23

Continuation of this: #403

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 28, 2022

Codecov Report

Merging #456 (4db4586) into main (933ddf0) will increase coverage by 2.64%.
The diff coverage is 96.47%.

@@            Coverage Diff             @@
##             main     #456      +/-   ##
==========================================
+ Coverage   83.37%   86.02%   +2.64%     
==========================================
  Files         134      134              
  Lines        6533     6640     +107     
==========================================
+ Hits         5447     5712     +265     
+ Misses       1086      928     -158     
Flag Coverage Δ
docarray 86.02% <96.47%> (+2.64%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/document.py 71.92% <ø> (ø)
docarray/array/mixins/match.py 75.00% <ø> (ø)
docarray/array/storage/annlite/find.py 93.33% <ø> (ø)
docarray/array/storage/elastic/getsetdel.py 100.00% <ø> (+4.54%) ⬆️
docarray/array/storage/memory/find.py 91.22% <ø> (ø)
docarray/array/storage/base/getsetdel.py 90.90% <88.88%> (+12.64%) ⬆️
docarray/array/mixins/find.py 87.62% <90.00%> (+0.27%) ⬆️
docarray/array/storage/base/backend.py 87.80% <93.33%> (+2.61%) ⬆️
docarray/array/storage/base/seqlike.py 86.84% <93.75%> (+6.07%) ⬆️
docarray/__init__.py 75.00% <100.00%> (ø)
... and 39 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions github-actions bot added size/l and removed size/m labels Jul 28, 2022
Comment thread docs/advanced/document-store/subindex.md Outdated
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 9, 2022

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 9, 2022

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@JohannesMessner JohannesMessner marked this pull request as ready for review August 9, 2022 07:37
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 9, 2022

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Comment thread docs/advanced/document-store/subindex.md
Comment thread docs/advanced/document-store/subindex.md
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 9, 2022

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Copy link
Copy Markdown
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. it is a great PR !

Comment thread tests/unit/array/mixins/test_getset.py Outdated
@github-actions
Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Copy link
Copy Markdown
Collaborator

@nan-wang nan-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Comment thread docs/advanced/document-store/subindex.md
Comment thread docs/advanced/document-store/subindex.md Outdated
Comment thread docs/advanced/document-store/subindex.md
Comment thread tests/unit/array/mixins/test_getset.py Outdated
@github-actions
Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link
Copy Markdown

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link
Copy Markdown

📝 Docs are deployed on https://ft-feat-secondary-index--jina-docs.netlify.app 🎉

Copy link
Copy Markdown
Collaborator

@nan-wang nan-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

@nan-wang nan-wang merged commit ec7dcce into main Aug 10, 2022
@nan-wang nan-wang deleted the feat-secondary-index branch August 10, 2022 13:46
if getattr(self, '_subindices', None):
for selector, da in self._subindices.items():
ids_subindex = DocumentArray(self[ids])[selector, 'id']
da._del_docs_by_ids(ids_subindex)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here should be da._del_docs(ids_subindex), to also delete ids in self._offset2ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants