fix: proper setting of multimodal attributes by JohannesMessner · Pull Request #448 · docarray/docarray

JohannesMessner · 2022-07-25T10:04:33Z

Previously, setting a multi-modal attribute did not set chunks properly. This resulted in only Document level access working, while DocumentArray level access was broken.
This PR fixes this.

The previous tests were not comprehensive enough to catch this flaw. This PR adds tests to that effect.

codecov · 2022-07-25T10:13:24Z

Codecov Report

Merging #448 (d6ec766) into main (bb132b3) will increase coverage by 2.14%.
The diff coverage is 93.10%.

@@            Coverage Diff             @@
##             main     #448      +/-   ##
==========================================
+ Coverage   84.50%   86.64%   +2.14%     
==========================================
  Files         134      134              
  Lines        6497     6516      +19     
==========================================
+ Hits         5490     5646     +156     
+ Misses       1007      870     -137

Flag	Coverage Δ
docarray	`86.64% <93.10%> (+2.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/document/mixins/multimodal.py	`93.33% <92.85%> (-0.86%)`	⬇️
docarray/__init__.py	`75.00% <100.00%> (ø)`
docarray/array/mixins/setitem.py	`81.81% <0.00%> (+0.82%)`	⬆️
docarray/array/storage/weaviate/find.py	`78.48% <0.00%> (+1.26%)`	⬆️
docarray/base.py	`98.64% <0.00%> (+1.35%)`	⬆️
docarray/array/storage/sqlite/seqlike.py	`84.31% <0.00%> (+1.96%)`	⬆️
docarray/array/mixins/find.py	`87.35% <0.00%> (+2.29%)`	⬆️
docarray/array/mixins/io/binary.py	`97.45% <0.00%> (+2.54%)`	⬆️
docarray/array/storage/weaviate/seqlike.py	`77.41% <0.00%> (+3.22%)`	⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4748fe5...d6ec766. Read the comment docs.

JoanFM

LGTM, just a comment

JoanFM · 2022-07-25T11:24:44Z

+                f'attribute {attr} is not a valid multi modal attribute.'
+                f' One possible cause is the usage of a non-supported type in the dataclass definition.'
            )
+        return int(pos)


when would this not be an int?

Not sure to be honest, in the original implementation this was done (but in a slightly different place), so I kept it.

JoanFM · 2022-07-25T11:26:19Z

+        if attr not in self._metadata['multi_modal_schema']:
+            raise ValueError(f'the Document schema does not contain attribute {attr}')
+
+        pos = self._metadata['multi_modal_schema'][attr].get('position')


I wonder if we can have a list of all these specific KEYS and their meaning so that we do not have them scattered around the code

Good point, but I would also have to collect them all around the code base first.

we can have a separate ticket

Hello, my name is Nicholas, I am new here, but I have been doing something for a small learning and automation time saving project that I could recreate and modify a bit and I do believe that it could automate the task of identifying your "keys / Symbolic-Arguments via first performing a search of all the standard source code files and read me documentations, located in the projects collection. As long as the files are encoded in standard us English text char encoding it wouldn't matter whether they be .py, .js, whatever. The agent would simply be searching through the source code looking for all the comment blocks, and extracting those as well as all the read me and help documentation files. Once it has collected at least the majority of said comments and documentation text from all included project files, it would count up the number of times repeatedly used, but non standard/non-dictionary words which are found to occur in a frequency that is above the statistical expectation for a similar non-dictionary phrase if not being used to convey some symbolic meaning or as a proxy for an argument/variable. It could then collect all such possible "KEYS" you guys are discussing, trying to create a unified index to be used by all code contributors and future users, and then at least all the charector strings which are repetitively utalied in the source comments and or documentation would be neatly indexed, in order of most frequent to least if that would be also helpful :). Would you guys think me creating a small program/script to do that would actually be very helpful? I know it's not a huge thing, but I am just getting into ML and I could certainly take on something like that. Once the list of possible keys has been created, of course, they should be discussed amongst the core staff and frequent contributors, I think. In fact, I would be okay with simply creating the tool, and then handing the code and the tool with its simple documentation and usage information to your team. Let me know what you think, if that would really be helpful, or not. It's okay if not, I would rather know it wouldn't really help much, than to commit my limited available coding / research time to a task that is of negligible value for the sake of someone trying to be polite. Lol, I don't get my feelings hurt, so just be clear and honest, and I'll appreciate your genuine communication over overly polite confusing language any day! Thanks either way. TheAssemboler, NickOrion21@gmail.com

I am working on a Java script tool that loads any document, website, .CSV, excel, ... Etc, basically any standard char. file, and functionally performs an array of regex swaps for simple string replacements in the target file. This was to allow much faster creation of custom eBAY listings using a custom HTML template I created to sell a large number of collectable trading cards, each being slightly different requiring a number of particular changes in the base template, and doing so manually was simply too slow.
I mention this However, because as I read your guys conversation it occures to me that I could conceivably adapt the code so that instead of doing reg ex swaps via arrays from custom target files and replacement values per item.
I could write a new version that could first perform a search of all the documents located in a collection, whether they be .py, .js, whatever, and search through the comments of each language to identify specific non-dictionary words that occur in a frequency that is above the statistical expectation for a similar non-dictionary phrase.

It could then collect all such possible "KEYS" you guys are discussing, trying to create a sort of unified index to be used by all code contributors and future users. Would that be very helpful? I know it's not a huge thing, but I am just getting into creating basic AI agents and leveraging the awesome power of ML and I could certainly take on something semi simple such as a key identification program. Once the list of possible keys has been created, of course, they should be discussed and officiated or removed amongst the core developers and and frequent contributes. In fact, I would be okay with simply creating the tool, and then handing the code and the tool with its simple documentation and usage information to your team. Let me know what you think, if that would actually be helpful, in a real way, or not. Thanks either way. The Assemboler, NickOrion21@gmail.com

JohannesMessner added 2 commits July 25, 2022 12:00

test: add more rigorous setting tests

50fe9e9

fix: proper setting of multimodal properties

4d39e2b

github-actions bot added size/m area/core area/testing component/document labels Jul 25, 2022

JohannesMessner mentioned this pull request Jul 25, 2022

feat: draft secondary index #403

Closed

7 tasks

test: remove print

c89bc01

JohannesMessner marked this pull request as draft July 25, 2022 10:09

fix: properly raise error on invalid multi modal type#

d6ec766

JoanFM reviewed Jul 25, 2022

View reviewed changes

JohannesMessner marked this pull request as ready for review July 25, 2022 11:36

JoanFM approved these changes Jul 25, 2022

View reviewed changes

JoanFM merged commit b905405 into main Jul 25, 2022

JoanFM deleted the fix-mm-set branch July 25, 2022 11:54

JohannesMessner mentioned this pull request Jul 28, 2022

feat: subindex for all document stores #456

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: proper setting of multimodal attributes#448

fix: proper setting of multimodal attributes#448
JoanFM merged 4 commits intomainfrom
fix-mm-set

JohannesMessner commented Jul 25, 2022

Uh oh!

codecov bot commented Jul 25, 2022 •

edited

Loading

Uh oh!

JoanFM left a comment

Uh oh!

JoanFM Jul 25, 2022

Uh oh!

JohannesMessner Jul 25, 2022

Uh oh!

JoanFM Jul 25, 2022

Uh oh!

JohannesMessner Jul 25, 2022

Uh oh!

JoanFM Jul 25, 2022

Uh oh!

NickOrion21 Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesMessner commented Jul 25, 2022

Uh oh!

codecov bot commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JoanFM left a comment

Choose a reason for hiding this comment

Uh oh!

JoanFM Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

JohannesMessner Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

JoanFM Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

JohannesMessner Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

JoanFM Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

NickOrion21 Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 25, 2022 •

edited

Loading