Skip to content

fix: proper setting of multimodal attributes#448

Merged
JoanFM merged 4 commits intomainfrom
fix-mm-set
Jul 25, 2022
Merged

fix: proper setting of multimodal attributes#448
JoanFM merged 4 commits intomainfrom
fix-mm-set

Conversation

@JohannesMessner
Copy link
Copy Markdown
Member

Previously, setting a multi-modal attribute did not set chunks properly. This resulted in only Document level access working, while DocumentArray level access was broken.
This PR fixes this.

The previous tests were not comprehensive enough to catch this flaw. This PR adds tests to that effect.

@JohannesMessner JohannesMessner marked this pull request as draft July 25, 2022 10:09
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 25, 2022

Codecov Report

Merging #448 (d6ec766) into main (bb132b3) will increase coverage by 2.14%.
The diff coverage is 93.10%.

@@            Coverage Diff             @@
##             main     #448      +/-   ##
==========================================
+ Coverage   84.50%   86.64%   +2.14%     
==========================================
  Files         134      134              
  Lines        6497     6516      +19     
==========================================
+ Hits         5490     5646     +156     
+ Misses       1007      870     -137     
Flag Coverage Δ
docarray 86.64% <93.10%> (+2.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/document/mixins/multimodal.py 93.33% <92.85%> (-0.86%) ⬇️
docarray/__init__.py 75.00% <100.00%> (ø)
docarray/array/mixins/setitem.py 81.81% <0.00%> (+0.82%) ⬆️
docarray/array/storage/weaviate/find.py 78.48% <0.00%> (+1.26%) ⬆️
docarray/base.py 98.64% <0.00%> (+1.35%) ⬆️
docarray/array/storage/sqlite/seqlike.py 84.31% <0.00%> (+1.96%) ⬆️
docarray/array/mixins/find.py 87.35% <0.00%> (+2.29%) ⬆️
docarray/array/mixins/io/binary.py 97.45% <0.00%> (+2.54%) ⬆️
docarray/array/storage/weaviate/seqlike.py 77.41% <0.00%> (+3.22%) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4748fe5...d6ec766. Read the comment docs.

Copy link
Copy Markdown
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a comment

f'attribute {attr} is not a valid multi modal attribute.'
f' One possible cause is the usage of a non-supported type in the dataclass definition.'
)
return int(pos)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this not be an int?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to be honest, in the original implementation this was done (but in a slightly different place), so I kept it.

if attr not in self._metadata['multi_modal_schema']:
raise ValueError(f'the Document schema does not contain attribute {attr}')

pos = self._metadata['multi_modal_schema'][attr].get('position')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can have a list of all these specific KEYS and their meaning so that we do not have them scattered around the code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but I would also have to collect them all around the code base first.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have a separate ticket

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  Hello, my name is Nicholas, I am new here, but I have been doing something for a small learning and automation time saving project that I could recreate and modify a bit and I do believe that it could automate the task of identifying your "keys / Symbolic-Arguments via first performing a search of all the standard source code files and read me documentations, located in the projects collection.  As long as the files are encoded in standard us English text char encoding it wouldn't matter whether they be .py, .js, whatever.  The agent would simply be searching through the source code looking for all the comment blocks, and extracting those as well as all the read me and help documentation files.  Once it has collected at least the majority of said comments and documentation text from all included project files, it would count up the number of times repeatedly used, but non standard/non-dictionary words which are found to occur in a frequency that is above the statistical expectation for a similar non-dictionary phrase if not being used to convey some symbolic meaning or as a proxy for an argument/variable.
  It could then collect all such possible "KEYS" you guys are discussing, trying to create a unified index to be used by all code contributors and future users, and then at least all the charector strings which are repetitively utalied in the source comments and or documentation would be neatly indexed, in order of most frequent to least if that would be also helpful  :). 
 Would you guys think me creating a small program/script to do that would actually be very helpful? I know it's not a huge thing, but I am just getting into ML and I could certainly take on something like that. Once the list of possible keys has been created, of course, they should be discussed amongst the core staff and frequent contributors, I think. In fact, I would be okay with simply creating the tool, and then handing the code and the tool with its simple documentation and usage information to your team. Let me know what you think, if that would really be helpful, or not. It's okay if not, I would rather know it wouldn't really help much, than to commit my limited available coding / research time to a task that is of negligible value for the sake of someone trying to be polite. Lol, I don't get my feelings hurt, so just be clear and honest, and I'll appreciate your genuine communication over overly polite confusing language any day! Thanks either way. TheAssemboler, NickOrion21@gmail.com

I am working on a Java script tool that loads any document, website, .CSV, excel, ... Etc, basically any standard char. file, and functionally performs an array of regex swaps for simple string replacements in the target file. This was to allow much faster creation of custom eBAY listings using a custom HTML template I created to sell a large number of collectable trading cards, each being slightly different requiring a number of particular changes in the base template, and doing so manually was simply too slow.
I mention this However, because as I read your guys conversation it occures to me that I could conceivably adapt the code so that instead of doing reg ex swaps via arrays from custom target files and replacement values per item.
I could write a new version that could first perform a search of all the documents located in a collection, whether they be .py, .js, whatever, and search through the comments of each language to identify specific non-dictionary words that occur in a frequency that is above the statistical expectation for a similar non-dictionary phrase.

   It could then collect all such possible "KEYS" you guys are discussing, trying to create a sort of unified index to be used by all code contributors and future users. Would that be very helpful? I know it's not a huge thing, but I am just getting into creating basic AI agents and leveraging the awesome power of ML and I could certainly take on something semi simple such as a key identification program.

    Once the list of possible keys has been created, of course, they should be discussed and officiated or removed amongst the core developers and and frequent contributes.  

   In fact, I would be okay with simply creating the tool, and then handing the code and the tool with its simple documentation and usage information to your team. Let me know what you think, if that would actually be helpful, in a real way, or not. Thanks either way. The Assemboler, NickOrion21@gmail.com

@JohannesMessner JohannesMessner marked this pull request as ready for review July 25, 2022 11:36
@JoanFM JoanFM merged commit b905405 into main Jul 25, 2022
@JoanFM JoanFM deleted the fix-mm-set branch July 25, 2022 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants