-
Notifications
You must be signed in to change notification settings - Fork 234
feat(v2): load da from csv and save to csv #1144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Will nested schemas work somehow? |
Yes, for nested fields you can give the access path as column names, e.g. 'image.url' to set the url of a field of type Image |
7f03c5e to
ed134c5
Compare
b21d518 to
5126056
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
27303af to
5babdae
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
tests/toydata/docs_nested.csv
Outdated
| @@ -0,0 +1,4 @@ | |||
| count,text,image,image2.url | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| count,text,image,image2.url | |
| count,text,image,image2__url |
Not really sure here but I think we cannot/should not use . inside column name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, do you mean like by csv convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah kind of because, though it is more gut feeling that actual knowledge about CSV convention. But generally I think we should keep with a__b sintax rather than a.b one. Same for vector database I believe @JohannesMessner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will adjust this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
@JohannesMessner true, at other places we used the dot separation for the access paths (such as in traverse_flat(), e.g.da.traverse_flat(access_path='author.name'). @samsja mentioned that we should not use the . inside the column name. I am not sure how dangerous/against the rules this is
|
|
||
| def test_from_csv_without_schema_raise_exception(): | ||
| with pytest.raises(TypeError, match='no document schema defined'): | ||
| DocumentArray.from_csv(file_path=str(TOYDATA_DIR / 'docs_nested.csv')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense that this raise an Expection for now. But what about implementing at this level the auto schema detection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now the exception, I think the auto detection or handing over a schema some other way then setting .document_type I would handle in a separate PR, if that's good with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes makes sense to do it in another PR
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
JohannesMessner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but I think this will be a truly cool feature! We should promote this properly once it is out, I bet the data validation folks that already use pydantic would love it
docarray/array/array/io.py
Outdated
| """ | ||
| Check if a given access path is a valid path for a given Document class. | ||
| """ | ||
| field, _, remaining = access_path.partition('__') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: we will use __ instead of . separator for access paths
docarray/array/array/io.py
Outdated
| def _access_path_to_dict(access_path: str, value) -> Dict[str, Any]: | ||
| """ | ||
| Convert an access path and its value to a (potentially) nested dict. | ||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}} | ||
| """ | ||
| fields = access_path.split('__') | ||
| for field in reversed(fields): | ||
| result = {field: value} | ||
| value = result | ||
| return result | ||
|
|
||
|
|
||
| def _dict_to_access_paths(d: dict) -> Dict[str, Any]: | ||
| """ | ||
| Convert a (nested) dict to a Dict[access_path, value]. | ||
| Access paths are defines as a path of field(s) separated by "__". | ||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'} | ||
| """ | ||
| result = {} | ||
| for k, v in d.items(): | ||
| if isinstance(v, dict): | ||
| v = _dict_to_access_paths(v) | ||
| for nested_k, nested_v in v.items(): | ||
| new_key = '__'.join([k, nested_k]) | ||
| result[new_key] = nested_v | ||
| else: | ||
| result[k] = v | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider making these methods global helpers, they will come in handy in other places of the code base as well, e.g. i already implemented something similar for document stores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure, sounds good. What exactly do you mean by making these methods global helpers, put them somewhere in a helper file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, in a helper.py or utils.py at the top level or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
bc46938 to
00a9ea7
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
|
📝 Docs are deployed on https://ft-feat-from-to-csv--jina-docs.netlify.app 🎉 |
JohannesMessner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature!
* feat: load from and to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: from to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * feat: add access path to dict Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: from to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * docs: add docstring and update tmpdir in test Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: merge nested dicts Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * test: update test Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply samis suggestion from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review wrt access paths Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: typos Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * refactor: move helper functions to helper file Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * test: fix fixture Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> --------- Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Arnav Zutshi <arnzut1324@gmail.com>
Goals:
document_typeneeds to be set to determine schema.'image__url'None