Skip to content

Changing 1-to-M behaviour of on_disk_cache.#810

Closed
VitalyFedyunin wants to merge 3 commits intogh/VitalyFedyunin/23/basefrom
gh/VitalyFedyunin/23/head
Closed

Changing 1-to-M behaviour of on_disk_cache.#810
VitalyFedyunin wants to merge 3 commits intogh/VitalyFedyunin/23/basefrom
gh/VitalyFedyunin/23/head

Conversation

@VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented Oct 6, 2022

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as filename_fn

Stack from ghstack (oldest at bottom):

Differential Revision: D40148560

VitalyFedyunin added a commit that referenced this pull request Oct 6, 2022
ghstack-source-id: 4bfa54e
Pull Request resolved: #810
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2022
@VitalyFedyunin VitalyFedyunin requested a review from ejguan October 6, 2022 16:54
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!! Thank you!!!

if filepath_fn is not None:
_check_unpickable_fn(filepath_fn)
filepath_fn = _generator_to_list(filepath_fn) if inspect.isgeneratorfunction(filepath_fn) else filepath_fn
assert not inspect.isgeneratorfunction(filepath_fn) # BC breaking, now only str is accepted as return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we need to change the dataset implementation from TorchText side. e.g.: https://github.com/pytorch/text/blob/ff1fdfce8ac030a11638b2d94d54364144586253/torchtext/datasets/imdb.py#L107

Comment on lines +274 to +276
todo_dp: Any
cached_dp: Any
one_many_cached_dp: Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: annotated as IterDataPipe

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`


Differential Revision: [D40148560](https://our.internmc.facebook.com/intern/diff/D40148560)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this pull request Oct 11, 2022
ghstack-source-id: faa070d
Pull Request resolved: #810
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`


Differential Revision: [D40148560](https://our.internmc.facebook.com/intern/diff/D40148560)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this pull request Oct 11, 2022
ghstack-source-id: 74bee20
Pull Request resolved: #810
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit to pytorch/text that referenced this pull request Oct 12, 2022
Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]
VitalyFedyunin added a commit to pytorch/text that referenced this pull request Oct 13, 2022
Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]
Nayef211 pushed a commit to pytorch/text that referenced this pull request Oct 13, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]
joecummings added a commit to pytorch/text that referenced this pull request Oct 14, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com>
@facebook-github-bot facebook-github-bot deleted the gh/VitalyFedyunin/23/head branch October 16, 2022 14:19
ejguan pushed a commit to ejguan/text that referenced this pull request Oct 20, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com>
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 20, 2022
Summary:
Pull Request resolved: meta-pytorch#810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
ejguan pushed a commit that referenced this pull request Oct 20, 2022
Summary:
Pull Request resolved: #810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
abhinavarora pushed a commit to pytorch/text that referenced this pull request Oct 21, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with meta-pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com>

Co-authored-by: Joe Cummings <jrcummings@fb.com>
Co-authored-by: Vitaly Fedyunin <vitalyf@fb.com>
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 23, 2022
Summary:
Pull Request resolved: meta-pytorch#810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants