Filter while recursing #5425

bugfood · 2020-07-28T21:48:42Z

Issue #
#1138
Probably #1117

Description of changes:
This fleshes out #2105 and hopefully is suitable for inclusion. The patch from @clarete is unmodified except for resolution of a trivial merge conflict.

More description is available in #2105 ; my contributions are to:

Add a minor optimization.
Add a unit test.
Simplify surrounding unit tests.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov-commenter · 2020-07-28T21:55:02Z

Codecov Report

Merging #5425 (cbbe291) into develop (d693aed) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head cbbe291 differs from pull request most recent head 8f8bf7a. Consider uploading reports for the commit 8f8bf7a to get more accurate results

@@             Coverage Diff             @@
##           develop    #5425      +/-   ##
===========================================
- Coverage    92.87%   92.85%   -0.02%     
===========================================
  Files          204      204              
  Lines        16347    16313      -34     
===========================================
- Hits         15182    15148      -34     
  Misses        1165     1165

Impacted Files	Coverage Δ
awscli/customizations/s3/filegenerator.py	`98.94% <100.00%> (+0.02%)`	⬆️
awscli/customizations/s3/filters.py	`100.00% <100.00%> (ø)`
awscli/customizations/s3/subcommands.py	`97.22% <100.00%> (+<0.01%)`	⬆️
awscli/customizations/configure/writer.py	`99.04% <0.00%> (-0.10%)`	⬇️
awscli/customizations/emr/createcluster.py	`98.34% <0.00%> (-0.02%)`	⬇️
awscli/customizations/emr/helptext.py	`100.00% <0.00%> (ø)`
awscli/customizations/eks/get_token.py	`100.00% <0.00%> (ø)`
awscli/customizations/eks/update_kubeconfig.py	`97.93% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d693aed...8f8bf7a. Read the comment docs.

bugfood · 2020-08-06T21:34:59Z

Thanks @IhostVlad for the approval. Is there anything else I can/should do in order to get this merged?

Thanks,
Corey

clarete · 2020-08-06T22:26:27Z

Thanks for working on this, @bugfood! I wonder if @kyleknap could take a look at this and maybe help us getting this merged? 🙏

bugfood · 2020-08-07T18:51:08Z

Thanks @clarete for the original patch! I set out with the goal to write such a patch and found it already pretty much done.

* release-1.19.66: Bumping version to 1.19.66 Update changelog based on model updates

* release-1.19.67: Bumping version to 1.19.67 Update changelog based on model updates

* release-1.19.68: Bumping version to 1.19.68 Update changelog based on model updates

* release-1.19.69: Bumping version to 1.19.69 Update changelog based on model updates

* release-1.19.70: Bumping version to 1.19.70 Update changelog based on model updates

* release-1.19.71: Bumping version to 1.19.71 Update changelog based on model updates

* release-1.19.72: Bumping version to 1.19.72 Update changelog based on model updates

* release-1.19.73: Bumping version to 1.19.73 Update changelog based on model updates

* release-1.19.74: Bumping version to 1.19.74 Update changelog based on model updates

* release-1.19.75: Bumping version to 1.19.75 Update changelog based on model updates

* release-1.19.76: Bumping version to 1.19.76 Update changelog based on model updates

* release-1.19.77: Bumping version to 1.19.77 Update changelog based on model updates

* release-1.19.78: Bumping version to 1.19.78 Update changelog based on model updates

* release-1.19.79: Bumping version to 1.19.79 Update changelog based on model updates

* release-1.19.80: Bumping version to 1.19.80 Update changelog based on model updates

* release-1.19.81: Bumping version to 1.19.81 Update changelog based on model updates

* release-1.19.82: Bumping version to 1.19.82 Update changelog based on model updates

* release-1.19.83: Bumping version to 1.19.83 Update changelog based on model updates add explicit size to example fix note directives

* release-1.19.84: Bumping version to 1.19.84 Update changelog based on model updates New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub

* release-1.19.85: Bumping version to 1.19.85 Update changelog based on model updates Fixed some backslashes CLI examples for ec2, elasticache, globalaccelerator, kms, securityhub, ssm Update create-data-catalog.rst Update policy-exists.rst

* release-1.19.86: Bumping version to 1.19.86 Update changelog based on model updates

* release-1.19.87: Bumping version to 1.19.87 Update changelog based on model updates

* release-1.19.88: Bumping version to 1.19.88 Update changelog based on model updates

* release-1.19.89: Bumping version to 1.19.89 Update changelog based on model updates

* release-1.23.11: Bumping version to 1.23.11 Update changelog based on model updates Add support for specifying an OSReleaseLabel in create-cluster

* release-1.23.12: Bumping version to 1.23.12 Update changelog based on model updates Fix s3 cp example for grants to specific users

justindho · 2022-05-12T17:50:42Z

@bugfood Thanks for making the changes! The team has done another pass on the PR and had some questions/comments.

Is there a specific reason that additional filtering logic was added to filters.py? As @tim-finnigan mentioned in a previous comment, we'd like to remove the filters from the sync, cp, rm, and mv subcommands and keep filtering to one location (in the FileGenerator's list_objects method) to reduce complexity. We recognize that this is a bit of a refactor, but the team would still like to see the changes done here.
Could you rebase your changes off of the develop branch? It looks like the changes were rebased off of master, which introduced a lot of commits to this PR.
The tests you've added look pretty good. Just one minor comment: could you also add another assertion in text_include_exclude to verify that no operations were called with excluded paths (i.e., self.assertNotIn(path, excluded))?

bugfood · 2022-05-12T19:22:59Z

Thanks for the review.

I rebased first, then I added a couple commits, but I need to do more work on that later, so please don't re-review yet.

I added the new assertNotIn; no problem there.

I had earlier added the code to filters.py because the surrounding code is involved in the compilation of user-supplied filters into the actual filters used internally. Since I needed to auto-insert some patterns at pattern-compilation time, that seemed like a natural place. I'm not arguing that it must be that way, but I don't know what would be better.

For the filtering location, I'm not entirely sure what you are asking for and I may need clarification later, though I did spot a deficiency to address first.

Allow FileCreator to provide a consistent abtraction. Do not do this for TestSymlinksIgnoreFiles, though--this currently relies on FileGenerator returning paths with a double separator; use of self.files to generate the path in this case results in: Traceback (most recent call last): File "tests/unit/customizations/s3/test_filegenerator.py", line 369, in test_follow_symlink self.assertEqual(result_list[i], filename) AssertionError: u'/tmp/tmpVnUU91//foo.txt' != u'/tmp/tmpVnUU91/foo.txt' - /tmp/tmpVnUU91//foo.txt ? - + /tmp/tmpVnUU91/foo.txt

Any include-path deep in a directory structure implicitly includes its parent directories (though not necessarily the other contents of those parents). Currently, this does not matter, since the FileGenerator examines the entire tree and emits all deeply-included paths. In a subsequent patch, however, we want to restrict the FileGenerator from traversing excluded paths, so we need to explicitly include the parents of any deeply-included paths, or else a more shallow exclude will inhibit discovery of the deeply-included paths, even if that exclude is specified earlier in the command line.

The way the task pipeline that each command executes filters files *after* collecting information about it. For local files it means that even when the user doesn't want to sync these files, they'll still be stat'ed, which is unnecessary. This change uses the filters received from the command line inside of the `FileGenerator.should_ignore_file` method. So the list of files to be filtered won't contain files supposed to be excluded.

This catches bug aws#1117 (which is now fixed).

If desired, this could be squashed to: Corey Hickey: add a functional test for "aws s3 sync" include/exclude

Move s3 filtering into the FileGenerator and remove the external pass. This makes the location of s3 file filtering consistent with local file generation, at least as much as feasible. For local files, the external pass was redundant now that local filtering is done within the generation itself.

bugfood · 2022-05-12T22:14:45Z

@justindho can you please check the PR now?

The last two commits are work done today; the previous commits are unchanged.

I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly.

kyleknap

Hi @bugfood! I'll be the primary reviewer for the PR and will be reviewing any updates to the PR until it gets merged.

With regards to this comment:

I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly.

Yes that is what we were referring to. The update here looks great.

Otherwise, I mainly looked through the implementation and all of my comments were related to the implementation. I'll probably have some other smaller comments in a subsequent review, but I want to make sure we are in a good spot with regards to the approach to the implementation first. Thanks again for seeing this through!

kyleknap · 2022-05-17T21:23:40Z

awscli/customizations/s3/filegenerator.py

+                if list(self.file_filter.call([FileInfo(path)])):
+                    yield path, data
+
+    def list_objects_raw(self, s3_path, dir_op):


For this method let's add an underscore to it to make it _list_objects_raw to indicate it's a private method and is only used within this class.

kyleknap · 2022-05-17T23:38:38Z

awscli/customizations/s3/filegenerator.py

-                path = path[:-1]
            if os.path.islink(path):
                return True
+        if self.file_filter is not None:


It would be great if we can move the file filter check into its own helper method that we could reuse across the local and s3 listing methods. So something like this:

def _should_ignore_from_file_filter(self, path): return ( self.file_filter is None and not list(self.file_filter.call([FileInfo(path)])) )

This will make it a bit more reusable and improve readability as we are reducing the level of nesting/complexity in the if statements.

kyleknap · 2022-05-17T23:47:12Z

awscli/customizations/s3/filegenerator.py

        return False

-    def list_objects(self, s3_path, dir_op):
+    def list_objects(self, *args, **kwargs):


It would be great if we could consolidate the calls to the _list_objects_raw() to a single instance to help simplify the logic in list_objects():

def list_objects(self, s3_path, dir_op): for path, data in self._list_objects_raw(s3_path, dir_op): if self._should_ignore_from_file_filter(path): continue yield path, data

kyleknap · 2022-05-17T23:58:31Z

awscli/customizations/s3/filters.py

        for pattern in original_patterns:
            full_patterns.append(
                (pattern[0], os.path.join(rootdir, pattern[1])))
+            # When including paths deep in the directory tree, auto-include the


I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands? I think that would be really helpful.

In general, I'd prefer we do not change anything in the filters.py logic if we do not need to. Mainly the filters.py module is one of the older modules in the s3 package and we rarely update it. I think before adding any additional complexity to it, we do some refactoring of it to make it easier to follow/work with.

Thanks. I have to focus on another project for the time being, but I will circle back to this later for the technical changes requested.

I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands?

The use case is fairly simple:

aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include 'excluded/included/*'

If we make the FileGenerator no longer recurse into excluded directories, then excluded/included would never be examined and any files therein would have no chance to be included. What is needed is a specific include filter for excluded/included. We could make the user do it:

aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include excluded/included --include 'excluded/included/*'

...but that seems onerous and would break current behavior. In my PR, this is now verified by the commit with description add a functional test for "aws s3 sync" include/exclude.

The best solution I thought of was to auto-insert the necessary filter, since we know the existing filtering logic does what we need. I wanted to keep the special-case code to a minimum.

Thanks for the detailed examples that was really helpful in understanding it! In general with this PR, we definitely need to make sure we do not break current behavior (i.e. given any set of filters + file set, the CLI will continue to include/exclude the same set of files after this change with no needing to adjust the filters provided at the command line).

While it looks like the coded that got added does handle the missing edge case that you illustrated, I am concerned that we are going to be missing other edge cases after seeing the edge case you raised. For example, suppose we had this directory layout:

$ tree . . ├── directory │ ├── another-dir │ │ ├── exclude-me.txt │ │ └── include-me.py │ ├── exclude-me.txt │ └── include-me.py ├── exclude-me.txt └── include-me.py

If we only wanted to upload the .py files, a user would likely run a command like this:

$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun

This would result in all of the .py files being uploaded:

(dryrun) upload: directory/another-dir/include-me.py to s3://mybucket/directory/another-dir/include-me.py (dryrun) upload: directory/include-me.py to s3://mybucket/directory/include-me.py (dryrun) upload: ./include-me.py to s3://mybucket/include-me.py

However, with the approach we are currently taking, we would still not traversing the directory and another-dir to get all of the .py files:

$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun (dryrun) upload: ./include-me.py to s3://mybucket/include-me.py

In general, I'm not optimistic the original approach of just applying the filters to the directory before traversing it is going to work. In terms of options to pursue, I think we either:

Need to expand the logic of whether we traverse a directory to skip traversing a directory only if it is impossible for a file inside of it to be included based on the filters. I'm not sure if there is a straightforward way of doing this as we would need to account for all supported pattern symbols: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html#use-of-exclude-and-include-filters

Scope the optimization of not traversing directories to simple set of exclude/include patterns (e.g. we only apply this optimization to only filters that have only --exclude flags and/or apply the optimizations if there are no special characters used in the filter (e.g. *, ?, [], [!])

Thoughts? I want to think about this a bit more to see if I have any better suggestions, but I at least wanted to put this realization down now to start the discussion.

Hmm, that is rough.

Your first idea seems best, but I don't know how to do that either.

A third idea would be to give the user control, either by a global option --no-traverse-excludes or a separate argument --hard-exclude <path> (to replace --exclude <path>). This isn't ideal, since it would be best if the program does the right thing automatically, but it may be more straightforward to implement and would not have unexpected consequences for current uses.

Thebutlersguilty3 · 2022-05-30T00:04:57Z

Corey Hickey Friday, May 27, 5:34 PM --hard-exclude Sent from my Metro by T-Mobile 5G Device Get Outlook for Android<https://aka.ms/AAb9ysg>

________________________________ From: Corey Hickey ***@***.***> Sent: Friday, May 27, 2022 5:34:01 PM To: aws/aws-cli ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [aws/aws-cli] Filter while recursing (#5425) @bugfood commented on this pull request.

________________________________ In awscli/customizations/s3/filters.py<#5425 (comment)>:

@@ -101,6 +101,20 @@ def _full_path_patterns(self, original_patterns, rootdir):

for pattern in original_patterns: full_patterns.append( (pattern[0], os.path.join(rootdir, pattern[1]))) + # When including paths deep in the directory tree, auto-include the Hmm, that is rough. Your first idea seems best, but I don't know how to do that either. A third idea would be to give the user control, either by a global option --no-traverse-excludes or a separate argument --hard-exclude <path> (to replace --exclude <path>). This isn't ideal, since it would be best if the program does the right thing automatically, but it may be more straightforward to implement and would not have unexpected consequences for current uses. — Reply to this email directly, view it on GitHub<#5425 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AWLW4QLSPAPBDERAZF63KMDVME5UTANCNFSM4PK6QO4Q>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

tim-finnigan · 2023-08-30T17:43:36Z

Upon discussing with the team, this PR is not prioritized at this time but we are setting it to Draft to continue tracking the proposed changes. Per this earlier comment and discussion there are still some design implementation options that must be considered for a path forward here.

tim-finnigan · 2024-06-26T21:14:33Z

Checking in — we are going to close this as not prioritized at this time. There has not been any discussion regarding the changes in a couple of years, and further research is required into the best implementation here. But we are continuing to track the open related issues (#1138 and #1117)

bugfood mentioned this pull request Jul 28, 2020

s3.sync: Do not stat files that don't match filters #2105

Closed

IhostVlad approved these changes Jul 30, 2020

View reviewed changes

aws-sdk-python-automation added 24 commits May 5, 2021 02:41

Merge branch 'release-1.19.66'

4761b89

* release-1.19.66: Bumping version to 1.19.66 Update changelog based on model updates

Merge branch 'release-1.19.67'

e7951a4

* release-1.19.67: Bumping version to 1.19.67 Update changelog based on model updates

Merge branch 'release-1.19.68'

b0ad49c

* release-1.19.68: Bumping version to 1.19.68 Update changelog based on model updates

Merge branch 'release-1.19.69'

ce4a17f

* release-1.19.69: Bumping version to 1.19.69 Update changelog based on model updates

Merge branch 'release-1.19.70'

97100c5

* release-1.19.70: Bumping version to 1.19.70 Update changelog based on model updates

Merge branch 'release-1.19.71'

e07715e

* release-1.19.71: Bumping version to 1.19.71 Update changelog based on model updates

Merge branch 'release-1.19.72'

fc9b714

* release-1.19.72: Bumping version to 1.19.72 Update changelog based on model updates

Merge branch 'release-1.19.73'

44d21c4

* release-1.19.73: Bumping version to 1.19.73 Update changelog based on model updates

Merge branch 'release-1.19.74'

2a5fabd

* release-1.19.74: Bumping version to 1.19.74 Update changelog based on model updates

Merge branch 'release-1.19.75'

6eb2196

* release-1.19.75: Bumping version to 1.19.75 Update changelog based on model updates

Merge branch 'release-1.19.76'

483d42b

* release-1.19.76: Bumping version to 1.19.76 Update changelog based on model updates

Merge branch 'release-1.19.77'

028182e

* release-1.19.77: Bumping version to 1.19.77 Update changelog based on model updates

Merge branch 'release-1.19.78'

de3ddf9

* release-1.19.78: Bumping version to 1.19.78 Update changelog based on model updates

Merge branch 'release-1.19.79'

5db80b6

* release-1.19.79: Bumping version to 1.19.79 Update changelog based on model updates

Merge branch 'release-1.19.80'

154a4e6

* release-1.19.80: Bumping version to 1.19.80 Update changelog based on model updates

Merge branch 'release-1.19.81'

c793669

* release-1.19.81: Bumping version to 1.19.81 Update changelog based on model updates

Merge branch 'release-1.19.82'

d9d0226

* release-1.19.82: Bumping version to 1.19.82 Update changelog based on model updates

Merge branch 'release-1.19.83'

e600b8d

* release-1.19.83: Bumping version to 1.19.83 Update changelog based on model updates add explicit size to example fix note directives

Merge branch 'release-1.19.84'

1f6a183

* release-1.19.84: Bumping version to 1.19.84 Update changelog based on model updates New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub

Merge branch 'release-1.19.85'

86e226a

* release-1.19.85: Bumping version to 1.19.85 Update changelog based on model updates Fixed some backslashes CLI examples for ec2, elasticache, globalaccelerator, kms, securityhub, ssm Update create-data-catalog.rst Update policy-exists.rst

Merge branch 'release-1.19.86'

eb77fd6

* release-1.19.86: Bumping version to 1.19.86 Update changelog based on model updates

Merge branch 'release-1.19.87'

73426af

* release-1.19.87: Bumping version to 1.19.87 Update changelog based on model updates

Merge branch 'release-1.19.88'

189bed0

* release-1.19.88: Bumping version to 1.19.88 Update changelog based on model updates

Merge branch 'release-1.19.89'

46ffa92

* release-1.19.89: Bumping version to 1.19.89 Update changelog based on model updates

aws-sdk-python-automation added 2 commits May 10, 2022 18:06

Merge branch 'release-1.23.11'

0b6a3e5

* release-1.23.11: Bumping version to 1.23.11 Update changelog based on model updates Add support for specifying an OSReleaseLabel in create-cluster

Merge branch 'release-1.23.12'

b1263fe

* release-1.23.12: Bumping version to 1.23.12 Update changelog based on model updates Fix s3 cp example for grants to specific users

justindho added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. needs-rebase labels May 12, 2022

bugfood force-pushed the filter-while-recursing branch from cbbe291 to 8f8bf7a Compare May 12, 2022 18:33

bugfood and others added 7 commits May 12, 2022 14:56

add a functional test for "aws s3 sync" include/exclude

884ad4d

remove unused imports and variables from filegenerator unit test

375efe1

remove a couple unused local function assignments

b82c7b6

test that excluded files are not examined

26f85e6

This catches bug aws#1117 (which is now fixed).

bugfood force-pushed the filter-while-recursing branch from 9558f48 to 996e32a Compare May 12, 2022 22:10

bugfood added 2 commits May 12, 2022 15:11

add check that paths included are not intended to be excluded

91723b6

If desired, this could be squashed to: Corey Hickey: add a functional test for "aws s3 sync" include/exclude

bugfood force-pushed the filter-while-recursing branch from 996e32a to 0dccece Compare May 12, 2022 22:11

kdaily assigned kyleknap May 16, 2022

kdaily added responded and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels May 16, 2022

kyleknap reviewed May 18, 2022

View reviewed changes

kyleknap added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed responded labels May 18, 2022

tim-finnigan marked this pull request as draft August 30, 2023 17:43

tim-finnigan closed this Jun 26, 2024

Filter while recursing #5425

Filter while recursing #5425

Uh oh!

Conversation

bugfood commented Jul 28, 2020

Uh oh!

codecov-commenter commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bugfood commented Aug 6, 2020

Uh oh!

clarete commented Aug 6, 2020

Uh oh!

bugfood commented Aug 7, 2020

Uh oh!

justindho commented May 12, 2022

Uh oh!

bugfood commented May 12, 2022

Uh oh!

bugfood commented May 12, 2022

Uh oh!

kyleknap left a comment

Choose a reason for hiding this comment

Uh oh!

kyleknap May 17, 2022

Choose a reason for hiding this comment

Uh oh!

kyleknap May 17, 2022

Choose a reason for hiding this comment

Uh oh!

kyleknap May 17, 2022

Choose a reason for hiding this comment

Uh oh!

kyleknap May 17, 2022

Choose a reason for hiding this comment

Uh oh!

bugfood May 20, 2022

Choose a reason for hiding this comment

Uh oh!

kyleknap May 27, 2022

Choose a reason for hiding this comment

Uh oh!

bugfood May 27, 2022

Choose a reason for hiding this comment

Uh oh!

Thebutlersguilty3 commented May 30, 2022 via email

Uh oh!

tim-finnigan commented Aug 30, 2023

Uh oh!

tim-finnigan commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

codecov-commenter commented Jul 28, 2020 •

edited

Loading