-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Filter while recursing #5425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter while recursing #5425
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #5425 +/- ##
===========================================
- Coverage 92.87% 92.85% -0.02%
===========================================
Files 204 204
Lines 16347 16313 -34
===========================================
- Hits 15182 15148 -34
Misses 1165 1165
Continue to review full report at Codecov.
|
|
Thanks @IhostVlad for the approval. Is there anything else I can/should do in order to get this merged? Thanks, |
|
Thanks @clarete for the original patch! I set out with the goal to write such a patch and found it already pretty much done. |
* release-1.19.66: Bumping version to 1.19.66 Update changelog based on model updates
* release-1.19.67: Bumping version to 1.19.67 Update changelog based on model updates
* release-1.19.68: Bumping version to 1.19.68 Update changelog based on model updates
* release-1.19.69: Bumping version to 1.19.69 Update changelog based on model updates
* release-1.19.70: Bumping version to 1.19.70 Update changelog based on model updates
* release-1.19.71: Bumping version to 1.19.71 Update changelog based on model updates
* release-1.19.72: Bumping version to 1.19.72 Update changelog based on model updates
* release-1.19.73: Bumping version to 1.19.73 Update changelog based on model updates
* release-1.19.74: Bumping version to 1.19.74 Update changelog based on model updates
* release-1.19.75: Bumping version to 1.19.75 Update changelog based on model updates
* release-1.19.76: Bumping version to 1.19.76 Update changelog based on model updates
* release-1.19.77: Bumping version to 1.19.77 Update changelog based on model updates
* release-1.19.78: Bumping version to 1.19.78 Update changelog based on model updates
* release-1.19.79: Bumping version to 1.19.79 Update changelog based on model updates
* release-1.19.80: Bumping version to 1.19.80 Update changelog based on model updates
* release-1.19.81: Bumping version to 1.19.81 Update changelog based on model updates
* release-1.19.82: Bumping version to 1.19.82 Update changelog based on model updates
* release-1.19.83: Bumping version to 1.19.83 Update changelog based on model updates add explicit size to example fix note directives
* release-1.19.84: Bumping version to 1.19.84 Update changelog based on model updates New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub
* release-1.19.85: Bumping version to 1.19.85 Update changelog based on model updates Fixed some backslashes CLI examples for ec2, elasticache, globalaccelerator, kms, securityhub, ssm Update create-data-catalog.rst Update policy-exists.rst
* release-1.19.86: Bumping version to 1.19.86 Update changelog based on model updates
* release-1.19.87: Bumping version to 1.19.87 Update changelog based on model updates
* release-1.19.88: Bumping version to 1.19.88 Update changelog based on model updates
* release-1.19.89: Bumping version to 1.19.89 Update changelog based on model updates
* release-1.23.11: Bumping version to 1.23.11 Update changelog based on model updates Add support for specifying an OSReleaseLabel in create-cluster
* release-1.23.12: Bumping version to 1.23.12 Update changelog based on model updates Fix s3 cp example for grants to specific users
|
@bugfood Thanks for making the changes! The team has done another pass on the PR and had some questions/comments.
|
cbbe291 to
8f8bf7a
Compare
|
Thanks for the review. I rebased first, then I added a couple commits, but I need to do more work on that later, so please don't re-review yet. I added the new I had earlier added the code to filters.py because the surrounding code is involved in the compilation of user-supplied filters into the actual filters used internally. Since I needed to auto-insert some patterns at pattern-compilation time, that seemed like a natural place. I'm not arguing that it must be that way, but I don't know what would be better. For the filtering location, I'm not entirely sure what you are asking for and I may need clarification later, though I did spot a deficiency to address first. |
Allow FileCreator to provide a consistent abtraction.
Do not do this for TestSymlinksIgnoreFiles, though--this currently
relies on FileGenerator returning paths with a double separator; use of
self.files to generate the path in this case results in:
Traceback (most recent call last):
File "tests/unit/customizations/s3/test_filegenerator.py", line 369, in test_follow_symlink
self.assertEqual(result_list[i], filename)
AssertionError: u'/tmp/tmpVnUU91//foo.txt' != u'/tmp/tmpVnUU91/foo.txt'
- /tmp/tmpVnUU91//foo.txt
? -
+ /tmp/tmpVnUU91/foo.txt
Any include-path deep in a directory structure implicitly includes its parent directories (though not necessarily the other contents of those parents). Currently, this does not matter, since the FileGenerator examines the entire tree and emits all deeply-included paths. In a subsequent patch, however, we want to restrict the FileGenerator from traversing excluded paths, so we need to explicitly include the parents of any deeply-included paths, or else a more shallow exclude will inhibit discovery of the deeply-included paths, even if that exclude is specified earlier in the command line.
The way the task pipeline that each command executes filters files *after* collecting information about it. For local files it means that even when the user doesn't want to sync these files, they'll still be stat'ed, which is unnecessary. This change uses the filters received from the command line inside of the `FileGenerator.should_ignore_file` method. So the list of files to be filtered won't contain files supposed to be excluded.
This catches bug aws#1117 (which is now fixed).
9558f48 to
996e32a
Compare
If desired, this could be squashed to: Corey Hickey: add a functional test for "aws s3 sync" include/exclude
Move s3 filtering into the FileGenerator and remove the external pass. This makes the location of s3 file filtering consistent with local file generation, at least as much as feasible. For local files, the external pass was redundant now that local filtering is done within the generation itself.
996e32a to
0dccece
Compare
|
@justindho can you please check the PR now? The last two commits are work done today; the previous commits are unchanged. I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly. |
kyleknap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bugfood! I'll be the primary reviewer for the PR and will be reviewing any updates to the PR until it gets merged.
With regards to this comment:
I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly.
Yes that is what we were referring to. The update here looks great.
Otherwise, I mainly looked through the implementation and all of my comments were related to the implementation. I'll probably have some other smaller comments in a subsequent review, but I want to make sure we are in a good spot with regards to the approach to the implementation first. Thanks again for seeing this through!
| if list(self.file_filter.call([FileInfo(path)])): | ||
| yield path, data | ||
|
|
||
| def list_objects_raw(self, s3_path, dir_op): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this method let's add an underscore to it to make it _list_objects_raw to indicate it's a private method and is only used within this class.
| path = path[:-1] | ||
| if os.path.islink(path): | ||
| return True | ||
| if self.file_filter is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can move the file filter check into its own helper method that we could reuse across the local and s3 listing methods. So something like this:
def _should_ignore_from_file_filter(self, path):
return (
self.file_filter is None and
not list(self.file_filter.call([FileInfo(path)]))
)This will make it a bit more reusable and improve readability as we are reducing the level of nesting/complexity in the if statements.
| return False | ||
|
|
||
| def list_objects(self, s3_path, dir_op): | ||
| def list_objects(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could consolidate the calls to the _list_objects_raw() to a single instance to help simplify the logic in list_objects():
def list_objects(self, s3_path, dir_op):
for path, data in self._list_objects_raw(s3_path, dir_op):
if self._should_ignore_from_file_filter(path):
continue
yield path, data| for pattern in original_patterns: | ||
| full_patterns.append( | ||
| (pattern[0], os.path.join(rootdir, pattern[1]))) | ||
| # When including paths deep in the directory tree, auto-include the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands? I think that would be really helpful.
In general, I'd prefer we do not change anything in the filters.py logic if we do not need to. Mainly the filters.py module is one of the older modules in the s3 package and we rarely update it. I think before adding any additional complexity to it, we do some refactoring of it to make it easier to follow/work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have to focus on another project for the time being, but I will circle back to this later for the technical changes requested.
I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands?
The use case is fairly simple:
aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include 'excluded/included/*'
If we make the FileGenerator no longer recurse into excluded directories, then excluded/included would never be examined and any files therein would have no chance to be included. What is needed is a specific include filter for excluded/included. We could make the user do it:
aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include excluded/included --include 'excluded/included/*'
...but that seems onerous and would break current behavior. In my PR, this is now verified by the commit with description add a functional test for "aws s3 sync" include/exclude.
The best solution I thought of was to auto-insert the necessary filter, since we know the existing filtering logic does what we need. I wanted to keep the special-case code to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed examples that was really helpful in understanding it! In general with this PR, we definitely need to make sure we do not break current behavior (i.e. given any set of filters + file set, the CLI will continue to include/exclude the same set of files after this change with no needing to adjust the filters provided at the command line).
While it looks like the coded that got added does handle the missing edge case that you illustrated, I am concerned that we are going to be missing other edge cases after seeing the edge case you raised. For example, suppose we had this directory layout:
$ tree .
.
├── directory
│ ├── another-dir
│ │ ├── exclude-me.txt
│ │ └── include-me.py
│ ├── exclude-me.txt
│ └── include-me.py
├── exclude-me.txt
└── include-me.py
If we only wanted to upload the .py files, a user would likely run a command like this:
$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun
This would result in all of the .py files being uploaded:
(dryrun) upload: directory/another-dir/include-me.py to s3://mybucket/directory/another-dir/include-me.py
(dryrun) upload: directory/include-me.py to s3://mybucket/directory/include-me.py
(dryrun) upload: ./include-me.py to s3://mybucket/include-me.py
However, with the approach we are currently taking, we would still not traversing the directory and another-dir to get all of the .py files:
$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun
(dryrun) upload: ./include-me.py to s3://mybucket/include-me.py
In general, I'm not optimistic the original approach of just applying the filters to the directory before traversing it is going to work. In terms of options to pursue, I think we either:
-
Need to expand the logic of whether we traverse a directory to skip traversing a directory only if it is impossible for a file inside of it to be included based on the filters. I'm not sure if there is a straightforward way of doing this as we would need to account for all supported pattern symbols: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html#use-of-exclude-and-include-filters
-
Scope the optimization of not traversing directories to simple set of exclude/include patterns (e.g. we only apply this optimization to only filters that have only
--excludeflags and/or apply the optimizations if there are no special characters used in the filter (e.g.*,?,[],[!])
Thoughts? I want to think about this a bit more to see if I have any better suggestions, but I at least wanted to put this realization down now to start the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is rough.
Your first idea seems best, but I don't know how to do that either.
A third idea would be to give the user control, either by a global option --no-traverse-excludes or a separate argument --hard-exclude <path> (to replace --exclude <path>). This isn't ideal, since it would be best if the program does the right thing automatically, but it may be more straightforward to implement and would not have unexpected consequences for current uses.
|
Corey Hickey Friday, May 27, 5:34 PM
--hard-exclude
Sent from my Metro by T-Mobile 5G Device
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Corey Hickey ***@***.***>
Sent: Friday, May 27, 2022 5:34:01 PM
To: aws/aws-cli ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [aws/aws-cli] Filter while recursing (#5425)
@bugfood commented on this pull request.
________________________________
In awscli/customizations/s3/filters.py<#5425 (comment)>:
@@ -101,6 +101,20 @@ def _full_path_patterns(self, original_patterns, rootdir):
for pattern in original_patterns:
full_patterns.append(
(pattern[0], os.path.join(rootdir, pattern[1])))
+ # When including paths deep in the directory tree, auto-include the
Hmm, that is rough.
Your first idea seems best, but I don't know how to do that either.
A third idea would be to give the user control, either by a global option --no-traverse-excludes or a separate argument --hard-exclude <path> (to replace --exclude <path>). This isn't ideal, since it would be best if the program does the right thing automatically, but it may be more straightforward to implement and would not have unexpected consequences for current uses.
—
Reply to this email directly, view it on GitHub<#5425 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AWLW4QLSPAPBDERAZF63KMDVME5UTANCNFSM4PK6QO4Q>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
|
Upon discussing with the team, this PR is not prioritized at this time but we are setting it to Draft to continue tracking the proposed changes. Per this earlier comment and discussion there are still some design implementation options that must be considered for a path forward here. |
Issue #
#1138
Probably #1117
Description of changes:
This fleshes out #2105 and hopefully is suitable for inclusion. The patch from @clarete is unmodified except for resolution of a trivial merge conflict.
More description is available in #2105 ; my contributions are to:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.