Skip to content

Conversation

@bugfood
Copy link

@bugfood bugfood commented Jul 28, 2020

Issue #
#1138
Probably #1117

Description of changes:
This fleshes out #2105 and hopefully is suitable for inclusion. The patch from @clarete is unmodified except for resolution of a trivial merge conflict.

More description is available in #2105 ; my contributions are to:

  • Add a minor optimization.
  • Add a unit test.
  • Simplify surrounding unit tests.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2020

Codecov Report

Merging #5425 (cbbe291) into develop (d693aed) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head cbbe291 differs from pull request most recent head 8f8bf7a. Consider uploading reports for the commit 8f8bf7a to get more accurate results

@@             Coverage Diff             @@
##           develop    #5425      +/-   ##
===========================================
- Coverage    92.87%   92.85%   -0.02%     
===========================================
  Files          204      204              
  Lines        16347    16313      -34     
===========================================
- Hits         15182    15148      -34     
  Misses        1165     1165              
Impacted Files Coverage Δ
awscli/customizations/s3/filegenerator.py 98.94% <100.00%> (+0.02%) ⬆️
awscli/customizations/s3/filters.py 100.00% <100.00%> (ø)
awscli/customizations/s3/subcommands.py 97.22% <100.00%> (+<0.01%) ⬆️
awscli/customizations/configure/writer.py 99.04% <0.00%> (-0.10%) ⬇️
awscli/customizations/emr/createcluster.py 98.34% <0.00%> (-0.02%) ⬇️
awscli/customizations/emr/helptext.py 100.00% <0.00%> (ø)
awscli/customizations/eks/get_token.py 100.00% <0.00%> (ø)
awscli/customizations/eks/update_kubeconfig.py 97.93% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d693aed...8f8bf7a. Read the comment docs.

@bugfood
Copy link
Author

bugfood commented Aug 6, 2020

Thanks @IhostVlad for the approval. Is there anything else I can/should do in order to get this merged?

Thanks,
Corey

@clarete
Copy link

clarete commented Aug 6, 2020

Thanks for working on this, @bugfood! I wonder if @kyleknap could take a look at this and maybe help us getting this merged? 🙏

@bugfood
Copy link
Author

bugfood commented Aug 7, 2020

Thanks @clarete for the original patch! I set out with the goal to write such a patch and found it already pretty much done.

* release-1.19.66:
  Bumping version to 1.19.66
  Update changelog based on model updates
* release-1.19.67:
  Bumping version to 1.19.67
  Update changelog based on model updates
* release-1.19.68:
  Bumping version to 1.19.68
  Update changelog based on model updates
* release-1.19.69:
  Bumping version to 1.19.69
  Update changelog based on model updates
* release-1.19.70:
  Bumping version to 1.19.70
  Update changelog based on model updates
* release-1.19.71:
  Bumping version to 1.19.71
  Update changelog based on model updates
* release-1.19.72:
  Bumping version to 1.19.72
  Update changelog based on model updates
* release-1.19.73:
  Bumping version to 1.19.73
  Update changelog based on model updates
* release-1.19.74:
  Bumping version to 1.19.74
  Update changelog based on model updates
* release-1.19.75:
  Bumping version to 1.19.75
  Update changelog based on model updates
* release-1.19.76:
  Bumping version to 1.19.76
  Update changelog based on model updates
* release-1.19.77:
  Bumping version to 1.19.77
  Update changelog based on model updates
* release-1.19.78:
  Bumping version to 1.19.78
  Update changelog based on model updates
* release-1.19.79:
  Bumping version to 1.19.79
  Update changelog based on model updates
* release-1.19.80:
  Bumping version to 1.19.80
  Update changelog based on model updates
* release-1.19.81:
  Bumping version to 1.19.81
  Update changelog based on model updates
* release-1.19.82:
  Bumping version to 1.19.82
  Update changelog based on model updates
* release-1.19.83:
  Bumping version to 1.19.83
  Update changelog based on model updates
  add explicit size to example
  fix note directives
* release-1.19.84:
  Bumping version to 1.19.84
  Update changelog based on model updates
  New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub
  New CLI examples ssm, ec2, emr, codeguru-reviewer, detective, wafv2, securityhub
* release-1.19.85:
  Bumping version to 1.19.85
  Update changelog based on model updates
  Fixed some backslashes
  CLI examples for ec2, elasticache, globalaccelerator, kms, securityhub, ssm
  Update create-data-catalog.rst
  Update policy-exists.rst
* release-1.19.86:
  Bumping version to 1.19.86
  Update changelog based on model updates
* release-1.19.87:
  Bumping version to 1.19.87
  Update changelog based on model updates
* release-1.19.88:
  Bumping version to 1.19.88
  Update changelog based on model updates
* release-1.19.89:
  Bumping version to 1.19.89
  Update changelog based on model updates
* release-1.23.11:
  Bumping version to 1.23.11
  Update changelog based on model updates
  Add support for specifying an OSReleaseLabel in create-cluster
* release-1.23.12:
  Bumping version to 1.23.12
  Update changelog based on model updates
  Fix s3 cp example for grants to specific users
@justindho
Copy link
Contributor

@bugfood Thanks for making the changes! The team has done another pass on the PR and had some questions/comments.

  1. Is there a specific reason that additional filtering logic was added to filters.py? As @tim-finnigan mentioned in a previous comment, we'd like to remove the filters from the sync, cp, rm, and mv subcommands and keep filtering to one location (in the FileGenerator's list_objects method) to reduce complexity. We recognize that this is a bit of a refactor, but the team would still like to see the changes done here.
  2. Could you rebase your changes off of the develop branch? It looks like the changes were rebased off of master, which introduced a lot of commits to this PR.
  3. The tests you've added look pretty good. Just one minor comment: could you also add another assertion in text_include_exclude to verify that no operations were called with excluded paths (i.e., self.assertNotIn(path, excluded))?

@justindho justindho added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. needs-rebase labels May 12, 2022
@bugfood bugfood force-pushed the filter-while-recursing branch from cbbe291 to 8f8bf7a Compare May 12, 2022 18:33
@bugfood
Copy link
Author

bugfood commented May 12, 2022

Thanks for the review.

I rebased first, then I added a couple commits, but I need to do more work on that later, so please don't re-review yet.

I added the new assertNotIn; no problem there.

I had earlier added the code to filters.py because the surrounding code is involved in the compilation of user-supplied filters into the actual filters used internally. Since I needed to auto-insert some patterns at pattern-compilation time, that seemed like a natural place. I'm not arguing that it must be that way, but I don't know what would be better.

For the filtering location, I'm not entirely sure what you are asking for and I may need clarification later, though I did spot a deficiency to address first.

bugfood and others added 7 commits May 12, 2022 14:56
Allow FileCreator to provide a consistent abtraction.

Do not do this for TestSymlinksIgnoreFiles, though--this currently
relies on FileGenerator returning paths with a double separator; use of
self.files to generate the path in this case results in:

Traceback (most recent call last):
  File "tests/unit/customizations/s3/test_filegenerator.py", line 369, in test_follow_symlink
    self.assertEqual(result_list[i], filename)
AssertionError: u'/tmp/tmpVnUU91//foo.txt' != u'/tmp/tmpVnUU91/foo.txt'
- /tmp/tmpVnUU91//foo.txt
?                -
+ /tmp/tmpVnUU91/foo.txt
Any include-path deep in a directory structure implicitly includes its
parent directories (though not necessarily the other contents of those
parents).

Currently, this does not matter, since the FileGenerator examines the
entire tree and emits all deeply-included paths. In a subsequent patch,
however, we want to restrict the FileGenerator from traversing excluded
paths, so we need to explicitly include the parents of any
deeply-included paths, or else a more shallow exclude will inhibit
discovery of the deeply-included paths, even if that exclude is
specified earlier in the command line.
The way the task pipeline that each command executes filters files
*after* collecting information about it. For local files it means that
even when the user doesn't want to sync these files, they'll still be
stat'ed, which is unnecessary.

This change uses the filters received from the command line inside of
the `FileGenerator.should_ignore_file` method. So the list of files to
be filtered won't contain files supposed to be excluded.
This catches bug aws#1117 (which is now fixed).
@bugfood bugfood force-pushed the filter-while-recursing branch from 9558f48 to 996e32a Compare May 12, 2022 22:10
bugfood added 2 commits May 12, 2022 15:11
If desired, this could be squashed to:
Corey Hickey: add a functional test for "aws s3 sync" include/exclude
Move s3 filtering into the FileGenerator and remove the external pass.

This makes the location of s3 file filtering consistent with local file
generation, at least as much as feasible.

For local files, the external pass was redundant now that local
filtering is done within the generation itself.
@bugfood bugfood force-pushed the filter-while-recursing branch from 996e32a to 0dccece Compare May 12, 2022 22:11
@bugfood
Copy link
Author

bugfood commented May 12, 2022

@justindho can you please check the PR now?

The last two commits are work done today; the previous commits are unchanged.

I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly.

@kdaily kdaily added responded and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels May 16, 2022
Copy link
Contributor

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bugfood! I'll be the primary reviewer for the PR and will be reviewing any updates to the PR until it gets merged.

With regards to this comment:

I think this does what you meant with respect to moving the filtering into FileGenerator. I had pushed back on this earlier, but now I realize that the PR was making the filter handing inconsistent, so it is good to take care of this now. At least, if I understand correctly.

Yes that is what we were referring to. The update here looks great.

Otherwise, I mainly looked through the implementation and all of my comments were related to the implementation. I'll probably have some other smaller comments in a subsequent review, but I want to make sure we are in a good spot with regards to the approach to the implementation first. Thanks again for seeing this through!

if list(self.file_filter.call([FileInfo(path)])):
yield path, data

def list_objects_raw(self, s3_path, dir_op):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this method let's add an underscore to it to make it _list_objects_raw to indicate it's a private method and is only used within this class.

path = path[:-1]
if os.path.islink(path):
return True
if self.file_filter is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can move the file filter check into its own helper method that we could reuse across the local and s3 listing methods. So something like this:

def _should_ignore_from_file_filter(self, path):
      return (
           self.file_filter is None and
           not list(self.file_filter.call([FileInfo(path)]))
      )

This will make it a bit more reusable and improve readability as we are reducing the level of nesting/complexity in the if statements.

return False

def list_objects(self, s3_path, dir_op):
def list_objects(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could consolidate the calls to the _list_objects_raw() to a single instance to help simplify the logic in list_objects():

def list_objects(self, s3_path, dir_op):
     for path, data in self._list_objects_raw(s3_path, dir_op):
         if self._should_ignore_from_file_filter(path):
             continue
         yield path, data

for pattern in original_patterns:
full_patterns.append(
(pattern[0], os.path.join(rootdir, pattern[1])))
# When including paths deep in the directory tree, auto-include the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands? I think that would be really helpful.

In general, I'd prefer we do not change anything in the filters.py logic if we do not need to. Mainly the filters.py module is one of the older modules in the s3 package and we rarely update it. I think before adding any additional complexity to it, we do some refactoring of it to make it easier to follow/work with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have to focus on another project for the time being, but I will circle back to this later for the technical changes requested.

I'm still not sure I understand the case this logic is optimizing for after reading this comment: #5425 (comment) and the code. Would it be possible to illustrate the use/edge case with a sample directory structure and CLI commands?

The use case is fairly simple:

aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include 'excluded/included/*'

If we make the FileGenerator no longer recurse into excluded directories, then excluded/included would never be examined and any files therein would have no chance to be included. What is needed is a specific include filter for excluded/included. We could make the user do it:

aws s3 sync dir s3://bucket/dir '--exclude excluded/*' --include excluded/included --include 'excluded/included/*'

...but that seems onerous and would break current behavior. In my PR, this is now verified by the commit with description add a functional test for "aws s3 sync" include/exclude.

The best solution I thought of was to auto-insert the necessary filter, since we know the existing filtering logic does what we need. I wanted to keep the special-case code to a minimum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed examples that was really helpful in understanding it! In general with this PR, we definitely need to make sure we do not break current behavior (i.e. given any set of filters + file set, the CLI will continue to include/exclude the same set of files after this change with no needing to adjust the filters provided at the command line).

While it looks like the coded that got added does handle the missing edge case that you illustrated, I am concerned that we are going to be missing other edge cases after seeing the edge case you raised. For example, suppose we had this directory layout:

$ tree .
.
├── directory
│   ├── another-dir
│   │   ├── exclude-me.txt
│   │   └── include-me.py
│   ├── exclude-me.txt
│   └── include-me.py
├── exclude-me.txt
└── include-me.py

If we only wanted to upload the .py files, a user would likely run a command like this:

$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun

This would result in all of the .py files being uploaded:

(dryrun) upload: directory/another-dir/include-me.py to s3://mybucket/directory/another-dir/include-me.py
(dryrun) upload: directory/include-me.py to s3://mybucket/directory/include-me.py
(dryrun) upload: ./include-me.py to s3://mybucket/include-me.py

However, with the approach we are currently taking, we would still not traversing the directory and another-dir to get all of the .py files:

$ aws s3 cp . s3://mybucket --recursive --exclude '*' --include '*.py' --dryrun
(dryrun) upload: ./include-me.py to s3://mybucket/include-me.py

In general, I'm not optimistic the original approach of just applying the filters to the directory before traversing it is going to work. In terms of options to pursue, I think we either:

  • Need to expand the logic of whether we traverse a directory to skip traversing a directory only if it is impossible for a file inside of it to be included based on the filters. I'm not sure if there is a straightforward way of doing this as we would need to account for all supported pattern symbols: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html#use-of-exclude-and-include-filters

  • Scope the optimization of not traversing directories to simple set of exclude/include patterns (e.g. we only apply this optimization to only filters that have only --exclude flags and/or apply the optimizations if there are no special characters used in the filter (e.g. *, ?, [], [!])

Thoughts? I want to think about this a bit more to see if I have any better suggestions, but I at least wanted to put this realization down now to start the discussion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is rough.

Your first idea seems best, but I don't know how to do that either.

A third idea would be to give the user control, either by a global option --no-traverse-excludes or a separate argument --hard-exclude <path> (to replace --exclude <path>). This isn't ideal, since it would be best if the program does the right thing automatically, but it may be more straightforward to implement and would not have unexpected consequences for current uses.

@kyleknap kyleknap added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed responded labels May 18, 2022
@Thebutlersguilty3
Copy link

Thebutlersguilty3 commented May 30, 2022 via email

@tim-finnigan
Copy link
Contributor

Upon discussing with the team, this PR is not prioritized at this time but we are setting it to Draft to continue tracking the proposed changes. Per this earlier comment and discussion there are still some design implementation options that must be considered for a path forward here.

@tim-finnigan tim-finnigan marked this pull request as draft August 30, 2023 17:43
@tim-finnigan
Copy link
Contributor

Checking in — we are going to close this as not prioritized at this time. There has not been any discussion regarding the changes in a couple of years, and further research is required into the best implementation here. But we are continuing to track the open related issues (#1138 and #1117)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

implementation needs-rebase response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aws s3 sync traverses excluded directory - takes up a long time

10 participants