Skip to content

Conversation

@clarete
Copy link

@clarete clarete commented Aug 14, 2016

This probably helps on issue #1138.

I have a HUGE ~/src directory that contains projects that I clone from various version control systems. Which makes me want to avoid backing them up on my personal S3 bucket to save some coins. For that, I used the --exclude="src/*" parameter (I also make sure the aws command is called from my $HOME directory, since I learned that the filters start matching from the current directory -- more details can be found in the #1588 issue). Although it worked, it still scanned and stat'ed the entire directory.

This patch changes the FileGenerator class to

  1. take the filters from the command line as a parameter in its constructor
  2. Use the file_filter parameter to match the path received by .should_ignore_file and figure out if the path received by the method is allowed by the filters.

This patch can be considered a work in progress because I only tested my specific need. It still needs proper tests. I didn't really write any because I wanted to ask if the strategy I chose is enough. If that's the case, I'd spend time writing the proper tests. Otherwise, I'd love to hear a better idea. I may be able to rework it if it's not too time consuming.

The way the task pipeline that each command executes filters files
*after* collecting information about it. For local files it means that
even when the user doesn't want to sync these files, they'll still be
stat'ed, which is unnecessary.

This change uses the filters received from the command line inside of
the `FileGenerator.should_ignore_file` method. So the list of files to
be filtered won't contain files supposed to be excluded.
@clarete
Copy link
Author

clarete commented Aug 14, 2016

I hope you guys like the idea, it also seems to fix #1117.

@kyleknap
Copy link
Contributor

@clarete I think the approach is reasonable. I think this is something we would pull in.

The only thing that I would be wary of is to make sure the file comparisons in sync do not get out of order when comparing the two order lists. But I do not think that is an issue because the comparison of files happens after the usual filtering and you are just pushing the filtering to happen even earlier in the process.

If you can get some unit tests up, we will take a more scrutinized look and make sure that none of the integration tests are not broken and proceed from there.

@clarete
Copy link
Author

clarete commented Sep 5, 2016

Still interested, just didn't have time yet. Will post results as soon as my weird schedule allows. Sorry about that!

@gravyboat
Copy link

@clarete Are you still planning on finishing this?

@bugfood
Copy link

bugfood commented Jul 28, 2020

This patch still applies to current develop, albeit with a couple trivial merge conflicts. I tried it out and found it quite suitable. The largest benefit for my use case is that it makes aws no longer recurse deeply into directories that are excluded. It still stats the contents of excluded directories, but does not go into subdirectories thereof.

In my case, I am periodically syncing a large deep file tree from a netapp NFS share; netapp provides access to snapshots in .snapshot/8hour.2020-07-26_1815, .8hour.2020-07-27_0215, etc. for each snapshot. Naturally, I am excluding the snapshots, but with six snapshots and unpatched aws, the sync takes around seven times as long as it should (the source contents don't change substantially, so there is little actual data to transfer after the initial sync).

I have made a PR with a minor modification and an update to unit tests:
#5425
I'll try to work on that until it passes the CI tests (edit: CI passes)

@kdaily kdaily added needs-review This issue or pull request needs review from a core team member. pr:needs-review This PR needs a review from a Member. labels Aug 24, 2021
@tim-finnigan
Copy link
Contributor

Thank you for submitting this PR. We are going to use #5425 to track this going forward, as that includes these changes along with a few tests and modifications as mentioned above. I just left a response here on that PR requesting some additional changes.

@tim-finnigan tim-finnigan removed the needs-review This issue or pull request needs review from a core team member. label Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:needs-review This PR needs a review from a Member.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants