cli: add support for userfile upload CLI command#50981
cli: add support for userfile upload CLI command#50981craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
9150770 to
eabe983
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 8 of 10 files at r1, 1 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)
pkg/cli/cli_test.go, line 1411 at r1 (raw file):
nodelocal upload and delete nodelocal files userfile upload and delete user scoped files
indentation off?
eabe983 to
0bb7b6c
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)
pkg/cli/cli_test.go, line 1411 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
indentation off?
Done.
knz
left a comment
There was a problem hiding this comment.
I'm glad this work is now started; it's super important.
I have a few remarks below.
One general design question I have is why it is so important to preserve the directory hierarchy that's present client-side to also be there server-side. Is it truly important that a client path a/b/c.csv gets passed as userfile://.../a/b/c.csv? Why not flatten all the complex paths to a single directory level server-side, for example userfile://.../a_b_c.csv?
(We happen to have a helper function to do that btw see my comment below)
Reviewed 8 of 10 files at r1, 3 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)
pkg/cli/nodelocal.go, line 86 at r3 (raw file):
// messages because we clean the path before using in the nodelocal // ExternalStorage methods. destination = strings.TrimPrefix(destination, "/")
Can you double-check the semantics of this on Windows:
-
windows uses a
//prefix to indicate a network directory. Your code would behave weirdly for that. -
if you want to trim the directory separator, be mindful that the separator on windows can be either
/or\. there's a function in the stdlib for this.
pkg/cli/userfile.go, line 92 at r3 (raw file):
// its contents to a SQL table with filename set to test/../../test.csv. This // is strange and we should come up with a better scheme of enforcing // "sensible" filenames.
Please have a look at type fileNameEscaper in zip.go, what it's used for, and if possible use it.
pkg/cli/userfile.go, line 106 at r3 (raw file):
// Currently we hardcode the db.schema prefix, in the future we might allow // users to specify this. userfileURI := fmt.Sprintf("userfile://%s%s",
Please avoid generating URIs by string concatenation in this way. You can instead:
- instantiate a url.URL
- populate its fields, in particular use filepath.Join for the Path component
- then use String() on the result.
pkg/cli/userfile.go, line 110 at r3 (raw file):
stmt, err := conn.conn.Prepare(sql.CopyInFileStmt(userfileURI, "crdb_internal", "user_file_upload"))
Please avoid magic strings in the code; especially magic strings that have to match in multiple places.
Instead, export a constant string from a common package dependency and use the constant in both places.
pkg/sql/conn_executor.go, line 1776 at r3 (raw file):
var cm copyMachineInterface var err error if table := cmd.Stmt.Table; (table.Table() == nodelocalFileUploadTable ||
I would recommend extracting this comparison itno a helper function, whose functionality would be documented in a comment.
pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):
// Using deprecated method to COPY. stream.GrepNot(`pkg/cli/nodelocal.go:.* stmt.Exec is deprecated: .*`), stream.GrepNot(`pkg/cli/userfile.go:.* stmt.Exec is deprecated: .*`),
Please reach out to @mjibson to check whether the new method is already available; then use it if it is (and avoid modifying the linter).
Conversely, if the method is not available:
- add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced
- update the linter here to include matt's explanation into the comment (so that the next person coming to this has more words to understand what's going on)
madelynnblue
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)
pkg/cli/userfile.go, line 75 at r3 (raw file):
} if _, err := conn.conn.Exec(`BEGIN`, nil); err != nil {
ExecContext might work here? Try it out.
0bb7b6c to
4424c6f
Compare
There was a problem hiding this comment.
Thank you for the review! I brought up this question offline with the team and the general consensus was that there doesn't seem to be a compelling enough reason to replace / with _. A path such as /test/../../test.csv would get obfuscated to _test_______test.csv(provided we don't do any path cleaning) in the SQL tables. Having said this the current state is only to make forward progress - the follow-up approach being discussed is:
- Treat the destination path as a filename. So /a/b/c.csv refers to a file with that whole path as its name.
This is what gsutil does, and we think will be the easiest to convey to the user - "whatever you give as the destination will be the name used to reference that blob"
While we arrive at a conclusion, we think we should error out on seeing .. in the file path. We can always remove this error once we settle on a good user experience.
EDIT: We had some more discussions over user-friendly interaction with userfile upload over the last week. These can be tracked at issue -#51222 and PR - #51353.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, @miretskiy, and @mjibson)
pkg/cli/nodelocal.go, line 86 at r3 (raw file):
Previously, knz (kena) wrote…
Can you double-check the semantics of this on Windows:
windows uses a
//prefix to indicate a network directory. Your code would behave weirdly for that.if you want to trim the directory separator, be mindful that the separator on windows can be either
/or\. there's a function in the stdlib for this.
Constructing a url.URL as you suggested let me get rid of the trimming. This should accommodate for both / and \ as well.
pkg/cli/userfile.go, line 75 at r3 (raw file):
Previously, mjibson (Matt Jibson) wrote…
ExecContext might work here? Try it out.
Done. stmt.Exec cannot be changed to StmtExecContext because the copyin driver does not implement the method from what I can tell.
pkg/cli/userfile.go, line 92 at r3 (raw file):
Previously, knz (kena) wrote…
Please have a look at
type fileNameEscaperinzip.go, what it's used for, and if possible use it.
Addressed in the PR level comment above.
pkg/cli/userfile.go, line 106 at r3 (raw file):
Previously, knz (kena) wrote…
Please avoid generating URIs by string concatenation in this way. You can instead:
- instantiate a url.URL
- populate its fields, in particular use filepath.Join for the Path component
- then use String() on the result.
Done.
pkg/cli/userfile.go, line 110 at r3 (raw file):
Previously, knz (kena) wrote…
Please avoid magic strings in the code; especially magic strings that have to match in multiple places.
Instead, export a constant string from a common package dependency and use the constant in both places.
Done.
pkg/sql/conn_executor.go, line 1776 at r3 (raw file):
Previously, knz (kena) wrote…
I would recommend extracting this comparison itno a helper function, whose functionality would be documented in a comment.
Done.
pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):
add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced
Some improvements were made to switch Exec to ExecContext, but the stmt.Exec call which is caught by the linter cannot be changed as the copyin driver does not have this method implemented. I have added both the suggested comments.
d26187f to
7294f5b
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewed 10 of 10 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @mjibson)
pkg/cli/nodelocal.go, line 56 at r4 (raw file):
defer reader.Close() ctx, cancel := context.WithCancel(context.Background())
I'm confused - your uploadFile function is synchronous, what is there to cancel when it has returned?
pkg/cli/userfile.go, line 54 at r4 (raw file):
defer reader.Close() ctx, cancel := context.WithCancel(context.Background())
ditto - what is this cancel about
pkg/cli/userfile.go, line 101 at r4 (raw file):
// is strange and we should come up with a better scheme of enforcing // "sensible" filenames. if strings.Contains(destination, "..") {
This check looks overly restrictive. As stated, I cannot use a destination path that looks like this: a/b..c/d even though it ought to be valid.
Also, What do you make of the destinations a/./b and a/b? Are they the same? Are they considered different?
pkg/cli/userfile.go, line 116 at r4 (raw file):
userfileURL := url.URL{ Scheme: "userfile", Host: defaultQualifiedDBSchemaName + connURL.User.Username(),
Ok that might be a problem. We allow username that contain arbitrary unicode letters (character classes L/l) which means any letter with accents, diacritics etc.
Do we have a guarantee that all the stores server-side where the files will be created accept to name files using arbitrary unicode letters?
If not, you still need to adopt an escaping scheme of some sort.
(Also I recommend you add some tests for this case)
pkg/sql/copy_file_upload_test.go, line 55 at r4 (raw file):
func prepareFileUploadURI(user, testSendFile, copyInternalTable string) (string, error) { var uri string if copyInternalTable == NodelocalFileUploadTable {
nit: turn this if/elseif into a switch to make it more readable.
pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced
Some improvements were made to switch
ExectoExecContext, but thestmt.Execcall which is caught by the linter cannot be changed as thecopyindriver does not have this method implemented. I have added both the suggested comments.
thanks
pkg/cli/userfile.go
Outdated
| // you upload a file to a destination such as test/../../test.csv, we write | ||
| // its contents to a SQL table with filename set to test/../../test.csv. This | ||
| // is strange and we should come up with a better scheme of enforcing | ||
| // "sensible" filenames. |
There was a problem hiding this comment.
I think Go URLs support path normalization per RFC3986. https://tools.ietf.org/html/rfc3986
Perhaps we could normalize a URL by parsing it, then reconstruct it and avoid some headaches?
There was a problem hiding this comment.
I still think we’re making a lot of these “paths” that we don’t need to — These are not actual file system paths, they’re just string names. S3 or google cloud storage just treat the name as a string, with slashes simply a display-time concept used for grouping results in the GUI. gsutil cp local-file gs://my_bucket/path/../to/./file produces a file named “ path/../to/./file” and if it is good enough for google, it is good enough for me.
There was a problem hiding this comment.
if it is good enough for google, it is good enough for me.
-
You say "S3 or GCS". What does S3 say?
-
S3 provides a HTTP interface to access stored files in a web browser. What does the web browser say about these special paths? Do they get normalized?
-
you're casually dismissing the "UX surprise" factor. Users will see this an ask questions. Wouldn't it be clearer to side-step these discussions preemptively?
There was a problem hiding this comment.
[13:21:41] $ aws s3 cp README.md s3://roland-test-s3encryption/a/path/../bar
upload: ./README.md to s3://bucket/a/path/../bar
[13:22:00] $ aws s3 ls s3://bucket/a/
PRE path/
[13:22:10] $ aws s3 ls s3://bucket/a/path/
PRE ../
[13:22:14] $ aws s3 ls s3://bucket/a/path/../
2020-07-14 13:22:01 6494 bar
There was a problem hiding this comment.
AWS HTTP API just treats the dots as part of the name. A GET request to "http://bucket.s3-website-us-east-1.amazonaws.com/a/path/../bar" will return that file, after a aws s3api put-object-acl --bucket bucket --key a/path/foo --acl public-read, where, again, in the key we see dots are just dots.
There was a problem hiding this comment.
So, is it a bit surprising? yeah, it could be, to someone who expects it to behave like their shell. But given that userfile sits side-by-side with s3 and gcs in our storage options, I'm in favor of it behaving consistently with them. The interaction we're building here maps 1:1 with aws s3 cp local remote or gsutil cp local remote and obviously they both came to the same conclusion: just treat the supplied remote name as a string key, with no special internal rewriting rules, and let the user decide what to put in it.
There was a problem hiding this comment.
The interaction we're building here maps 1:1 with
I hear you and as long as this is true, all the points you make (and the choices in the PR) are OK.
Except that I expect this not to be true all the time. Before we know it we'll want to mock the stores by using a filesystem-backed store instead. Or there's a new cloud provider with a different idea on the topic.
More generally, I am not keen to look just at S3 and GCS and then rewrite 60 years of filesystem design history, while praying that the future will not re-align. This is ad-hoc revisionism and bound to bite us in the future.
Finally, this creates an inconsistency between the handling of s3/gcs-backed bulk operations, and that supported by nodelocal://. Because nodelocal does normalize its paths.
There was a problem hiding this comment.
If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping. This would be true anyway, regardless of dot segments, due to unicode or other special chars. I'm confident we could solve that if we needed to.
Yes, nodelocal normalizes paths, but if anything, I'd prefer it didn't -- we have to take extreme care when doing so to ensure that the post-normalization path abides by the base-path restriction. I'd be in favor of, instead of checking that post-normalization we're still in the base, we instead simply required normalization to be noop, that the post-norm result match the input, and thus rejected paths that used dot segments when interacting with nodelocal. Additionally, we control the semantics of nodelocal but we don't control the semantics of the cloud providers, so if we want to provide consistent semantics across out external IO providers, then the fact nodelocal is the odd-one-out is what we should fix, not emulate.
There was a problem hiding this comment.
If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping
Ok that works for me.
So, is it a bit surprising? yeah, it could be, to someone who expects it to behave like their shell. But given that userfile sits side-by-side with s3 and gcs in our storage options, I'm in favor of it behaving consistently with them.
This does not address the fact that a user will expect a/./b and a/b to be equivalent, as well as a/b/../c and a/c.
I get that S3/GCS have a different opinion on the topic, but it's unlikely that the database user will know/understand this. I think it's on CockroachDB to "make data easy" and remove this source of surprise.
At the very least I'm in favor of blocking these patterns. I would be surprised if users actively desire to support them, an I'd be willing to block this feature until the point we have evidence it's needed.
There was a problem hiding this comment.
If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping
Ok that works for me.
I concur here. The keys for the blobs should have no relation to the underlying storage system.
At the very least I'm in favor of blocking these patterns. I would be surprised if users actively desire to support them, an I'd be willing to block this feature until the point we have evidence it's needed.
Sounds good to me.
7294f5b to
0debb50
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @knz, @miretskiy, and @mjibson)
pkg/cli/nodelocal.go, line 56 at r4 (raw file):
Previously, knz (kena) wrote…
I'm confused - your
uploadFilefunction is synchronous, what is there to cancel when it has returned?
Done.
pkg/cli/userfile.go, line 54 at r4 (raw file):
Previously, knz (kena) wrote…
ditto - what is this cancel about
Done.
pkg/cli/userfile.go, line 101 at r4 (raw file):
Previously, knz (kena) wrote…
This check looks overly restrictive. As stated, I cannot use a destination path that looks like this:
a/b..c/deven though it ought to be valid.Also, What do you make of the destinations
a/./banda/b? Are they the same? Are they considered different?
The follow-up PR which adopts the new scheme of user input handling gets rid of this check entirely. If we commit to the notion that the filename is what the user provided as the destination arg then a/./b and a/b will be different in the underlying UserFileTableStorage.
pkg/cli/userfile.go, line 116 at r4 (raw file):
Previously, knz (kena) wrote…
Ok that might be a problem. We allow username that contain arbitrary unicode letters (character classes L/l) which means any letter with accents, diacritics etc.
Do we have a guarantee that all the stores server-side where the files will be created accept to name files using arbitrary unicode letters?If not, you still need to adopt an escaping scheme of some sort.
(Also I recommend you add some tests for this case)
As discussed offline the username is only used as a prefix of the SQL tables. It follows the same behavior as if one created a table using CREATE TABLE with a special Unicode character. As an aside I think SHOW TABLES does not handle unicode characters correctly, as SHOW USERS does. I have filed a separate issue for this.
pkg/sql/copy_file_upload_test.go, line 55 at r4 (raw file):
Previously, knz (kena) wrote…
nit: turn this if/elseif into a switch to make it more readable.
Done.
0debb50 to
511d6d1
Compare
| " not provided") | ||
| } | ||
|
|
||
| // FileTableStorage is not backed by a file system and so the name of the file |
knz
left a comment
There was a problem hiding this comment.
LGTM modulo adding a sanity check about special chars
Reviewed 5 of 5 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @adityamaru, @dt, @knz, @miretskiy, and @mjibson)
pkg/cli/userfiletable_test.go, line 40 at r5 (raw file):
c.Run(fmt.Sprintf("userfile upload %s /test/file1.csv", file)) c.Run(fmt.Sprintf("userfile upload %s /test/../../file1.csv", file)) c.Run(fmt.Sprintf("userfile upload %s /test/./file1.csv", file))
can you add at least one test with special characters here.
pkg/storage/cloudimpl/file_table_storage.go, line 45 at r5 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
@knz @dt moved the check to the server-side. Also added a couple of tests in
userfile_test.go.
thanks!
This change adds a CLI command allowing users to upload local files to the user scoped file table ExternalStorage. The command `userfile upload` uses the existing COPY protocol (similar to `nodelocal upload`) to upload files and write them to the UserFileTableSystem. The UserFileTableSystem is backed by two SQL tables which are currently always created with `defaultdb.public.user` as the prefix of the qualified name. In the future we may allow users to specify the `db.schema` they wish to store their tables in. The command takes a source and destination path. The former is used to find the file contents locally, the latter is used to reference the file and its related metadata/payload in the SQL tables. Known limitations: - All destination paths must start with `/`, this is to help us disambiguate filepath from `db.schema` name when we allow users to specify that in the future. - Destination paths must not have a `..` in them. Since the UserFilTableSystem is not a "real" file system, storing SQL rows with filenames such as /test/../test.csv seems strange. We will work on enforcing a better naming scheme. Release note (cli change): Adds a userfile upload command that can be used to upload a file to the user scoped blob storage: `userfile upload source/file /destination/of/file`
511d6d1 to
252d60b
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @dt, @knz, @miretskiy, and @mjibson)
pkg/cli/userfiletable_test.go, line 40 at r5 (raw file):
Previously, knz (kena) wrote…
can you add at least one test with special characters here.
done, thanks for pointing this out it had slipped through the cracks. I had to "unescape" the URL to account for arbitrary unicode characters.
knz
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @dt, @miretskiy, and @mjibson)
can you also change your other PR with the "list" feature to ensure that special chars properly roundtrip across store-listing. |
|
TFTRs! bors r+ |
Build succeeded |
This change adds a CLI command allowing users to upload local files to
the user scoped file table ExternalStorage. The command
userfile uploaduses the existing COPY protocol (similar tonodelocal upload)to upload files and write them to the UserFileTableSystem. The
UserFileTableSystem is backed by two SQL tables which are currently
always created with
defaultdb.public.useras the prefix of thequalified name. In the future we may allow users to specify the
db.schemathey wish to store their tables in.The command takes a source and destination path. The former is used to
find the file contents locally, the latter is used to reference the file
and its related metadata/payload in the SQL tables.
Known limitations:
All destination paths must start with
/, this is to help usdisambiguate filepath from
db.schemaname when we allow users tospecify that in the future.
Destination paths must not have a
..in them. Since theUserFilTableSystem is not a "real" file system, storing SQL rows with
filenames such as /test/../test.csv seems strange. We will work on
enforcing a better naming scheme.
Informs: #47211
Release note (cli change): Adds a userfile upload command that can be
used to upload a file to the user scoped blob storage:
userfile upload source/file /destination/of/file