Skip to content

cli: add support for userfile upload CLI command#50981

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:user-file-upload
Jul 16, 2020
Merged

cli: add support for userfile upload CLI command#50981
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:user-file-upload

Conversation

@adityamaru
Copy link
Copy Markdown
Contributor

This change adds a CLI command allowing users to upload local files to
the user scoped file table ExternalStorage. The command
userfile upload uses the existing COPY protocol (similar to nodelocal upload)
to upload files and write them to the UserFileTableSystem. The
UserFileTableSystem is backed by two SQL tables which are currently
always created with defaultdb.public.user as the prefix of the
qualified name. In the future we may allow users to specify the
db.schema they wish to store their tables in.

The command takes a source and destination path. The former is used to
find the file contents locally, the latter is used to reference the file
and its related metadata/payload in the SQL tables.

Known limitations:

  • All destination paths must start with /, this is to help us
    disambiguate filepath from db.schema name when we allow users to
    specify that in the future.

  • Destination paths must not have a .. in them. Since the
    UserFilTableSystem is not a "real" file system, storing SQL rows with
    filenames such as /test/../test.csv seems strange. We will work on
    enforcing a better naming scheme.

Informs: #47211

Release note (cli change): Adds a userfile upload command that can be
used to upload a file to the user scoped blob storage: userfile upload source/file /destination/of/file

@adityamaru adityamaru requested review from dt and miretskiy July 4, 2020 18:33
@adityamaru adityamaru requested a review from a team as a code owner July 4, 2020 18:33
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 10 files at r1, 1 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)


pkg/cli/cli_test.go, line 1411 at r1 (raw file):

  nodelocal         upload and delete nodelocal files
	userfile					upload and delete user scoped files

indentation off?

@miretskiy miretskiy self-requested a review July 6, 2020 00:10
Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)


pkg/cli/cli_test.go, line 1411 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

indentation off?

Done.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this work is now started; it's super important.
I have a few remarks below.

One general design question I have is why it is so important to preserve the directory hierarchy that's present client-side to also be there server-side. Is it truly important that a client path a/b/c.csv gets passed as userfile://.../a/b/c.csv? Why not flatten all the complex paths to a single directory level server-side, for example userfile://.../a_b_c.csv?

(We happen to have a helper function to do that btw see my comment below)

Reviewed 8 of 10 files at r1, 3 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)


pkg/cli/nodelocal.go, line 86 at r3 (raw file):

	// messages because we clean the path before using in the nodelocal
	// ExternalStorage methods.
	destination = strings.TrimPrefix(destination, "/")

Can you double-check the semantics of this on Windows:

  1. windows uses a // prefix to indicate a network directory. Your code would behave weirdly for that.

  2. if you want to trim the directory separator, be mindful that the separator on windows can be either / or \. there's a function in the stdlib for this.


pkg/cli/userfile.go, line 92 at r3 (raw file):

	// its contents to a SQL table with filename set to test/../../test.csv. This
	// is strange and we should come up with a better scheme of enforcing
	// "sensible" filenames.

Please have a look at type fileNameEscaper in zip.go, what it's used for, and if possible use it.


pkg/cli/userfile.go, line 106 at r3 (raw file):

	// Currently we hardcode the db.schema prefix, in the future we might allow
	// users to specify this.
	userfileURI := fmt.Sprintf("userfile://%s%s",

Please avoid generating URIs by string concatenation in this way. You can instead:

  • instantiate a url.URL
  • populate its fields, in particular use filepath.Join for the Path component
  • then use String() on the result.

pkg/cli/userfile.go, line 110 at r3 (raw file):

	stmt, err := conn.conn.Prepare(sql.CopyInFileStmt(userfileURI, "crdb_internal",
		"user_file_upload"))

Please avoid magic strings in the code; especially magic strings that have to match in multiple places.

Instead, export a constant string from a common package dependency and use the constant in both places.


pkg/sql/conn_executor.go, line 1776 at r3 (raw file):

	var cm copyMachineInterface
	var err error
	if table := cmd.Stmt.Table; (table.Table() == nodelocalFileUploadTable ||

I would recommend extracting this comparison itno a helper function, whose functionality would be documented in a comment.


pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):

				// Using deprecated method to COPY.
				stream.GrepNot(`pkg/cli/nodelocal.go:.* stmt.Exec is deprecated: .*`),
				stream.GrepNot(`pkg/cli/userfile.go:.* stmt.Exec is deprecated: .*`),

Please reach out to @mjibson to check whether the new method is already available; then use it if it is (and avoid modifying the linter).

Conversely, if the method is not available:

  • add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced
  • update the linter here to include matt's explanation into the comment (so that the next person coming to this has more words to understand what's going on)

Copy link
Copy Markdown
Contributor

@madelynnblue madelynnblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)


pkg/cli/userfile.go, line 75 at r3 (raw file):

	}

	if _, err := conn.conn.Exec(`BEGIN`, nil); err != nil {

ExecContext might work here? Try it out.

Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! I brought up this question offline with the team and the general consensus was that there doesn't seem to be a compelling enough reason to replace / with _. A path such as /test/../../test.csv would get obfuscated to _test_______test.csv(provided we don't do any path cleaning) in the SQL tables. Having said this the current state is only to make forward progress - the follow-up approach being discussed is:

  • Treat the destination path as a filename. So /a/b/c.csv refers to a file with that whole path as its name.
    This is what gsutil does, and we think will be the easiest to convey to the user - "whatever you give as the destination will be the name used to reference that blob"

While we arrive at a conclusion, we think we should error out on seeing .. in the file path. We can always remove this error once we settle on a good user experience.

EDIT: We had some more discussions over user-friendly interaction with userfile upload over the last week. These can be tracked at issue -#51222 and PR - #51353.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, @miretskiy, and @mjibson)


pkg/cli/nodelocal.go, line 86 at r3 (raw file):

Previously, knz (kena) wrote…

Can you double-check the semantics of this on Windows:

  1. windows uses a // prefix to indicate a network directory. Your code would behave weirdly for that.

  2. if you want to trim the directory separator, be mindful that the separator on windows can be either / or \. there's a function in the stdlib for this.

Constructing a url.URL as you suggested let me get rid of the trimming. This should accommodate for both / and \ as well.


pkg/cli/userfile.go, line 75 at r3 (raw file):

Previously, mjibson (Matt Jibson) wrote…

ExecContext might work here? Try it out.

Done. stmt.Exec cannot be changed to StmtExecContext because the copyin driver does not implement the method from what I can tell.


pkg/cli/userfile.go, line 92 at r3 (raw file):

Previously, knz (kena) wrote…

Please have a look at type fileNameEscaper in zip.go, what it's used for, and if possible use it.

Addressed in the PR level comment above.


pkg/cli/userfile.go, line 106 at r3 (raw file):

Previously, knz (kena) wrote…

Please avoid generating URIs by string concatenation in this way. You can instead:

  • instantiate a url.URL
  • populate its fields, in particular use filepath.Join for the Path component
  • then use String() on the result.

Done.


pkg/cli/userfile.go, line 110 at r3 (raw file):

Previously, knz (kena) wrote…

Please avoid magic strings in the code; especially magic strings that have to match in multiple places.

Instead, export a constant string from a common package dependency and use the constant in both places.

Done.


pkg/sql/conn_executor.go, line 1776 at r3 (raw file):

Previously, knz (kena) wrote…

I would recommend extracting this comparison itno a helper function, whose functionality would be documented in a comment.

Done.


pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):

add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced

Some improvements were made to switch Exec to ExecContext, but the stmt.Exec call which is caught by the linter cannot be changed as the copyin driver does not have this method implemented. I have added both the suggested comments.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, @miretskiy, and @mjibson)


pkg/cli/nodelocal.go, line 56 at r4 (raw file):

	defer reader.Close()

	ctx, cancel := context.WithCancel(context.Background())

I'm confused - your uploadFile function is synchronous, what is there to cancel when it has returned?


pkg/cli/userfile.go, line 54 at r4 (raw file):

	defer reader.Close()

	ctx, cancel := context.WithCancel(context.Background())

ditto - what is this cancel about


pkg/cli/userfile.go, line 101 at r4 (raw file):

	// is strange and we should come up with a better scheme of enforcing
	// "sensible" filenames.
	if strings.Contains(destination, "..") {

This check looks overly restrictive. As stated, I cannot use a destination path that looks like this: a/b..c/d even though it ought to be valid.

Also, What do you make of the destinations a/./b and a/b? Are they the same? Are they considered different?


pkg/cli/userfile.go, line 116 at r4 (raw file):

	userfileURL := url.URL{
		Scheme: "userfile",
		Host:   defaultQualifiedDBSchemaName + connURL.User.Username(),

Ok that might be a problem. We allow username that contain arbitrary unicode letters (character classes L/l) which means any letter with accents, diacritics etc.
Do we have a guarantee that all the stores server-side where the files will be created accept to name files using arbitrary unicode letters?

If not, you still need to adopt an escaping scheme of some sort.

(Also I recommend you add some tests for this case)


pkg/sql/copy_file_upload_test.go, line 55 at r4 (raw file):

func prepareFileUploadURI(user, testSendFile, copyInternalTable string) (string, error) {
	var uri string
	if copyInternalTable == NodelocalFileUploadTable {

nit: turn this if/elseif into a switch to make it more readable.


pkg/testutils/lint/lint_test.go, line 1451 at r3 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

add a TODO in your new code and also in nodelocal.go to indicate that the function needs to be replaced

Some improvements were made to switch Exec to ExecContext, but the stmt.Exec call which is caught by the linter cannot be changed as the copyin driver does not have this method implemented. I have added both the suggested comments.

thanks

// you upload a file to a destination such as test/../../test.csv, we write
// its contents to a SQL table with filename set to test/../../test.csv. This
// is strange and we should come up with a better scheme of enforcing
// "sensible" filenames.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Go URLs support path normalization per RFC3986. https://tools.ietf.org/html/rfc3986

Perhaps we could normalize a URL by parsing it, then reconstruct it and avoid some headaches?

https://golang.org/src/net/url/url.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we’re making a lot of these “paths” that we don’t need to — These are not actual file system paths, they’re just string names. S3 or google cloud storage just treat the name as a string, with slashes simply a display-time concept used for grouping results in the GUI. gsutil cp local-file gs://my_bucket/path/../to/./file produces a file named “ path/../to/./file” and if it is good enough for google, it is good enough for me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is good enough for google, it is good enough for me.

  • You say "S3 or GCS". What does S3 say?

  • S3 provides a HTTP interface to access stored files in a web browser. What does the web browser say about these special paths? Do they get normalized?

  • you're casually dismissing the "UX surprise" factor. Users will see this an ask questions. Wouldn't it be clearer to side-step these discussions preemptively?

Copy link
Copy Markdown
Contributor

@dt dt Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[13:21:41] $ aws s3 cp README.md s3://roland-test-s3encryption/a/path/../bar
upload: ./README.md to s3://bucket/a/path/../bar

[13:22:00] $ aws s3 ls s3://bucket/a/
                           PRE path/

[13:22:10] $ aws s3 ls s3://bucket/a/path/
                           PRE ../

[13:22:14] $ aws s3 ls s3://bucket/a/path/../
2020-07-14 13:22:01       6494 bar

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS HTTP API just treats the dots as part of the name. A GET request to "http://bucket.s3-website-us-east-1.amazonaws.com/a/path/../bar" will return that file, after a aws s3api put-object-acl --bucket bucket --key a/path/foo --acl public-read, where, again, in the key we see dots are just dots.

Copy link
Copy Markdown
Contributor

@dt dt Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is it a bit surprising? yeah, it could be, to someone who expects it to behave like their shell. But given that userfile sits side-by-side with s3 and gcs in our storage options, I'm in favor of it behaving consistently with them. The interaction we're building here maps 1:1 with aws s3 cp local remote or gsutil cp local remote and obviously they both came to the same conclusion: just treat the supplied remote name as a string key, with no special internal rewriting rules, and let the user decide what to put in it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interaction we're building here maps 1:1 with

I hear you and as long as this is true, all the points you make (and the choices in the PR) are OK.

Except that I expect this not to be true all the time. Before we know it we'll want to mock the stores by using a filesystem-backed store instead. Or there's a new cloud provider with a different idea on the topic.

More generally, I am not keen to look just at S3 and GCS and then rewrite 60 years of filesystem design history, while praying that the future will not re-align. This is ad-hoc revisionism and bound to bite us in the future.

Finally, this creates an inconsistency between the handling of s3/gcs-backed bulk operations, and that supported by nodelocal://. Because nodelocal does normalize its paths.

Copy link
Copy Markdown
Contributor

@dt dt Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping. This would be true anyway, regardless of dot segments, due to unicode or other special chars. I'm confident we could solve that if we needed to.

Yes, nodelocal normalizes paths, but if anything, I'd prefer it didn't -- we have to take extreme care when doing so to ensure that the post-normalization path abides by the base-path restriction. I'd be in favor of, instead of checking that post-normalization we're still in the base, we instead simply required normalization to be noop, that the post-norm result match the input, and thus rejected paths that used dot segments when interacting with nodelocal. Additionally, we control the semantics of nodelocal but we don't control the semantics of the cloud providers, so if we want to provide consistent semantics across out external IO providers, then the fact nodelocal is the odd-one-out is what we should fix, not emulate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping

Ok that works for me.

So, is it a bit surprising? yeah, it could be, to someone who expects it to behave like their shell. But given that userfile sits side-by-side with s3 and gcs in our storage options, I'm in favor of it behaving consistently with them.

This does not address the fact that a user will expect a/./b and a/b to be equivalent, as well as a/b/../c and a/c.

I get that S3/GCS have a different opinion on the topic, but it's unlikely that the database user will know/understand this. I think it's on CockroachDB to "make data easy" and remove this source of surprise.

At the very least I'm in favor of blocking these patterns. I would be surprised if users actively desire to support them, an I'd be willing to block this feature until the point we have evidence it's needed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If or when we choose to back this with a real file system, it'd be on us to ensure ensure we internally use file names that underlying filesystem handles, either via escaping of the user-provided name, hashing it, or a completely opaque identifier to which we maintain a mapping

Ok that works for me.

I concur here. The keys for the blobs should have no relation to the underlying storage system.

At the very least I'm in favor of blocking these patterns. I would be surprised if users actively desire to support them, an I'd be willing to block this feature until the point we have evidence it's needed.

Sounds good to me.

Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @knz, @miretskiy, and @mjibson)


pkg/cli/nodelocal.go, line 56 at r4 (raw file):

Previously, knz (kena) wrote…

I'm confused - your uploadFile function is synchronous, what is there to cancel when it has returned?

Done.


pkg/cli/userfile.go, line 54 at r4 (raw file):

Previously, knz (kena) wrote…

ditto - what is this cancel about

Done.


pkg/cli/userfile.go, line 101 at r4 (raw file):

Previously, knz (kena) wrote…

This check looks overly restrictive. As stated, I cannot use a destination path that looks like this: a/b..c/d even though it ought to be valid.

Also, What do you make of the destinations a/./b and a/b? Are they the same? Are they considered different?

The follow-up PR which adopts the new scheme of user input handling gets rid of this check entirely. If we commit to the notion that the filename is what the user provided as the destination arg then a/./b and a/b will be different in the underlying UserFileTableStorage.


pkg/cli/userfile.go, line 116 at r4 (raw file):

Previously, knz (kena) wrote…

Ok that might be a problem. We allow username that contain arbitrary unicode letters (character classes L/l) which means any letter with accents, diacritics etc.
Do we have a guarantee that all the stores server-side where the files will be created accept to name files using arbitrary unicode letters?

If not, you still need to adopt an escaping scheme of some sort.

(Also I recommend you add some tests for this case)

As discussed offline the username is only used as a prefix of the SQL tables. It follows the same behavior as if one created a table using CREATE TABLE with a special Unicode character. As an aside I think SHOW TABLES does not handle unicode characters correctly, as SHOW USERS does. I have filed a separate issue for this.


pkg/sql/copy_file_upload_test.go, line 55 at r4 (raw file):

Previously, knz (kena) wrote…

nit: turn this if/elseif into a switch to make it more readable.

Done.

" not provided")
}

// FileTableStorage is not backed by a file system and so the name of the file
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knz @dt moved the check to the server-side. Also added a couple of tests in userfile_test.go.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo adding a sanity check about special chars

Reviewed 5 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @adityamaru, @dt, @knz, @miretskiy, and @mjibson)


pkg/cli/userfiletable_test.go, line 40 at r5 (raw file):

	c.Run(fmt.Sprintf("userfile upload %s /test/file1.csv", file))
	c.Run(fmt.Sprintf("userfile upload %s /test/../../file1.csv", file))
	c.Run(fmt.Sprintf("userfile upload %s /test/./file1.csv", file))

can you add at least one test with special characters here.


pkg/storage/cloudimpl/file_table_storage.go, line 45 at r5 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

@knz @dt moved the check to the server-side. Also added a couple of tests in userfile_test.go.

thanks!

This change adds a CLI command allowing users to upload local files to
the user scoped file table ExternalStorage.  The command `userfile
upload` uses the existing COPY protocol (similar to `nodelocal upload`)
to upload files and write them to the UserFileTableSystem. The
UserFileTableSystem is backed by two SQL tables which are currently
always created with `defaultdb.public.user` as the prefix of the
qualified name.  In the future we may allow users to specify the
`db.schema` they wish to store their tables in.

The command takes a source and destination path. The former is used to
find the file contents locally, the latter is used to reference the file
and its related metadata/payload in the SQL tables.

Known limitations:
- All destination paths must start with `/`, this is to help us
  disambiguate filepath from `db.schema` name when we allow users to
specify that in the future.

- Destination paths must not have a `..` in them. Since the
  UserFilTableSystem is not a "real" file system, storing SQL rows with
filenames such as /test/../test.csv seems strange. We will work on
enforcing a better naming scheme.

Release note (cli change): Adds a userfile upload command that can be
used to upload a file to the user scoped blob storage: `userfile upload
source/file /destination/of/file`
Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @dt, @knz, @miretskiy, and @mjibson)


pkg/cli/userfiletable_test.go, line 40 at r5 (raw file):

Previously, knz (kena) wrote…

can you add at least one test with special characters here.

done, thanks for pointing this out it had slipped through the cracks. I had to "unescape" the URL to account for arbitrary unicode characters.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @dt, @miretskiy, and @mjibson)

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 16, 2020

done, thanks for pointing this out it had slipped through the cracks. I had to "unescape" the URL to account for arbitrary unicode characters.

can you also change your other PR with the "list" feature to ensure that special chars properly roundtrip across store-listing.

@adityamaru
Copy link
Copy Markdown
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 16, 2020

Build succeeded

@craig craig bot merged commit 3ef8a89 into cockroachdb:master Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants