Skip to content

importccl: add userfile import tests and import benchmark#51077

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:test-import-userfile
Jul 16, 2020
Merged

importccl: add userfile import tests and import benchmark#51077
craig[bot] merged 1 commit intocockroachdb:masterfrom
adityamaru:test-import-userfile

Conversation

@adityamaru
Copy link
Copy Markdown
Contributor

@adityamaru adityamaru commented Jul 7, 2020

This change adds simple unit tests to ensure we can import from the new
userfile ExternalStorage. It also adds a benchmark to compare the
performance of importing from nodelocal against importing from userfile
storage.

Informs: #51222

@adityamaru adityamaru requested review from a team, dt and miretskiy July 7, 2020 18:38
@adityamaru adityamaru requested a review from a team as a code owner July 7, 2020 18:38
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@adityamaru
Copy link
Copy Markdown
Contributor Author

adityamaru commented Jul 7, 2020

I will rebase once #50981 is merged in. I just wanted to sanity check that import works as expected. I am not sure to what degree we want to test the different kinds of IMPORT from UserFileTableStorage, considering the individual pieces of the ExternalStorage have their own tests and IMPORT itself is tested extensively against nodelocal.

@adityamaru adityamaru removed the request for review from a team July 7, 2020 18:41
@miretskiy miretskiy self-requested a review July 7, 2020 20:37
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 12 files at r1, 1 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)


pkg/ccl/importccl/import_stmt_test.go, line 2861 at r2 (raw file):

000000

This benchmark, as well as the benchmark you have copied it from, are completely
broken, and don't really report anything.

  1. This benchmark runs only once, trying to co-opt b.N parameter into some
    function of bytes.
  2. These benchmarks use makeCSVData(b, numFiles, b.N*100, nodes, 16) files. It seems that these files are large, ... but they are not. You need to run the benchmark with COCKROACH_REWRITE_CSV_TESTDATA=true environment variable to tell this method to regenerate data. Otherwise, you're importing few KBs. Even with this env variable,
    the file is still tiny (only 87K). Particularly important for your benchmark, userfile storage
    chunks files into 4MB chunks (if I recall correctly). I'd say your import file should reflect that.
    I would just generate brand new file and use that instead; Generate it in a temp directory.
    Make it large enough 10?50MB? something like that.
  3. Finally, you need to call b.SetBytes() to set the number of bytes processed by operation.
    Since you run the import only once, you should set it to the size of the input file.
    This will report throughput.

@adityamaru adityamaru force-pushed the test-import-userfile branch 2 times, most recently from 43466c5 to c4de779 Compare July 8, 2020 04:17
Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)


pkg/ccl/importccl/import_stmt_test.go, line 2861 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
000000

This benchmark, as well as the benchmark you have copied it from, are completely
broken, and don't really report anything.

  1. This benchmark runs only once, trying to co-opt b.N parameter into some
    function of bytes.
  2. These benchmarks use makeCSVData(b, numFiles, b.N*100, nodes, 16) files. It seems that these files are large, ... but they are not. You need to run the benchmark with COCKROACH_REWRITE_CSV_TESTDATA=true environment variable to tell this method to regenerate data. Otherwise, you're importing few KBs. Even with this env variable,
    the file is still tiny (only 87K). Particularly important for your benchmark, userfile storage
    chunks files into 4MB chunks (if I recall correctly). I'd say your import file should reflect that.
    I would just generate brand new file and use that instead; Generate it in a temp directory.
    Make it large enough 10?50MB? something like that.
  3. Finally, you need to call b.SetBytes() to set the number of bytes processed by operation.
    Since you run the import only once, you should set it to the size of the input file.
    This will report throughput.

Thanks for pointing this out. I have fixed the benchmarks to generate a single file ~50Mib and then import it either from a nodelocal or usefile storage source. The throughputs for both the imports seem comparable assuming the benchmarks are correctly setup.

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when the other PR merges -- i'd like to take a look after.

Reviewed 1 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)


pkg/ccl/importccl/import_stmt_test.go, line 2853 at r3 (raw file):

Quoted 10 lines of code…
	// Set up a directory for the data files.
	err := os.Mkdir(filepath.Join(baseDir, "test"), 0777)
	require.NoError(b, err)
	// Write the test data into a file.
	r, err := csvGen.Open()
	defer r.Close()
	content, err := ioutil.ReadAll(r)
	require.NoError(b, err)
	err = ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), content, 0666)
	require.NoError(b, err)

I would do it slightly differently:
f, err := ioutil.TempFile(baseDir, "test_file")
require.NoError(t, err)
numBytes, err := io.Copy(f, r)
require.NoError(t, err)
b.SetBytes(numBytes)


pkg/ccl/importccl/import_stmt_test.go, line 2859 at r3 (raw file):

	// tweaked.
	require.GreaterOrEqual(b, len(content), 50*1024*1024)
	require.LessOrEqual(b, len(content), 55*1024*1024)

I'd say that it's probably a bit too much -- it really doesn't matter how exactly what the size is.
you're still reporting correct throughput. I'd drop those 2 checks and a comment.


pkg/ccl/importccl/import_stmt_test.go, line 2869 at r3 (raw file):

			CSV DATA ('%s')`,
			"nodelocal://0/test/data",
		))

I think it makes sense to refactor this entire benchmark into helper -- the exact code is repeated in the userfile upload benchmark below:

func benchUserUpload(b *testing.B, uploadBaseURI string) {
....
sqlDB.Exec(b,
fmt.Sprintf(
IMPORT TABLE t (a INT8 PRIMARY KEY, b STRING, INDEX (b), INDEX (a, b)) CSV DATA ('%s/test_file'),
uploadBaseURI
))
}

Invoke this as:
benchUserUpload(b, "nodelocal://0")
and
benchUserUpload(b, "userfile://defaultdb.public.root")

@miretskiy miretskiy self-requested a review July 8, 2020 13:03
@adityamaru adityamaru force-pushed the test-import-userfile branch from c4de779 to 3322ad1 Compare July 8, 2020 13:58
Copy link
Copy Markdown
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)


pkg/ccl/importccl/import_stmt_test.go, line 2853 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
	// Set up a directory for the data files.
	err := os.Mkdir(filepath.Join(baseDir, "test"), 0777)
	require.NoError(b, err)
	// Write the test data into a file.
	r, err := csvGen.Open()
	defer r.Close()
	content, err := ioutil.ReadAll(r)
	require.NoError(b, err)
	err = ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), content, 0666)
	require.NoError(b, err)

I would do it slightly differently:
f, err := ioutil.TempFile(baseDir, "test_file")
require.NoError(t, err)
numBytes, err := io.Copy(f, r)
require.NoError(t, err)
b.SetBytes(numBytes)

Done.


pkg/ccl/importccl/import_stmt_test.go, line 2859 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I'd say that it's probably a bit too much -- it really doesn't matter how exactly what the size is.
you're still reporting correct throughput. I'd drop those 2 checks and a comment.

Done. Dropped the file size to 25MiB.


pkg/ccl/importccl/import_stmt_test.go, line 2869 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think it makes sense to refactor this entire benchmark into helper -- the exact code is repeated in the userfile upload benchmark below:

func benchUserUpload(b *testing.B, uploadBaseURI string) {
....
sqlDB.Exec(b,
fmt.Sprintf(
IMPORT TABLE t (a INT8 PRIMARY KEY, b STRING, INDEX (b), INDEX (a, b)) CSV DATA ('%s/test_file'),
uploadBaseURI
))
}

Invoke this as:
benchUserUpload(b, "nodelocal://0")
and
benchUserUpload(b, "userfile://defaultdb.public.root")

Done. nodelocal requires writing to a file, while userfile requires writing to the ExternalStorage, but apart from that all the logic is shared.

@adityamaru adityamaru force-pushed the test-import-userfile branch 2 times, most recently from 0e3715e to ae323a7 Compare July 16, 2020 17:07
@adityamaru
Copy link
Copy Markdown
Contributor Author

@miretskiy RFAL!

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r2, 1 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)

@miretskiy miretskiy self-requested a review July 16, 2020 17:52
This change adds simple unit tests to ensure we can import from the new
userfile ExternalStorage. It also adds a benchmark to compare the
performance of importing from nodelocal against importing from userfile
storage.

Release note: None
@adityamaru adityamaru force-pushed the test-import-userfile branch from ae323a7 to c6b9600 Compare July 16, 2020 18:04
@adityamaru
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 16, 2020

Build failed (retrying...)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 16, 2020

Canceled (will resume)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 16, 2020

Build succeeded

@craig craig bot merged commit dc736e8 into cockroachdb:master Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants