importccl: add userfile import tests and import benchmark#51077
importccl: add userfile import tests and import benchmark#51077craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
I will rebase once #50981 is merged in. I just wanted to sanity check that import works as expected. I am not sure to what degree we want to test the different kinds of IMPORT from UserFileTableStorage, considering the individual pieces of the ExternalStorage have their own tests and IMPORT itself is tested extensively against nodelocal. |
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 6 of 12 files at r1, 1 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)
pkg/ccl/importccl/import_stmt_test.go, line 2861 at r2 (raw file):
000000
This benchmark, as well as the benchmark you have copied it from, are completely
broken, and don't really report anything.
- This benchmark runs only once, trying to co-opt b.N parameter into some
function of bytes. - These benchmarks use
makeCSVData(b, numFiles, b.N*100, nodes, 16)files. It seems that these files are large, ... but they are not. You need to run the benchmark with COCKROACH_REWRITE_CSV_TESTDATA=true environment variable to tell this method to regenerate data. Otherwise, you're importing few KBs. Even with this env variable,
the file is still tiny (only 87K). Particularly important for your benchmark, userfile storage
chunks files into 4MB chunks (if I recall correctly). I'd say your import file should reflect that.
I would just generate brand new file and use that instead; Generate it in a temp directory.
Make it large enough 10?50MB? something like that. - Finally, you need to call b.SetBytes() to set the number of bytes processed by operation.
Since you run the import only once, you should set it to the size of the input file.
This will report throughput.
43466c5 to
c4de779
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)
pkg/ccl/importccl/import_stmt_test.go, line 2861 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
000000This benchmark, as well as the benchmark you have copied it from, are completely
broken, and don't really report anything.
- This benchmark runs only once, trying to co-opt b.N parameter into some
function of bytes.- These benchmarks use
makeCSVData(b, numFiles, b.N*100, nodes, 16)files. It seems that these files are large, ... but they are not. You need to run the benchmark with COCKROACH_REWRITE_CSV_TESTDATA=true environment variable to tell this method to regenerate data. Otherwise, you're importing few KBs. Even with this env variable,
the file is still tiny (only 87K). Particularly important for your benchmark, userfile storage
chunks files into 4MB chunks (if I recall correctly). I'd say your import file should reflect that.
I would just generate brand new file and use that instead; Generate it in a temp directory.
Make it large enough 10?50MB? something like that.- Finally, you need to call b.SetBytes() to set the number of bytes processed by operation.
Since you run the import only once, you should set it to the size of the input file.
This will report throughput.
Thanks for pointing this out. I have fixed the benchmarks to generate a single file ~50Mib and then import it either from a nodelocal or usefile storage source. The throughputs for both the imports seem comparable assuming the benchmarks are correctly setup.
miretskiy
left a comment
There was a problem hiding this comment.
Let me know when the other PR merges -- i'd like to take a look after.
Reviewed 1 of 2 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)
pkg/ccl/importccl/import_stmt_test.go, line 2853 at r3 (raw file):
Quoted 10 lines of code…
// Set up a directory for the data files. err := os.Mkdir(filepath.Join(baseDir, "test"), 0777) require.NoError(b, err) // Write the test data into a file. r, err := csvGen.Open() defer r.Close() content, err := ioutil.ReadAll(r) require.NoError(b, err) err = ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), content, 0666) require.NoError(b, err)
I would do it slightly differently:
f, err := ioutil.TempFile(baseDir, "test_file")
require.NoError(t, err)
numBytes, err := io.Copy(f, r)
require.NoError(t, err)
b.SetBytes(numBytes)
pkg/ccl/importccl/import_stmt_test.go, line 2859 at r3 (raw file):
// tweaked. require.GreaterOrEqual(b, len(content), 50*1024*1024) require.LessOrEqual(b, len(content), 55*1024*1024)
I'd say that it's probably a bit too much -- it really doesn't matter how exactly what the size is.
you're still reporting correct throughput. I'd drop those 2 checks and a comment.
pkg/ccl/importccl/import_stmt_test.go, line 2869 at r3 (raw file):
CSV DATA ('%s')`, "nodelocal://0/test/data", ))
I think it makes sense to refactor this entire benchmark into helper -- the exact code is repeated in the userfile upload benchmark below:
func benchUserUpload(b *testing.B, uploadBaseURI string) {
....
sqlDB.Exec(b,
fmt.Sprintf(
IMPORT TABLE t (a INT8 PRIMARY KEY, b STRING, INDEX (b), INDEX (a, b)) CSV DATA ('%s/test_file'),
uploadBaseURI
))
}
Invoke this as:
benchUserUpload(b, "nodelocal://0")
and
benchUserUpload(b, "userfile://defaultdb.public.root")
c4de779 to
3322ad1
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)
pkg/ccl/importccl/import_stmt_test.go, line 2853 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
// Set up a directory for the data files. err := os.Mkdir(filepath.Join(baseDir, "test"), 0777) require.NoError(b, err) // Write the test data into a file. r, err := csvGen.Open() defer r.Close() content, err := ioutil.ReadAll(r) require.NoError(b, err) err = ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), content, 0666) require.NoError(b, err)I would do it slightly differently:
f, err := ioutil.TempFile(baseDir, "test_file")
require.NoError(t, err)
numBytes, err := io.Copy(f, r)
require.NoError(t, err)
b.SetBytes(numBytes)
Done.
pkg/ccl/importccl/import_stmt_test.go, line 2859 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I'd say that it's probably a bit too much -- it really doesn't matter how exactly what the size is.
you're still reporting correct throughput. I'd drop those 2 checks and a comment.
Done. Dropped the file size to 25MiB.
pkg/ccl/importccl/import_stmt_test.go, line 2869 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I think it makes sense to refactor this entire benchmark into helper -- the exact code is repeated in the userfile upload benchmark below:
func benchUserUpload(b *testing.B, uploadBaseURI string) {
....
sqlDB.Exec(b,
fmt.Sprintf(
IMPORT TABLE t (a INT8 PRIMARY KEY, b STRING, INDEX (b), INDEX (a, b)) CSV DATA ('%s/test_file'),
uploadBaseURI
))
}Invoke this as:
benchUserUpload(b, "nodelocal://0")
and
benchUserUpload(b, "userfile://defaultdb.public.root")
Done. nodelocal requires writing to a file, while userfile requires writing to the ExternalStorage, but apart from that all the logic is shared.
0e3715e to
ae323a7
Compare
|
@miretskiy RFAL! |
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r2, 1 of 2 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt and @miretskiy)
This change adds simple unit tests to ensure we can import from the new userfile ExternalStorage. It also adds a benchmark to compare the performance of importing from nodelocal against importing from userfile storage. Release note: None
ae323a7 to
c6b9600
Compare
|
bors r+ |
Build failed (retrying...) |
Canceled (will resume) |
Build succeeded |
This change adds simple unit tests to ensure we can import from the new
userfile ExternalStorage. It also adds a benchmark to compare the
performance of importing from nodelocal against importing from userfile
storage.
Informs: #51222