-
Notifications
You must be signed in to change notification settings - Fork 177
docs: Add README for HBase Tools and Beam import/export and validator pipelines #2949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| --bigtableTableId=$TABLE_NAME \ | ||
| --destinationPath=$BUCKET_NAME/hbase_export/ \ | ||
| --tempLocation=$BUCKET_NAME/hbase_temp]/ \ | ||
| --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 10x the right number here? I've typically seen 3x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10x is much higher. I agree 3x works much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what is expr 10? Should we just say numNodesInCluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expr 10 \* just does the arithmetic inline, so you only need to provide the number of nodes for your cluster
| -export $MAXVERSIONS | ||
| bin/hbase org.apache.hadoop.hbase.mapreduce.Export \ | ||
| -Dmapred.output.compress=true \ | ||
| -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend against adding the GZIP here. We can't import GZIP files via dataflow yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this from the solutions article, if you have more info on the command we can dig deeper into it
| --bigtableTableId=$TABLE_NAME \ | ||
| --destinationPath=$BUCKET_NAME/hbase_export/ \ | ||
| --tempLocation=$BUCKET_NAME/hbase_temp]/ \ | ||
| --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what is expr 10? Should we just say numNodesInCluster?
| Create the table in your cluster. | ||
| ## Importing to Bigtable | ||
|
|
||
| This folder contains pipelines to help import data via snapshots or sequence files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have separate import/export folders per say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just the way I worded this. I will ask someone on my team how to rephrase
|
|
||
| Copy the schema file to a host which can connect to Google Cloud. | ||
|
|
||
| todo: how to do if it can't connect to the internet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can either use a VPC endpoint for GCS/S3 or ssh to the private VPC host via proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to provide instructions on this or is this something that will be clear? Also, my intention is to have environment variables for the outputted schema so it can be clear what will be imported in the command further down. I might need some help figuring out how to say this
| todo: how to do if it can't connect to the internet? | ||
|
|
||
| ``` | ||
| $SCHEMA_FILE_PATH=/path/to/hbase-schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what is happening here.
jhambleton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really like the guide for both the schema tool and import/export tools. i ran through the snapshot import and schema tool and provided comments.
| -export $MAXVERSIONS | ||
| bin/hbase org.apache.hadoop.hbase.mapreduce.Export \ | ||
| -Dmapred.output.compress=true \ | ||
| -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing \ isn't conventional
| --tempLocation=$BUCKET_NAME/hbase_temp]/ \ | ||
| --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit trailing slash and hbase_temp] with right bracket
| --project=$PROJECT_ID \ | ||
| --bigtableInstanceId=$INSTANCE_ID \ | ||
| --bigtableTableId=$TABLE_NAME \ | ||
| --hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note, importsnapshot expected subdir for /data from my test. i.e. with --hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data the tool will look for $SNAPSHOT_PATH/data/data/.hbase-snapshots
| java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ | ||
| com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \ | ||
| -Dgoogle.bigtable.project.id=$PROJECT_ID \ | ||
| -Dgoogle.bigtable.instance.id=$INSTANCE_ID \ | ||
| -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \ | ||
| -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \ | ||
| -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using java system properties, these need to be passed into the jvm such as:
java \
-Dgoogle.bigtable.project.id=$PROJECT_ID \
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
-jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator | java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ | ||
| com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \ | ||
| -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \ | ||
| -Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH | ||
| -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \ | ||
| -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no instruction for what to set $SCHEMA_FILE_PATH as. also update java system props so they aren't interpreted as args, ie.
java \
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
-Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
-jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, but I think it's a little confusing still
| java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ | ||
| com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \ | ||
| -Dgoogle.bigtable.project.id=$PROJECT_ID \ | ||
| -Dgoogle.bigtable.instance.id=$INSTANCE_ID \ | ||
| -Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update java sys props
java \
-Dgoogle.bigtable.project.id=$PROJECT_ID \
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
-Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \
-jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator |
@jhambleton @vermas2012 I'm gonna run through all these instructions to make sure they're working as I've written and kinda test it out probably on Monday and put in more updates. I need to take a step back from it since there are so many commands and instructions, so I wanna come back with fresh eyes and really run through it with this to make sure everything is clear. I'm glad we're using the README as a starting point, being able to pull in the commands and instructions for any other tutorials and blogs will be really helpful and will speed up that work down the line |
|
@jhambleton @vermas2012 just finished rerunning through the snapshot workflow and it worked, only issue was with the sync job v2 which shitanshu knows about. I also started writing up the instructions in a long form doc which will be good for either a blog post, codelab, or some other kind of content. |
jhambleton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great updates. the steps for the snapshot import and schema tool are laid out nicely. see few comments.
jhambleton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job shaping these up! lgtm.
| TABLE_NAME="my-new-table" | ||
| EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export | ||
| hadoop fs -mkdir -p ${EXPORTDIR} | ||
| MAXVERSIONS=2147483647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this number? Is this the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah looks like maxint?
| 1. Run the import. | ||
| ``` | ||
| java -jar bigtable-beam-import-1.14.1-SNAPSHOT-shaded.jar importsnapshot \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.20.0
...ase-1.x-tools/src/main/java/com/google/cloud/bigtable/hbase/tools/HBaseSchemaTranslator.java
Outdated
Show resolved
Hide resolved
| 1. Export the snapshot | ||
| ``` | ||
| hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \ | ||
| -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make mappers as a env variable? I am not sure if 16 mappers will work for snapshots of all size?
| ``` | ||
| 1. Create hashes for the table to be used during the data validation step. | ||
| ``` | ||
| hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhambleton What should be our guidance for the batchsize and numHashFiles? I would say it should be dependent on the snapshot size,..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchsize is number of bytes we scan in computing a single hash. We'd likely set this closer to a kb. With TB
sized tables we should expect the results in the GBs. Instead of providing recommendations here, we should like to hbase doc: http://hbase.apache.org/book.html#_step_1_hashtable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should provide this information. I will add the link for now, but let's discuss adding a more thorough guide to the params
| bin/hbase org.apache.hadoop.hbase.mapreduce.Export \ | ||
| -Dmapred.output.compress=true \ | ||
| -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ | ||
| -DRAW_SCAN=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet support import of compressed files. That support is coming, but untill then, we should not suggest compression.s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vermas2012 - dataflow supports import of compressed sequencefiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so should we leave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?
grammar additional param info
|
@billyjacobson I updated the PR title to be a bit more descriptive, feel free to change as needed :) |
| bin/hbase org.apache.hadoop.hbase.mapreduce.Export \ | ||
| -Dmapred.output.compress=true \ | ||
| -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ | ||
| -DRAW_SCAN=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?
vermas2012
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with suggested changes.
add timestamp to output file
… pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file
… pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file
…via snapshots) (#3197) * docs: Add README for HBase Tools and Beam import/export and validator pipelines (#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo
…via snapshots) (googleapis#3197) * docs: Add README for HBase Tools and Beam import/export and validator pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (googleapis#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (googleapis#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo
READMEs for HBase Tools and Beam import/export and validator pipelines.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> ☕️