docs: Add README for HBase Tools and Beam import/export and validator pipelines #2949

billyjacobson · 2021-04-27T21:18:49Z

READMEs for HBase Tools and Beam import/export and validator pipelines.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

… pipelines.

billyjacobson · 2021-04-27T21:23:24Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    --bigtableTableId=$TABLE_NAME \
+    --destinationPath=$BUCKET_NAME/hbase_export/ \
+    --tempLocation=$BUCKET_NAME/hbase_temp]/ \
+    --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \


is 10x the right number here? I've typically seen 3x

10x is much higher. I agree 3x works much better.

Also, what is expr 10? Should we just say numNodesInCluster?

the expr 10 \* just does the arithmetic inline, so you only need to provide the number of nodes for your cluster

vermas2012 · 2021-04-27T21:46:41Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    -export $MAXVERSIONS
+bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
+    -Dmapred.output.compress=true \
+    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \


I would recommend against adding the GZIP here. We can't import GZIP files via dataflow yet.

I took this from the solutions article, if you have more info on the command we can dig deeper into it

vermas2012 · 2021-04-27T21:48:06Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    --bigtableTableId=$TABLE_NAME \
+    --destinationPath=$BUCKET_NAME/hbase_export/ \
+    --tempLocation=$BUCKET_NAME/hbase_temp]/ \
+    --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \


Also, what is expr 10? Should we just say numNodesInCluster?

vermas2012 · 2021-04-27T21:48:58Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

-Create the table in your cluster.
+## Importing to Bigtable
+
+This folder contains pipelines to help import data via snapshots or sequence files


I don't think we have separate import/export folders per say.

I think it's just the way I worded this. I will ask someone on my team how to rephrase

bigtable-dataflow-parent/bigtable-beam-import/README.md

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

vermas2012 · 2021-04-27T22:10:30Z

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

+
+Copy the schema file to a host which can connect to Google Cloud. 
+
+todo: how to do if it can't connect to the internet?


They can either use a VPC endpoint for GCS/S3 or ssh to the private VPC host via proxy.

Do we need to provide instructions on this or is this something that will be clear? Also, my intention is to have environment variables for the outputted schema so it can be clear what will be imported in the command further down. I might need some help figuring out how to say this

vermas2012 · 2021-04-27T22:11:01Z

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

+todo: how to do if it can't connect to the internet?
+
+```
+$SCHEMA_FILE_PATH=/path/to/hbase-schema


not sure what is happening here.

jhambleton

really like the guide for both the schema tool and import/export tools. i ran through the snapshot import and schema tool and provided comments.

jhambleton · 2021-04-30T13:53:43Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    -export $MAXVERSIONS
+bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
+    -Dmapred.output.compress=true \
+    -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \


trailing \ isn't conventional

jhambleton · 2021-04-30T13:55:08Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    --tempLocation=$BUCKET_NAME/hbase_temp]/ \
+    --maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \


nit trailing slash and hbase_temp] with right bracket

jhambleton · 2021-04-30T14:06:14Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    --project=$PROJECT_ID \
+    --bigtableInstanceId=$INSTANCE_ID \
+    --bigtableTableId=$TABLE_NAME \
+    --hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data \


just a note, importsnapshot expected subdir for /data from my test. i.e. with --hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data the tool will look for $SNAPSHOT_PATH/data/data/.hbase-snapshots

jhambleton · 2021-04-30T14:08:55Z

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

+java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
+  com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
+ -Dgoogle.bigtable.project.id=$PROJECT_ID \
+ -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
+ -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
+ -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
+ -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT


if using java system properties, these need to be passed into the jvm such as:

java \ -Dgoogle.bigtable.project.id=$PROJECT_ID \ -Dgoogle.bigtable.instance.id=$INSTANCE_ID \ -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \ -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \ -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \ -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

jhambleton · 2021-04-30T14:10:49Z

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

+java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
+  com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
+ -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
+ -Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH
+ -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
+ -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \


there's no instruction for what to set $SCHEMA_FILE_PATH as. also update java system props so they aren't interpreted as args, ie.

java \ -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \ -Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \ -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \ -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator

Fixed, but I think it's a little confusing still

jhambleton · 2021-04-30T14:11:36Z

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

+java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
+ com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
+ -Dgoogle.bigtable.project.id=$PROJECT_ID \
+ -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
+ -Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \


update java sys props

java \ -Dgoogle.bigtable.project.id=$PROJECT_ID \ -Dgoogle.bigtable.instance.id=$INSTANCE_ID \ -Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \ -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \ com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator

billyjacobson · 2021-04-30T17:45:00Z

@jhambleton @vermas2012 I'm gonna run through all these instructions to make sure they're working as I've written and kinda test it out probably on Monday and put in more updates. I need to take a step back from it since there are so many commands and instructions, so I wanna come back with fresh eyes and really run through it with this to make sure everything is clear. I'm glad we're using the README as a starting point, being able to pull in the commands and instructions for any other tutorials and blogs will be really helpful and will speed up that work down the line

billyjacobson · 2021-05-04T22:33:03Z

@jhambleton @vermas2012 just finished rerunning through the snapshot workflow and it worked, only issue was with the sync job v2 which shitanshu knows about. I also started writing up the instructions in a long form doc which will be good for either a blog post, codelab, or some other kind of content.

jhambleton

great updates. the steps for the snapshot import and schema tool are laid out nicely. see few comments.

bigtable-hbase-1.x-parent/bigtable-hbase-tools/README.md

bigtable-dataflow-parent/bigtable-beam-import/README.md

jhambleton

great job shaping these up! lgtm.

bigtable-dataflow-parent/bigtable-beam-import/README.md

kolea2 · 2021-05-13T14:17:36Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    TABLE_NAME="my-new-table"
+    EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export
+    hadoop fs -mkdir -p ${EXPORTDIR}
+    MAXVERSIONS=2147483647


what is this number? Is this the limit?

yeah looks like maxint?

bigtable-dataflow-parent/bigtable-beam-import/README.md

kolea2 · 2021-05-13T14:18:18Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    
+1. Run the import.
+    ```
+    java -jar bigtable-beam-import-1.14.1-SNAPSHOT-shaded.jar importsnapshot \


bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools/README.md

...ase-1.x-tools/src/main/java/com/google/cloud/bigtable/hbase/tools/HBaseSchemaTranslator.java

vermas2012 · 2021-05-13T14:50:47Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+1. Export the snapshot   
+    ```
+    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
+        -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers 16


should we make mappers as a env variable? I am not sure if 16 mappers will work for snapshots of all size?

@jhambleton

vermas2012 · 2021-05-13T14:51:43Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    ```
+1. Create hashes for the table to be used during the data validation step.
+   ```
+   hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \


@jhambleton What should be our guidance for the batchsize and numHashFiles? I would say it should be dependent on the snapshot size,..

batchsize is number of bytes we scan in computing a single hash. We'd likely set this closer to a kb. With TB
sized tables we should expect the results in the GBs. Instead of providing recommendations here, we should like to hbase doc: http://hbase.apache.org/book.html#_step_1_hashtable

I think we should provide this information. I will add the link for now, but let's discuss adding a more thorough guide to the params

bigtable-dataflow-parent/bigtable-beam-import/README.md

vermas2012 · 2021-05-13T14:53:57Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
+        -Dmapred.output.compress=true \
+        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
+        -DRAW_SCAN=true \


We don't yet support import of compressed files. That support is coming, but untill then, we should not suggest compression.s

@vermas2012 - dataflow supports import of compressed sequencefiles.

so should we leave?

@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?

bigtable-dataflow-parent/bigtable-beam-import/README.md

bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools/README.md

grammar additional param info

kolea2 · 2021-06-02T17:47:55Z

@billyjacobson I updated the PR title to be a bit more descriptive, feel free to change as needed :)

vermas2012 · 2021-06-02T20:01:09Z

bigtable-dataflow-parent/bigtable-beam-import/README.md

+    bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
+        -Dmapred.output.compress=true \
+        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
+        -DRAW_SCAN=true \


@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?

vermas2012

Approved with suggested changes.

bigtable-dataflow-parent/bigtable-beam-import/README.md

add timestamp to output file

… pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file

…via snapshots) (#3197) * docs: Add README for HBase Tools and Beam import/export and validator pipelines (#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo

…via snapshots) (googleapis#3197) * docs: Add README for HBase Tools and Beam import/export and validator pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (googleapis#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (googleapis#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo

docs: Add README for HBase Tools and Beam import/export and validator…

5a9c3bb

… pipelines.

billyjacobson requested review from a team as code owners April 27, 2021 21:18

product-auto-label bot added the api: bigtable Issues related to the googleapis/java-bigtable-hbase API. label Apr 27, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Apr 27, 2021

vermas2012 self-requested a review April 27, 2021 21:22

billyjacobson commented Apr 27, 2021

View reviewed changes

vermas2012 reviewed Apr 27, 2021

View reviewed changes

jhambleton reviewed Apr 30, 2021

View reviewed changes

billyjacobson added 2 commits April 30, 2021 12:22

responding to some review comments

831070e

more cleanups and adding hashes export and copy to bucket

ca3554f

Reran through commands and fixed/cleaned up

08f1af8

jhambleton suggested changes May 5, 2021

View reviewed changes

Cleanup for Jordan

34080c7

jhambleton approved these changes May 12, 2021

View reviewed changes

billyjacobson added 2 commits May 12, 2021 10:53

merge master

ccc785e

fix references to hbase-tools to hbase-1.x-tools

3a1c0b5

kolea2 reviewed May 13, 2021

View reviewed changes

vermas2012 reviewed May 13, 2021

View reviewed changes

update version

7f5c9c4

grammar additional param info

kolea2 changed the title ~~docs: Add README~~ docs: Add README for HBase Tools and Beam import/export and validator pipelines Jun 2, 2021

vermas2012 approved these changes Jun 2, 2021

View reviewed changes

vermas2012 reviewed Jun 2, 2021

View reviewed changes

jhambleton reviewed Jun 3, 2021

View reviewed changes

bigtable-dataflow-parent/bigtable-beam-import/README.md Outdated Show resolved Hide resolved

jhambleton reviewed Jun 3, 2021

View reviewed changes

bigtable-dataflow-parent/bigtable-beam-import/README.md Outdated Show resolved Hide resolved

billyjacobson added 2 commits June 4, 2021 11:48

remove unnecessary commands

0edbf5e

add timestamp to output file

Merge branch 'master' into update-hbase-import-readme

0476833

billyjacobson added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 7, 2021

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 7, 2021

billyjacobson merged commit e05b548 into master Jun 7, 2021

billyjacobson deleted the update-hbase-import-readme branch June 7, 2021 17:30


		Copy the schema file to a host which can connect to Google Cloud.

		todo: how to do if it can't connect to the internet?

		--tempLocation=$BUCKET_NAME/hbase_temp]/ \
		--maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \

docs: Add README for HBase Tools and Beam import/export and validator pipelines #2949

docs: Add README for HBase Tools and Beam import/export and validator pipelines #2949

Uh oh!

Conversation

billyjacobson commented Apr 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhambleton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

billyjacobson commented Apr 30, 2021

Uh oh!

billyjacobson commented May 4, 2021

Uh oh!

jhambleton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhambleton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!