Skip to content

Conversation

@billyjacobson
Copy link
Contributor

READMEs for HBase Tools and Beam import/export and validator pipelines.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

@billyjacobson billyjacobson requested review from a team as code owners April 27, 2021 21:18
@product-auto-label product-auto-label bot added the api: bigtable Issues related to the googleapis/java-bigtable-hbase API. label Apr 27, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Apr 27, 2021
@vermas2012 vermas2012 self-requested a review April 27, 2021 21:22
--bigtableTableId=$TABLE_NAME \
--destinationPath=$BUCKET_NAME/hbase_export/ \
--tempLocation=$BUCKET_NAME/hbase_temp]/ \
--maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 10x the right number here? I've typically seen 3x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10x is much higher. I agree 3x works much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is expr 10? Should we just say numNodesInCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expr 10 \* just does the arithmetic inline, so you only need to provide the number of nodes for your cluster

-export $MAXVERSIONS
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend against adding the GZIP here. We can't import GZIP files via dataflow yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from the solutions article, if you have more info on the command we can dig deeper into it

--bigtableTableId=$TABLE_NAME \
--destinationPath=$BUCKET_NAME/hbase_export/ \
--tempLocation=$BUCKET_NAME/hbase_temp]/ \
--maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is expr 10? Should we just say numNodesInCluster?

Create the table in your cluster.
## Importing to Bigtable

This folder contains pipelines to help import data via snapshots or sequence files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have separate import/export folders per say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just the way I worded this. I will ask someone on my team how to rephrase


Copy the schema file to a host which can connect to Google Cloud.

todo: how to do if it can't connect to the internet?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can either use a VPC endpoint for GCS/S3 or ssh to the private VPC host via proxy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide instructions on this or is this something that will be clear? Also, my intention is to have environment variables for the outputted schema so it can be clear what will be imported in the command further down. I might need some help figuring out how to say this

todo: how to do if it can't connect to the internet?

```
$SCHEMA_FILE_PATH=/path/to/hbase-schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is happening here.

Copy link
Contributor

@jhambleton jhambleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really like the guide for both the schema tool and import/export tools. i ran through the snapshot import and schema tool and provided comments.

-export $MAXVERSIONS
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing \ isn't conventional

Comment on lines 99 to 100
--tempLocation=$BUCKET_NAME/hbase_temp]/ \
--maxNumWorkers=$(expr 10 \* $CLUSTER_NUM_NODES) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit trailing slash and hbase_temp] with right bracket

--project=$PROJECT_ID \
--bigtableInstanceId=$INSTANCE_ID \
--bigtableTableId=$TABLE_NAME \
--hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note, importsnapshot expected subdir for /data from my test. i.e. with --hbaseSnapshotSourceDir=$SNAPSHOT_PATH/data the tool will look for $SNAPSHOT_PATH/data/data/.hbase-snapshots

Comment on lines 49 to 55
java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
-Dgoogle.bigtable.project.id=$PROJECT_ID \
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using java system properties, these need to be passed into the jvm such as:

java \
 -Dgoogle.bigtable.project.id=$PROJECT_ID \
 -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
 -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
 -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
 -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
 -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
  com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator 

Comment on lines 76 to 81
java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
-Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no instruction for what to set $SCHEMA_FILE_PATH as. also update java system props so they aren't interpreted as args, ie.

java \
 -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
 -Dgoogle.bigtable.output.filepath=$SCHEMA_FILE_PATH
 -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
 -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
 -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
  com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, but I think it's a little confusing still

Comment on lines 99 to 103
java -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator \
-Dgoogle.bigtable.project.id=$PROJECT_ID \
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
-Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update java sys props

 java \
 -Dgoogle.bigtable.project.id=$PROJECT_ID \
 -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
 -Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \
 -jar bigtable-hbase-tools-1.14.1-SNAPSHOT-jar-with-dependencies.jar \
  com.google.cloud.bigtable.hbase.tools.HBaseSchemaTranslator 

@billyjacobson
Copy link
Contributor Author

@jhambleton @vermas2012 I'm gonna run through all these instructions to make sure they're working as I've written and kinda test it out probably on Monday and put in more updates. I need to take a step back from it since there are so many commands and instructions, so I wanna come back with fresh eyes and really run through it with this to make sure everything is clear. I'm glad we're using the README as a starting point, being able to pull in the commands and instructions for any other tutorials and blogs will be really helpful and will speed up that work down the line

@billyjacobson
Copy link
Contributor Author

@jhambleton @vermas2012 just finished rerunning through the snapshot workflow and it worked, only issue was with the sync job v2 which shitanshu knows about. I also started writing up the instructions in a long form doc which will be good for either a blog post, codelab, or some other kind of content.

Copy link
Contributor

@jhambleton jhambleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great updates. the steps for the snapshot import and schema tool are laid out nicely. see few comments.

Copy link
Contributor

@jhambleton jhambleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job shaping these up! lgtm.

TABLE_NAME="my-new-table"
EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export
hadoop fs -mkdir -p ${EXPORTDIR}
MAXVERSIONS=2147483647
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this number? Is this the limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah looks like maxint?

1. Run the import.
```
java -jar bigtable-beam-import-1.14.1-SNAPSHOT-shaded.jar importsnapshot \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.20.0

1. Export the snapshot
```
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
-copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make mappers as a env variable? I am not sure if 16 mappers will work for snapshots of all size?

@jhambleton

```
1. Create hashes for the table to be used during the data validation step.
```
hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhambleton What should be our guidance for the batchsize and numHashFiles? I would say it should be dependent on the snapshot size,..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batchsize is number of bytes we scan in computing a single hash. We'd likely set this closer to a kb. With TB
sized tables we should expect the results in the GBs. Instead of providing recommendations here, we should like to hbase doc: http://hbase.apache.org/book.html#_step_1_hashtable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should provide this information. I will add the link for now, but let's discuss adding a more thorough guide to the params

bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-DRAW_SCAN=true \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet support import of compressed files. That support is coming, but untill then, we should not suggest compression.s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vermas2012 - dataflow supports import of compressed sequencefiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should we leave?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?

grammar
additional param info
@kolea2 kolea2 changed the title docs: Add README docs: Add README for HBase Tools and Beam import/export and validator pipelines Jun 2, 2021
@kolea2
Copy link
Contributor

kolea2 commented Jun 2, 2021

@billyjacobson I updated the PR title to be a bit more descriptive, feel free to change as needed :)

bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-DRAW_SCAN=true \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igorbernstein2 can our dataflow templates import compressed sequence files? I thought none of our pipelines worked with compressed data?

Copy link
Member

@vermas2012 vermas2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with suggested changes.

@billyjacobson billyjacobson added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 7, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 7, 2021
@billyjacobson billyjacobson merged commit e05b548 into master Jun 7, 2021
@billyjacobson billyjacobson deleted the update-hbase-import-readme branch June 7, 2021 17:30
billyjacobson added a commit to billyjacobson/java-bigtable-hbase that referenced this pull request Aug 30, 2021
… pipelines (googleapis#2949)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines.

* responding to some review comments

* more cleanups and adding hashes export and copy to bucket

* Reran through commands and fixed/cleaned up

* Cleanup for Jordan

* fix references to hbase-tools to hbase-1.x-tools

* update version
grammar
additional param info

* remove unnecessary commands
add timestamp to output file
billyjacobson added a commit to billyjacobson/java-bigtable-hbase that referenced this pull request Aug 31, 2021
… pipelines (googleapis#2949)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines.

* responding to some review comments

* more cleanups and adding hashes export and copy to bucket

* Reran through commands and fixed/cleaned up

* Cleanup for Jordan

* fix references to hbase-tools to hbase-1.x-tools

* update version
grammar
additional param info

* remove unnecessary commands
add timestamp to output file
billyjacobson added a commit that referenced this pull request Sep 7, 2021
…via snapshots) (#3197)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines (#2949)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines.

* responding to some review comments

* more cleanups and adding hashes export and copy to bucket

* Reran through commands and fixed/cleaned up

* Cleanup for Jordan

* fix references to hbase-tools to hbase-1.x-tools

* update version
grammar
additional param info

* remove unnecessary commands
add timestamp to output file

* docs: fix readme title for Bigtable HBase tools (#3013)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

* docs: Fix broken links for HBase Migration tools (#3097)

* docs: Fix broken links

* use more refined link

* update header in readme

* revert schema translator class

* Update link generators and typo
mutianf pushed a commit to mutianf/java-bigtable-hbase that referenced this pull request Sep 20, 2022
…via snapshots) (googleapis#3197)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines (googleapis#2949)

* docs:  Add README for HBase Tools and Beam import/export and validator pipelines.

* responding to some review comments

* more cleanups and adding hashes export and copy to bucket

* Reran through commands and fixed/cleaned up

* Cleanup for Jordan

* fix references to hbase-tools to hbase-1.x-tools

* update version
grammar
additional param info

* remove unnecessary commands
add timestamp to output file

* docs: fix readme title for Bigtable HBase tools (googleapis#3013)

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

* docs: Fix broken links for HBase Migration tools (googleapis#3097)

* docs: Fix broken links

* use more refined link

* update header in readme

* revert schema translator class

* Update link generators and typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable Issues related to the googleapis/java-bigtable-hbase API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants