Catalog Implementation for hive and hadoop.#187
Catalog Implementation for hive and hadoop.#187Parth-Brahmbhatt wants to merge 1 commit intoapache:masterfrom
Conversation
|
I have deliberately left the Tables interface and implementation for BaseMetaStore tables as we have not decided what are we going to do about the transaction APIs. We can either take this PR as an opportunity to address that or move those APIs under BaseMetastoreCatalog and remove Tables interface and all of its implementation. |
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { | ||
| getTable(tableIdentifier); | ||
| return false; |
|
|
||
| @Override | ||
| public String toString() { | ||
| // assumes a level it self won't have a period in it, otherwise it will be ambiguous |
There was a problem hiding this comment.
nit: typo - it self -> itself
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { | ||
| getTable(tableIdentifier); | ||
| return false; |
There was a problem hiding this comment.
Looks like it should return a true here.
| */ | ||
| boolean tableExists(TableIdentifier tableIdentifier); | ||
| default boolean tableExists(TableIdentifier tableIdentifier) { | ||
| try { |
There was a problem hiding this comment.
The implementation here simplifies code but also introduces exception-handling to ordinary control flow. An alternative is to leave the method blank and give the implementors discretion about how to do it.
There was a problem hiding this comment.
I think it is reasonable to do this in a default implementation. You can't expect the default implementation to do everything the best way.
| @@ -28,7 +29,7 @@ | |||
| /** | |||
| * Top level Catalog APIs that supports table DDLs and namespace listing. | |||
There was a problem hiding this comment.
Can you give a brief explanation on the rationale for leaving out the namespace/table listing methods?
| @Override | ||
| public void dropTable(TableIdentifier tableIdentifier) { | ||
| validateTableIdentifier(tableIdentifier); | ||
| HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient(); |
There was a problem hiding this comment.
Looks like clients.get() is a more legitimate method to use here? Alternatively, clients.run() can be used here.
There was a problem hiding this comment.
That's correct. newClient is protected so that subclasses can provide an implementation, not so that other classes in this package can call it. The correct way to do this is to use run. That handles reconnections and uses the client pool. get is also private.
| * @param location a path URI (e.g. hdfs:///warehouse/my_table) | ||
| * @return newly created table implementation | ||
| * @param tableIdentifier an identifier to identify this table in a namespace. | ||
| * @param schema the schema for this table, can not be null. |
There was a problem hiding this comment.
Would be good to add a Preconditions.checkNotNull() check to assert this.
| @@ -1,18 +1,3 @@ | |||
| /* | |||
| * Copyright 2017 Netflix, Inc. | |||
There was a problem hiding this comment.
Looks like you missed adding the apache header.
| public void testCreate() throws TException { | ||
| // Table should be created in hive metastore | ||
| final org.apache.hadoop.hive.metastore.api.Table table = metastoreClient.getTable(DB_NAME, TABLE_NAME); | ||
| varifyTable(TABLE_IDENTIFIER); |
There was a problem hiding this comment.
nit: typo varifyTable -> verifyTable
| * @param tableIdentifier an identifier to identify this table in a namespace. | ||
| * @return instance of {@link Table} implementation referred by {@code tableIdentifier} | ||
| */ | ||
| Table getTable(TableIdentifier tableIdentifier); |
There was a problem hiding this comment.
Should be loadTable because get usually implies that the operation is a quick fetch, like a getter would be on a Java object.
| @Override | ||
| public String toString() { | ||
| // assumes a level it self won't have a period in it, otherwise it will be ambiguous | ||
| return Arrays.stream(levels).collect(joining(".")); |
There was a problem hiding this comment.
Is it necessary to use the streams API here? What about Joiners.on(".").join(levels)? Seems simpler.
| tableIdentifier.name()); | ||
| } | ||
|
|
||
| protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) { |
There was a problem hiding this comment.
I think this should be in Hive, not in the base implementation. The base could be used for metastores that support more nesting than 1 namespace level.
| Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " + | ||
| "schema name as namespace"); | ||
| String schemaName = tableIdentifier.namespace().levels()[0]; | ||
| Preconditions.checkArgument(schemaName != null && !schemaName.isEmpty(), "schema name can't be null or " + |
There was a problem hiding this comment.
Should TableIdentifier and Namespace check that none of the parts are null instead? I think that is going to be easier than validating in lots of other places.
|
|
||
| protected static final void validateTableIdentifier(TableIdentifier tableIdentifier) { | ||
| Preconditions.checkArgument(tableIdentifier.hasNamespace(), "metastore tables should have schema as namespace"); | ||
| Preconditions.checkArgument(tableIdentifier.namespace().levels().length == 1, "metastore tables should only have " + |
There was a problem hiding this comment.
At least for Hive, I think that the correct term is "database" and not "schema" for the namespace.
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
| import org.apache.iceberg.exceptions.RuntimeIOException; | ||
|
|
||
| public class HadoopCatalog implements Catalog, Configurable { |
There was a problem hiding this comment.
Why does this implement Configurable? Seems strange to me.
| * Top level Catalog APIs that supports table DDLs and namespace listing. | ||
| */ | ||
| public interface Catalog { | ||
| public interface Catalog extends Closeable { |
There was a problem hiding this comment.
Why make all catalogs Closeable?
| static final String DB_NAME = "hivedb"; | ||
| static final String TABLE_NAME = "tbl"; | ||
| static final TableIdentifier TABLE_IDENTIFIER = | ||
| new TableIdentifier(Namespace.namespace(new String[] {DB_NAME}), TABLE_NAME); |
There was a problem hiding this comment.
Looks like namespace should take String....
| import org.apache.iceberg.exceptions.AlreadyExistsException; | ||
| import org.apache.iceberg.exceptions.NoSuchTableException; | ||
|
|
||
| public abstract class BaseMetastoreCatalog implements Catalog { |
There was a problem hiding this comment.
Should this replace BaseMetastoreTables?
| validateTableIdentifier(to); | ||
|
|
||
| HiveMetaStoreClient hiveMetaStoreClient = this.clients.newClient(); | ||
| String location = ((BaseTable) getTable(from)).operations().current().file().location(); |
There was a problem hiding this comment.
I don't think renaming the table should change the location. Data can be stored in the old location and referenced by a new name.
| try { | ||
| Table table = hiveMetaStoreClient.getTable(oldDBName, oldTableName); | ||
|
|
||
| // hive metastore renames the table's directory as part of renaming the table. |
| tables.close(); | ||
| this.tables = null; | ||
| try { | ||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); |
There was a problem hiding this comment.
Why does this get the table first?
| .protocolFactory(new TBinaryProtocol.Factory()) | ||
| .minWorkerThreads(3) | ||
| .maxWorkerThreads(5); | ||
| .maxWorkerThreads(32); |
There was a problem hiding this comment.
What was the reason for changing this? Keeping it low helps us catch client leaks, like the one introduced by the newClient call in the Hive catalog.
|
|
||
| hiveCatalog.dropTable(TABLE_IDENTIFIER); | ||
|
|
||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); |
There was a problem hiding this comment.
It is usually better to use assertThrows because you can check the error message. Then you wouldn't need to load the table first.
| private Configuration conf; | ||
|
|
||
| public HadoopTables() { | ||
| public HadoopCatalog() { |
There was a problem hiding this comment.
I don't think it makes sense to move the Hadoop implementation to be a Catalog. Identifiers don't handle paths well and there are quite a few changes for not much benefit.
What about just moving Hive to use the Catalog interface? Do we need to move HadoopTables or can we leave it as-is?
|
Do we see the new |
| @@ -18,6 +18,7 @@ | |||
| */ | |||
| package org.apache.iceberg.catalog; | |||
There was a problem hiding this comment.
Should we require an empty line after the header? I think it is true for most of the files. However, we don't enforce it in checkstyle.xml. Shall we add PACKAGE_DEF to EmptyLineSeparator?
There was a problem hiding this comment.
I'm fine either way. If we choose to enforce it, we'll have to add it as a separate PR.
| @@ -28,7 +29,7 @@ | |||
| /** | |||
| * Top level Catalog APIs that supports table DDLs and namespace listing. | |||
There was a problem hiding this comment.
Should the comment be "Top-level Catalog API that supports ..." or "Top-level Catalog APIs that support ..."?
There was a problem hiding this comment.
I think it should be singular API.
| public interface Catalog { | ||
| public interface Catalog extends Closeable { | ||
| /** | ||
| * creates the table or throws {@link AlreadyExistsException}. |
There was a problem hiding this comment.
nit: what about having the same formatting for Javadoc in this file? I see that some comments start with a capital letter some not.
| public abstract TableOperations newTableOps(Configuration newConf, TableIdentifier tableIdentifier); | ||
|
|
||
| protected String defaultWarehouseLocation(Configuration hadoopConf, TableIdentifier tableIdentifier) { | ||
| String warehouseLocation = hadoopConf.get("hive.metastore.warehouse.dir"); |
There was a problem hiding this comment.
Should we rely on a Hive Metastore property to get a table base location? Maybe the Hive specific implementation can set the Hive warehouse dir, though I'm not sure whether we need to provide an implementation here or leave it abstract.
There was a problem hiding this comment.
Yeah, good point. I think this should be left abstract and implemented in the Hive module.
| try { | ||
| metastoreClient.getTable(DB_NAME, TABLE_NAME); | ||
| metastoreClient.dropTable(DB_NAME, TABLE_NAME); | ||
| this.catalog.close(); |
There was a problem hiding this comment.
Why do we close the Catalog after drop?
There was a problem hiding this comment.
I think this was to avoid exhausting handlers in the test metastore. The problem is that this is instantiating new clients instead of using the pool.
Yes, for metastore use cases. |
No description provided.