Update iceberg-spark to use HiveTables by aokolnychyi · Pull Request #239 · apache/iceberg

aokolnychyi · 2019-06-27T22:54:08Z

This PR enables support for HiveTables in IcebergSource and resolves #8.

aokolnychyi · 2019-06-27T22:55:04Z

rdblue · 2019-06-28T00:29:09Z

spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

+      HadoopTables tables = new HadoopTables(conf);
+      return tables.load(path.get());
+    } else {
+      HiveTables tables = new HiveTables(conf);


HiveTables creates a connection pool that we will need to close. Maybe we don't need to worry about this for now because Spark 3.0 will use a catalog with a more reasonable life-cycle, but it seems like a bad idea to leak connections. It is also difficult to close this because these are instantiated every time.

What about creating a static cache and getting a HiveTables instance based on the value of hive.metastore.uris?

rdblue · 2019-06-28T00:29:47Z

@aokolnychyi, nice work! It is ready other than the connection problem. I think we need to add a static cache to avoid creating too many HMS connections.

rdsr · 2019-06-28T01:19:33Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceHiveTables.java

+import org.junit.Test;
+
+import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS;
+import static org.apache.iceberg.types.Types.NestedField.optional;


Baseline may frown at this

We disable AvoidStaticImport for tests in checkstyle-suppressions.xml. I still try to avoid static imports, but the statement would be too long in this particular case.

Makes sense!

rdblue · 2019-06-29T15:02:41Z

@aokolnychyi, I merged #240, so could you rebase and update this to use HiveCatalog?

aokolnychyi · 2019-06-29T20:21:16Z

@rdblue @rdsr ready for another review round.

aokolnychyi · 2019-07-01T18:15:45Z

hive/src/main/java/org/apache/iceberg/hive/HiveCatalogs.java

+  public static HiveCatalog loadCatalog(Configuration conf) {
+    // metastore URI can be null in local mode
+    String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname, "");
+    return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf));


This might be a bit problematic as we cache the Hadoop conf. Let me think about possible implications.

Okay, one case when this won't work is for hadoop. data source options.

Building a cache Configuration -> HiveCatalog doesn't seem as an option either.

We're okay with that for 2.4 support. This will be improved when using a real catalog for Spark 3.0. Also, the hadoop. options are mostly for configuring the read or write, not for controlling the metastore.

@rdblue, you are right. However, hadoop. data source options can be used to set iceberg.compress.metadata if we want to compress metadata only in particular tables.

Ideally, iceberg.compress.metadata should be a table property. It is the only entry in ConfigProperties. The problem is that HadoopTables requires this config upfront to find tables. Maybe, we can circumvent this by trying files with/without gz suffix.

rdblue reviewed Jun 28, 2019

View reviewed changes

rdsr reviewed Jun 28, 2019

View reviewed changes

rdblue mentioned this pull request Jun 28, 2019

Apply Baseline plugin to iceberg-hive #233

Merged

aokolnychyi mentioned this pull request Jun 28, 2019

Add HiveCatalog implementation #240

Merged

aokolnychyi force-pushed the spark-hive-tables branch from 03d1cc4 to a7e68c2 Compare June 29, 2019 20:16

aokolnychyi force-pushed the spark-hive-tables branch from a7e68c2 to 6a8716f Compare June 29, 2019 23:49

Update iceberg-spark to use HiveTables

9b8b3a6

aokolnychyi force-pushed the spark-hive-tables branch from 6a8716f to 9b8b3a6 Compare June 30, 2019 13:46

aokolnychyi commented Jul 1, 2019

View reviewed changes

rdblue merged commit b898129 into apache:master Jul 2, 2019

rdblue mentioned this pull request Jul 6, 2019

IcebergSource cannot load data from hiveTable #121

Closed

Conversation

aokolnychyi commented Jun 27, 2019

Uh oh!

aokolnychyi commented Jun 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 28, 2019

Uh oh!

rdsr Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 29, 2019

Uh oh!

aokolnychyi commented Jun 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdsr Jun 28, 2019 •

edited

Loading