Update iceberg-spark to use HiveTables#239
Conversation
| HadoopTables tables = new HadoopTables(conf); | ||
| return tables.load(path.get()); | ||
| } else { | ||
| HiveTables tables = new HiveTables(conf); |
There was a problem hiding this comment.
HiveTables creates a connection pool that we will need to close. Maybe we don't need to worry about this for now because Spark 3.0 will use a catalog with a more reasonable life-cycle, but it seems like a bad idea to leak connections. It is also difficult to close this because these are instantiated every time.
What about creating a static cache and getting a HiveTables instance based on the value of hive.metastore.uris?
|
@aokolnychyi, nice work! It is ready other than the connection problem. I think we need to add a static cache to avoid creating too many HMS connections. |
| import org.junit.Test; | ||
|
|
||
| import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS; | ||
| import static org.apache.iceberg.types.Types.NestedField.optional; |
There was a problem hiding this comment.
We disable AvoidStaticImport for tests in checkstyle-suppressions.xml. I still try to avoid static imports, but the statement would be too long in this particular case.
|
@aokolnychyi, I merged #240, so could you rebase and update this to use HiveCatalog? |
03d1cc4 to
a7e68c2
Compare
a7e68c2 to
6a8716f
Compare
6a8716f to
9b8b3a6
Compare
| public static HiveCatalog loadCatalog(Configuration conf) { | ||
| // metastore URI can be null in local mode | ||
| String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname, ""); | ||
| return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf)); |
There was a problem hiding this comment.
This might be a bit problematic as we cache the Hadoop conf. Let me think about possible implications.
There was a problem hiding this comment.
Okay, one case when this won't work is for hadoop. data source options.
There was a problem hiding this comment.
Building a cache Configuration -> HiveCatalog doesn't seem as an option either.
There was a problem hiding this comment.
We're okay with that for 2.4 support. This will be improved when using a real catalog for Spark 3.0. Also, the hadoop. options are mostly for configuring the read or write, not for controlling the metastore.
There was a problem hiding this comment.
@rdblue, you are right. However, hadoop. data source options can be used to set iceberg.compress.metadata if we want to compress metadata only in particular tables.
Ideally, iceberg.compress.metadata should be a table property. It is the only entry in ConfigProperties. The problem is that HadoopTables requires this config upfront to find tables. Maybe, we can circumvent this by trying files with/without gz suffix.
This PR enables support for
HiveTablesinIcebergSourceand resolves #8.