-
Notifications
You must be signed in to change notification settings - Fork 63
Description
It would be useful to allow the custom metadata described in #106 to be consumed by the Spark Data Source. For example, it would be helpful to encrypt the files upon writing them. But different users of the data source will have different ways they would want to use that custom metadata to inform how the data is read or written.
We therefore propose supporting a data source option that service loads an instance of the below interface in the data source reader and writer layer.
Below is an API sketch for such a plugin:
class OutputFileWithMetadata {
OutputFile file;
Map<String, String> customMetadata;
}
interface IcebergSparkIO {
/**
* Opens a data file for reading.
*/
InputFile open(DataFile file);
/**
* Opens a data file for reading, only specifying the location and custom metadata
* associated with that file.
*
* (Location can be either URI or String, prefer URI for stricter semantics)
*/
InputFile open(URI location, Map<String, String> customMetadata);
/**
* Create a new output file, returning a handle to write bytes to it as well as
* custom metadata that should be included alongside the file when adding the
* appropriate DataFile to the table.
*
* (Location can be either URI or String, prefer URI for stricter semantics)
*/
OutputFileWithMetadata createNewFile(URI location);
/**
* Deletes the file at the given location. Called when jobs have to abort tasks
* and clean up previously written files.
*
* (Location can be either URI or String, prefer URI for stricter semantics)
*/
void delete(URI location);
}
It is difficult, however, to make IcebergSparkIO Serializable. Therefore if we're not careful, we would have to service load the implementation on every executor, and that so multiple times. We propose instead to service load a provider class that can be passed the data source options data structure so that the plugin only has to be service loaded once and can be serialized to be distributed to the executor nodes. Therefore we also require the below interface:
inteface IcebergSparkIOFactory extends Serializable {
IcebergSparkIO create(DataSourceOptions options);
}