Describe the bug
The TableInput section in the Glue AWS::Glue::Table has two different Parameters sections, one for the storage and one for the table. The current implementation of the S3-Table puts all the custom parameters into the StorageDescriptor section Parameters and leaves the other hard-coded.
The use case is for dynamic-partitioning, which uses projection.<dynamic-partitioning>.format and similar parameters to define the way that Glue (and Athena) will parse the dynamic partitioning field. This is a common way to archive data into S3 using Kinesis Firehose.
Expected Behavior
When using the following code in the CDK:
var replication_table = new glue.S3Table(this, 'ReplicationTable', {
database: replication_database,
tableName: <Glue-Table-Name>,
columns: <Columns>,
partitionKeys: [{
name: 'datehour',
type: glue.Schema.STRING,
}],
bucket: eventsBucket,
s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
storedAsSubDirectories: true,
storageParameters: [
glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
// The parameters that are relevant for the calculation of the dynamic partitioning
// glue.StorageParameter.custom('projection.enabled', 'true'),
glue.StorageParameter.custom('projection.enabled', 'true'),
glue.StorageParameter.custom('projection.datehour.type', 'date'),
glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'),
glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'),
glue.StorageParameter.custom('projection.datehour.interval', '1'),
glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'),
glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'),
glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
glue.StorageParameter.custom("compressionType", 'gzip'),
],
dataFormat: glue.DataFormat.JSON,
enablePartitionFiltering: true,
compressed: true,
});
I expect to get the following CFN snippet:
"ReplicationTable2E30ABDE": {
"Type": "AWS::Glue::Table",
"Properties": {
"CatalogId": {
"Ref": "AWS::AccountId"
},
"DatabaseName": {
"Ref": "DatabaseB269D8BB"
},
"TableInput": {
"Name": <Glue-Table-Name>,
"Parameters": {
"classification": "json",
"partition_filtering.enabled": true,
"projection.enabled": "true",
"projection.datehour.type": "date",
"projection.datehour.format": "yyyy/MM/dd",
"projection.datehour.range": "2021/01/01,NOW",
"projection.datehour.interval": "1",
"projection.datehour.interval.unit": "DAYS",
"storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
"EXTERNAL": "TRUE",
"compressionType": "gzip"
},
"PartitionKeys": [
{
"Name": "datehour",
"Type": "string"
}
],
"StorageDescriptor": {
"Columns": [<Columns>],
"Compressed": true,
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"Location": {
"Fn::Join": [
"",
[
"s3://",
{
"Ref": "EventsBucketCD4657F9"
},
"/events/table=<DynamoDB-Table-Name>/"
]
]
},
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"Parameters": {
"compression_type": "gzip"
},
"SerdeInfo": {
"SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
},
"StoredAsSubDirectories": true
},
"TableType": "EXTERNAL_TABLE"
}
}
Please note that the parameters are under the Table Input.
Current Behavior
Instead I get the following stack Snippet:
"Type": "AWS::Glue::Table",
"Properties": {
"CatalogId": {
"Ref": "AWS::AccountId"
},
"DatabaseName": {
"Ref": "DatabaseB269D8BB"
},
"TableInput": {
"Name": <Glue-Table-Name>,
"Parameters": {
"classification": "json",
"has_encrypted_data": true,
"partition_filtering.enabled": true
},
"PartitionKeys": [
{
"Name": "datehour",
"Type": "string"
}
],
"StorageDescriptor": {
"Columns": [ <Columns>],
"Compressed": true,
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"Location": {
"Fn::Join": [
"",
[
"s3://",
{
"Ref": "EventsBucketCD4657F9"
},
"/events/table=<DynamoDB-Table-Name>/"
]
]
},
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"Parameters": {
"compression_type": "gzip",
"projection.datehour.enabled": "true",
"projection.datehour.type": "date",
"projection.datehour.format": "yyyy/MM/dd",
"projection.datehour.range": "2021/01/01,NOW",
"projection.datehour.interval": "1",
"projection.datehour.interval.unit": "DAYS",
"storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
"EXTERNAL": "TRUE",
"compressionType": "gzip"
},
"SerdeInfo": {
"SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
},
"StoredAsSubDirectories": true
},
"TableType": "EXTERNAL_TABLE"
}
}
Please note that the dynamic partitioning parameters are added to the wrong parameters section.
Reproduction Steps
Use a similar code in your stack definition under /lib:
var replication_table = new glue.S3Table(this, 'ReplicationTable', {
database: replication_database,
tableName: <Glue-Table-Name>,
columns: <Columns>,
partitionKeys: [{
name: 'datehour',
type: glue.Schema.STRING,
}],
bucket: eventsBucket,
s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
storedAsSubDirectories: true,
storageParameters: [
glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
// The parameters that are relevant for the calculation of the dynamic partitioning
// glue.StorageParameter.custom('projection.enabled', 'true'),
glue.StorageParameter.custom('projection.enabled', 'true'),
glue.StorageParameter.custom('projection.datehour.type', 'date'),
glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'),
glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'),
glue.StorageParameter.custom('projection.datehour.interval', '1'),
glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'),
glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'),
glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
glue.StorageParameter.custom("compressionType", 'gzip'),
],
dataFormat: glue.DataFormat.JSON,
enablePartitionFiltering: true,
compressed: true,
});
Possible Solution
I can think of three options to solve the bug:
- Allow adding
parameters to the tableInput and not only to the storageParameters - something like tableParameters.
- Allow access to the node after the constructor and allow the user to move the objects in the
tableInput object.
- Add a method that will be specific to the dynamic-partitioning option, similar to the way that it is defined in a single value in Kinesis Firehose:
extendedS3DestinationConfiguration : {
prefix: 'events/table=!{partitionKeyFromQuery:tablename}/!{timestamp:yyyy/MM/dd}/',
errorOutputPrefix: 'errors/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/',
...
Additional Information/Context
As mentioned above, this is part of common pipeline of replication from a DynamoDB table to S3 to allow analytical queries on that data from Athena. In the example above (extendedS3DestinationConfiguration) the user can define the format of the dynamic partitioning of the data in Firehose. If we fix this issue with a similar focused method (option 3 above), it will be easy to extend constructs such as KinesisStreamsToKinesisFirehoseToS3, AwsDynamoDBKinesisStreamsS3 or KinesisFirehoseToS3 to support the creation of the Glue table on top of the data in S3.
CDK CLI Version
2.99.0 (build 0aa1096)
Framework Version
No response
Node.js Version
v16.18.1
OS
MacOS
Language
Typescript
Language Version
No response
Other information
No response
Describe the bug
The
TableInputsection in the GlueAWS::Glue::Tablehas two differentParameterssections, one for the storage and one for the table. The current implementation of the S3-Table puts all the custom parameters into theStorageDescriptorsectionParametersand leaves the other hard-coded.The use case is for dynamic-partitioning, which uses
projection.<dynamic-partitioning>.formatand similar parameters to define the way that Glue (and Athena) will parse the dynamic partitioning field. This is a common way to archive data into S3 using Kinesis Firehose.Expected Behavior
When using the following code in the CDK:
I expect to get the following CFN snippet:
Please note that the parameters are under the Table Input.
Current Behavior
Instead I get the following stack Snippet:
Please note that the dynamic partitioning parameters are added to the wrong
parameterssection.Reproduction Steps
Use a similar code in your stack definition under /lib:
Possible Solution
I can think of three options to solve the bug:
parametersto thetableInputand not only to thestorageParameters- something liketableParameters.tableInputobject.Additional Information/Context
As mentioned above, this is part of common pipeline of replication from a DynamoDB table to S3 to allow analytical queries on that data from Athena. In the example above (
extendedS3DestinationConfiguration) the user can define the format of the dynamic partitioning of the data in Firehose. If we fix this issue with a similar focused method (option 3 above), it will be easy to extend constructs such asKinesisStreamsToKinesisFirehoseToS3,AwsDynamoDBKinesisStreamsS3orKinesisFirehoseToS3to support the creation of the Glue table on top of the data in S3.CDK CLI Version
2.99.0 (build 0aa1096)
Framework Version
No response
Node.js Version
v16.18.1
OS
MacOS
Language
Typescript
Language Version
No response
Other information
No response