Skip to content

aws-glue-alpha: S3-Table table properties are added to the wrong parameters section #27365

@guyernest

Description

@guyernest

Describe the bug

The TableInput section in the Glue AWS::Glue::Table has two different Parameters sections, one for the storage and one for the table. The current implementation of the S3-Table puts all the custom parameters into the StorageDescriptor section Parameters and leaves the other hard-coded.

The use case is for dynamic-partitioning, which uses projection.<dynamic-partitioning>.format and similar parameters to define the way that Glue (and Athena) will parse the dynamic partitioning field. This is a common way to archive data into S3 using Kinesis Firehose.

Expected Behavior

When using the following code in the CDK:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

I expect to get the following CFN snippet:

  "ReplicationTable2E30ABDE": {
   "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "partition_filtering.enabled": true,
      "projection.enabled": "true",
      "projection.datehour.type": "date",
      "projection.datehour.format": "yyyy/MM/dd",
      "projection.datehour.range": "2021/01/01,NOW",
      "projection.datehour.interval": "1",
      "projection.datehour.interval.unit": "DAYS",
      "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
      "EXTERNAL": "TRUE",
      "compressionType": "gzip"
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [<Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the parameters are under the Table Input.

Current Behavior

Instead I get the following stack Snippet:

  "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "has_encrypted_data": true,
      "partition_filtering.enabled": true
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [ <Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip",
       "projection.datehour.enabled": "true",
       "projection.datehour.type": "date",
       "projection.datehour.format": "yyyy/MM/dd",
       "projection.datehour.range": "2021/01/01,NOW",
       "projection.datehour.interval": "1",
       "projection.datehour.interval.unit": "DAYS",
       "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
       "EXTERNAL": "TRUE",
       "compressionType": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the dynamic partitioning parameters are added to the wrong parameters section.

Reproduction Steps

Use a similar code in your stack definition under /lib:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

Possible Solution

I can think of three options to solve the bug:

  • Allow adding parameters to the tableInput and not only to the storageParameters - something like tableParameters.
  • Allow access to the node after the constructor and allow the user to move the objects in the tableInput object.
  • Add a method that will be specific to the dynamic-partitioning option, similar to the way that it is defined in a single value in Kinesis Firehose:
        extendedS3DestinationConfiguration : {
          prefix: 'events/table=!{partitionKeyFromQuery:tablename}/!{timestamp:yyyy/MM/dd}/',
          errorOutputPrefix: 'errors/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/',
...

Additional Information/Context

As mentioned above, this is part of common pipeline of replication from a DynamoDB table to S3 to allow analytical queries on that data from Athena. In the example above (extendedS3DestinationConfiguration) the user can define the format of the dynamic partitioning of the data in Firehose. If we fix this issue with a similar focused method (option 3 above), it will be easy to extend constructs such as KinesisStreamsToKinesisFirehoseToS3, AwsDynamoDBKinesisStreamsS3 or KinesisFirehoseToS3 to support the creation of the Glue table on top of the data in S3.

CDK CLI Version

2.99.0 (build 0aa1096)

Framework Version

No response

Node.js Version

v16.18.1

OS

MacOS

Language

Typescript

Language Version

No response

Other information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions