-
Notifications
You must be signed in to change notification settings - Fork 313
[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238
Description
Bug Description
opensearchproject/data-prepper container image incorrectly handles UTF-8 characters when streaming data from DynamoDB to S3 buckets in NDJSON format. Non-ASCII characters are replaced with question marks (?) in the output files.
Steps to Reproduce
- Set up data-prepper using the
opensearchproject/data-preppercontainer image - Create a DynamoDB table with items containing strings with non-ASCII characters (e.g., Mandarin, Tamil)
- Configure data-prepper to stream changes from the DynamoDB table to an S3 bucket using NDJSON format
- Observe the resulting S3 objects
Actual Behavior
All non-ASCII characters in the original DynamoDB data are replaced with question marks (?) in the S3 output files.
Expected Behavior
All UTF-8 characters, including non-ASCII characters, should be preserved in the output NDJSON files exactly as they appear in the source DynamoDB table.
Workaround
Adding the environment variable LC_ALL=C.UTF-8 to the container configuration resolves the issue. This environment variable should be set by default in the container image to ensure proper UTF-8 handling.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status