Skip to content

[BUG] UTF-8 Character Encoding Issues in opensearchproject/data-prepper container #5238

@voronin-ilya

Description

@voronin-ilya

Bug Description
opensearchproject/data-prepper container image incorrectly handles UTF-8 characters when streaming data from DynamoDB to S3 buckets in NDJSON format. Non-ASCII characters are replaced with question marks (?) in the output files.

Steps to Reproduce

  1. Set up data-prepper using the opensearchproject/data-prepper container image
  2. Create a DynamoDB table with items containing strings with non-ASCII characters (e.g., Mandarin, Tamil)
  3. Configure data-prepper to stream changes from the DynamoDB table to an S3 bucket using NDJSON format
  4. Observe the resulting S3 objects

Actual Behavior
All non-ASCII characters in the original DynamoDB data are replaced with question marks (?) in the S3 output files.

Expected Behavior
All UTF-8 characters, including non-ASCII characters, should be preserved in the output NDJSON files exactly as they appear in the source DynamoDB table.

Workaround
Adding the environment variable LC_ALL=C.UTF-8 to the container configuration resolves the issue. This environment variable should be set by default in the container image to ensure proper UTF-8 handling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions