Skip to content

KV-pair IR serialization of CLP-encoded strings inherits text IR's ~2 GiB logtype/variable size limits #2175

@junhaoliao

Description

@junhaoliao

Bug

The KV-pair IR serializer delegates to the text IR CLP-encoding functions when serializing string values that contain spaces. These text IR functions enforce an INT32_MAX (~2 GiB) limit on logtype and dictionary variable strings. Since virtually all log messages contain spaces, this limit effectively applies to all log event message strings serialized into KV-pair IR.

How the limit propagates

  1. serialize_value_string() checks if the string contains a space:

    • No space → calls serialize_string(), which supports up to UINT32_MAX (~4 GiB) — not affected.
    • Has space → calls serialize_clp_string() which calls four_byte_encoding::serialize_message() or eight_byte_encoding::serialize_message().
  2. serialize_logtype() fails if the logtype exceeds INT32_MAX:

    } else if (length <= INT32_MAX) {
        ir_buf.push_back(cProtocol::Payload::LogtypeStrLenInt);
        serialize_int(static_cast<int32_t>(length), ir_buf);
    } else {
        // Logtype is too long for encoding
        return false;
    }
  3. DictionaryVariableHandler::operator() similarly fails if a single dictionary variable string exceeds INT32_MAX.

The KV-pair IR protocol constants (protocol_constants.hpp) define LogtypeStrLenInt (0x23) using a signed 32-bit integer for the length field, which is the root cause of the ~2 GiB ceiling. In contrast, the KV-pair IR native string path uses StrLenUInt (0x43) with an unsigned 32-bit integer (protocol_constants.hpp:61), supporting ~4 GiB.

Additionally, the IR stream preamble metadata (serialize_metadata()) is capped at UINT16_MAX (65,535 bytes), shared by both text IR and KV-pair IR.

Size limits

Limit Value Source
CLP-encoded logtype string ~2 GiB (INT32_MAX) encoding_methods.cpp:84-89
CLP-encoded dictionary variable ~2 GiB (INT32_MAX) encoding_methods.cpp:61-65
Plain string (KV-pair IR native) ~4 GiB (UINT32_MAX) utils.cpp:45-48
IR stream preamble metadata 64 KiB (UINT16_MAX) utils.cpp:25-30

Practical impact today

These limits are not the binding constraint today. The log-converter has a 64 MiB buffer limit per log event (LogConverter.hpp:41) which is hit first, and JSON ingestion defaults to 512 MiB per record (--max-document-size). The ~2 GiB IR limit would become the bottleneck only if those upstream limits are raised.

Future impact on log-viewer

Once #2174 (MongoDB 16 MiB BSON limit for search results) is resolved and large log events can be retrieved through the WebUI, the log-viewer's extraction path could also be affected. Currently:

  • The clp_s log-viewer extracts ordered JSON chunks via JsonConstructor (clp-s x --ordered), which does not go through the KV-pair IR serializer. However, if this extraction path is ever changed to use KV-pair IR, the same limits would apply.
  • The clp engine log-viewer extracts text IR streams via clo i, but the text IR logtype/variable limits were already enforced during ingestion, so extraction would not introduce new failures.

CLP version

3b4d13f

Environment

Any environment that serializes string values into KV-pair IR. Today this is the log-converter during unstructured text ingestion in the CLP-JSON package, though the 64 MiB LogConverter buffer limit is hit first.

Reproduction steps

  1. Bypass the LogConverter's 64 MiB buffer by calling the KV-pair IR Serializer directly (e.g., in a unit test) with a string value larger than INT32_MAX (~2 GiB) that includes at least one space.
  2. Observe that serialize_logtype() returns false, causing the serialization to fail.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions