Skip to content

Add parse_formatted_bytes function to parse human-readable byte size#20107

Merged
lnkuiper merged 22 commits intoduckdb:mainfrom
EtgarDev:feature/etgarsh/parse-formatted-bytes
Jan 6, 2026
Merged

Add parse_formatted_bytes function to parse human-readable byte size#20107
lnkuiper merged 22 commits intoduckdb:mainfrom
EtgarDev:feature/etgarsh/parse-formatted-bytes

Conversation

@EtgarDev
Copy link
Contributor

@EtgarDev EtgarDev commented Dec 9, 2025

This PR introduces a new scalar function parse_formatted_bytes(VARCHAR) to DuckDB, which converts a human-readable byte size (e.g. "16 KiB) into a numeric (UBIGINT) number of bytes.

Changes:

  • Added parse_formatted_bytes (throws errors) and try_parse_formatted_bytes (returns NULLs on Errors)
  • Implementation of parsing logic in StringUtil (originally taken from DBConfig)
  • Updated the existing component (DBConfig) that relied on the old parsing logic to use the new StringUtil version.
  • Tests for the new scalar function

Why?

The initial motivation came from the implementation of #19726, which aimed to expose another column representing byte sizes.
During the review, @Mytherin suggested exposing a dedicated function instead, keeping duckdb_settings lean while providing a more generally useful utility.
try_parse_formatted_bytes can now be reused across various use cases where parsing byte-size strings is needed.

Note - try(parse_formatted_bytes) acted differently than expected, it seems like it doesn't handle runtime Exceptions. Exposing try_parse_formatted_bytes was my workaround.

Comment on lines +328 to +334
// Make sure the result is not greater than `idx_t` max value
constexpr double max_value = static_cast<double>(NumericLimits<idx_t>::Maximum());
const double double_multiplier = static_cast<double>(multiplier);

if (limit > (max_value / double_multiplier)) {
throw ParserException("Memory value out of range: value is too large");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mytherin
This is my only change from the original implementation (from DBConfig).
There used to be a bug when we try to set values like 343897438957348957 TiB (the process actually crashed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the bug doesn't reproduce on 1.4.2 but does reproduce for me in main and is fixed by this implementation.

return to_string(array[0]) + (bytes == 1 ? " byte" : " bytes");
}

idx_t StringUtil::ParseFormattedBytes(const string &arg) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploaded the diff between the two, to make the review easier for this method
Image

@EtgarDev EtgarDev marked this pull request as ready for review December 11, 2025 11:08
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 11, 2025 12:01
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have a few comments:

return to_string(array[0]) + (bytes == 1 ? " byte" : " bytes");
}

idx_t StringUtil::ParseFormattedBytes(const string &arg) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing an exception in the function, could you implement this as:

bool StringUtil::TryParseFormattedBytes(const string &arg, idx_t result) {
    // ....
}

idx_t StringUtil::ParseFormattedBytes(const string &arg) {
    idx_t result;
    if (!TryParseFormattedBytes(arg, result)) {
        // throw
    }
}

This avoids having to do try/catch in other places, and the need to add the new exception class NonNumericMemoryException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we end up throwing the same exception for all failure cases,
with this change, DBConfig would no longer be able to catch a specific exception type (as it does now) since all parsing errors would collapse into a single generic one. Then, the client loses the ability to distinguish between different error scenarios in the existing behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could return a string with the error message instead? If this string is empty(), parsing has succeded without errors. If a non-empty string is returned, we can throw a ParserException with the string

Copy link
Contributor Author

@EtgarDev EtgarDev Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lnkuiper I understand, I'll follow this pattern then :)
Just to better understand the design preference here - is returning an error message (instead of throwing a typed exception) considered a more idiomatic approach in util functions in the codebase?
I'm asking because I'd love to understand the rationale in cases like this, especially around consistency with existing error handling patterns in DuckDB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we prefer to return a bool/enum/string that tells us whether the function was successful, rather than try/catch, especially for "shallow" function calls like this.

We do have some try/catch, but this is when the error you need to catch is deeply nested in a function call chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lnkuiper I do notice that it makes the part when I handle a specific exception (in order convert it to a more specific exception 'memory limit) less neat.
I decided just to remove the limit from the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lnkuiper I think I understand the direction, but I am still a bit confused about how we are expected to handle diffrent kinds of errors with this pattern.

I'll give a concrete example from my latest changes.

I no longer want the parse_formatted_bytes to be responsible for negative to infinite conversion anymore.
I still want the DBConfig to support negative memories and to convert it to infinite (to preserve behaviour).
But now, it means that under TryParseFormattedBytes I should return an error string:

	if (limit < 0) {
		return "Memory cannot be negative";
	}

And then in DBConfig::ParseMemoeryLimit:

	idx_t result;
	string error = StringUtil::TryParseFormattedBytes(arg, result);

	if (!error.empty()) {
		if (error == "Memory cannot be negative") {
			result = -1;
		} else {
			throw ParserException(error);
		}
	}

	return result;

This feels a bit odd to handle errors like this, since it requires branching on specific error strings to decide control flow.
My intuition would be that if we need to distinguish between different failure modes, it might be clearer to return some structured error (for example an enum or error code), but that also starts to look like reimplementing exceptions in a different form.

So I wanted to sanity check - is matching on error strings the intended pattern here, or is there a cleaner way you would expect this to be handled? Curious to hear your thoughts :)

} catch (const Exception &) {
mask.SetInvalid(index);
return static_cast<idx_t>(0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to remove the try/catch here now and set NULL based on string.empty() when using TryParseFormattedBytes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove try_parse_formatted_bytes entirely actually. This is what try is for - you can use try(parse_formatted_bytes(...)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mytherin I tried using the try on it and it didn't work. It seems like it doesn't catch exceptions during runtime, but is used only for casting failures and such. Am I missing something?

Copy link
Contributor Author

@EtgarDev EtgarDev Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lnkuiper I also removed the try-catch for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TRY operator only works for these exception types.

bool Exception::IsExecutionError(ExceptionType type) {
	switch (type) {
	case ExceptionType::INVALID_INPUT:
	case ExceptionType::OUT_OF_RANGE:
	case ExceptionType::CONVERSION:
		return true;
	default:
		return false;
	}
}

I think it makes sense to throw a InvalidInputException in your function, but keep it a ParserException for the config option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh cool! I didn't notice that. Thanks! :)

…o remove infinite mapping for invalid inputs and improve error messages
…tes` for improved error handling and cleaner parsing logic
@EtgarDev EtgarDev marked this pull request as ready for review December 15, 2025 12:42
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 16, 2025 13:36
@EtgarDev EtgarDev marked this pull request as ready for review December 17, 2025 07:05
@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 17, 2025 12:48
@EtgarDev EtgarDev marked this pull request as ready for review December 22, 2025 15:45
@EtgarDev EtgarDev requested a review from lnkuiper December 22, 2025 15:45
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one more typo then this can be merged :)

#include "core_functions/scalar/string_functions.hpp"

namespace duckdb {
static void PraseFormattedBytesFunction(DataChunk &args, ExpressionState &state, Vector &result) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ParseFormattedBytesFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed :)

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 5, 2026 08:19
@EtgarDev EtgarDev requested a review from lnkuiper January 5, 2026 08:21
@EtgarDev EtgarDev marked this pull request as ready for review January 5, 2026 08:21
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM

@lnkuiper lnkuiper merged commit 88cd26b into duckdb:main Jan 6, 2026
60 checks passed
@Tishj
Copy link
Contributor

Tishj commented Jan 7, 2026

	idx_t result;
	string error = StringUtil::TryParseFormattedBytes(arg, result);

	if (!error.empty()) {
		if (error == "Memory cannot be negative") {
			result = -1;
		} else {
			throw ParserException(error);
		}
	}

This causes a warning, when setting result = -1;
This should be: result = DConstants::INVALID_INDEX;

lnkuiper added a commit that referenced this pull request Jan 15, 2026
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Feb 26, 2026
Add `parse_formatted_bytes` function to parse human-readable byte size (duckdb/duckdb#20107)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Feb 26, 2026
Add `parse_formatted_bytes` function to parse human-readable byte size (duckdb/duckdb#20107)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants