Add parse_formatted_bytes function to parse human-readable byte size#20107
Add parse_formatted_bytes function to parse human-readable byte size#20107lnkuiper merged 22 commits intoduckdb:mainfrom
parse_formatted_bytes function to parse human-readable byte size#20107Conversation
| // Make sure the result is not greater than `idx_t` max value | ||
| constexpr double max_value = static_cast<double>(NumericLimits<idx_t>::Maximum()); | ||
| const double double_multiplier = static_cast<double>(multiplier); | ||
|
|
||
| if (limit > (max_value / double_multiplier)) { | ||
| throw ParserException("Memory value out of range: value is too large"); | ||
| } |
There was a problem hiding this comment.
@Mytherin
This is my only change from the original implementation (from DBConfig).
There used to be a bug when we try to set values like 343897438957348957 TiB (the process actually crashed)
There was a problem hiding this comment.
I can see that the bug doesn't reproduce on 1.4.2 but does reproduce for me in main and is fixed by this implementation.
…arse_formatted_bytes` by setting it to infinite
src/common/string_util.cpp
Outdated
| return to_string(array[0]) + (bytes == 1 ? " byte" : " bytes"); | ||
| } | ||
|
|
||
| idx_t StringUtil::ParseFormattedBytes(const string &arg) { |
- Remove unused import
lnkuiper
left a comment
There was a problem hiding this comment.
Thanks for the PR! I have a few comments:
src/common/string_util.cpp
Outdated
| return to_string(array[0]) + (bytes == 1 ? " byte" : " bytes"); | ||
| } | ||
|
|
||
| idx_t StringUtil::ParseFormattedBytes(const string &arg) { |
There was a problem hiding this comment.
Instead of throwing an exception in the function, could you implement this as:
bool StringUtil::TryParseFormattedBytes(const string &arg, idx_t result) {
// ....
}
idx_t StringUtil::ParseFormattedBytes(const string &arg) {
idx_t result;
if (!TryParseFormattedBytes(arg, result)) {
// throw
}
}This avoids having to do try/catch in other places, and the need to add the new exception class NonNumericMemoryException.
There was a problem hiding this comment.
But then we end up throwing the same exception for all failure cases,
with this change, DBConfig would no longer be able to catch a specific exception type (as it does now) since all parsing errors would collapse into a single generic one. Then, the client loses the ability to distinguish between different error scenarios in the existing behaviour.
There was a problem hiding this comment.
Could return a string with the error message instead? If this string is empty(), parsing has succeded without errors. If a non-empty string is returned, we can throw a ParserException with the string
There was a problem hiding this comment.
@lnkuiper I understand, I'll follow this pattern then :)
Just to better understand the design preference here - is returning an error message (instead of throwing a typed exception) considered a more idiomatic approach in util functions in the codebase?
I'm asking because I'd love to understand the rationale in cases like this, especially around consistency with existing error handling patterns in DuckDB.
There was a problem hiding this comment.
Yes, we prefer to return a bool/enum/string that tells us whether the function was successful, rather than try/catch, especially for "shallow" function calls like this.
We do have some try/catch, but this is when the error you need to catch is deeply nested in a function call chain.
There was a problem hiding this comment.
@lnkuiper I do notice that it makes the part when I handle a specific exception (in order convert it to a more specific exception 'memory limit) less neat.
I decided just to remove the limit from the string.
There was a problem hiding this comment.
@lnkuiper I think I understand the direction, but I am still a bit confused about how we are expected to handle diffrent kinds of errors with this pattern.
I'll give a concrete example from my latest changes.
I no longer want the parse_formatted_bytes to be responsible for negative to infinite conversion anymore.
I still want the DBConfig to support negative memories and to convert it to infinite (to preserve behaviour).
But now, it means that under TryParseFormattedBytes I should return an error string:
if (limit < 0) {
return "Memory cannot be negative";
}And then in DBConfig::ParseMemoeryLimit:
idx_t result;
string error = StringUtil::TryParseFormattedBytes(arg, result);
if (!error.empty()) {
if (error == "Memory cannot be negative") {
result = -1;
} else {
throw ParserException(error);
}
}
return result;This feels a bit odd to handle errors like this, since it requires branching on specific error strings to decide control flow.
My intuition would be that if we need to distinguish between different failure modes, it might be clearer to return some structured error (for example an enum or error code), but that also starts to look like reimplementing exceptions in a different form.
So I wanted to sanity check - is matching on error strings the intended pattern here, or is there a cleaner way you would expect this to be handled? Curious to hear your thoughts :)
…eaner error handling and improved parsing logic
| } catch (const Exception &) { | ||
| mask.SetInvalid(index); | ||
| return static_cast<idx_t>(0); | ||
| } |
There was a problem hiding this comment.
I think we should be able to remove the try/catch here now and set NULL based on string.empty() when using TryParseFormattedBytes.
There was a problem hiding this comment.
I think we should remove try_parse_formatted_bytes entirely actually. This is what try is for - you can use try(parse_formatted_bytes(...)).
There was a problem hiding this comment.
@Mytherin I tried using the try on it and it didn't work. It seems like it doesn't catch exceptions during runtime, but is used only for casting failures and such. Am I missing something?
There was a problem hiding this comment.
@lnkuiper I also removed the try-catch for now
There was a problem hiding this comment.
The TRY operator only works for these exception types.
bool Exception::IsExecutionError(ExceptionType type) {
switch (type) {
case ExceptionType::INVALID_INPUT:
case ExceptionType::OUT_OF_RANGE:
case ExceptionType::CONVERSION:
return true;
default:
return false;
}
}I think it makes sense to throw a InvalidInputException in your function, but keep it a ParserException for the config option.
There was a problem hiding this comment.
oh cool! I didn't notice that. Thanks! :)
… conversions to infinite
…o remove infinite mapping for invalid inputs and improve error messages
…tes` for improved error handling and cleaner parsing logic
…and updating invalid input messages.
…h/parse-formatted-bytes
…es in `extension_entries.hpp`
extension/core_functions/scalar/string/parse_formatted_bytes.cpp
Outdated
Show resolved
Hide resolved
…om the codebase.
…e_formatted_bytes` tests with improved error handling and try-function scenarios.
lnkuiper
left a comment
There was a problem hiding this comment.
Thanks! Just one more typo then this can be merged :)
| #include "core_functions/scalar/string_functions.hpp" | ||
|
|
||
| namespace duckdb { | ||
| static void PraseFormattedBytesFunction(DataChunk &args, ExpressionState &state, Vector &result) { |
There was a problem hiding this comment.
Nit: ParseFormattedBytesFunction
There was a problem hiding this comment.
good catch, fixed :)
…eFormattedBytesFunction`.
…h/parse-formatted-bytes
lnkuiper
left a comment
There was a problem hiding this comment.
Thanks for the changes! LGTM
idx_t result;
string error = StringUtil::TryParseFormattedBytes(arg, result);
if (!error.empty()) {
if (error == "Memory cannot be negative") {
result = -1;
} else {
throw ParserException(error);
}
}This causes a warning, when setting |
…20510) Addressing @Tishj 's comment: #20107 (comment)
Add `parse_formatted_bytes` function to parse human-readable byte size (duckdb/duckdb#20107)
Add `parse_formatted_bytes` function to parse human-readable byte size (duckdb/duckdb#20107) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

This PR introduces a new scalar function
parse_formatted_bytes(VARCHAR)to DuckDB, which converts a human-readable byte size (e.g. "16 KiB) into a numeric (UBIGINT) number of bytes.Changes:
parse_formatted_bytes(throws errors) andtry_parse_formatted_bytes(returnsNULLs on Errors)StringUtil(originally taken fromDBConfig)StringUtilversion.Why?
The initial motivation came from the implementation of #19726, which aimed to expose another column representing byte sizes.
During the review, @Mytherin suggested exposing a dedicated function instead, keeping
duckdb_settingslean while providing a more generally useful utility.try_parse_formatted_bytescan now be reused across various use cases where parsing byte-size strings is needed.Note -
try(parse_formatted_bytes)acted differently than expected, it seems like it doesn't handle runtimeExceptions. Exposingtry_parse_formatted_byteswas my workaround.