Conversation
| for (auto &[tag_key, tag_val] : tags) { | ||
| #ifndef NDEBUG | ||
| // In debug build, verify tag_val is printable. | ||
| for (auto &c : tag_val) { | ||
| RAY_CHECK(isprint(c)) << "Found unprintable character " << static_cast<int>(c) | ||
| << " in " << tag_val; | ||
| } | ||
| combined_tags.emplace_back(TagKeyType::Register(t.first), std::move(t.second)); | ||
| #endif // NDEBUG | ||
| combined_tags.emplace_back(TagKeyType::Register(tag_key), std::move(tag_val)); |
There was a problem hiding this comment.
Seems this code is very similar to the above code. Could we dedup them?
There was a problem hiding this comment.
Do we have chance to run this code in debug mode? I'm worried that it won't catch things if we always use NDEBUG.
Also if the message fail to send due to non-printable chars, do we get some error code that we can check fail?
There was a problem hiding this comment.
Extracted the logic to a separate function.
This logic may not run for all possible input in debug mode, but I'm not sure if verification / applying fixes are useful in production mode. Unprintable char in tag value is usually a sign of misconfiguration or data corruption. Exporting these data seem to have low value. I doubt any dashboard will be built with the ? character in metric / dimension name. And recording these data can result in high memory usage in opencensus. Dropping the malformed data as in opencensus seems like the best approach in production. Later on if it turns out opencensus does not log the malformed data, we can have verification with error logging for 1 in 1000 CheckPrintableChar() calls.
There was a problem hiding this comment.
Btw this loop costs 1.5~2% of the benchmark.
…ats collection (ray-project#22054)" This reverts commit 9ac3f68.
Why are these changes needed?
This PR avoids some unnecessary copying and branching when recording event stats. It improves / recovers ~10% of
single_client_get_calls_Plasma_Storeperformance. On AWS EC2m5.8xlarge,single_client_get_calls_Plasma_Storecurrent: ~5200/ssingle_client_get_calls_Plasma_Storewith PR: ~5800/sWhen
RAY_event_stats=0,single_client_get_calls_Plasma_Storecan reach ~6800/s. If we want to optimize further, we can record data in opencensus only in intervals, or when the data are exported.Related issue number
#22053
Checks
scripts/format.shto lint the changes in this PR.