Add support for C++ std::regex to benchmarks#459
Add support for C++ std::regex to benchmarks#459mkrupcale wants to merge 3 commits intorust-lang:masterfrom
std::regex to benchmarks#459Conversation
* bench/Cargo.toml: add `re-stdcpp` feature * bench/build.rs: add `cstdcpp` library to bench build * bench/compile: add `re-stdcpp` feature to bench compile script * bench/run: add `re-stdcpp` feature to bench run script * bench/src/bench.rs: use `ffi::stdcpp::Regex`, define its `text!` macro, and `Text` type * bench/src/ffi/mod.rs: declare `stdcpp` module * bench/src/ffi/stdcpp.cpp: implement C API using C++ `std::regex` * bench/src/ffi/stdcpp.rs: Rust `Regex` API implementation using C++ `std::regex` C API wrapper * bench/src/main.rs: add stdcpp to bench main * bench/src/misc.rs: - do not run `no_exponential` benchmark for `re-stdcpp` feature because `libstdc++` `std::regex` implementation currently seems to have exponential behavior here - do not run `match_class_unicode` benchmark for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support unicode character classes * bench/src/sherlock.rs: - do not run `name_sherlock_nocase`, `name_holmes_nocase`, `name_sherlock_holmes_nocase`, `name_alt3_nocase`, `name_alt4_nocase`, `name_alt5_nocase`, `the_nocase`, `everything_greedy_nl`, and `line_boundary_sherlock_holmes` benchmarks for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support inline modifier syntax - do not run `letters`, `letters_upper`, and `letters_lower` benchmarks for `re-stdcpp` feature because `std::regex` ECMAScript grammar does not support unicode character classes - use a different regex for `everything_greedy` benchmark because `std::regex` '.' does not match '\r' - `words` benchmark for `std::regex` matches RE2 test result, so use that test for `re-stdcpp` feature as well - do not run `holmes_coword_watson` benchmark for `re-stdcpp` feature because `libstdc++` `std::regex` implementation currently seems to have exponential behavior here
|
This is amazing! Thanks so much for doing this. I am also surprised at how slow |
|
I'm not a C++ expert, but I can't spot any obvious deficiencies in your bindings either. e.g., I don't see any extra allocations going on. |
|
Yeah, I was pretty surprised by the results as well. The implementation here follows the RE2 one fairly closely, obviously adjusted to use the std::regex API, so I think at least those two implementations should be comparable. On the other hand, I wasn't sure how much overhead there was between the Rust native implementation and the FFI implementations, especially considering the use of unsafe blocks for the FFI implementations. I would also like to try using LLVM/Clang when building to test the |
The only overhead is probably a function call. One possibility would be to implement the iterator inside the glue code, which might give you inlining. However, this will likely only help in benchmarks that have a ton of matches. (Most don't.) Benchmarking against Clang probably won't show too much of a difference, especially with these speeds. Benchmarking Boost is a good idea though! :-) |
* bench/Cargo.toml: add `libcxx` feature
* bench/build.rs: link against `libc++` when `libcxx` feature is specified with `re-stdcpp` feature. The `cc::Build::cpp_set_stdlib("c++")` method is not used because GCC does not support the `-stdlib` flag.
* bench/run: add `stdcpp-libcxx` option for benchmarking `libc++` `std::regex` implementation.
This requires the user to specify the compiler `CXX` and compiler flags `CXXFLAGS` necessary for compiling with `libc++`[1].
[1] https://libcxx.llvm.org/docs/UsingLibcxx.html
|
I've added support for linking against the |
|
I've added support for benchmarking against Boost.Regex. The benchmark implementation reuses the Rust and C/C++ FFIs from |
|
Small update: I've enabled some additional tests for |
This is very similar to the `std::regex` benchmark implementation since Boost.Regex and `std::regex` have very similar APIs and regex grammar support. As such, it uses the `stdcpp` Rust and C FFIs to reduce code duplication.
* bench/Cargo.toml: add `re-boost` feature
* bench/build.rs: add `cboost` library to bench build. This uses a compiler preprocessor definition to indicate whether or not to use Boost when compiling the `stdcpp` FFI.
* bench/compile: add `re-boost` feature to bench compile script
* bench/run: add `re-boost` feature to bench run script
* bench/src/bench.rs: use `ffi::stdcpp::Regex`, define its `text!` macro, and `Text` type for feature `re-boost`
* bench/src/ffi/mod.rs: declare `stdcpp` module for `re-boost` feature
* bench/src/ffi/stdcpp.cpp: implement C API using C++ `boost::regex`. The Boost.Regex API is very similar to the `std::regex` API and therefore only uses a different namespace.
* bench/src/main.rs: add boost to bench main
* bench/src/misc.rs:
- do not run `match_class_unicode` benchmark for `re-boost` feature because `boost::regex` ECMAScript grammar does not support unicode character classes
* bench/src/sherlock.rs:
- do not run `letters`, `letters_upper`, and `letters_lower` benchmarks for `re-boost` feature because `boost::regex` ECMAScript grammar does not support unicode character classes
- use a different regex for `everything_greedy` benchmark because `boost::regex` '.' does not match '\r'
- `words` benchmark for `boost::regex` matches RE2 test result, so use that test for `re-boost` feature as well. Also fixes conditional compilation issue for `re-stdcpp`.
- do not run `holmes_coword_watson` benchmark for `re-boost` feature because Boost.Regex implementation currently seems to have exponential behavior here
|
Fixed some conditional compilation issues which were not enabling some tests as I had thought. On a side note, I do not particularly like the syntax for conditional compilation in Rust. It is not obvious how it works when compared to C/C++ preprocessor directives. In any case, I have some comparisons (82 tests) between |
|
A comment about libstdc++ (and libc++) |
|
@mkrupcale I finally had a chance to test this out and it all works! Thanks so much for doing this. :-) |
|
@BurntSushi no problem. Is there a reason why this pull request was closed, however? Seeing as there are benchmarks against other C and C++ regex implementations in this codebase, I'm not sure what the problem is with merging this additional set of C++ regex benchmarks. While I was doing this for my own interests, I don't see why it couldn't be useful to others and for future comparisons. |
|
@mkrupcale Oh! This was merged in 4e3a107, which should be linked to this PR. I did the merge manually in order to clean up the commit messages. Sadly, the Github UI doesn't really distinguish between "manually merged PR" and "closed PR without merging." |
|
Ah, that's my mistake. I didn't realize that was a reference to the merge commit. Apologies. |
|
Yeah, lately I've been trying to keep a cleaner history for all of my projects by doing the following:
Your PR however was lovingly split into 3 sensible commits. So the only way to merge that while also touching up the messages is manually. It's not clear if I will give up on this task yet or not, but so far it has been working somewhat well. :-) |
Detailed changes
re-stdcppfeaturecstdcpplibrary to bench buildre-stdcppfeature to bench compile scriptre-stdcppfeature to bench run scriptffi::stdcpp::Regex, define itstext!macro, andTexttypestdcppmodulestd::regexRegexAPI implementation using C++std::regexC API wrapperno_exponentialbenchmark forre-stdcppfeature becauselibstdc++std::regeximplementation currently seems to have exponential behavior herematch_class_unicodebenchmark forre-stdcppfeature becausestd::regexECMAScript grammar does not support unicode character classesname_sherlock_nocase,name_holmes_nocase,name_sherlock_holmes_nocase,name_alt3_nocase,name_alt4_nocase,name_alt5_nocase,the_nocase,everything_greedy_nl, andline_boundary_sherlock_holmesbenchmarks forre-stdcppfeature becausestd::regexECMAScript grammar does not support inline modifier syntaxletters,letters_upper, andletters_lowerbenchmarks forre-stdcppfeature becausestd::regexECMAScript grammar does not support unicode character classeseverything_greedybenchmark becausestd::regex'.' does not match '\r'wordsbenchmark forstd::regexmatches RE2 test result, so use that test forre-stdcppfeature as wellholmes_coword_watsonbenchmark forre-stdcppfeature becauselibstdc++std::regeximplementation currently seems to have exponential behavior hereResults
Unfortunately, the
libstdc++std::regeximplementation (version 7.3.1) on my machine did not do too well, often being tens or hundreds of times slower than PCRE2 or RE2. Some example results forlibstdc++are shown here: