Conversation
|
The data is obtained by My conclusion is there's no significant improvement in speed, but in memory usage. The difference in run-time is insignificant. My refactored version has 3.22% fewer calls to memory allocation in total. The memory allocation calls per second is less significant, and it's 7.03% fewer. The difference in temporary memory allocation in total and per second are not significant My refactored version has 12.79% fewer peak memory usage, and it's very significant. My refactored version has 4.14% fewer peak resident memory. The only downside, I have significantly much more leaked memory (25.90% increase). This should not be a big deal though. And notice that |
|
In the latest forced push, I replaced regex based matching with url parsing based matching. So for example, to extract repo and user name from a github URL, there's really no need to use regex at all. |
|
This is great! Love the improvements. 👏 |
|
Need to do more investigation on the performance changes. Then should Also, the CI would fail unless the library component is actually published. |
Thanks. 👍
Yes, please.
All good. Just let me know when this is ready for review and I'll try it locally and then we can merge this. |
|
To replicate, you need to install heaptrack git checkout develop
cargo clean && cargo build --release --all
mkdir develop
cd develop
for _i in {0..9}; do heaptrack ../target/release/lychee --no-progress -E https://wiki.archlinux.org/index.php/List_of_applications &> /dev/null; done
popd
git checkout master
cargo clean && cargo build --release
mkdir master
cd master
for _i in {0..9}; do heaptrack ../target/release/lychee --no-progress -E https://wiki.archlinux.org/index.php/List_of_applications &> /dev/null; done
popd
echo "ID, total_runtime, calls_to_alloc, calls_to_alloc_per_second, temp_mem_alloc, temp_mem_alloc_per_second, peak_heap_mem_MB, peak_RSS_MB, total_mem_leaked_KB" > data.csv
for _i in ./develop/heaptrack*; do echo "ID: develop"; heaptrack_print $_i | tail -n 6; done >> data.csv
for _i in ./master/heaptrack*; do echo "ID: master"; heaptrack_print $_i | tail -n 6; done >> data.csvThe next part is data cleaning. You can do that in 2,$s/[[:alnum:]| ()]*: //
2,$s/s\.$\|MB$\|KB$//
2,$s/\ (/\r/
2,$s/\/s)$//
2,$j
2s/ \(developer\|master\)/\r\1/g
2,$s/ /, /gThen
Next is to generate the output, you need install.packages(c("ggplot2", "ggpubr", "stringi", "data.table", "rlang", "rstatix"))
library(ggplot2)
library(ggpubr)
library(stringi)
library(data.table)
library(rlang)
library(rstatix)
DT <- data.table::fread(file = "data.csv")
pps <-
lapply(names(DT)[-1],
function (var) {
df <- DT[, .SD, .SDcols = var, by = ID]
caption <-
DT %>% t_test(new_formula(sym(var), sym("ID"))) %>% get_test_label(detailed = TRUE)
names(df) <- c("ID", "Measure")
ggplot(df, aes(x = ID, y = Measure, color = ID)) +
geom_boxplot() +
geom_jitter() +
labs(x = "", y = "", caption = caption) +
theme_minimal() +
theme(legend.position = "none")
})
labels <-
sapply(names(DT)[-1],
function (name) {
name %>%
stri_replace_all_fixed("_", " ") %>%
stri_trans_totitle() %>%
stri_replace_last_fixed("b", "B")
},
USE.NAMES = FALSE)
pp <-
eval(expr(
ggarrange(
!!!pps,
labels = labels,
ncol = 2,
nrow = 4,
font.label = list(size = 10)
)
))
ggexport(
pp,
filename = "figure1.png",
width = 1000,
height = 1400,
res = 100
) |
|
I'm replacing ad-hoc errors (via anyhow) to structural errors (via this-error). |
Yeah that was on the list anyway. Thanks for tackling this as well. 😀 |
|
I had handwritten the errors without using Another advantage is that I get rid of all dynamic dispatch of Errors. All errors have size known at compile time. The downside is I have to implement For As for the naming: |
|
If this looks good I will do some final clean ups and rebase the commits. |
| } | ||
| FALSE_POSITIVES.is_match(input) | ||
| pub fn is_false_positive(input: &str) -> bool { | ||
| input == FALSE_POSITIVE_PAT[0] |
There was a problem hiding this comment.
Even though FALSE_POSITIVE_PAT only contains one string so far, I wonder if we still want to iterate through all elements of that slice already so that we don't forget to update this code once we start adding more strings to FALSE_POSITIVE_PAT. What do you think?
There was a problem hiding this comment.
In the future we may want. So this is not future-proof.
But it's really hard to come up with a second 'false-positive' example.
Also, I haven't thought thoroughly but the naming 'false-positive' is not clear enough.
There was a problem hiding this comment.
True. Let's keep it like that then. Also can't come up with a better name at the moment. Maybe something like IGNORE_PATTERNS or KNOWN_PHONY_URLS could work. 🤔 We can come up with something better in the future.
lychee-lib/src/uri.rs
Outdated
| let url = if let Uri::Website(url) = self { | ||
| url | ||
| } else { | ||
| unsafe { std::hint::unreachable_unchecked() } |
There was a problem hiding this comment.
Just wanted to mention that I still have mixed feelings about the use of unsafe. 😄
It's fine and we can merge it like that, but I don't see the need for introducing unsafe if the perf improvement is not significant and there are no other significant upsides. It might be beneficial, but we'd have to measure that.
Yes, I'm aware that this totally safe in this case with the debug_assert, but why introduce unsafe when we can have a fully "safe" crate with 99% of the upsides?
My main concern is about future refactorings and missing an edge-case.
Sorry, don't want to start another long discussion. It's just a note that we have to revisit that topic and we might revert that decision in the future. 😅
There was a problem hiding this comment.
Well for this particular case, I'm able to avoid writing any explicit unwrap, expect, unrechable or std::hint::unreachable_unchecked.
if let Uri::Website(url) = self {
if matches!(url.domain(), Some("github.com") | Some("www.github.com")) {
if let Some(mut path) = url.path_segments() {
if let Some(owner) = path.next() {
if let Some(repo) = path.next() {
return Some((owner, repo));
}
}
}
}
}There was a problem hiding this comment.
Extracting this in a separate fn extract_github_info(url: &_) -> Option<(String, String)> { .. } would allow you to simply use the ? operator (Option<_> impls Try) and there would no need to be a single nesting layer 😃
There was a problem hiding this comment.
Good idea @drahnr. Would you be willing to open a PR for that?
There was a problem hiding this comment.
Actually, it was already implemented as extract_github in a follow up which I missed
|
That pull request is great! I learned a lot about making the code more ergonomic. Let's keep |
|
Well a few things happened during last few days. I further slimmed downed the new The only downside is we are not able to create create a superficial http error with empty message and a status code. The impact is minimal because we only need to do that in tests ( specifically I updated dependencies, and ordered in alphabetically. Now |
|
And about the performance gain. I wrote quite a lot on the benchmarks. To sum up
So
My guess is, To verify my hypothesis, I made a markdown file with top 1,000 websites in Alexa's rank, and removed any of them that timed out. As you can see, my refactored version IS faster for about 13%. |
|
@lebensterben, guess there is one more CI issue to fix before we can finally merge this: Source: https://github.com/lycheeverse/lychee/pull/208/checks?check_run_id=2331614860 |
|
@mre The error comes from L126 where it coverts a specific discriminant of And the rule for converting |
|
BTW... I've found other interfaces for further refactor. But I will keep them in another PR. |
|
You didn't merge it right. Now the problem should be fixed. |
|
Since it passes both cargo test and cargo lint, I'm going to squash those commits. Merge with care. I almost lost all the work... Added a few links. See |
8db2423 to
5e6faae
Compare
Resolved the conflict via Github directly as it only showed a single file to be changed: the I'm seeing a list of conflicting files now. Given my bad experiences with the last merge, perhaps you could try to resolve that this time to avoid further churn. Other than that, let's merge it in. 😊 |
- The binary component and library component are separated as two
packages in the same workspace.
- `lychee` is the binary component, in `lychee-bin/*`.
- `lychee-lib` is the library component, in `lychee-lib/*`.
- Users can now install only the `lychee-lib`, instead of both
components, that would require fewer dependencies and faster
compilation.
- Dependencies for each component are adjusted and updated. E.g.,
no CLI dependencies for `lychee-lib`.
- CLI tests are only moved to `lychee`, as it has nothing to do
with the library component.
- `Status::Error` is refactored to contain dedicated error enum,
`ErrorKind`.
- The motivation is to delay the formatting of errors to strings.
Note that `e.to_string()` is not necessarily cheap (though
trivial in many cases). The formatting is no delayed until the
error is needed to be displayed to users. So in some cases, if
the error is never used, it means that it won't be formatted at
all.
- Replaced `regex` based matching with one of the following:
- Simple string equality test in the case of 'false positivie'.
- URL parsing based test, in the case of extracting repository and
user name for GitHub links.
- Either cases would be much more efficient than `regex` based
matching. First, there's no need to construct a state machine for
regex. Second, URL is already verified and parsed on its creation,
and extracting its components is fairly cheap. Also, this removes
the dependency on `lazy-static` in `lychee-lib`.
- `types` module now has a sub-directory, and its components are now
separated into their own modules (in that sub-directory).
- `lychee-lib::test_utils` module is only compiled for tests.
- `wiremock` is moved to `dev-dependency` as it's only needed for
`test` modules.
- Dependencies are listed in alphabetical order.
- Imports are organized in the following fashion:
- Imports from `std`
- Imports from 3rd-party crates, and `lychee-lib`.
- Imports from `crate::*` or `super::*`.
- No glob import.
- I followed suggestion from `cargo clippy`, with `clippy::all` and
`clippy:pedantic`.
|
@mre |
|
Hate to say it, but there is one more lint error. 😅 Compiling lychee-lib v0.6.0 (/home/runner/work/lychee/lychee/target/package/lychee-lib-0.6.0)
error[E0277]: the trait bound `types::error::ErrorKind: From<reqwest::error::Error>` is not satisfied
--> src/types/error.rs:127:45
|
127 | hubcaps::Error::Reqwest(e) => e.into(),
| ^^^^ the trait `From<reqwest::error::Error>` is not implemented for `types::error::ErrorKind`
|
= help: the following implementations were found:
<types::error::ErrorKind as From<(PathBuf, std::io::Error)>>
<types::error::ErrorKind as From<(std::string::String, url::ParseError)>>
<types::error::ErrorKind as From<(std::string::String, url::ParseError, fast_chemail::ParseError)>>
<types::error::ErrorKind as From<Infallible>>
and 6 others
= note: required because of the requirements on the impl of `Into<types::error::ErrorKind>` for `reqwest::error::Error`I swear I didn't touch it this time. 🤞 Also, shouldn't just work? I'm getting error[E0277]: the trait bound `url::Url: uri::_::_serde::Serialize` is not satisfied
--> lychee-lib/src/uri.rs:15:5
|
15 | /// Website URL or mail address
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `uri::_::_serde::Serialize` is not implemented for `url::Url`
|
= note: required by `uri::_::_serde::ser::SerializeStruct::serialize_field`
error[E0277]: the trait bound `url::Url: Deserialize<'_>` is not satisfied
--> lychee-lib/src/uri.rs:15:5
|
15 | /// Website URL or mail address
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Deserialize<'_>` is not implemented for `url::Url`
|
= note: required by `next_element`
error[E0277]: the trait bound `url::Url: Deserialize<'_>` is not satisfied
--> lychee-lib/src/uri.rs:15:5
|
15 | /// Website URL or mail address
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `Deserialize<'_>` is not implemented for `url::Url`
|
= note: required by `next_value`
error: aborting due to 3 previous errorsWhat's the right command to install it locally? |
|
Mystery solved: |
|
This is some restriction of rust. These two are the same struct. I solved it by introduced a new enum variant. |
|
Nicely done. Should |
|
Yes it works. Once |
|
Yaaaay! You've done it. 🚀 Great success. Thanks a lot for pushing through and for your patience. 👏 |
|
I'll publish lychee-lib now. |


Changes:
lycheeis the binary component, inlychee-bin/*.lychee-libis the library component, inlychee-lib/*.lychee-lib, instead of both components, that would require fewer dependencies and faster compilation.lychee-lib.lychee, as it has nothing to do with the library component.Status::Erroris refactored to contain dedicated error enum,ErrorKind.e.to_string()is not necessarily cheap (though trivial in many cases). The formatting is no delayed until the error is needed to be displayed to users. So in some cases, if the error is never used, it means that it won't be formatted at all.regexbased matching with one of the following:regexbased matching. First, there's no need to construct a state machine for regex. Second, URL is already verified and parsed on its creation, and extracting its components is fairly cheap. Also, this removes the dependency onlazy-staticinlychee-lib.typesmodule now has a sub-directory, and its components are now separated into their own modules (in that sub-directory).lychee-lib::test_utilsmodule is only compiled for tests.wiremockis moved todev-dependencyas it's only needed fortestmodules.stdlychee-lib.crate::*orsuper::*.cargo clippy, withclippy::allandclippy:pedantic.Benchmark results of running the refactored version and the current origin/HEAD version is listed below.
The sample size is very small (n=6)
Frankly, I don't know which part contribute to most of improvement.