Concurrent fetchers by stefan-kolb · Pull Request #3881 · JabRef/jabref

stefan-kolb · 2018-03-22T14:12:42Z

Trying to improve the speed of the fulltext fetcher:

First the authoritative publisher is queried for the PDF (if user has access)
Afterwards we query all remaining sources and take the first result

@JabRef/developers WDYT?

tobiasdiez

In general, I like the idea. However, to some extend the order of the fulltext fetcher is also important. For example, we probably prefer to have a published paper over just the preprint.

Siedlerchr

Yeah! Awesone!

stefan-kolb · 2018-03-22T14:26:48Z

@tobiasdiez If that really matters, we could invokeAll and assign priorities to the fetchers or something like that. I'm not sure if the preprint (in reality) really differs (or how often) from the published paper.
Would still give us the parallelism.
And the more fetcher we have we can still keep the time complexity ~ N and not N*NumFetcher.

tobiasdiez · 2018-03-22T14:43:49Z

Priorities or clustering in Authority, Journals and preprints would be a good solution in my opinion.

I know a few instances where authors didn't update their arxiv preprint with the revised and published version. Since even the slightest changes could shift the equation or theorem numbering, having the published PDF is in general desirable.

stefan-kolb · 2018-03-23T12:27:12Z

I thought about this for a moment.
The problem that still persists is that we

have a time complexity of max(v_1,..., v_n) then instead of min(v_1,..., v_n)
decision complexity goes up (clustering, invokingall, checking which highest rated authority has a url, downloading this URL (hopefully it succeeds then if not probably the download is broken all the time as it will always be the highest priority....)

What really annoys me is that the download takes so much time now.
Your priority might be to get the right document.

Note sure which way to go here.

stefan-kolb · 2018-03-23T12:31:46Z

At the moment we try the original publisher for 10 seconds via the DOIResolution.
Afterwards there might be better alternatives like IEEE than GoogleScholar, but it will not be the original publisher site then anyhow!
Maybe we can risk it 😄

Siedlerchr · 2018-03-23T13:43:01Z

Maybe we can offer a switch? E.g. Prefer Official papers over Preprints?
Google Scholar has maybe PDF but the bibtex data of it are worse than every other page
We maybe could also give semanticscholar a try: https://www.semanticscholar.org/
They link to the orgiinal paper

stefan-kolb · 2018-03-23T15:18:13Z

I implemented a possible solution in #3882
Not 100% sure if it is correct but it could be a step into the right direction.
Fetching should be a little faster now as all fetchers are queried in parallel.
I'm still not sure if I like it that way.

tobiasdiez · 2018-03-23T17:29:28Z

Ok, these are good points. What do think about combining both approaches: we cluster the fetcher by trust level and run all fetchers in a cluster in parallel. Thus the performance is still min in each cluster.

Something like:

for (TrustLevel trustLevel : TrustLevel.values()) {
    var tasks = fetchers.stream()
            .filter(fetcher::getTrustLevel == trustLevel)
            .map(fetcher -> () -> fetcher.fetch(entry))
            .collect(list());

     try {
         return executer.invokeAny(tasks);
     } catch( ExecutionException) {
          // No fetcher successful, continue in next trust level bracket
     } 
}

(Fetcher.fetch should throw an exception if no url could be found, otherwise the above code does not work).

stefan-kolb · 2018-03-24T16:12:04Z

It's probably easier an more clear to just run all fetchers as it is now and then select the best authority. Don't see too much benefits running them after another except for multiple code loops then.
Not sure but most of the times we will get the PDF from the lowest authority which means we need to traverse the loop multiple times. I gues the average performance will be better when we run all of them in parallel then.

stefan-kolb · 2018-03-26T14:47:50Z

Closed in favor of #3882

* Parallel fetchers and first wins * Trust level implementation #3881 * Fix ordering * Add tests * Code style * Trust levels * Google refactoring * Syntax error * Reduce calls by one as mimeType is already known for fulltext as PDF #3879 * Fix test * Unued imports * Remove test * Refactoring * Feedback * Graceful shutdown and force shutdown for non-terminating tasks * 60 seconds * Revert test * Add Getters * Mock tests * Refactor to lambda * Revert "60 seconds" This reverts commit 27fa0e8. * Revert "Graceful shutdown and force shutdown for non-terminating tasks" This reverts commit f59a3c6. * Remove unused method

stefan-kolb added 4 commits March 22, 2018 13:56

Increase timeout as DOI resolution often fails

d2c8b91

Return DOI page if it directly redirects to a PDF

ddc3c94

No Excepetions are thrown here

fa40343

Parallel fetchers and first wins

9613438

stefan-kolb added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Mar 22, 2018

tobiasdiez reviewed Mar 22, 2018

View reviewed changes

Siedlerchr approved these changes Mar 22, 2018

View reviewed changes

stefan-kolb added a commit that referenced this pull request Mar 23, 2018

Trust level implementation #3881

1305a30

stefan-kolb closed this Mar 26, 2018

stefan-kolb deleted the concurrent-fetchers branch March 28, 2018 15:48

stefan-kolb mentioned this pull request Oct 7, 2020

Make the DOI Resolution Fetcher return nothing when the DOI leads to a host for which a tailored fetcher exists #6937

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concurrent fetchers#3881

Concurrent fetchers#3881
stefan-kolb wants to merge 4 commits into
masterfrom
concurrent-fetchers

stefan-kolb commented Mar 22, 2018

Uh oh!

tobiasdiez left a comment

Uh oh!

Siedlerchr left a comment

Uh oh!

stefan-kolb commented Mar 22, 2018 •

edited

Loading

Uh oh!

tobiasdiez commented Mar 22, 2018

Uh oh!

stefan-kolb commented Mar 23, 2018

Uh oh!

stefan-kolb commented Mar 23, 2018 •

edited

Loading

Uh oh!

Siedlerchr commented Mar 23, 2018 •

edited

Loading

Uh oh!

stefan-kolb commented Mar 23, 2018

Uh oh!

tobiasdiez commented Mar 23, 2018 •

edited

Loading

Uh oh!

stefan-kolb commented Mar 24, 2018 •

edited

Loading

Uh oh!

stefan-kolb commented Mar 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

stefan-kolb commented Mar 22, 2018

Uh oh!

tobiasdiez left a comment

Choose a reason for hiding this comment

Uh oh!

Siedlerchr left a comment

Choose a reason for hiding this comment

Uh oh!

stefan-kolb commented Mar 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tobiasdiez commented Mar 22, 2018

Uh oh!

stefan-kolb commented Mar 23, 2018

Uh oh!

stefan-kolb commented Mar 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Siedlerchr commented Mar 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan-kolb commented Mar 23, 2018

Uh oh!

tobiasdiez commented Mar 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan-kolb commented Mar 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan-kolb commented Mar 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefan-kolb commented Mar 22, 2018 •

edited

Loading

stefan-kolb commented Mar 23, 2018 •

edited

Loading

Siedlerchr commented Mar 23, 2018 •

edited

Loading

tobiasdiez commented Mar 23, 2018 •

edited

Loading

stefan-kolb commented Mar 24, 2018 •

edited

Loading