Skip to content

Concurrent fetchers#3881

Closed
stefan-kolb wants to merge 4 commits into
masterfrom
concurrent-fetchers
Closed

Concurrent fetchers#3881
stefan-kolb wants to merge 4 commits into
masterfrom
concurrent-fetchers

Conversation

@stefan-kolb

Copy link
Copy Markdown
Member

Trying to improve the speed of the fulltext fetcher:

  • First the authoritative publisher is queried for the PDF (if user has access)
  • Afterwards we query all remaining sources and take the first result

@JabRef/developers WDYT?

@stefan-kolb stefan-kolb added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Mar 22, 2018

@tobiasdiez tobiasdiez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I like the idea. However, to some extend the order of the fulltext fetcher is also important. For example, we probably prefer to have a published paper over just the preprint.

@Siedlerchr Siedlerchr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Awesone!

@stefan-kolb

stefan-kolb commented Mar 22, 2018

Copy link
Copy Markdown
Member Author

@tobiasdiez If that really matters, we could invokeAll and assign priorities to the fetchers or something like that. I'm not sure if the preprint (in reality) really differs (or how often) from the published paper.
Would still give us the parallelism.
And the more fetcher we have we can still keep the time complexity ~ N and not N*NumFetcher.

@tobiasdiez

Copy link
Copy Markdown
Member

Priorities or clustering in Authority, Journals and preprints would be a good solution in my opinion.

I know a few instances where authors didn't update their arxiv preprint with the revised and published version. Since even the slightest changes could shift the equation or theorem numbering, having the published PDF is in general desirable.

@stefan-kolb

Copy link
Copy Markdown
Member Author

I thought about this for a moment.
The problem that still persists is that we

  • have a time complexity of max(v_1,..., v_n) then instead of min(v_1,..., v_n)
  • decision complexity goes up (clustering, invokingall, checking which highest rated authority has a url, downloading this URL (hopefully it succeeds then if not probably the download is broken all the time as it will always be the highest priority....)

What really annoys me is that the download takes so much time now.
Your priority might be to get the right document.

Note sure which way to go here.

@stefan-kolb

stefan-kolb commented Mar 23, 2018

Copy link
Copy Markdown
Member Author

At the moment we try the original publisher for 10 seconds via the DOIResolution.
Afterwards there might be better alternatives like IEEE than GoogleScholar, but it will not be the original publisher site then anyhow!
Maybe we can risk it 😄

@Siedlerchr

Siedlerchr commented Mar 23, 2018

Copy link
Copy Markdown
Member

Maybe we can offer a switch? E.g. Prefer Official papers over Preprints?
Google Scholar has maybe PDF but the bibtex data of it are worse than every other page
We maybe could also give semanticscholar a try: https://www.semanticscholar.org/
They link to the orgiinal paper

stefan-kolb added a commit that referenced this pull request Mar 23, 2018
@stefan-kolb

Copy link
Copy Markdown
Member Author

I implemented a possible solution in #3882
Not 100% sure if it is correct but it could be a step into the right direction.
Fetching should be a little faster now as all fetchers are queried in parallel.
I'm still not sure if I like it that way.

@tobiasdiez

tobiasdiez commented Mar 23, 2018

Copy link
Copy Markdown
Member

Ok, these are good points. What do think about combining both approaches: we cluster the fetcher by trust level and run all fetchers in a cluster in parallel. Thus the performance is still min in each cluster.

Something like:

for (TrustLevel trustLevel : TrustLevel.values()) {
    var tasks = fetchers.stream()
            .filter(fetcher::getTrustLevel == trustLevel)
            .map(fetcher -> () -> fetcher.fetch(entry))
            .collect(list());

     try {
         return executer.invokeAny(tasks);
     } catch( ExecutionException) {
          // No fetcher successful, continue in next trust level bracket
     } 
}

(Fetcher.fetch should throw an exception if no url could be found, otherwise the above code does not work).

@stefan-kolb

stefan-kolb commented Mar 24, 2018

Copy link
Copy Markdown
Member Author

It's probably easier an more clear to just run all fetchers as it is now and then select the best authority. Don't see too much benefits running them after another except for multiple code loops then.
Not sure but most of the times we will get the PDF from the lowest authority which means we need to traverse the loop multiple times. I gues the average performance will be better when we run all of them in parallel then.

@stefan-kolb

Copy link
Copy Markdown
Member Author

Closed in favor of #3882

tobiasdiez pushed a commit that referenced this pull request Mar 28, 2018
* Parallel fetchers and first wins

* Trust level implementation #3881

* Fix ordering

* Add tests

* Code style

* Trust levels

* Google refactoring

* Syntax error

* Reduce calls by one as mimeType is already known for fulltext as PDF #3879

* Fix test

* Unued imports

* Remove test

* Refactoring

* Feedback

* Graceful shutdown and force shutdown for non-terminating tasks

* 60 seconds

* Revert test

* Add Getters

* Mock tests

* Refactor to lambda

* Revert "60 seconds"

This reverts commit 27fa0e8.

* Revert "Graceful shutdown and force shutdown for non-terminating tasks"

This reverts commit f59a3c6.

* Remove unused method
@stefan-kolb stefan-kolb deleted the concurrent-fetchers branch March 28, 2018 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants