Skip to content

Rewrite google scholar fetcher to new infrastructure#2101

Merged
koppor merged 9 commits into
masterfrom
new-gScholar-fetcher
Oct 12, 2016
Merged

Rewrite google scholar fetcher to new infrastructure#2101
koppor merged 9 commits into
masterfrom
new-gScholar-fetcher

Conversation

@matthiasgeiger

@matthiasgeiger matthiasgeiger commented Sep 30, 2016

Copy link
Copy Markdown
Member

Follow up of #2082:

Google Scholar search fetcher now use the new infrastructure - old class GoogleScholarFetcher.java with its excessive interweaving with the UI has been removed.

New implementation uses the Cookie approach @tobiasdiez proposed in #2082 (comment)

As I had no problems with Captchas during development I'll check whether getting more than the 10 hits of the first result page is possible...

  • Get more hits?
  • Change in CHANGELOG.md described
  • Tests created for changes
  • Screenshots added (for bigger UI changes)
  • Manually tested changed features in running JabRef
  • Check documentation status (Issue created for outdated help page at help.jabref.org?)
  • If you changed the localization: Did you run gradle localizationUpdate?

@Siedlerchr Siedlerchr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but i am just wondering if There is no longer a method retuning the help page name from that enum

@matthiasgeiger matthiasgeiger changed the title [WIP] Rewrite google scholar fetcher to new infrastructure Rewrite google scholar fetcher to new infrastructure Oct 5, 2016
@matthiasgeiger matthiasgeiger added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Oct 5, 2016
@matthiasgeiger

Copy link
Copy Markdown
Member Author

@Siedlerchr Thanks for the hint! I've overridden the method now.

@lenhard lenhard left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine (some nitpicking of course). Could you please merge master into this PR, so that we can really test that the fetcher is functional?

String content = URLDownload.createURLDownloadWithBrowserUserAgent(queryURL)
.downloadToString(StandardCharsets.UTF_8);

Matcher m = LINK_TO_BIB_PATTERN.matcher(content);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more one letter variable names, please :-)

}

private BibEntry downloadEntry(String link) throws IOException, FetcherException {
String s = URLDownload.createURLDownloadWithBrowserUserAgent(link).downloadToString(StandardCharsets.UTF_8);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please improve naming in this method.

@matthiasgeiger

Copy link
Copy Markdown
Member Author

@lenhard Done. Branch already has been rebased.

@tobiasdiez tobiasdiez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me, just a few minor remarks.

}

String url = String.format(SEARCH_URL,
String url = String.format(SEARCH_IN_TITLE_URL,

@tobiasdiez tobiasdiez Oct 5, 2016

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the UriBuilder instead of replacing strings. It is more readable and you don't need to take care of manually encoding the input string.


List<BibEntry> foundEntries = new ArrayList<>(10);

String queryURL = String.format(BASIC_SEARCH_URL, URLEncoder.encode(query, StandardCharsets.UTF_8.name()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: UriBuilder


String queryURL = String.format(BASIC_SEARCH_URL, URLEncoder.encode(query, StandardCharsets.UTF_8.name()));
addHitsFromQuery(foundEntries, queryURL);
String content;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content variable is not needed, right?


return foundEntries;
} catch (IOException e) {
throw new FetcherException("Fetching entries from Google Scholar failed: ", e);

@tobiasdiez tobiasdiez Oct 5, 2016

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should probably localized (applies also to the code below). I think there is similar error handling in SearchBasedParserFetcher.

throw new FetcherException("Parsing entries from Google Scholar bib file failed.");
} else {
BibEntry entry = entries.iterator().next();
entry.clearField(BibEntry.KEY_FIELD);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other fetcher keep the bibtexkey. Not sure what the best approach is, but we should be consistent.

@lenhard

lenhard commented Oct 5, 2016

Copy link
Copy Markdown
Member

Hm, for me this still freezes on second search.

Also, I am still getting this weird exception on the command line:

Exception in thread "JabRef CachedThreadPool" java.lang.IndexOutOfBoundsException: Index: 3, Size: 3
    at java.util.ArrayList.rangeCheck(ArrayList.java:653)
    at java.util.ArrayList.get(ArrayList.java:429)
    at java.awt.Container.createHierarchyEvents(Container.java:1441)
    at java.awt.Container.createHierarchyEvents(Container.java:1441)
    at java.awt.Container.createHierarchyEvents(Container.java:1441)
    at java.awt.Container.createHierarchyEvents(Container.java:1441)
    at java.awt.Container.createHierarchyEvents(Container.java:1441)
    at java.awt.Dialog.conditionalShow(Dialog.java:953)
    at java.awt.Dialog.show(Dialog.java:1037)
    at java.awt.Component.show(Component.java:1671)
    at java.awt.Component.setVisible(Component.java:1623)
    at java.awt.Window.setVisible(Window.java:1014)
    at java.awt.Dialog.setVisible(Dialog.java:1005)
    at net.sf.jabref.gui.importer.fetcher.GeneralFetcher.lambda$actionPerformed$5(GeneralFetcher.java:220)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

What about that?

@matthiasgeiger

Copy link
Copy Markdown
Member Author

Are you sure that you've pulled the rebased branch @lenhard?

GeneralFetcher.java is exactly the line which I reverted... and it works for me here.

@matthiasgeiger

Copy link
Copy Markdown
Member Author

Fixed the aspects @tobiasdiez has mentioned. Exception: Exception localization - this should be discussed more generally...

@lenhard Can you please retest whether the freezing problem persists for you?

@lenhard

lenhard commented Oct 6, 2016

Copy link
Copy Markdown
Member

Well, JabRef doesn't freeze, but I seem to be blocked from Google Scholar now with the fetcher:

Caused by: java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://scholar.google.com/scholar%3Fhl%3Den%26btnG%3DSearch%26q%3Dtest&hl=en&q=CGMSBMELm6gYh_jXvwUiGQDxp4NLijN0g-BTQduzpa67i0vP6rY1eFo
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1840) ~[?:1.8.0_92]
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) ~[?:1.8.0_92]
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) ~[?:1.8.0_92]
    at net.sf.jabref.logic.net.URLDownload.downloadToString(URLDownload.java:136) ~[main/:?]
    at net.sf.jabref.logic.importer.fetcher.GoogleScholar.addHitsFromQuery(GoogleScholar.java:135) ~[main/:?]
    at net.sf.jabref.logic.importer.fetcher.GoogleScholar.performSearch(GoogleScholar.java:120) ~[main/:?]
    ... 5 more

Unfortunately, filling out the captcha does not help. Of course, I cannot guarantee that the freezing does not return when I manage to get unblocked ;-)

Oh, and to be sure I executed a hard reset and pull for this branch, so I should be up to date.

pdfLink = Optional.of(new URL(s));
break;
try {
URIBuilder uriBuilder = new URIBuilder(SEARCH_IN_TITLE_URL);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not covered by tests? Or is just my Firefox plugin failing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the "old" part of the fetcher responsible for fulltext downloads. The tests are currently not executed at travis as they led to errors if the IP got blocked by google (See https://github.com/JabRef/jabref/blob/master/src/test/java/net/sf/jabref/logic/importer/fetcher/GoogleScholarTest.java).
However, as my new tests (currently) are running through I'll check whether this still persists.

@koppor

koppor commented Oct 10, 2016

Copy link
Copy Markdown
Member

Besides minor comments: Fetcher works for me without exceptions. I just tested the search string "bpel4chor".

@koppor koppor merged commit 7e76724 into master Oct 12, 2016
@koppor koppor deleted the new-gScholar-fetcher branch October 12, 2016 09:07
Siedlerchr added a commit that referenced this pull request Oct 13, 2016
* upstream/master: (24 commits)
  hotfix NPE in the main table (#2158)
  Enhance side pane toggle (#1605)
  Added gray background text to authors field to assist newcomers (#2147)
  Rewrite google scholar fetcher to new infrastructure (#2101)
  Found entries will be shown when dropping a pdf with xmp meta data (#2150)
  Japanese translation (#2155)
  Add shortcuts to context menu (#2131)
  [WIP] Doi resolution ignore (#2154)
  Fix DuplicationChecker and key generator (#2151)
  just close popup on first ESC
  keep entry in sight
  Fix isSharedDatabaseAlreadyPresent check.
  don't change entry after search if editing an entry (#2129)
  Change download URL to downloads.jabref.org (#2145)
  fix switching edited textfield in the entry editor with TAB (#2138)
  Update me.champeau.gradle.jmh' version from 0.3.0 to 0.3.1
  Update install4j plugin from 6.1.2 to 6.1.3
  Remove gradle plugin com.github.youribonnaffe.gradle.format as it is deprecated and the config uses a non-existing file
  Update gradle from 3.0 to 3.1
  Fix changing the font size not working when importing preferences (#2069)
  ...

# Conflicts:
#	src/main/java/net/sf/jabref/gui/externalfiles/DroppedFileHandler.java
#	src/main/java/net/sf/jabref/gui/importer/fetcher/GoogleScholarFetcher.java
zesaro pushed a commit to zesaro/jabref that referenced this pull request Nov 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: fetcher status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants