Fix for 12354 - Title and booktitle fields should not contain url#12431
Conversation
…L, a URL that is a false positive or a URL that is accepted in the title or booktitle
…n the Title and BookTitle inputs. Also added tests to identify whether a URL refers to a domain for the Title and BookTitle.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
LinusDietz
left a comment
There was a problem hiding this comment.
I think that this PR generally goes into the right direction.
- I would simplify the message to the user, maybe @JabRef/developers have another suggestion?
- As far as I see the checker for a
DOMAIN_ONLY_PATTERNdoes not do anything? Would the functionality be still the same if you removed those checks?
| } | ||
|
|
||
| if (FULL_URL_PATTERN.matcher(value).find()) { | ||
| return Optional.of(Localization.lang("The title contains a full URL which is forbidden")); |
There was a problem hiding this comment.
I would keep the message simple: The title contains a URL
| return Optional.of(Localization.lang("The book title contains a full URL which is forbidden")); | ||
| } | ||
|
|
||
| if (DOMAIN_ONLY_PATTERN.matcher(value).find()) { |
There was a problem hiding this comment.
why do you check this case? It seems unnecessary to me.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
…this addition. Regex in FULL_URL_PATTERN ensures that a domain URL can be distinguished from a full one.
… message that appears when the user enters a URL as a book title.
…or message that appears when the user enters a URL as a book title.
|
@RapidShotzz @LinusDietz Could you please fix the issue description. Currently, the template is still present I assume, this issue does not close #333, does it? |
|
@koppor @LinusDietz The issue description has been updated! |
I don't see this in the code.
|
koppor
left a comment
There was a problem hiding this comment.
This is a good starting point.
Note that you can search for an arbitrary string in the code using Ctrl+Shift+F. I searched for https?, which is are the first letters of your regex - and found a good match.
Please refactor the code.
| private static final Pattern FULL_URL_PATTERN = Pattern.compile( | ||
| "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE); |
There was a problem hiding this comment.
I think, it would make more sense to re-use org.jabref.logic.cleanup.URLCleanup#URL_REGEX.
- Refator ´org.jabref.logic.cleanup.URLCleanup#URL_REGEX` to reside in org.jabref.logic.util.URLUtil
- Make use of
org.jabref.logic.util.URLUtil#URL_REGEXin your code.
There was a problem hiding this comment.
Still unchanged according to https://github.com/JabRef/jabref/pull/12431/files.
Is this variable still used? I think, it can be removed.
| private static final Pattern FULL_URL_PATTERN = Pattern.compile( | ||
| "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE); |
There was a problem hiding this comment.
Use of org.jabref.logic.util.URLUtil#URL_REGEX in your code. (see above)
|
|
||
| @Test | ||
| void booktitleDoesNotAcceptFullURL() { | ||
| assertNotEquals(Optional.empty(), checker.checkValue("Proceedings of the https://example.com/conference")); |
There was a problem hiding this comment.
Can you do a @ParameterizedTest and @CsvSource - so that the parameters are very easy to type?
| } | ||
|
|
||
| @ParameterizedTest(name = "{index}. Title: \"{1}\" {0}") | ||
| @MethodSource("invalidTitlesWithURLs") |
| private static final Pattern FULL_URL_PATTERN = Pattern.compile( | ||
| "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE); |
There was a problem hiding this comment.
Still unchanged according to https://github.com/JabRef/jabref/pull/12431/files.
Is this variable still used? I think, it can be removed.
| * For GUI-oriented URL utilities see {@link org.jabref.gui.fieldeditors.URLUtil}. | ||
| */ | ||
| public class URLUtil { | ||
| public static final String URL_REGEX = "(?i)\\b((?:https?|ftp)://[^\\s]+)"; |
There was a problem hiding this comment.
Why was it impossibel to use URL_EXP?
Just add a documentation that you needed to search in strings and that you assume that URLs consist of non-whitespace only.
The variable can be private - because you don't read it.
koppor
left a comment
There was a problem hiding this comment.
Thank you for the follow-up.
The PR does not contain a screnshot, thus I have to guess on some places.
| } | ||
|
|
||
| if (URLUtil.URL_PATTERN.matcher(value).find()) { | ||
| return Optional.of(Localization.lang("The book title contains a URL")); |
There was a problem hiding this comment.
I totally overlooked it.
Consistency is key on tools! -- Now read 4 lines above: That starts with booktitle.
| return Optional.of(Localization.lang("The book title contains a URL")); | |
| return Optional.of(Localization.lang("booktitle contains a URL")); |
If you don't like it because of bad English language, change the localiaztion on line 19.
Seeing the title checker, maybe just remove "The book title" and "booktitle" in both strings.
| } | ||
|
|
||
| if (URLUtil.URL_PATTERN.matcher(value).find()) { | ||
| return Optional.of(Localization.lang("The title contains a URL")); |
There was a problem hiding this comment.
Be consistent with line 56
| return Optional.of(Localization.lang("The title contains a URL")); | |
| return Optional.of(Localization.lang("contains a URL")); |
| public class URLUtil { | ||
| private static final String URL_REGEX = "(?i)\\b((?:https?|ftp)://[^\\s]+)"; | ||
| /** | ||
| * Pattern match a string containing a URL with a protocol |
There was a problem hiding this comment.
| * Pattern match a string containing a URL with a protocol | |
| * Pattern matches a string containing a URL with a protocol |
Please a add a empty line before the JavaDoc, too. Otherwise, it is too close to the other variable
Changed localization key for URL pattern matcher Update URL_PATTERN JavaDoc Update JabRef_en.properties 'contains a URL' key
koppor
left a comment
There was a problem hiding this comment.
The architecture is wrong. One could have seen an indication for this, because the same code was put twice - and code duplication should be avoided.
See following code in FieldCheckers:
fieldCheckers.put(StandardField.BOOKTITLE, new BooktitleChecker());
fieldCheckers.put(StandardField.TITLE, new TitleChecker(databaseContext));One field - one checker.
Later
fieldCheckers.put(StandardField.DATE, new DateChecker());
fieldCheckers.put(StandardField.URLDATE, new DateChecker());
fieldCheckers.put(StandardField.EVENTDATE, new DateChecker());
fieldCheckers.put(StandardField.ORIGDATE, new DateChecker());One checker for several fields.
Thus, you need to create a class "NoUrlChecker" containing your logic.
| ### Added | ||
|
|
||
| - We added a feature for copying entries to libraries, available via the context menu, with an option to include cross-references. [#12374](https://github.com/JabRef/jabref/pull/12374) | ||
| - We added an error message that appears when a user tries to enter a URL as a book title. [#12354](https://github.com/JabRef/jabref/issues/12354) |
There was a problem hiding this comment.
| - We added an error message that appears when a user tries to enter a URL as a book title. [#12354](https://github.com/JabRef/jabref/issues/12354) | |
| - We added an integrity check if a URL appears in a title title. [#12354](https://github.com/JabRef/jabref/issues/12354) |
| if (URLUtil.URL_PATTERN.matcher(value).find()) { | ||
| return Optional.of(Localization.lang("contains a URL")); | ||
| } |
There was a problem hiding this comment.
The architecture is wrong - sorry. Please update. (See general PR comments)
Moved no url checking logic from TitleChecker and BooktitleChecker to NoURLChecker Updated CHANGELOG.md
| Entries\ copied\ successfully,\ including\ cross-references.=Entries copied successfully, including cross-references. | ||
| Entries\ copied\ successfully,\ without\ cross-references.=Entries copied successfully, without cross-references. | ||
|
|
||
| contains\ a\ URL=contains a URL |
There was a problem hiding this comment.
Should be moved to some other integrity check outputs, but I leave this as future work to keep things going.
koppor
left a comment
There was a problem hiding this comment.
Fixed the merge conflict by moving the new string near to another integrity check
There was a problem hiding this comment.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

This PR closes #12354 and the link: #12354
We have flagged full URLs that are not supposed to appear in the title and booktitle fields by displaying an error message to inform the user. Regex is used to identify keywords for this
To avoid false positives and incorrect flagging, we accept titles that mention website names as part of the topic e.g. Applying Trip@dvice Recommendation Technology to www.visiteurope.com.
The integrity check focuses on ensuring that URLs which have a start structure of http://, https:// or www. are not mistakenly included in the title/booktitle fields. In terms of minimising false positives, the check only flags full URLs that are followed by a path and will avoid flagging domain names or references that are linked to valid research titles.
Mandatory checks
CHANGELOG.mddescribed in a way that is understandable for the average user (if change is visible to the user)