Fix for 12354 - Title and booktitle fields should not contain url by RapidShotzz · Pull Request #12431 · JabRef/jabref

RapidShotzz · 2025-01-30T11:26:20Z

This PR closes #12354 and the link: #12354

We have flagged full URLs that are not supposed to appear in the title and booktitle fields by displaying an error message to inform the user. Regex is used to identify keywords for this

To avoid false positives and incorrect flagging, we accept titles that mention website names as part of the topic e.g. Applying Trip@dvice Recommendation Technology to www.visiteurope.com.

The integrity check focuses on ensuring that URLs which have a start structure of http://, https:// or www. are not mistakenly included in the title/booktitle fields. In terms of minimising false positives, the check only flags full URLs that are followed by a path and will avoid flagging domain names or references that are linked to valid research titles.

Mandatory checks

I own the copyright of the code submitted and I licence it under the MIT license
Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

…L, a URL that is a false positive or a URL that is accepted in the title or booktitle

…n the Title and BookTitle inputs. Also added tests to identify whether a URL refers to a domain for the Title and BookTitle.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

LinusDietz

I think that this PR generally goes into the right direction.

I would simplify the message to the user, maybe @JabRef/developers have another suggestion?
As far as I see the checker for a DOMAIN_ONLY_PATTERN does not do anything? Would the functionality be still the same if you removed those checks?

LinusDietz · 2025-01-30T17:55:59Z

        }

+        if (FULL_URL_PATTERN.matcher(value).find()) {
+            return Optional.of(Localization.lang("The title contains a full URL which is forbidden"));


I would keep the message simple: The title contains a URL

LinusDietz · 2025-01-30T17:59:07Z

+            return Optional.of(Localization.lang("The book title contains a full URL which is forbidden"));
+        }
+
+        if (DOMAIN_ONLY_PATTERN.matcher(value).find()) {


why do you check this case? It seems unnecessary to me.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

…this addition. Regex in FULL_URL_PATTERN ensures that a domain URL can be distinguished from a full one.

… message that appears when the user enters a URL as a book title.

…or message that appears when the user enters a URL as a book title.

koppor · 2025-02-03T13:21:50Z

@RapidShotzz @LinusDietz Could you please fix the issue description. Currently, the template is still present

I assume, this issue does not close #333, does it?

RapidShotzz · 2025-02-03T13:26:02Z

@koppor @LinusDietz The issue description has been updated!

koppor · 2025-02-03T13:37:07Z

To avoid false positives and incorrect flagging, we accept titles that mention website names as part of the topic e.g. Applying Trip@dvice Recommendation Technology to www.visiteurope.com.

I don't see this in the code.

No Test case containing Trip@dvice
No special regular expression.

koppor

This is a good starting point.

Note that you can search for an arbitrary string in the code using Ctrl+Shift+F. I searched for https?, which is are the first letters of your regex - and found a good match.

Please refactor the code.

koppor · 2025-02-03T13:33:23Z

+    private static final Pattern FULL_URL_PATTERN = Pattern.compile(
+            "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE);


I think, it would make more sense to re-use org.jabref.logic.cleanup.URLCleanup#URL_REGEX.

Refator ´org.jabref.logic.cleanup.URLCleanup#URL_REGEX` to reside in org.jabref.logic.util.URLUtil

Make use of org.jabref.logic.util.URLUtil#URL_REGEX in your code.

Still unchanged according to https://github.com/JabRef/jabref/pull/12431/files.

Is this variable still used? I think, it can be removed.

koppor · 2025-02-03T13:33:49Z

+    private static final Pattern FULL_URL_PATTERN = Pattern.compile(
+            "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE);


Use of org.jabref.logic.util.URLUtil#URL_REGEX in your code. (see above)

koppor · 2025-02-03T13:34:45Z

+
+    @Test
+    void booktitleDoesNotAcceptFullURL() {
+        assertNotEquals(Optional.empty(), checker.checkValue("Proceedings of the https://example.com/conference"));


Can you do a @ParameterizedTest and @CsvSource - so that the parameters are very easy to type?

koppor · 2025-02-03T13:35:21Z

        }
+
+        @ParameterizedTest(name = "{index}. Title: \"{1}\" {0}")
+        @MethodSource("invalidTitlesWithURLs")


@CsvSource could be easier here.

koppor

Minor comments

koppor · 2025-02-10T19:08:56Z

+    private static final Pattern FULL_URL_PATTERN = Pattern.compile(
+            "(https?://\\S+/\\S+|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE);


Still unchanged according to https://github.com/JabRef/jabref/pull/12431/files.

Is this variable still used? I think, it can be removed.

koppor · 2025-02-10T19:11:35Z

 * For GUI-oriented URL utilities see {@link org.jabref.gui.fieldeditors.URLUtil}.
 */
 public class URLUtil {
+    public static final String URL_REGEX = "(?i)\\b((?:https?|ftp)://[^\\s]+)";


Why was it impossibel to use URL_EXP?

Just add a documentation that you needed to search in strings and that you assume that URLs consist of non-whitespace only.

The variable can be private - because you don't read it.

koppor

Thank you for the follow-up.

The PR does not contain a screnshot, thus I have to guess on some places.

koppor · 2025-02-12T21:43:36Z

        }

+        if (URLUtil.URL_PATTERN.matcher(value).find()) {
+            return Optional.of(Localization.lang("The book title contains a URL"));


I totally overlooked it.

Consistency is key on tools! -- Now read 4 lines above: That starts with booktitle.

Suggested change

return Optional.of(Localization.lang("The book title contains a URL"));

return Optional.of(Localization.lang("booktitle contains a URL"));

If you don't like it because of bad English language, change the localiaztion on line 19.

Seeing the title checker, maybe just remove "The book title" and "booktitle" in both strings.

koppor · 2025-02-12T21:44:28Z

        }

+        if (URLUtil.URL_PATTERN.matcher(value).find()) {
+            return Optional.of(Localization.lang("The title contains a URL"));


Be consistent with line 56

Suggested change

return Optional.of(Localization.lang("The title contains a URL"));

return Optional.of(Localization.lang("contains a URL"));

koppor · 2025-02-12T21:45:59Z

 public class URLUtil {
+    private static final String URL_REGEX = "(?i)\\b((?:https?|ftp)://[^\\s]+)";
+    /**
+     * Pattern match a string containing a URL with a protocol


Suggested change

* Pattern match a string containing a URL with a protocol

* Pattern matches a string containing a URL with a protocol

Please a add a empty line before the JavaDoc, too. Otherwise, it is too close to the other variable

Changed localization key for URL pattern matcher Update URL_PATTERN JavaDoc Update JabRef_en.properties 'contains a URL' key

koppor

The architecture is wrong. One could have seen an indication for this, because the same code was put twice - and code duplication should be avoided.

See following code in FieldCheckers:

        fieldCheckers.put(StandardField.BOOKTITLE, new BooktitleChecker());
        fieldCheckers.put(StandardField.TITLE, new TitleChecker(databaseContext));

One field - one checker.

Later

            fieldCheckers.put(StandardField.DATE, new DateChecker());
            fieldCheckers.put(StandardField.URLDATE, new DateChecker());
            fieldCheckers.put(StandardField.EVENTDATE, new DateChecker());
            fieldCheckers.put(StandardField.ORIGDATE, new DateChecker());

One checker for several fields.

Thus, you need to create a class "NoUrlChecker" containing your logic.

koppor · 2025-02-14T10:37:45Z

 ### Added

 - We added a feature for copying entries to libraries, available via the context menu, with an option to include cross-references. [#12374](https://github.com/JabRef/jabref/pull/12374)
+- We added an error message that appears when a user tries to enter a URL as a book title. [#12354](https://github.com/JabRef/jabref/issues/12354)


Suggested change

- We added an error message that appears when a user tries to enter a URL as a book title. [#12354](https://github.com/JabRef/jabref/issues/12354)

- We added an integrity check if a URL appears in a title title. [#12354](https://github.com/JabRef/jabref/issues/12354)

koppor · 2025-02-14T11:23:32Z

+        if (URLUtil.URL_PATTERN.matcher(value).find()) {
+            return Optional.of(Localization.lang("contains a URL"));
+        }


The architecture is wrong - sorry. Please update. (See general PR comments)

Moved no url checking logic from TitleChecker and BooktitleChecker to NoURLChecker Updated CHANGELOG.md

koppor · 2025-02-25T15:23:19Z

 Entries\ copied\ successfully,\ including\ cross-references.=Entries copied successfully, including cross-references.
 Entries\ copied\ successfully,\ without\ cross-references.=Entries copied successfully, without cross-references.
+
+contains\ a\ URL=contains a URL


Should be moved to some other integrity check outputs, but I leave this as future work to keep things going.

koppor

Fixed the merge conflict by moving the new string near to another integrity check

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

RapidShotzz added 2 commits January 29, 2025 10:29

Added Integrity message to identify whether the user inputs a full UR…

8e9ce3e

…L, a URL that is a false positive or a URL that is accepted in the title or booktitle

Added tests to check whether a full URL is identified with an error i…

70199bf

…n the Title and BookTitle inputs. Also added tests to identify whether a URL refers to a domain for the Title and BookTitle.

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

Fixed missing localisation keys in JabRef_en.properties

f445361

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

removed added keys in JabRef_en.properties

27d507e

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

Add localization key

e508a59

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

Fix book title checker URL pattern localization

69f9af7

LinusDietz reviewed Jan 30, 2025

View reviewed changes

amended and simplified integrity message for booktitle and title

febe100

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

RapidShotzz added 2 commits January 30, 2025 19:39

amended JabRef_en.properties file

a92651f

Adjusted Title

00bc4e8

github-actions Bot reviewed Jan 30, 2025

View reviewed changes

RapidShotzz added 2 commits January 30, 2025 19:52

amended booktitle message and keys in JabRef_en.properties

8e0e571

removed DOMAIN_ONLY_PATTERN check as functionality was unaffected by …

4bf336d

…this addition. Regex in FULL_URL_PATTERN ensures that a domain URL can be distinguished from a full one.

This comment was marked as resolved.

Sign in to view

LinusDietz added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jan 31, 2025

RapidShotzz and others added 4 commits January 31, 2025 22:52

Added an entry in CHANGELOG.md to specify that we have added an error…

41ba56a

… message that appears when the user enters a URL as a book title.

Amended an entry in CHANGELOG.md to specify that we have added an err…

6febffb

…or message that appears when the user enters a URL as a book title.

resolved merge conflict

553fb6e

Merge branch 'main' into fix-12354-title-should-not-contain-url

89f9bdb

koppor requested changes Feb 3, 2025

View reviewed changes

koppor added status: changes-required Pull requests that are not yet complete and removed status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers labels Feb 3, 2025

RapidShotzz and others added 2 commits February 6, 2025 14:48

refactored code to include Regex in URLUtil and adjusted testing.

f1c469a

Add title and booktitle tests for protocol without URL

b79fae5

RapidShotzz requested a review from koppor February 7, 2025 14:09

koppor requested changes Feb 10, 2025

View reviewed changes

Remove unused FULL_URL_PATTERN and document URL_PATTERN

4f8eb59

koppor requested changes Feb 12, 2025

View reviewed changes

Update BooktitleChecker and Titlechecker

9c776c5

Changed localization key for URL pattern matcher Update URL_PATTERN JavaDoc Update JabRef_en.properties 'contains a URL' key

This comment was marked as outdated.

Sign in to view

RapidShotzz requested a review from koppor February 13, 2025 16:08

koppor requested changes Feb 14, 2025

View reviewed changes

koppor marked this pull request as draft February 17, 2025 10:57

11raphael and others added 2 commits February 18, 2025 14:18

Add NoURLChecker

6cdd6d2

Moved no url checking logic from TitleChecker and BooktitleChecker to NoURLChecker Updated CHANGELOG.md

Merge branch 'main' into fix-12354-title-should-not-contain-url

cddcade

LinusDietz added status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers and removed status: changes-required Pull requests that are not yet complete labels Feb 21, 2025

LinusDietz marked this pull request as ready for review February 21, 2025 16:40

ThiloteE changed the title ~~Fix 12354 title should not contain url~~ Fix for 12354 - Title and booktitle fields should not contain url Feb 24, 2025

koppor previously approved these changes Feb 25, 2025

View reviewed changes

Merge branch 'main' into fix-12354-title-should-not-contain-url

6dd7484

koppor dismissed their stale review via 6dd7484 February 25, 2025 15:25

koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Feb 25, 2025

koppor enabled auto-merge February 25, 2025 15:25

koppor previously approved these changes Feb 25, 2025

View reviewed changes

github-actions Bot reviewed Feb 25, 2025

View reviewed changes

Fix JabRef_en.properties

ac0a98a

koppor dismissed their stale review via ac0a98a February 25, 2025 16:45

koppor approved these changes Feb 25, 2025

View reviewed changes

koppor added this pull request to the merge queue Feb 25, 2025

Merged via the queue into JabRef:main with commit 1c5a73c Feb 25, 2025

		private static final Pattern FULL_URL_PATTERN = Pattern.compile(
		"(https?://\\S+/\\S+\|www\\.\\S+/\\S+)", Pattern.CASE_INSENSITIVE);

	return Optional.of(Localization.lang("The book title contains a URL"));
	return Optional.of(Localization.lang("booktitle contains a URL"));

	return Optional.of(Localization.lang("The title contains a URL"));
	return Optional.of(Localization.lang("contains a URL"));

	* Pattern match a string containing a URL with a protocol
	* Pattern matches a string containing a URL with a protocol

	- We added an error message that appears when a user tries to enter a URL as a book title. [#12354](https://github.com/JabRef/jabref/issues/12354)
	- We added an integrity check if a URL appears in a title title. [#12354](https://github.com/JabRef/jabref/issues/12354)

Uh oh!

Conversation

RapidShotzz commented Jan 30, 2025 • edited by calixtus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mandatory checks

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

LinusDietz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

koppor commented Feb 3, 2025

Uh oh!

RapidShotzz commented Feb 3, 2025

Uh oh!

koppor commented Feb 3, 2025

Uh oh!

koppor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koppor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koppor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

koppor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koppor left a comment

RapidShotzz commented Jan 30, 2025 •

edited by calixtus

Loading