Skip to content

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized#7639

Merged
Siedlerchr merged 30 commits into
JabRef:mainfrom
MrGhabi:fix-for-issue-5850
Apr 23, 2021
Merged

Fix for issue 5850: Journal abbreviations in UTF-8 not recognized#7639
Siedlerchr merged 30 commits into
JabRef:mainfrom
MrGhabi:fix-for-issue-5850

Conversation

@MrGhabi

@MrGhabi MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor

Fixes #5850

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Reproduce the issue:

  1. New library
  2. New article
  3. BibeTx source adds the following:
    @article{杨芙清2005软件工程技术发展思索,
      title={软件工程技术发展思索},
      author={杨芙清},
      journal={软件学报},
      volume={16},
      number={1},
      year={2005},
      publisher={Citeseer}
    }
    
  4. click "check integrity"

The main reason for this bug is the check-tools Check integrity only accept the charset ASCII. It works well in English citations, but jabref has users across the world and they have different charsets.

The screenshot:

  1. before:
    before

  2. after
    intigrity-check

The way to fix:

  1. First I find the bug related to the class ASCIICharacterChecker.java
  2. In this class
boolean asciiOnly = CharMatcher.ascii().matchesAllOf(field.getValue());

any non-ASCII encoded characters will be warned.
3. Then I remove the steps in IntegrityCheck .

And still, I want to give a warning about non-UTF8 encoded characters.

  1. I get the default encoding from the system (Since we need to give a warning when the field cannot be decoded in UTF-8. And this may happen when users' default encoding charset is non-UTF-8)
  2. check whether the value of fields(string) can be decoded in UTF-8
  3. if not, just give a warning about "Non-UTF-8 encoded found"

To check this, we need first set out the default charset(for example GBK) in the whole environment.
Then we can get the following warning when using Integrity check:
image

public List<IntegrityMessage> check(BibEntry entry) {
List<IntegrityMessage> results = new ArrayList<>();
for (Map.Entry<Field, String> field : entry.getFieldMap().entrySet()) {
Charset charset = Charset.forName(System.getProperty("file.encoding"));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract this out of the loop, as it doesn't depend on the loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to use System.getProperty("file.encoding") and not say the encoding specified in the Library properties?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since different users have different charsets due to the operating system or the default settings of the computer. And System.getProperty("file.encoding") is used get the default charset. If the charset is not UTF-8, we should give a warning about that.
And the reason not to use the Library properties & Database properties: Maybe the user doesn't know the default charset in his computer or he set the charset for jabref, but we should give a warning about that since Non-UTF-8 charset may cause to garbled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the explanation.

But doesn't this give a lot of false positives? Say I have my library encoded in Charset A, and my systems default is Charset B. If all characters in my database are properly encoded with Charset A, then I shouldn't get any warnings even though some of the characters may not be encodable in Charset B, right?

But I also have to admit that I do not yet understand the use-case from the user perspective, so maybe I'm missing something obvious.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that. In my test, if one user's default charset is A then his paste-board, his input is all encoded in A. So when he input something maybe just garbled. Maybe there is an example: #7629
So the scenario may be rare. That's the reason I don't choose to get the charset by Charset charset = bibDatabaseContext.getMetaData().getEncoding().orElse(preferences.getDefaultEncoding());

By the way, I have a question about the design. If the bibtex is only allowed in ascii in design, why do we allow the user to save it into different charsets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, reviewer! @tobiasdiez After thinking for a long time and doing some tests, I think maybe it's better to give 2 kinds of warning:

  1. In BibLatex, if the Library charset is not UTF-8, then give a warning Non-UTF-8 field found.
  2. In both BibLatex and BibTeX, if the System env is not UTF-8, give the warning Non-UTF-8 env, may cause garbled.
    And I'm eagerly waiting for your suggestions and reply!

Comment thread CHANGELOG.md Outdated

@Siedlerchr Siedlerchr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks already good. PLease have a look at the checkstyle issues

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 17, 2021
@tobiasdiez

Copy link
Copy Markdown
Member

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

@MrGhabi

MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor Author

I might be not up-to-date, but I always thought UTF8 characters are only allowed in biblatex and that bibtex only handles asci characters. Did this change?

Yeah, some journals and papers use non-ASCII characters as their names.. etc(just as the bib in bibtex I added before). and maybe it is difficult to do with it in jabref. The details are shown in the issue. So I think maybe it is better to trade them equally.

@tobiasdiez

Copy link
Copy Markdown
Member

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

@Siedlerchr

Copy link
Copy Markdown
Member

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

@MrGhabi

MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor Author

I don't really have experience with say Chinese names (as authors or journals) with bibtex. But the only evidence I could find was always suggesting bibLAtex, since bibtex doesn't support UTF8, see e.g. https://tex.stackexchange.com/questions/100092/how-to-include-a-chinese-paper-in-reference-via-bibtex.

So does it make more sense to keep the asci check for bibtex, and add the new utf8 check for biblatex?

I agree with @tobiasdiez we need the utf8 check for biblatex and the ascii checker for bibtex then.

Good idea!I will refactor my code to meet this need! (After searching more information about bibtex and biblatex, I agree with you~ )

And the check of utf8 for biblatex maybe it's not a bug but an enhancement? (laugh) I will focus on it!

@MrGhabi

MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor Author

Hi Reviewers! I have added the UTF-8 check for biblatex and recovery the ASCII check for bibtex!
Is there anything I should do to give a better submission?

@Siedlerchr

Copy link
Copy Markdown
Member

So far looks good, you only need to add the new localization string the l10 files, see here for more details https://devdocs.jabref.org/getting-into-the-code/code-howtos#using-localization-correctly

@MrGhabi

MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor Author

Hi reviewers!I added this statement to all language packs, but I rely on Google Translate for most of my translations, so please double check it for errors~

@Siedlerchr

Copy link
Copy Markdown
Member

You only need to add it to the English file. All otherttranslations are managed by crowdin.

@MrGhabi

MrGhabi commented Apr 17, 2021

Copy link
Copy Markdown
Contributor Author

You only need to add it to the English file. All otherttranslations are managed by crowdin.

Emmm, so I need to subtract all the files except the English file, right?

@MrGhabi

MrGhabi commented Apr 18, 2021

Copy link
Copy Markdown
Contributor Author

I have changed that. Hope everything goes well...

@MrGhabi

MrGhabi commented Apr 19, 2021

Copy link
Copy Markdown
Contributor Author

I added the javaDoc for UTFChecker and fix a little problem in my Junit test.

String NonUTF8 = "";
try {
NonUTF8 = new String("你好,这条语句使用GBK字符集".getBytes(), "GBK");
} catch (Exception e) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply remove that catch here and add throws Exception to the test method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

add 2 Junit Test for UTF8Checker.UTF8EncodingChecker in UTF8CheckerTest

add 2 Junit Test for IntegrityCheck in IntegrityCheckTest
@MrGhabi MrGhabi requested a review from tobiasdiez April 19, 2021 12:26
@MrGhabi

MrGhabi commented Apr 19, 2021

Copy link
Copy Markdown
Contributor Author

Hi reviewers! I have added 2 Junit Test for UTF8Checker and 2 for IntegrityCheck. I'm not quite sure if these test cases are redundant and standardized, so please give me some advice if problems exist!

Comment thread src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java Outdated
@MrGhabi MrGhabi requested a review from Siedlerchr April 21, 2021 23:00
@Siedlerchr Siedlerchr merged commit 434250d into JabRef:main Apr 23, 2021
@Siedlerchr

Copy link
Copy Markdown
Member

Thanks a lot for your contribution!

Siedlerchr added a commit that referenced this pull request Apr 24, 2021
…om.tngtech.archunit-archunit-junit5-api-0.18.0

* upstream/main:
  Fix exception when searching (#7659)
  Fixes Jabref#7660 (#7663)
  Fix for issue 5850: Journal abbreviations in UTF-8 not recognized (#7639)
  Fix SSLHandshake Exception by using bypass (#7657)
  Fix for issue 7633: Unable to download arXiv pdfs if Title contains curly brackets (#7652)
  Fix#7195 partly Opacity of disabled icon-buttons
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Journal abbreviations in UTF-8 not recognized

3 participants