Copyrights owner and licenses identification models#1078
Merged
Conversation
Member
|
I've done some tests, and updated the Grobid.odd to add the copyrightOnwners. I've added some tests (one is failing, I'm not sure it's a bug). |
Collaborator
Author
|
@lfoppiano I changed |
| return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations); | ||
| @DefaultValue("0") @FormDataParam(INCLUDE_RAW_AFFILIATIONS) String includeRawAffiliations, | ||
| @DefaultValue("0") @FormDataParam(INCLUDE_RAW_COPYRIGHTS) String includeRawCopyrights) { | ||
| return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations, includeRawCopyrights); |
Check warning
Code scanning / CodeQL
Information exposure through a stack trace
Collaborator
Author
|
Update of XML schema (also for the latest Pub2TEI version) -> #1084 1084 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR integrates two new models to identify the copyrights' owner of a document (publisher, authors or unknown) and to identify the license, if provided, for sharing the document file (e.g. CC-BY, CC-BY-NC, etc.). The models currently only work if the
"delft"engine is selected. If this engine is not selected, the identification is currently skipped.In the TEI, the result is serialized as followed - example is https://peerj.com/articles/cs-1022/
To encode the copyrights' owner, we use an attribute
@resp("responsible party") and add a comment explaining how to interpret it. Note that the standard @resp in TEI should be a pointer, here we customize it to 2 possible values to avoid overcomplicating it. When the copyright owner is undecided by the classifier or unknown, there is no@respattribute at the element<availability>.In addition, the service now includes a boolean parameter
includeRawCopyrightsto include or not in the<availability>part the full copyright/license section that has been extracted (under added element<p type="raw">). This section is used by the classifier to determine the copyrights owner and the license.To have it working, edit
grobid-home/config/grobid.yamlto indicatedelftas engine for the two new models:Latest evaluations:
TODO:
@resp