Annotated crossref citation training data#854
Conversation
Trying to reduce incidents of non-URL tokens getting added to start or end of URLs.
…red' references These are hand-labeled from these notes: https://gist.github.com/bnewbold/b437e363e6a0429719c65c751babe84d
Yes, prefixes are used to identify the identifier type, same for trailing section. Currently, we just extract identifiers in general and then with regex we identify the type of the identifier and normalize it. There's not enough training data for the moment to more directly classify the identifier either while doing sequence labeling (we don't want to multiply labels and be too imbalanced for accuracy reason) nor to classify the extracted content (and we still need to normalize afterwards the identifier, so hard to avoid regex).
This is covered by the Authors is normally only human persons in the sense of Bern treaty. Although it's still common to see an organization as author in the US (like the "author organization" field in MARC), normally following the Bern convention a creative work must always have author/moral rights associated to humans who have to be acknowledged.
I think "in principle", For preprint I think using |
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Show resolved
Hide resolved
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/name/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/name/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/name/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
grobid-trainer/resources/dataset/name/citation/corpus/crossref_raw_citations.xml
Outdated
Show resolved
Hide resolved
|
I submitted a review, in case it is not visible. |
|
I pushed an update, and resolved all but one of the review comments (for a citation to a thesis, whether to use |
|
Thanks a lot @bnewbold ! The update doc looks very good and clear for me. |
|
@bnewbold If you plan to add more annotated data, do you prefer to do it in this PR or should I merge it ? |
This is my first batch of citations and names annotated. I should note that while citations to webpages, and Wikipedia, are sort of weird and non-traditional in general, they are of particular interest to me.
Some particular things I wasn't sure about:
arxiv:prefix be included in , or outside of it? what about the trailing "arxiv section" tag, like[cs.LG]?type="m"for things like web page titles, or if it was ambiguous if the thing being cited was a book. I did usetype="a"for arxiv.org pre-prints, even though these are not in a "journal" in the usual senseAlso made some small tweaks to existing
citations.xmltraining data, to narrow tags around URLs.Closes: #847