Annotated crossref citation training data by bnewbold · Pull Request #854 · grobidOrg/grobid

bnewbold · 2021-11-09T03:13:41Z

This is my first batch of citations and names annotated. I should note that while citations to webpages, and Wikipedia, are sort of weird and non-traditional in general, they are of particular interest to me.

Some particular things I wasn't sure about:

should arxiv: prefix be included in , or outside of it? what about the trailing "arxiv section" tag, like [cs.LG]?
for authors which are organizations, more like a consortium and less like a publisher, how should the name be tagged? For example, "Wikipedia" and "OpenAI"
following the annotation guide, I used title type="m" for things like web page titles, or if it was ambiguous if the thing being cited was a book. I did use type="a" for arxiv.org pre-prints, even though these are not in a "journal" in the usual sense

Also made some small tweaks to existing citations.xml training data, to narrow tags around URLs.

Closes: #847

Trying to reduce incidents of non-URL tokens getting added to start or end of URLs.

…red' references These are hand-labeled from these notes: https://gist.github.com/bnewbold/b437e363e6a0429719c65c751babe84d

kermitt2 · 2021-11-13T17:18:40Z

should arxiv: prefix be included in , or outside of it? what about the trailing "arxiv section" tag, like [cs.LG]?

Yes, prefixes are used to identify the identifier type, same for trailing section. Currently, we just extract identifiers in general and then with regex we identify the type of the identifier and normalize it. There's not enough training data for the moment to more directly classify the identifier either while doing sequence labeling (we don't want to multiply labels and be too imbalanced for accuracy reason) nor to classify the extracted content (and we still need to normalize afterwards the identifier, so hard to avoid regex).

for authors which are organizations, more like a consortium and less like a publisher, how should the name be tagged? For example, "Wikipedia" and "OpenAI"

This is covered by the <orgname type="collaboration"> label in academic contect. It was introduced typically for HEP and astronomy collaborations and I used it too for working groups in standardization organization like IETF, etc. It's seen as a kind of alias for a group of human authors (e.g. all the members of the group/consortium are authors).
Otherwise for companies or more traditional organizations (really legal entities) that "host" the work (like Wikipedia or OpenAI, IETF), normally we use just <orgname> and no author if no human author are provided.

Authors is normally only human persons in the sense of Bern treaty. Although it's still common to see an organization as author in the US (like the "author organization" field in MARC), normally following the Bern convention a creative work must always have author/moral rights associated to humans who have to be acknowledged.

following the annotation guide, I used title type="m" for things like web page titles, or if it was ambiguous if the thing being cited was a book. I did use type="a" for arxiv.org pre-prints, even though these are not in a "journal" in the usual sense

I think "in principle",type="m" is for citing a website title and type="a" is for citing a web page title that belongs to a web site. Not sure it's very easy to use in practice.

For preprint I think using type="a" is the expected usage (it's a publication "unit" in the arxiv).

grobid-trainer/resources/dataset/citation/corpus/crossref_raw_citations.xml

grobid-trainer/resources/dataset/name/citation/corpus/crossref_raw_citations.xml

kermitt2 · 2021-11-13T18:47:32Z

I submitted a review, in case it is not visible.

Based on PR review

bnewbold · 2021-11-15T23:18:31Z

I pushed an update, and resolved all but one of the review comments (for a citation to a thesis, whether to use <publisher> or <orgName>). I also updated the docs with some clarifications, those changes should be reviewed.

kermitt2 · 2021-11-18T08:06:57Z

Thanks a lot @bnewbold ! The update doc looks very good and clear for me.

kermitt2 · 2021-11-18T08:23:38Z

@bnewbold If you plan to add more annotated data, do you prefer to do it in this PR or should I merge it ?
I should be able to update the models in between.

bnewbold · 2021-11-22T18:58:56Z

@kermitt2 this is all I have for this specific batch. @miku is working on annotating a separate batch and will submit those for review in a separate branch.

I would propose we merge this PR as-is now, but hold off on retraining for a week?

kermitt2 · 2021-11-23T03:27:32Z

Thank you @bnewbold and @miku !

A follow up on grobidOrg#854.

bnewbold added 2 commits November 8, 2021 19:04

citations.xml: small URL labeling adjustments

235dcea

Trying to reduce incidents of non-URL tokens getting added to start or end of URLs.

add citation and name-citation training date from crossref 'unstructu…

2daad1d

…red' references These are hand-labeled from these notes: https://gist.github.com/bnewbold/b437e363e6a0429719c65c751babe84d

bnewbold mentioned this pull request Nov 9, 2021

GROBID citation parsing issues, for possible training #847

Closed

kermitt2 requested changes Nov 13, 2021

View reviewed changes

bnewbold added 3 commits November 15, 2021 15:03

updates to reference annotation docs

d02420e

Based on PR review

bioRxiv training refs: normalize <orgname> -> <orgName>

ca379f7

update citation annotations based on review

c64b1fe

kermitt2 approved these changes Nov 18, 2021

View reviewed changes

back to <orgName> for university of disseration

49c14ac

kermitt2 merged commit 938d12a into grobidOrg:master Nov 23, 2021

miku added a commit to miku/grobid that referenced this pull request Nov 25, 2021

add citations training data from crossref unstructured references

b3c19a3

A follow up on grobidOrg#854.

miku mentioned this pull request Nov 25, 2021

add citations training data from crossref unstructured references #864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotated crossref citation training data#854

Annotated crossref citation training data#854
kermitt2 merged 6 commits intogrobidOrg:masterfrom
bnewbold:crossref-citation-training

bnewbold commented Nov 9, 2021

Uh oh!

kermitt2 commented Nov 13, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kermitt2 commented Nov 13, 2021

Uh oh!

bnewbold commented Nov 15, 2021

Uh oh!

kermitt2 commented Nov 18, 2021

Uh oh!

kermitt2 commented Nov 18, 2021

Uh oh!

bnewbold commented Nov 22, 2021

Uh oh!

kermitt2 commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bnewbold commented Nov 9, 2021

Uh oh!

kermitt2 commented Nov 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kermitt2 commented Nov 13, 2021

Uh oh!

bnewbold commented Nov 15, 2021

Uh oh!

kermitt2 commented Nov 18, 2021

Uh oh!

kermitt2 commented Nov 18, 2021

Uh oh!

bnewbold commented Nov 22, 2021

Uh oh!

kermitt2 commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kermitt2 commented Nov 13, 2021 •

edited

Loading