Skip to content

Annotated crossref citation training data#854

Merged
kermitt2 merged 6 commits intogrobidOrg:masterfrom
bnewbold:crossref-citation-training
Nov 23, 2021
Merged

Annotated crossref citation training data#854
kermitt2 merged 6 commits intogrobidOrg:masterfrom
bnewbold:crossref-citation-training

Conversation

@bnewbold
Copy link
Copy Markdown
Contributor

@bnewbold bnewbold commented Nov 9, 2021

This is my first batch of citations and names annotated. I should note that while citations to webpages, and Wikipedia, are sort of weird and non-traditional in general, they are of particular interest to me.

Some particular things I wasn't sure about:

  • should arxiv: prefix be included in , or outside of it? what about the trailing "arxiv section" tag, like [cs.LG]?
  • for authors which are organizations, more like a consortium and less like a publisher, how should the name be tagged? For example, "Wikipedia" and "OpenAI"
  • following the annotation guide, I used title type="m" for things like web page titles, or if it was ambiguous if the thing being cited was a book. I did use type="a" for arxiv.org pre-prints, even though these are not in a "journal" in the usual sense

Also made some small tweaks to existing citations.xml training data, to narrow tags around URLs.

Closes: #847

Trying to reduce incidents of non-URL tokens getting added to start or
end of URLs.
@kermitt2
Copy link
Copy Markdown
Collaborator

kermitt2 commented Nov 13, 2021

should arxiv: prefix be included in , or outside of it? what about the trailing "arxiv section" tag, like [cs.LG]?

Yes, prefixes are used to identify the identifier type, same for trailing section. Currently, we just extract identifiers in general and then with regex we identify the type of the identifier and normalize it. There's not enough training data for the moment to more directly classify the identifier either while doing sequence labeling (we don't want to multiply labels and be too imbalanced for accuracy reason) nor to classify the extracted content (and we still need to normalize afterwards the identifier, so hard to avoid regex).

for authors which are organizations, more like a consortium and less like a publisher, how should the name be tagged? For example, "Wikipedia" and "OpenAI"

This is covered by the <orgname type="collaboration"> label in academic contect. It was introduced typically for HEP and astronomy collaborations and I used it too for working groups in standardization organization like IETF, etc. It's seen as a kind of alias for a group of human authors (e.g. all the members of the group/consortium are authors).
Otherwise for companies or more traditional organizations (really legal entities) that "host" the work (like Wikipedia or OpenAI, IETF), normally we use just <orgname> and no author if no human author are provided.

Authors is normally only human persons in the sense of Bern treaty. Although it's still common to see an organization as author in the US (like the "author organization" field in MARC), normally following the Bern convention a creative work must always have author/moral rights associated to humans who have to be acknowledged.

following the annotation guide, I used title type="m" for things like web page titles, or if it was ambiguous if the thing being cited was a book. I did use type="a" for arxiv.org pre-prints, even though these are not in a "journal" in the usual sense

I think "in principle",type="m" is for citing a website title and type="a" is for citing a web page title that belongs to a web site. Not sure it's very easy to use in practice.

For preprint I think using type="a" is the expected usage (it's a publication "unit" in the arxiv).

@kermitt2
Copy link
Copy Markdown
Collaborator

I submitted a review, in case it is not visible.

@bnewbold
Copy link
Copy Markdown
Contributor Author

I pushed an update, and resolved all but one of the review comments (for a citation to a thesis, whether to use <publisher> or <orgName>). I also updated the docs with some clarifications, those changes should be reviewed.

@kermitt2
Copy link
Copy Markdown
Collaborator

Thanks a lot @bnewbold ! The update doc looks very good and clear for me.

@kermitt2
Copy link
Copy Markdown
Collaborator

@bnewbold If you plan to add more annotated data, do you prefer to do it in this PR or should I merge it ?
I should be able to update the models in between.

@bnewbold
Copy link
Copy Markdown
Contributor Author

@kermitt2 this is all I have for this specific batch. @miku is working on annotating a separate batch and will submit those for review in a separate branch.

I would propose we merge this PR as-is now, but hold off on retraining for a week?

@kermitt2
Copy link
Copy Markdown
Collaborator

Thank you @bnewbold and @miku !

@kermitt2 kermitt2 merged commit 938d12a into grobidOrg:master Nov 23, 2021
miku added a commit to miku/grobid that referenced this pull request Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GROBID citation parsing issues, for possible training

2 participants