-
Notifications
You must be signed in to change notification settings - Fork 265
Closed
Description
In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g. # sent_id = 123. This issue is about the format of the ID itself (i.e. the 123 part) and also about a related question of storing parallel treebanks in CoNLL-U.
My motivation
- CoNLL-U format should be used not only for storing UD treebanks (frozen in v1.2, 1.3 etc.) but also as data interchange format and for various NLP tools, in all intermediate stages of the pipeline. See Validity of CoNLL-U #242.
- I would like to store parallel treebanks with word alignment in CoNLL-U format. For many reasons (e.g. efficient parallel processing, serialization, streaming, consistency and alignment) it is useful to have all the languages in one file (interleaved as: sent1-langA, sent1-langB, sent2-langA, sent2-langB etc). We plan to release Czech-English treebank CzEng 1.6 with 62M sentences in this format. See a sample.
By parallel treebanks I mean not only different languages and paraphrases, but also alternative annotations of the same sentence, e.g. gold and automatic. - I would like to store word-alignment and coreference (and possibly other types of relations) links in CoNLL-U files. Coreference can go across sentences. This has some consequences for IDs. I plan to open a separate issue for this soon.
- I would like to keep the CoNLL-U format simple (not bloated like CoNLL2009).
My proposal
in short: bundle_id/zone
An example of a valid sent_id is f123-s9/en_udpipe.
- The part (
f123-s9) is called bundle_id and in parallel treebanks it is shared for all translations of the same sentence (which form a so-called bundle). The internal structure of bundle_id can reflect the original treebank numbering, e.g. here f123 is the filename and s9 is the 9th bundle in that file. I suggest bundle_id format is restricted by a[a-zA-Z0-9_-]+regex. We can make it less strict if needed for some legacy data, but it should not contain whitespace nor slash. - The second part (
en_udpipe) is so-called zone and it can be omitted in treebanks where each bundle has just one zone (so the zone is an empty string). If present, it must be separated by a slash from the bundle_id and it must match the regex^[a-z-]+(_[a-zA-Z0-9-]+)?$. The internal structure of zone is language_selector, where the _selector part is optional. - language is a ISO639 (or rather IETF) language code
- selector is any string (
^[a-zA-Z0-9-]+$), which allows to store parallel sentences in the same language. E.g.udpipeindicates that the tree was parsed using UDPipe. Another example: selectorsrefandmtmay distinguish reference translation and machine translation.
Notes
I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that
- You can use simple IDs (e.g. integers) as sent_id and just one language (one zone) per file. It is still valid according to the proposal.
- I think IDs should be optional in CoNLL-U (though I would like to see them in all UD v2 treebanks). All UD-compatible tools should handle files without IDs. This proposal is just for those who need IDs, so they use it in the same standardized way allowing interoperability.
- We have a real need for such format (e.g. releasing the CzEng treebank in CoNLL-U, evaluation and visualization tools, an MT system).
- We are working on a Python+Perl+Java API for UD called Udapi, which benefits from the proposal and also makes it easy to use (e.g. extract trees from one zone and store in a separate file). We want to invite the UD community to contribute to Udapi soon.