sent_id format and parallel treebanks

In #273, it was suggested that each sentence in CoNLL-U should have its ID encoded in header (comment) in a standardized way, e.g. `# sent_id = 123`. This issue is about the format of the ID itself (i.e. the `123` part) and also about a related question of storing parallel treebanks in CoNLL-U.
## My motivation
-  CoNLL-U format should be used not only for storing UD treebanks (frozen in v1.2, 1.3 etc.) but also as data interchange format and for various NLP tools, in all intermediate stages of the pipeline. See #242.
- I would like to store parallel treebanks with word alignment in CoNLL-U format. For many reasons (e.g. efficient parallel processing, serialization, streaming, consistency and alignment) it is useful to have all the languages in one file (interleaved as: sent1-langA, sent1-langB, sent2-langA, sent2-langB etc). We plan to release Czech-English treebank CzEng 1.6 with 62M sentences in this format. See [a sample](http://ufallab.ms.mff.cuni.cz/~popel/czeng1.6-sample.conllu).
  By parallel treebanks I mean not only different languages and paraphrases, but also alternative annotations of the same sentence, e.g. gold and automatic.
- I would like to store word-alignment and coreference (and possibly other types of relations) links in CoNLL-U files. Coreference can go across sentences. This has some consequences for IDs. I plan to open a separate issue for this soon.
- I would like to keep the CoNLL-U format simple (not bloated like CoNLL2009).
## My proposal

in short: **bundle_id/zone**
An example of a valid sent_id is `f123-s9/en_udpipe`.
- The part (`f123-s9`) is called **bundle_id** and in parallel treebanks it is shared for all translations of the same sentence (which form a so-called bundle). The internal structure of bundle_id can reflect the original treebank numbering, e.g. here f123 is the filename and s9 is the 9th bundle in that file. I suggest bundle_id format is restricted by a `[a-zA-Z0-9_-]+` regex. We can make it less strict if needed for some legacy data, but it should not contain whitespace nor slash.
- The second part (`en_udpipe`) is so-called **zone** and it can be omitted in treebanks where each bundle has just one zone (so the zone is an empty string). If present, it must be separated by a slash from the bundle_id and it must match the regex `^[a-z-]+(_[a-zA-Z0-9-]+)?$`. The internal structure of zone is **language_selector**, where the _selector part is optional.
- **language** is a ISO639 (or rather [IETF](https://en.wikipedia.org/wiki/IETF_language_tag)) language code
- **selector** is any string (`^[a-zA-Z0-9-]+$`), which allows to store parallel sentences in the same language. E.g. `udpipe` indicates that the tree was parsed using [UDPipe](http://ufal.mff.cuni.cz/udpipe). Another example: selectors `ref` and `mt` may distinguish reference translation and machine translation.
## Notes

I know not everyone needs to work with (multi-) parallel treebanks stored in one file, so this proposal may sound too complex. However, note that
- You can use simple IDs (e.g. integers) as sent_id and just one language (one zone) per file. It is still valid according to the proposal.
- I think IDs should be optional in CoNLL-U (though I would like to see them in all UD v2 treebanks). All UD-compatible tools should handle files without IDs. This proposal is just for those who need IDs, so they use it in the same standardized way allowing interoperability.
- We have a real need for such format (e.g. releasing the CzEng treebank in CoNLL-U, evaluation and visualization tools, an MT system).
- We are working on a Python+Perl+Java API for UD called [Udapi](http://udapi.github.io/), which benefits from the proposal and also makes it easy to use (e.g. extract trees from one zone and store in a separate file). We want to invite the UD community to contribute to Udapi soon.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sent_id format and parallel treebanks #321

My motivation

My proposal

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sent_id format and parallel treebanks #321

Description

My motivation

My proposal

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions