-
Notifications
You must be signed in to change notification settings - Fork 265
Description
We would like to use CoNLL-U as a data interchange format and define two levels of "validity". The first level (analogy to XML well-fomedness) means that any CoNLLU-conforming tool can read the file. The second level (analogy to DTD validation) means that the file strictly conforms to UD definition for a given language/treebank (including sets of language-specific deprel, feats and misc values).
- The CoNLL-U specification does not distinguish any levels of validity.
- It does not link the validate.py tool.
- It does not mention that only unix-style LF newlines are allowed (but validate.py checks it).
- It does not explicitly forbid UPOS=_ (but validate.py does not allow underscores as UPOS except for multiword tokens).
As for the last point, I though that UPOS=X means a word which cannot be assigned any other part-of-spech category (e.g. code switching), but underscore means an unknown/empty value and it should be used e.g. if no tagger (or human annotator) was applied (yet).
Once the CoNLL-U specification page becomes too long, it would be probably good to move the details to a separate page, so that the main page stays brief.