Skip to content

Validity of CoNLL-U #242

@martinpopel

Description

@martinpopel

We would like to use CoNLL-U as a data interchange format and define two levels of "validity". The first level (analogy to XML well-fomedness) means that any CoNLLU-conforming tool can read the file. The second level (analogy to DTD validation) means that the file strictly conforms to UD definition for a given language/treebank (including sets of language-specific deprel, feats and misc values).

  • The CoNLL-U specification does not distinguish any levels of validity.
  • It does not link the validate.py tool.
  • It does not mention that only unix-style LF newlines are allowed (but validate.py checks it).
  • It does not explicitly forbid UPOS=_ (but validate.py does not allow underscores as UPOS except for multiword tokens).

As for the last point, I though that UPOS=X means a word which cannot be assigned any other part-of-spech category (e.g. code switching), but underscore means an unknown/empty value and it should be used e.g. if no tagger (or human annotator) was applied (yet).

Once the CoNLL-U specification page becomes too long, it would be probably good to move the details to a separate page, so that the main page stays brief.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions