Skip to content

Standardize sentence-level comments that are used in multiple treebanks #273

@dan-zeman

Description

@dan-zeman

Originally reported by @jeanm in #272 (comment)

I noticed UD_French has a comment above each annotated sentence with the unnanotated text. For example:

# sentid: fr-ud-dev_00001
# sentence-text: Aviator, un film sur la vie de Hughes.
1   Aviator _   PROPN   _   _   0   root    _   _
2   ,   _   PUNCT   _   _   1   punct   _   _
3   un  _   DET _   _   4   det _   _
4   film    _   NOUN    _   _   1   appos   _   _
5   sur _   ADP _   _   7   case    _   _
6   la  _   DET _   _   7   det _   _
7   vie _   NOUN    _   _   4   nmod    _   _
8   de  _   ADP _   _   9   case    _   _
9   Hughes  _   PROPN   _   _   7   nmod    _   _
10  .   _   PUNCT   _   _   1   punct   _   _

Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional # sentence-text: <unannotated text> line before each annotated sentence.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions