Skip to content

sigmorphon/2022SegmentationST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SIGMORPHON 2022 Shared Task on Morpheme Segmentation

Morphemes (prefixes, suffixes, root words) are linguistic descriptions, defined as the smallest meaningful unit of words. Our proposed shared task is morpheme segmentation that converts a text into a sequence of morphemes. In order to prepare a dataset for this task, we integrated all basic types of morphological databases (including UniMorph (Kirov et al., 2018b; McCarthy et al., 2020) – inflectional morphology; MorphyNet (Batsuren et al., 2021) – derivational morphology; Universal Dependencies (Nivre et al., 2017) and ten editions of Wiktionary – compound morphology and root words). In the future, we expect the NLP community will benefit a lot by innovating subword-based tokenization with this task. This shared task has two parts:

Please join our Google Group to stay up to date. Click here to register for the task!

Please open the issues if you have any questions.

Two subtasks will be scored separately. Participant teams may submit as many systems as they want to as many subtasks as they want.

Part 1: Word-level Morpheme Segmentation

At the word level, participants will be asked to segment a given word into a sequence of morphemes. Input words contains all types of word forms: root words, derived words, inflected words, and compound words.

Data

Training and development data are UTF-8-encoded tab-separated values files. Each example occupies a single line and consists of input word, the corresponding morpheme sequence, and the corresponding morphological category. The following shows three lines of English data:

inaccuracies  in @@accurate @@cy @@s  110
dictionary  dictionary  000
screwdriver screw @@drive @@er  011

Note: The third column as the morphological category is an optional feature that can only be used to oversample or undersample training data.

First example is a derived word with prefix (in-) and suffixes (-cy and -s), and second example is a root word. Third example is a compound word. In the test datasets, we will provide only first column of data as input words.

Languages

Development languages are:

  1. ces: Czech
  2. eng: English
  3. fra: French
  4. hun: Hungarian
  5. spa: Spanish
  6. ita: Italian
  7. lat: Latin
  8. rus: Russian

Data Statistics

word class English Spanish Hungarian French Italian Russian Czech Latin
100 126544 502229 410662 105192 253455 221760 - 831991
010 203102 18449 24923 67983 41092 72970 - 0
101 13790 458 101189 478 317 1909 - 0
000 101938 15843 6952 13619 21037 2921 - 50338
011 5381 82 1654 506 140 328 - 0
110 106570 346862 323119 126196 237104 481409 - 0
001 16990 248 3320 1684 431 259 - 0
111 3059 343 54279 186 158 2658 - 0
total words 577374 884514 926098 382797 553734 784214 38682 882329

Word category description

For some of the development languages, we are providing the word categories so that participant can deal with imbalanced situation of morphological categories.

word class Description English example (input ==> output)
100 Inflection only played ==> play @@ed
010 Derivation only player ==> play @@er
101 Inflection and Compound wheelbands ==> wheel @@band @@s
000 Root words progress ==> progress
011 Derivation and Compound tankbuster ==> tank @@bust @@er
110 Inflection and Derivation urbanizes ==> urban @@ize @@s
001 Compound only hotpot ==> hot @@pot
111 Inflection, Derivation, Compound trackworkers ==> track @@work @@er @@s

Baseline results

The following table shows the word-level task results of pretrained BertTokenizer on English. This pretrained model was taken from HuggingFace.

word class inflection derivation compound R P F1 lev. distance
101 yes no yes 48.8 51.59 50.16 1.56
001 no no yes 49.5 43.76 46.45 1.66
111 yes yes yes 27.45 34.86 30.71 3.28
011 no yes yes 27.16 32.77 29.7 3.35
010 no yes no 33.91 25.31 28.99 2.76
110 yes yes no 26.21 24.95 25.56 3.31
100 yes no no 19.58 12.49 15.25 2.79
000 no no no 6.63 2.42 3.55 2.16
total - - - 28.28 20.86 24.01 2.72

Part 2: Sentence-level Morpheme Segmentation

At the sentence level, participating systems are expected to predict a sequence of morphemes for a given sentence. The following shows two lines of English data:

Six weeks of basic training. Six week @@s of base @@ic train @@ing .
Fistfights, please. Fist @@fight @@s , please .

The following shows two lines of Mongolian data:

Гэрт эмээ хоол хийв. Гэр @@т эмээ хоол хийх @@в .
Би өдөр эмээ уусан. Би өдөр эм @@ээ уух @@сан .

In above example, эмээ is a hononym of two different words, first means a grandmother and second is medicine. Depending on the context, the second homonym word is inflectional form of medicine and it is segmentable.

Languages

Development languages are:

  1. ces: Czech
  2. eng: English
  3. mon: Mongolian

Data Statistics

train dev test
Czech 1000 500 500
English 11007 1783 1846
Mongolian 1000 500 500

Evaluation

We will provide python evaluation scripts, reporting the following evaluation measures:

  • Precision - fraction of correctly predicted morphemes on all predicted morphemes
  • Recall - ratio of correctly predicted morphemes on all gold morphemes
  • F-measure - the harmonic mean of the precision and recall
  • Edit distance - average Levenshtein distance between the predicted output and the gold instance.

Timeline

Development Phase

Generalization Phase

  • April 8, 2022: Training and development splits for surprise languages released.

Evaluation Phase

  • April 15, 2022: Test splits for development and surprise languages are released.
  • April 29, 2022: Participants' submissions due.

Write-up Phase

  • May 13, 2022: Participants' draft system description papers due.
  • May 20, 2022: Participants' camera-ready system description papers due.

Organizers

  • Khuyagbaatar Batsuren (National University of Mongolia)
  • Gábor Bella (University of Trento)
  • Aryaman Arora (Georgetown University)
  • Viktor Martinović (University of Vienna)
  • Kyle Gorman (Graduate center, City University Of New York)
  • Zdeněk Žabokrtský (Charles University)
  • Amarsanaa Ganbold (National University of Mongolia)
  • Šárka Dohnalová (Charles University)
  • Magda Ševčíková (Charles University)
  • Kateřina Pelegrinová (University of Ostrava)
  • Fausto Giunchiglia (University of Trento)
  • Ryan Cotterell (ETH Zürich)
  • Ekaterina Vylomova (University of Melbourne)

License

The data is released under the Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary itself.

References

Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S., McCarthy, A., Kübler, S., Yarowsky, D., Eisner, J., and Hulden, M. (2018). UniMorph 2.0: Universal Morphology. Proceedings of LREC 2018.

McCarthy, A.D., Kirov, C., Grella, M., Nidhi, A., Xia, P., Gorman, K., Vylomova, E., Mielke, S.J., Nicolai, G., Silfverberg, M. and Arkhangelskij, T., (2020). UniMorph 3.0: Universal Morphology.. Proceedings of LREC 2020.

Batsuren, K., Bella, G. and Giunchiglia, F., (2021). MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology. In Proceedings of SIGMORPHON 2021 (pp. 39-48).

Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M.J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L. and Badmaeva, E., (2017). Universal Dependencies 2.1.

About

SIGMORPHON 2022 Shared Task on Morpheme Segmentation

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors