This folder contains the data releases relative to the CLEF-HIPE shared task on NERC and EL on historical newspapers.
A second edition of the shared task is organised in 2022: see the HIPE-2022 website and data repository.
- v1.4 (released 11.02.2022): post-evaluation release with sentence splitting (all splits).
- test-masked-v1.3 (released 03.06.2020): masked test dataset for evaluation of system runs for task bundle 5.
- test-masked-v1.2 (released 25.05.2020): masked test dataset for evaluation of system runs for task bundles 1-4.
- training-v1.2 (released 12.10.2020): fourth version of training and dev datasets for HIP. Main changes are: additional data for French and German.
- training-v1.1 (released 7.04.2020): this release fixes a problem in the German validation set (see issue #5), as well as with escaping double quotes in the TSV exports.
- training-v1.0 (released 26.03.2020): second version of training and dev datasets for German and French, and of dev dataset for English (there won't be training data for English). Main changes are quantitative (more documents and therefore more mentions and linked entities). Foreseen release v1.1 beginning of april with increased quality.
- training-v0.9 (released 20.02.2020): first version of training and dev datasets for German and French, and first version of dev dataset for English (there won't be training data for English).
- sample-v1.0 (released 10.01.2020): 10 French and 8 German content items fully annotated.
The HIPE datasets are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
If you use this data, please consider citing
- the extended overview paper:
M. Ehrmann, M. Romanello, A. Flückiger, and S. Clematide, Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers in Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 2020, vol. 2696, p. 38. doi: 10.5281/zenodo.4117566.
@inproceedings{ehrmann_extended_2020,
title = {Extended {Overview} of {CLEF HIPE} 2020: {Named Entity Processing} on {Historical Newspapers}},
booktitle = {{CLEF 2020 Working Notes}. {Working Notes} of {CLEF} 2020 - {Conference} and {Labs} of the {Evaluation Forum}},
author = {Ehrmann, Maud and Romanello, Matteo and Fl{\"u}ckiger, Alex and Clematide, Simon},
editor = {Cappellato, Linda and Eickhoff, Carsten and Ferro, Nicola and N{\'e}v{\'e}ol, Aur{\'e}lie},
year = {2020},
volume = {2696},
pages = {38},
publisher = {{CEUR-WS}},
address = {{Thessaloniki, Greece}},
doi = {10.5281/zenodo.4117566},
url = {https://infoscience.epfl.ch/record/281054},
}
- the Springer LNCS overview:
M. Ehrmann, M. Romanello, A. Flückiger, and S. Clematide, Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers In: Arampatzis et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2020. Lecture Notes in Computer Science(), vol 12260. Springer, Cham. doi: https://doi.org/10.1007/978-3-030-58219-7_21
@InProceedings{10.1007/978-3-030-58219-7_21,
author="Ehrmann, Maud and Romanello, Matteo and Fl{\"u}ckiger, Alex and Clematide, Simon",
editor="Arampatzis, Avi and Kanoulas, Evangelos and Tsikrika, Theodora and Vrochidis, Stefanos and Joho, Hideo and Lioma, Christina and Eickhoff, Carsten and N{\'e}v{\'e}ol, Aur{\'e}lie and Cappellato, Linda and Ferro, Nicola",
title="Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers",
booktitle="Experimental IR Meets Multilinguality, Multimodality, and Interaction",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="288--310",
isbn="978-3-030-58219-7"
}
Results of participating teams appear in the working notes proceedings, published by CEUR Workshop, and were presented in the CLEF conference in Sept 2020 (see the youtube playlist).
