Automatic Keyword Extraction from Historical Document Images

Terasawa, Kengo; Nagasaki, Takeshi; Kawashima, Toshio

doi:10.1007/11669487_37

Kengo Terasawa¹⁸,
Takeshi Nagasaki¹⁸ &
Toshio Kawashima¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3872))

Included in the following conference series:

International Workshop on Document Analysis Systems

1852 Accesses
4 Citations

Abstract

This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.

Download to read the full chapter text

Chapter PDF

Content-Based Document Image Retrieval Based on Document Modeling

Article 06 June 2020

Unsupervised Automatic Keyphrases Extraction Algorithms

Clustering documents in evolving languages by image texture analysis

Article 26 December 2016

References

Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. of International Conference on Document Analysis and Recognition, pp. 1070–1074 (2005)
Google Scholar
Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.: A segmentation-free approach for keyword search in historical typewritten documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 54–58 (2005)
Google Scholar
Lu, Y., Tan, C.L.: Word spotting in Chinese document images without layout analysis. In: Proc. of IEEE International Conference on Pattern Recognition, pp. 30057–30060 (2002)
Google Scholar
Manmatha, R., Han, C., Riseman, E.M.: Word Spotting: A New Approach to Indexing Handwriting. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 631–637 (1996)
Google Scholar
Marinai, S., Marino, E., Soda, G.: Indexing and retrieval of words in old documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 223–227 (2003)
Google Scholar
Oka, R.: Spotting Method for Classification of Real World Data. The Computer Journal 41(8), 559–565 (1998)
Article MATH Google Scholar
Rath, T.M., Manmatha, R.: Features for Word Spotting in Historical Manuscripts. In: Proc. of International Conference on Document Analysis and Recognition, pp. 218–222 (2003)
Google Scholar
Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 521–527 (2003)
Google Scholar
Terasawa, K., Nagasaki, T., Kawashima, T.: Eigenspace method for text retrieval in historical document images. In: Proc. of International Conference on Document Analysis and Recognition, pp. 437–441 (2005)
Google Scholar
Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Article Google Scholar
Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591 (1991)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Systems Information Science, Future University-Hakodate, 116–2 Kamedanakano-cho, Hakodate-shi, Hokkaido, 041–8655, Japan
Kengo Terasawa, Takeshi Nagasaki & Toshio Kawashima

Authors

Kengo Terasawa
View author publications
Search author on:PubMed Google Scholar
Takeshi Nagasaki
View author publications
Search author on:PubMed Google Scholar
Toshio Kawashima
View author publications
Search author on:PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Applied Mathematics, University of Bern, Neubrückstrasse 10, CH-3012, Bern, Switzerland
Horst Bunke
DocRec Ltd, 34 Strathaven Place, 7001, Atawhai, Nelson, New Zealand
A. Lawrence Spitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Terasawa, K., Nagasaki, T., Kawashima, T. (2006). Automatic Keyword Extraction from Historical Document Images. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_37

Download citation

DOI: https://doi.org/10.1007/11669487_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)