Abstract
This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.
Chapter PDF
Similar content being viewed by others
References
Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. of International Conference on Document Analysis and Recognition, pp. 1070–1074 (2005)
Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.: A segmentation-free approach for keyword search in historical typewritten documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 54–58 (2005)
Lu, Y., Tan, C.L.: Word spotting in Chinese document images without layout analysis. In: Proc. of IEEE International Conference on Pattern Recognition, pp. 30057–30060 (2002)
Manmatha, R., Han, C., Riseman, E.M.: Word Spotting: A New Approach to Indexing Handwriting. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 631–637 (1996)
Marinai, S., Marino, E., Soda, G.: Indexing and retrieval of words in old documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 223–227 (2003)
Oka, R.: Spotting Method for Classification of Real World Data. The Computer Journal 41(8), 559–565 (1998)
Rath, T.M., Manmatha, R.: Features for Word Spotting in Historical Manuscripts. In: Proc. of International Conference on Document Analysis and Recognition, pp. 218–222 (2003)
Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 521–527 (2003)
Terasawa, K., Nagasaki, T., Kawashima, T.: Eigenspace method for text retrieval in historical document images. In: Proc. of International Conference on Document Analysis and Recognition, pp. 437–441 (2005)
Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)
Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Terasawa, K., Nagasaki, T., Kawashima, T. (2006). Automatic Keyword Extraction from Historical Document Images. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_37
Download citation
DOI: https://doi.org/10.1007/11669487_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

